Message ID | 8cf1036167ec5fb58c1d2f70bbb0b678@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Kashyap, On Tue, Feb 06, 2018 at 07:57:35PM +0530, Kashyap Desai wrote: > > -----Original Message----- > > From: Ming Lei [mailto:ming.lei@redhat.com] > > Sent: Tuesday, February 6, 2018 6:02 PM > > To: Kashyap Desai > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph > > Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar > Sandoval; > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; > Peter > > Rivera; Paolo Bonzini; Laurence Oberman > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce > > force_blk_mq > > > > Hi Kashyap, > > > > On Tue, Feb 06, 2018 at 04:59:51PM +0530, Kashyap Desai wrote: > > > > -----Original Message----- > > > > From: Ming Lei [mailto:ming.lei@redhat.com] > > > > Sent: Tuesday, February 6, 2018 1:35 PM > > > > To: Kashyap Desai > > > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; > > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun > > > > Easi; Omar > > > Sandoval; > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; > > > Peter > > > > Rivera; Paolo Bonzini; Laurence Oberman > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > > > introduce force_blk_mq > > > > > > > > Hi Kashyap, > > > > > > > > On Tue, Feb 06, 2018 at 11:33:50AM +0530, Kashyap Desai wrote: > > > > > > > We still have more than one reply queue ending up completion > > > > > > > one > > > CPU. > > > > > > > > > > > > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that > > > > > > means smp_affinity_enable has to be set as 1, but seems it is > > > > > > the default > > > > > setting. > > > > > > > > > > > > Please see kernel/irq/affinity.c, especially > > > > > > irq_calc_affinity_vectors() > > > > > which > > > > > > figures out an optimal number of vectors, and the computation is > > > > > > based > > > > > on > > > > > > cpumask_weight(cpu_possible_mask) now. If all offline CPUs are > > > > > > mapped to some of reply queues, these queues won't be active(no > > > > > > request submitted > > > > > to > > > > > > these queues). The mechanism of PCI_IRQ_AFFINITY basically makes > > > > > > sure > > > > > that > > > > > > more than one irq vector won't be handled by one same CPU, and > > > > > > the irq vector spread is done in irq_create_affinity_masks(). > > > > > > > > > > > > > Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver > > > > > > > via module parameter to simulate the issue. We need more > > > > > > > number of Online CPU than reply-queue. > > > > > > > > > > > > IMO, you don't need to simulate the issue, > > > > > > pci_alloc_irq_vectors( > > > > > > PCI_IRQ_AFFINITY) will handle that for you. You can dump the > > > > > > returned > > > > > irq > > > > > > vector number, num_possible_cpus()/num_online_cpus() and each > > > > > > irq vector's affinity assignment. > > > > > > > > > > > > > We may see completion redirected to original CPU because of > > > > > > > "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep > > > > > > > one CPU busy in local ISR routine. > > > > > > > > > > > > Could you dump each irq vector's affinity assignment of your > > > > > > megaraid in > > > > > your > > > > > > test? > > > > > > > > > > To quickly reproduce, I restricted to single MSI-x vector on > > > > > megaraid_sas driver. System has total 16 online CPUs. > > > > > > > > I suggest you don't do the restriction of single MSI-x vector, and > > > > just > > > use the > > > > device supported number of msi-x vector. > > > > > > Hi Ming, CPU lock up is seen even though it is not single msi-x > vector. > > > Actual scenario need some specific topology and server for overnight > test. > > > Issue can be seen on servers which has more than 16 logical CPUs and > > > Thunderbolt series MR controller which supports at max 16 MSIx > vectors. > > > > > > > > > > > > > > Output of affinity hints. > > > > > kernel version: > > > > > Linux rhel7.3 4.15.0-rc1+ #2 SMP Mon Feb 5 12:13:34 EST 2018 > > > > > x86_64 > > > > > x86_64 > > > > > x86_64 GNU/Linux > > > > > PCI name is 83:00.0, dump its irq affinity: > > > > > irq 105, cpu list 0-3,8-11 > > > > > > > > In this case, which CPU is selected for handling the interrupt is > > > decided by > > > > interrupt controller, and it is easy to cause CPU overload if > > > > interrupt > > > controller > > > > always selects one same CPU to handle the irq. > > > > > > > > > > > > > > Affinity mask is created properly, but only CPU-0 is overloaded > > > > > with interrupt processing. > > > > > > > > > > # numactl --hardware > > > > > available: 2 nodes (0-1) > > > > > node 0 cpus: 0 1 2 3 8 9 10 11 > > > > > node 0 size: 47861 MB > > > > > node 0 free: 46516 MB > > > > > node 1 cpus: 4 5 6 7 12 13 14 15 > > > > > node 1 size: 64491 MB > > > > > node 1 free: 62805 MB > > > > > node distances: > > > > > node 0 1 > > > > > 0: 10 21 > > > > > 1: 21 10 > > > > > > > > > > Output of system activities (sar). (gnice is 100% and it is > > > > > consumed in megaraid_sas ISR routine.) > > > > > > > > > > > > > > > 12:44:40 PM CPU %usr %nice %sys %iowait > %steal > > > > > %irq %soft %guest %gnice %idle > > > > > 12:44:41 PM all 6.03 0.00 29.98 0.00 > > > > > 0.00 0.00 0.00 0.00 0.00 > 63.99 > > > > > 12:44:41 PM 0 0.00 0.00 0.00 > 0.00 > > > > > 0.00 0.00 0.00 0.00 100.00 0 > > > > > > > > > > > > > > > In my test, I used rq_affinity is set to 2. > > > > > (QUEUE_FLAG_SAME_FORCE). I also used " host_tagset" V2 patch set > for > > megaraid_sas. > > > > > > > > > > Using RFC requested in - > > > > > "https://marc.info/?l=linux-scsi&m=151601833418346&w=2 " lockup is > > > > > avoided (you can noticed that gnice is shifted to softirq. Even > > > > > though it is 100% consumed, There is always exit for existing > > > > > completion loop due to irqpoll_weight @irq_poll_init(). > > > > > > > > > > Average: CPU %usr %nice %sys %iowait > %steal > > > > > %irq %soft %guest %gnice %idle > > > > > Average: all 4.25 0.00 21.61 0.00 > > > > > 0.00 0.00 6.61 0.00 0.00 67.54 > > > > > Average: 0 0.00 0.00 0.00 > 0.00 > > > > > 0.00 0.00 100.00 0.00 0.00 0.00 > > > > > > > > > > > > > > > Hope this clarifies. We need different fix to avoid lockups. Can > > > > > we consider using irq poll interface if #CPU is more than Reply > > > queue/MSI-x. > > > > > ? > > > > > > > > Please use the device's supported msi-x vectors number, and see if > > > > there > > > is this > > > > issue. If there is, you can use irq poll too, which isn't > > > > contradictory > > > with the > > > > blk-mq approach taken by this patchset. > > > > > > Device supported scenario need more time to reproduce, but it is more > > > quick method is to just use single MSI-x vector and try to create > > > worst case IO completion loop. > > > Using irq poll, my test run without any CPU lockup. I tried your > > > latest V2 series as well and that is also behaving the same. > > > > Again, you can use irq poll, which isn't contradictory with blk-mq. > > Just wanted to explained that issue of CPU lock up is different. Thanks > for clarification. > > > > > > > > BTW - I am seeing drastically performance drop using V2 series of > > > patch on megaraid_sas. Those who is testing HPSA, can also verify if > > > that is a generic behavior. > > > > OK, I will see if I can find a megaraid_sas to see the performance drop > issue. If I > > can't, I will try to run performance test on HPSA. > > Patch is appended. > > > > > Could you share us your patch for enabling global_tags/MQ on > megaraid_sas > > so that I can reproduce your test? > > > > > See below perf top data. "bt_iter" is consuming 4 times more CPU. > > > > Could you share us what the IOPS/CPU utilization effect is after > applying the > > patch V2? And your test script? > Regarding CPU utilization, I need to test one more time. Currently system > is in used. > > I run below fio test on total 24 SSDs expander attached. > > numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k > --ioengine=libaio --rw=randread > > Performance dropped from 1.6 M IOPs to 770K IOPs. > > > > > In theory, it shouldn't, because the HBA only supports HBA wide tags, > that > > means the allocation has to share a HBA wide sbitmap no matter if global > tags > > is used or not. > > > > Anyway, I will take a look at the performance test and data. > > > > > > Thanks, > > Ming > > > Megaraid_sas version of shared tag set. Thanks for providing the patch. > > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > b/drivers/scsi/megaraid/megaraid_sas_base.c > index 0f1d88f..75ea86b 100644 > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > @@ -50,6 +50,7 @@ > #include <linux/mutex.h> > #include <linux/poll.h> > #include <linux/vmalloc.h> > +#include <linux/blk-mq-pci.h> > > #include <scsi/scsi.h> > #include <scsi/scsi_cmnd.h> > @@ -220,6 +221,15 @@ static int megasas_get_ld_vf_affiliation(struct > megasas_instance *instance, > static inline void > megasas_init_ctrl_params(struct megasas_instance *instance); > > + > +static int megaraid_sas_map_queues(struct Scsi_Host *shost) > +{ > + struct megasas_instance *instance; > + instance = (struct megasas_instance *)shost->hostdata; > + > + return blk_mq_pci_map_queues(&shost->tag_set, instance->pdev); > +} > + > /** > * megasas_set_dma_settings - Populate DMA address, length and flags for > DCMDs > * @instance: Adapter soft state > @@ -3177,6 +3187,8 @@ struct device_attribute *megaraid_host_attrs[] = { > .use_clustering = ENABLE_CLUSTERING, > .change_queue_depth = scsi_change_queue_depth, > .no_write_same = 1, > + .map_queues = megaraid_sas_map_queues, > + .host_tagset = 1, > }; > > /** > @@ -5965,6 +5977,9 @@ static int megasas_io_attach(struct megasas_instance > *instance) > host->max_lun = MEGASAS_MAX_LUN; > host->max_cmd_len = 16; > > + /* map reply queue to blk_mq hw queue */ > + host->nr_hw_queues = instance->msix_vectors; > + > /* > * Notify the mid-layer about the new controller > */ > diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c > b/drivers/scsi/megaraid/megaraid_sas_fusion.c > index 073ced0..034d976 100644 > --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c > +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c > @@ -2655,11 +2655,15 @@ static void megasas_stream_detect(struct > megasas_instance *instance, > fp_possible = (io_info.fpOkForIo > 0) ? true : > false; > } > > +#if 0 > /* Use raw_smp_processor_id() for now until cmd->request->cpu is > CPU > id by default, not CPU group id, otherwise all MSI-X queues > won't > be utilized */ > cmd->request_desc->SCSIIO.MSIxIndex = instance->msix_vectors ? > raw_smp_processor_id() % instance->msix_vectors : 0; > +#endif > + > + cmd->request_desc->SCSIIO.MSIxIndex = > blk_mq_unique_tag_to_hwq(scp->request->tag); Looks the above line is wrong, and you just use reply queue 0 in this way to complete rq. Follows the correct usage: cmd->request_desc->SCSIIO.MSIxIndex = blk_mq_unique_tag_to_hwq(blk_mq_unique_tag(scp->request)); > > praid_context = &io_request->RaidContext; > > @@ -2985,9 +2989,13 @@ static void megasas_build_ld_nonrw_fusion(struct > megasas_instance *instance, > } > > cmd->request_desc->SCSIIO.DevHandle = io_request->DevHandle; > + > +#if 0 > cmd->request_desc->SCSIIO.MSIxIndex = > instance->msix_vectors ? > (raw_smp_processor_id() % instance->msix_vectors) : 0; > +#endif > + cmd->request_desc->SCSIIO.MSIxIndex = > blk_mq_unique_tag_to_hwq(scmd->request->tag); Same with above, could you fix the patch and run your performance test again? Thanks Ming
Hi all, [ .. ] >> >> Could you share us your patch for enabling global_tags/MQ on > megaraid_sas >> so that I can reproduce your test? >> >>> See below perf top data. "bt_iter" is consuming 4 times more CPU. >> >> Could you share us what the IOPS/CPU utilization effect is after > applying the >> patch V2? And your test script? > Regarding CPU utilization, I need to test one more time. Currently system > is in used. > > I run below fio test on total 24 SSDs expander attached. > > numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k > --ioengine=libaio --rw=randread > > Performance dropped from 1.6 M IOPs to 770K IOPs. > This is basically what we've seen with earlier iterations. >> >> In theory, it shouldn't, because the HBA only supports HBA wide tags, >> that means the allocation has to share a HBA wide sbitmap no matte>> if global tags is used or not. >> >> Anyway, I will take a look at the performance test and data. >> >> >> Thanks, >> Ming > > > Megaraid_sas version of shared tag set. > Whee; thanks for that. I've just finished a patchset moving megarai_sas_fusion to embedded commands (and cutting down the size of 'struct megasas_cmd_fusion' by half :-), so that will come in just handy. Will give it a spin. Cheers, Hannes
On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote: > Hi all, > > [ .. ] > >> > >> Could you share us your patch for enabling global_tags/MQ on > > megaraid_sas > >> so that I can reproduce your test? > >> > >>> See below perf top data. "bt_iter" is consuming 4 times more CPU. > >> > >> Could you share us what the IOPS/CPU utilization effect is after > > applying the > >> patch V2? And your test script? > > Regarding CPU utilization, I need to test one more time. Currently system > > is in used. > > > > I run below fio test on total 24 SSDs expander attached. > > > > numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k > > --ioengine=libaio --rw=randread > > > > Performance dropped from 1.6 M IOPs to 770K IOPs. > > > This is basically what we've seen with earlier iterations. Hi Hannes, As I mentioned in another mail[1], Kashyap's patch has a big issue, which causes only reply queue 0 used. [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2 So could you guys run your performance test again after fixing the patch? Thanks, Ming
> -----Original Message----- > From: Ming Lei [mailto:ming.lei@redhat.com] > Sent: Wednesday, February 7, 2018 5:53 PM > To: Hannes Reinecke > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; Christoph > Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar Sandoval; > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; Peter > Rivera; Paolo Bonzini; Laurence Oberman > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce > force_blk_mq > > On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote: > > Hi all, > > > > [ .. ] > > >> > > >> Could you share us your patch for enabling global_tags/MQ on > > > megaraid_sas > > >> so that I can reproduce your test? > > >> > > >>> See below perf top data. "bt_iter" is consuming 4 times more CPU. > > >> > > >> Could you share us what the IOPS/CPU utilization effect is after > > > applying the > > >> patch V2? And your test script? > > > Regarding CPU utilization, I need to test one more time. Currently > > > system is in used. > > > > > > I run below fio test on total 24 SSDs expander attached. > > > > > > numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k > > > --ioengine=libaio --rw=randread > > > > > > Performance dropped from 1.6 M IOPs to 770K IOPs. > > > > > This is basically what we've seen with earlier iterations. > > Hi Hannes, > > As I mentioned in another mail[1], Kashyap's patch has a big issue, which > causes only reply queue 0 used. > > [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2 > > So could you guys run your performance test again after fixing the patch? Ming - I tried after change you requested. Performance drop is still unresolved. From 1.6 M IOPS to 770K IOPS. See below data. All 24 reply queue is in used correctly. IRQs / 1 second(s) IRQ# TOTAL NODE0 NODE1 NAME 360 16422 0 16422 IR-PCI-MSI 70254653-edge megasas 364 15980 0 15980 IR-PCI-MSI 70254657-edge megasas 362 15979 0 15979 IR-PCI-MSI 70254655-edge megasas 345 15696 0 15696 IR-PCI-MSI 70254638-edge megasas 341 15659 0 15659 IR-PCI-MSI 70254634-edge megasas 369 15656 0 15656 IR-PCI-MSI 70254662-edge megasas 359 15650 0 15650 IR-PCI-MSI 70254652-edge megasas 358 15596 0 15596 IR-PCI-MSI 70254651-edge megasas 350 15574 0 15574 IR-PCI-MSI 70254643-edge megasas 342 15532 0 15532 IR-PCI-MSI 70254635-edge megasas 344 15527 0 15527 IR-PCI-MSI 70254637-edge megasas 346 15485 0 15485 IR-PCI-MSI 70254639-edge megasas 361 15482 0 15482 IR-PCI-MSI 70254654-edge megasas 348 15467 0 15467 IR-PCI-MSI 70254641-edge megasas 368 15463 0 15463 IR-PCI-MSI 70254661-edge megasas 354 15420 0 15420 IR-PCI-MSI 70254647-edge megasas 351 15378 0 15378 IR-PCI-MSI 70254644-edge megasas 352 15377 0 15377 IR-PCI-MSI 70254645-edge megasas 356 15348 0 15348 IR-PCI-MSI 70254649-edge megasas 337 15344 0 15344 IR-PCI-MSI 70254630-edge megasas 343 15320 0 15320 IR-PCI-MSI 70254636-edge megasas 355 15266 0 15266 IR-PCI-MSI 70254648-edge megasas 335 15247 0 15247 IR-PCI-MSI 70254628-edge megasas 363 15233 0 15233 IR-PCI-MSI 70254656-edge megasas Average: CPU %usr %nice %sys %iowait %steal %irq %soft %guest %gnice %idle Average: 18 3.80 0.00 14.78 10.08 0.00 0.00 4.01 0.00 0.00 67.33 Average: 19 3.26 0.00 15.35 10.62 0.00 0.00 4.03 0.00 0.00 66.74 Average: 20 3.42 0.00 14.57 10.67 0.00 0.00 3.84 0.00 0.00 67.50 Average: 21 3.19 0.00 15.60 10.75 0.00 0.00 4.16 0.00 0.00 66.30 Average: 22 3.58 0.00 15.15 10.66 0.00 0.00 3.51 0.00 0.00 67.11 Average: 23 3.34 0.00 15.36 10.63 0.00 0.00 4.17 0.00 0.00 66.50 Average: 24 3.50 0.00 14.58 10.93 0.00 0.00 3.85 0.00 0.00 67.13 Average: 25 3.20 0.00 14.68 10.86 0.00 0.00 4.31 0.00 0.00 66.95 Average: 26 3.27 0.00 14.80 10.70 0.00 0.00 3.68 0.00 0.00 67.55 Average: 27 3.58 0.00 15.36 10.80 0.00 0.00 3.79 0.00 0.00 66.48 Average: 28 3.46 0.00 15.17 10.46 0.00 0.00 3.32 0.00 0.00 67.59 Average: 29 3.34 0.00 14.42 10.72 0.00 0.00 3.34 0.00 0.00 68.18 Average: 30 3.34 0.00 15.08 10.70 0.00 0.00 3.89 0.00 0.00 66.99 Average: 31 3.26 0.00 15.33 10.47 0.00 0.00 3.33 0.00 0.00 67.61 Average: 32 3.21 0.00 14.80 10.61 0.00 0.00 3.70 0.00 0.00 67.67 Average: 33 3.40 0.00 13.88 10.55 0.00 0.00 4.02 0.00 0.00 68.15 Average: 34 3.74 0.00 17.41 10.61 0.00 0.00 4.51 0.00 0.00 63.73 Average: 35 3.35 0.00 14.37 10.74 0.00 0.00 3.84 0.00 0.00 67.71 Average: 36 0.54 0.00 1.77 0.00 0.00 0.00 0.00 0.00 0.00 97.69 .. Average: 54 3.60 0.00 15.17 10.39 0.00 0.00 4.22 0.00 0.00 66.62 Average: 55 3.33 0.00 14.85 10.55 0.00 0.00 3.96 0.00 0.00 67.31 Average: 56 3.40 0.00 15.19 10.54 0.00 0.00 3.74 0.00 0.00 67.13 Average: 57 3.41 0.00 13.98 10.78 0.00 0.00 4.10 0.00 0.00 67.73 Average: 58 3.32 0.00 15.16 10.52 0.00 0.00 4.01 0.00 0.00 66.99 Average: 59 3.17 0.00 15.80 10.35 0.00 0.00 3.86 0.00 0.00 66.80 Average: 60 3.00 0.00 14.63 10.59 0.00 0.00 3.97 0.00 0.00 67.80 Average: 61 3.34 0.00 14.70 10.66 0.00 0.00 4.32 0.00 0.00 66.97 Average: 62 3.34 0.00 15.29 10.56 0.00 0.00 3.89 0.00 0.00 66.92 Average: 63 3.29 0.00 14.51 10.72 0.00 0.00 3.85 0.00 0.00 67.62 Average: 64 3.48 0.00 15.31 10.65 0.00 0.00 3.97 0.00 0.00 66.60 Average: 65 3.34 0.00 14.36 10.80 0.00 0.00 4.11 0.00 0.00 67.39 Average: 66 3.13 0.00 14.94 10.70 0.00 0.00 4.10 0.00 0.00 67.13 Average: 67 3.06 0.00 15.56 10.69 0.00 0.00 3.82 0.00 0.00 66.88 Average: 68 3.33 0.00 14.98 10.61 0.00 0.00 3.81 0.00 0.00 67.27 Average: 69 3.20 0.00 15.43 10.70 0.00 0.00 3.82 0.00 0.00 66.85 Average: 70 3.34 0.00 17.14 10.59 0.00 0.00 3.00 0.00 0.00 65.92 Average: 71 3.41 0.00 14.94 10.56 0.00 0.00 3.41 0.00 0.00 67.69 Perf top - 64.33% [kernel] [k] bt_iter 4.86% [kernel] [k] blk_mq_queue_tag_busy_iter 4.23% [kernel] [k] _find_next_bit 2.40% [kernel] [k] native_queued_spin_lock_slowpath 1.09% [kernel] [k] sbitmap_any_bit_set 0.71% [kernel] [k] sbitmap_queue_clear 0.63% [kernel] [k] find_next_bit 0.54% [kernel] [k] _raw_spin_lock_irqsave > > > Thanks, > Ming
Hi Kashyap, On Wed, Feb 07, 2018 at 07:44:04PM +0530, Kashyap Desai wrote: > > -----Original Message----- > > From: Ming Lei [mailto:ming.lei@redhat.com] > > Sent: Wednesday, February 7, 2018 5:53 PM > > To: Hannes Reinecke > > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; Christoph > > Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar > Sandoval; > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; > Peter > > Rivera; Paolo Bonzini; Laurence Oberman > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce > > force_blk_mq > > > > On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote: > > > Hi all, > > > > > > [ .. ] > > > >> > > > >> Could you share us your patch for enabling global_tags/MQ on > > > > megaraid_sas > > > >> so that I can reproduce your test? > > > >> > > > >>> See below perf top data. "bt_iter" is consuming 4 times more CPU. > > > >> > > > >> Could you share us what the IOPS/CPU utilization effect is after > > > > applying the > > > >> patch V2? And your test script? > > > > Regarding CPU utilization, I need to test one more time. Currently > > > > system is in used. > > > > > > > > I run below fio test on total 24 SSDs expander attached. > > > > > > > > numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k > > > > --ioengine=libaio --rw=randread > > > > > > > > Performance dropped from 1.6 M IOPs to 770K IOPs. > > > > > > > This is basically what we've seen with earlier iterations. > > > > Hi Hannes, > > > > As I mentioned in another mail[1], Kashyap's patch has a big issue, > which > > causes only reply queue 0 used. > > > > [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2 > > > > So could you guys run your performance test again after fixing the > patch? > > Ming - > > I tried after change you requested. Performance drop is still unresolved. > From 1.6 M IOPS to 770K IOPS. > > See below data. All 24 reply queue is in used correctly. > > IRQs / 1 second(s) > IRQ# TOTAL NODE0 NODE1 NAME > 360 16422 0 16422 IR-PCI-MSI 70254653-edge megasas > 364 15980 0 15980 IR-PCI-MSI 70254657-edge megasas > 362 15979 0 15979 IR-PCI-MSI 70254655-edge megasas > 345 15696 0 15696 IR-PCI-MSI 70254638-edge megasas > 341 15659 0 15659 IR-PCI-MSI 70254634-edge megasas > 369 15656 0 15656 IR-PCI-MSI 70254662-edge megasas > 359 15650 0 15650 IR-PCI-MSI 70254652-edge megasas > 358 15596 0 15596 IR-PCI-MSI 70254651-edge megasas > 350 15574 0 15574 IR-PCI-MSI 70254643-edge megasas > 342 15532 0 15532 IR-PCI-MSI 70254635-edge megasas > 344 15527 0 15527 IR-PCI-MSI 70254637-edge megasas > 346 15485 0 15485 IR-PCI-MSI 70254639-edge megasas > 361 15482 0 15482 IR-PCI-MSI 70254654-edge megasas > 348 15467 0 15467 IR-PCI-MSI 70254641-edge megasas > 368 15463 0 15463 IR-PCI-MSI 70254661-edge megasas > 354 15420 0 15420 IR-PCI-MSI 70254647-edge megasas > 351 15378 0 15378 IR-PCI-MSI 70254644-edge megasas > 352 15377 0 15377 IR-PCI-MSI 70254645-edge megasas > 356 15348 0 15348 IR-PCI-MSI 70254649-edge megasas > 337 15344 0 15344 IR-PCI-MSI 70254630-edge megasas > 343 15320 0 15320 IR-PCI-MSI 70254636-edge megasas > 355 15266 0 15266 IR-PCI-MSI 70254648-edge megasas > 335 15247 0 15247 IR-PCI-MSI 70254628-edge megasas > 363 15233 0 15233 IR-PCI-MSI 70254656-edge megasas > > > Average: CPU %usr %nice %sys %iowait %steal > %irq %soft %guest %gnice %idle > Average: 18 3.80 0.00 14.78 10.08 0.00 > 0.00 4.01 0.00 0.00 67.33 > Average: 19 3.26 0.00 15.35 10.62 0.00 > 0.00 4.03 0.00 0.00 66.74 > Average: 20 3.42 0.00 14.57 10.67 0.00 > 0.00 3.84 0.00 0.00 67.50 > Average: 21 3.19 0.00 15.60 10.75 0.00 > 0.00 4.16 0.00 0.00 66.30 > Average: 22 3.58 0.00 15.15 10.66 0.00 > 0.00 3.51 0.00 0.00 67.11 > Average: 23 3.34 0.00 15.36 10.63 0.00 > 0.00 4.17 0.00 0.00 66.50 > Average: 24 3.50 0.00 14.58 10.93 0.00 > 0.00 3.85 0.00 0.00 67.13 > Average: 25 3.20 0.00 14.68 10.86 0.00 > 0.00 4.31 0.00 0.00 66.95 > Average: 26 3.27 0.00 14.80 10.70 0.00 > 0.00 3.68 0.00 0.00 67.55 > Average: 27 3.58 0.00 15.36 10.80 0.00 > 0.00 3.79 0.00 0.00 66.48 > Average: 28 3.46 0.00 15.17 10.46 0.00 > 0.00 3.32 0.00 0.00 67.59 > Average: 29 3.34 0.00 14.42 10.72 0.00 > 0.00 3.34 0.00 0.00 68.18 > Average: 30 3.34 0.00 15.08 10.70 0.00 > 0.00 3.89 0.00 0.00 66.99 > Average: 31 3.26 0.00 15.33 10.47 0.00 > 0.00 3.33 0.00 0.00 67.61 > Average: 32 3.21 0.00 14.80 10.61 0.00 > 0.00 3.70 0.00 0.00 67.67 > Average: 33 3.40 0.00 13.88 10.55 0.00 > 0.00 4.02 0.00 0.00 68.15 > Average: 34 3.74 0.00 17.41 10.61 0.00 > 0.00 4.51 0.00 0.00 63.73 > Average: 35 3.35 0.00 14.37 10.74 0.00 > 0.00 3.84 0.00 0.00 67.71 > Average: 36 0.54 0.00 1.77 0.00 0.00 > 0.00 0.00 0.00 0.00 97.69 > .. > Average: 54 3.60 0.00 15.17 10.39 0.00 > 0.00 4.22 0.00 0.00 66.62 > Average: 55 3.33 0.00 14.85 10.55 0.00 > 0.00 3.96 0.00 0.00 67.31 > Average: 56 3.40 0.00 15.19 10.54 0.00 > 0.00 3.74 0.00 0.00 67.13 > Average: 57 3.41 0.00 13.98 10.78 0.00 > 0.00 4.10 0.00 0.00 67.73 > Average: 58 3.32 0.00 15.16 10.52 0.00 > 0.00 4.01 0.00 0.00 66.99 > Average: 59 3.17 0.00 15.80 10.35 0.00 > 0.00 3.86 0.00 0.00 66.80 > Average: 60 3.00 0.00 14.63 10.59 0.00 > 0.00 3.97 0.00 0.00 67.80 > Average: 61 3.34 0.00 14.70 10.66 0.00 > 0.00 4.32 0.00 0.00 66.97 > Average: 62 3.34 0.00 15.29 10.56 0.00 > 0.00 3.89 0.00 0.00 66.92 > Average: 63 3.29 0.00 14.51 10.72 0.00 > 0.00 3.85 0.00 0.00 67.62 > Average: 64 3.48 0.00 15.31 10.65 0.00 > 0.00 3.97 0.00 0.00 66.60 > Average: 65 3.34 0.00 14.36 10.80 0.00 > 0.00 4.11 0.00 0.00 67.39 > Average: 66 3.13 0.00 14.94 10.70 0.00 > 0.00 4.10 0.00 0.00 67.13 > Average: 67 3.06 0.00 15.56 10.69 0.00 > 0.00 3.82 0.00 0.00 66.88 > Average: 68 3.33 0.00 14.98 10.61 0.00 > 0.00 3.81 0.00 0.00 67.27 > Average: 69 3.20 0.00 15.43 10.70 0.00 > 0.00 3.82 0.00 0.00 66.85 > Average: 70 3.34 0.00 17.14 10.59 0.00 > 0.00 3.00 0.00 0.00 65.92 > Average: 71 3.41 0.00 14.94 10.56 0.00 > 0.00 3.41 0.00 0.00 67.69 > > Perf top - > > 64.33% [kernel] [k] bt_iter > 4.86% [kernel] [k] blk_mq_queue_tag_busy_iter > 4.23% [kernel] [k] _find_next_bit > 2.40% [kernel] [k] native_queued_spin_lock_slowpath > 1.09% [kernel] [k] sbitmap_any_bit_set > 0.71% [kernel] [k] sbitmap_queue_clear > 0.63% [kernel] [k] find_next_bit > 0.54% [kernel] [k] _raw_spin_lock_irqsave The above trace says nothing about the performance drop, and it just means some disk stat utilities are crazy reading /proc/diskstats or /sys/block/sda/stat, see below, and the performance drop might be related with this crazy reading too. bt_iter <-bt_for_each <-blk_mq_queue_tag_busy_iter <-blk_mq_in_flight <-part_in_flight <-part_stat_show <-diskstats_show <-part_round_stats <-blk_mq_timeout_work If you are using fio to run the test, could you show us the fio log(with and without the patchset) and don't start any disk stat utilities meantime? Also seems none is the default scheduler after this patchset is applied, could you run same test with mq-deadline? Thanks, Ming
On 02/07/2018 03:14 PM, Kashyap Desai wrote: >> -----Original Message----- >> From: Ming Lei [mailto:ming.lei@redhat.com] >> Sent: Wednesday, February 7, 2018 5:53 PM >> To: Hannes Reinecke >> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; Christoph >> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar > Sandoval; >> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; > Peter >> Rivera; Paolo Bonzini; Laurence Oberman >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce >> force_blk_mq >> >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote: >>> Hi all, >>> >>> [ .. ] >>>>> >>>>> Could you share us your patch for enabling global_tags/MQ on >>>> megaraid_sas >>>>> so that I can reproduce your test? >>>>> >>>>>> See below perf top data. "bt_iter" is consuming 4 times more CPU. >>>>> >>>>> Could you share us what the IOPS/CPU utilization effect is after >>>> applying the >>>>> patch V2? And your test script? >>>> Regarding CPU utilization, I need to test one more time. Currently >>>> system is in used. >>>> >>>> I run below fio test on total 24 SSDs expander attached. >>>> >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k >>>> --ioengine=libaio --rw=randread >>>> >>>> Performance dropped from 1.6 M IOPs to 770K IOPs. >>>> >>> This is basically what we've seen with earlier iterations. >> >> Hi Hannes, >> >> As I mentioned in another mail[1], Kashyap's patch has a big issue, > which >> causes only reply queue 0 used. >> >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2 >> >> So could you guys run your performance test again after fixing the > patch? > > Ming - > > I tried after change you requested. Performance drop is still unresolved. > From 1.6 M IOPS to 770K IOPS. > > See below data. All 24 reply queue is in used correctly. > > IRQs / 1 second(s) > IRQ# TOTAL NODE0 NODE1 NAME > 360 16422 0 16422 IR-PCI-MSI 70254653-edge megasas > 364 15980 0 15980 IR-PCI-MSI 70254657-edge megasas > 362 15979 0 15979 IR-PCI-MSI 70254655-edge megasas > 345 15696 0 15696 IR-PCI-MSI 70254638-edge megasas > 341 15659 0 15659 IR-PCI-MSI 70254634-edge megasas > 369 15656 0 15656 IR-PCI-MSI 70254662-edge megasas > 359 15650 0 15650 IR-PCI-MSI 70254652-edge megasas > 358 15596 0 15596 IR-PCI-MSI 70254651-edge megasas > 350 15574 0 15574 IR-PCI-MSI 70254643-edge megasas > 342 15532 0 15532 IR-PCI-MSI 70254635-edge megasas > 344 15527 0 15527 IR-PCI-MSI 70254637-edge megasas > 346 15485 0 15485 IR-PCI-MSI 70254639-edge megasas > 361 15482 0 15482 IR-PCI-MSI 70254654-edge megasas > 348 15467 0 15467 IR-PCI-MSI 70254641-edge megasas > 368 15463 0 15463 IR-PCI-MSI 70254661-edge megasas > 354 15420 0 15420 IR-PCI-MSI 70254647-edge megasas > 351 15378 0 15378 IR-PCI-MSI 70254644-edge megasas > 352 15377 0 15377 IR-PCI-MSI 70254645-edge megasas > 356 15348 0 15348 IR-PCI-MSI 70254649-edge megasas > 337 15344 0 15344 IR-PCI-MSI 70254630-edge megasas > 343 15320 0 15320 IR-PCI-MSI 70254636-edge megasas > 355 15266 0 15266 IR-PCI-MSI 70254648-edge megasas > 335 15247 0 15247 IR-PCI-MSI 70254628-edge megasas > 363 15233 0 15233 IR-PCI-MSI 70254656-edge megasas > > > Average: CPU %usr %nice %sys %iowait %steal > %irq %soft %guest %gnice %idle > Average: 18 3.80 0.00 14.78 10.08 0.00 > 0.00 4.01 0.00 0.00 67.33 > Average: 19 3.26 0.00 15.35 10.62 0.00 > 0.00 4.03 0.00 0.00 66.74 > Average: 20 3.42 0.00 14.57 10.67 0.00 > 0.00 3.84 0.00 0.00 67.50 > Average: 21 3.19 0.00 15.60 10.75 0.00 > 0.00 4.16 0.00 0.00 66.30 > Average: 22 3.58 0.00 15.15 10.66 0.00 > 0.00 3.51 0.00 0.00 67.11 > Average: 23 3.34 0.00 15.36 10.63 0.00 > 0.00 4.17 0.00 0.00 66.50 > Average: 24 3.50 0.00 14.58 10.93 0.00 > 0.00 3.85 0.00 0.00 67.13 > Average: 25 3.20 0.00 14.68 10.86 0.00 > 0.00 4.31 0.00 0.00 66.95 > Average: 26 3.27 0.00 14.80 10.70 0.00 > 0.00 3.68 0.00 0.00 67.55 > Average: 27 3.58 0.00 15.36 10.80 0.00 > 0.00 3.79 0.00 0.00 66.48 > Average: 28 3.46 0.00 15.17 10.46 0.00 > 0.00 3.32 0.00 0.00 67.59 > Average: 29 3.34 0.00 14.42 10.72 0.00 > 0.00 3.34 0.00 0.00 68.18 > Average: 30 3.34 0.00 15.08 10.70 0.00 > 0.00 3.89 0.00 0.00 66.99 > Average: 31 3.26 0.00 15.33 10.47 0.00 > 0.00 3.33 0.00 0.00 67.61 > Average: 32 3.21 0.00 14.80 10.61 0.00 > 0.00 3.70 0.00 0.00 67.67 > Average: 33 3.40 0.00 13.88 10.55 0.00 > 0.00 4.02 0.00 0.00 68.15 > Average: 34 3.74 0.00 17.41 10.61 0.00 > 0.00 4.51 0.00 0.00 63.73 > Average: 35 3.35 0.00 14.37 10.74 0.00 > 0.00 3.84 0.00 0.00 67.71 > Average: 36 0.54 0.00 1.77 0.00 0.00 > 0.00 0.00 0.00 0.00 97.69 > .. > Average: 54 3.60 0.00 15.17 10.39 0.00 > 0.00 4.22 0.00 0.00 66.62 > Average: 55 3.33 0.00 14.85 10.55 0.00 > 0.00 3.96 0.00 0.00 67.31 > Average: 56 3.40 0.00 15.19 10.54 0.00 > 0.00 3.74 0.00 0.00 67.13 > Average: 57 3.41 0.00 13.98 10.78 0.00 > 0.00 4.10 0.00 0.00 67.73 > Average: 58 3.32 0.00 15.16 10.52 0.00 > 0.00 4.01 0.00 0.00 66.99 > Average: 59 3.17 0.00 15.80 10.35 0.00 > 0.00 3.86 0.00 0.00 66.80 > Average: 60 3.00 0.00 14.63 10.59 0.00 > 0.00 3.97 0.00 0.00 67.80 > Average: 61 3.34 0.00 14.70 10.66 0.00 > 0.00 4.32 0.00 0.00 66.97 > Average: 62 3.34 0.00 15.29 10.56 0.00 > 0.00 3.89 0.00 0.00 66.92 > Average: 63 3.29 0.00 14.51 10.72 0.00 > 0.00 3.85 0.00 0.00 67.62 > Average: 64 3.48 0.00 15.31 10.65 0.00 > 0.00 3.97 0.00 0.00 66.60 > Average: 65 3.34 0.00 14.36 10.80 0.00 > 0.00 4.11 0.00 0.00 67.39 > Average: 66 3.13 0.00 14.94 10.70 0.00 > 0.00 4.10 0.00 0.00 67.13 > Average: 67 3.06 0.00 15.56 10.69 0.00 > 0.00 3.82 0.00 0.00 66.88 > Average: 68 3.33 0.00 14.98 10.61 0.00 > 0.00 3.81 0.00 0.00 67.27 > Average: 69 3.20 0.00 15.43 10.70 0.00 > 0.00 3.82 0.00 0.00 66.85 > Average: 70 3.34 0.00 17.14 10.59 0.00 > 0.00 3.00 0.00 0.00 65.92 > Average: 71 3.41 0.00 14.94 10.56 0.00 > 0.00 3.41 0.00 0.00 67.69 > > Perf top - > > 64.33% [kernel] [k] bt_iter > 4.86% [kernel] [k] blk_mq_queue_tag_busy_iter > 4.23% [kernel] [k] _find_next_bit > 2.40% [kernel] [k] native_queued_spin_lock_slowpath > 1.09% [kernel] [k] sbitmap_any_bit_set > 0.71% [kernel] [k] sbitmap_queue_clear > 0.63% [kernel] [k] find_next_bit > 0.54% [kernel] [k] _raw_spin_lock_irqsave > Ah. So we're spending quite some time in trying to find a free tag. I guess this is due to every queue starting at the same position trying to find a free tag, which inevitably leads to a contention. Can't we lay out the pointers so that each queue starts looking for free bits at a _different_ location? IE if we evenly spread the initial position for each queue and use a round-robin algorithm we should be getting better results, methinks. I'll give it a go once the hickups with converting megaraid_sas to embedded commands are done with :-( Cheers, Hannes
On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke wrote: > On 02/07/2018 03:14 PM, Kashyap Desai wrote: > >> -----Original Message----- > >> From: Ming Lei [mailto:ming.lei@redhat.com] > >> Sent: Wednesday, February 7, 2018 5:53 PM > >> To: Hannes Reinecke > >> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; Christoph > >> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar > > Sandoval; > >> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; > > Peter > >> Rivera; Paolo Bonzini; Laurence Oberman > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce > >> force_blk_mq > >> > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote: > >>> Hi all, > >>> > >>> [ .. ] > >>>>> > >>>>> Could you share us your patch for enabling global_tags/MQ on > >>>> megaraid_sas > >>>>> so that I can reproduce your test? > >>>>> > >>>>>> See below perf top data. "bt_iter" is consuming 4 times more CPU. > >>>>> > >>>>> Could you share us what the IOPS/CPU utilization effect is after > >>>> applying the > >>>>> patch V2? And your test script? > >>>> Regarding CPU utilization, I need to test one more time. Currently > >>>> system is in used. > >>>> > >>>> I run below fio test on total 24 SSDs expander attached. > >>>> > >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k > >>>> --ioengine=libaio --rw=randread > >>>> > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs. > >>>> > >>> This is basically what we've seen with earlier iterations. > >> > >> Hi Hannes, > >> > >> As I mentioned in another mail[1], Kashyap's patch has a big issue, > > which > >> causes only reply queue 0 used. > >> > >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2 > >> > >> So could you guys run your performance test again after fixing the > > patch? > > > > Ming - > > > > I tried after change you requested. Performance drop is still unresolved. > > From 1.6 M IOPS to 770K IOPS. > > > > See below data. All 24 reply queue is in used correctly. > > > > IRQs / 1 second(s) > > IRQ# TOTAL NODE0 NODE1 NAME > > 360 16422 0 16422 IR-PCI-MSI 70254653-edge megasas > > 364 15980 0 15980 IR-PCI-MSI 70254657-edge megasas > > 362 15979 0 15979 IR-PCI-MSI 70254655-edge megasas > > 345 15696 0 15696 IR-PCI-MSI 70254638-edge megasas > > 341 15659 0 15659 IR-PCI-MSI 70254634-edge megasas > > 369 15656 0 15656 IR-PCI-MSI 70254662-edge megasas > > 359 15650 0 15650 IR-PCI-MSI 70254652-edge megasas > > 358 15596 0 15596 IR-PCI-MSI 70254651-edge megasas > > 350 15574 0 15574 IR-PCI-MSI 70254643-edge megasas > > 342 15532 0 15532 IR-PCI-MSI 70254635-edge megasas > > 344 15527 0 15527 IR-PCI-MSI 70254637-edge megasas > > 346 15485 0 15485 IR-PCI-MSI 70254639-edge megasas > > 361 15482 0 15482 IR-PCI-MSI 70254654-edge megasas > > 348 15467 0 15467 IR-PCI-MSI 70254641-edge megasas > > 368 15463 0 15463 IR-PCI-MSI 70254661-edge megasas > > 354 15420 0 15420 IR-PCI-MSI 70254647-edge megasas > > 351 15378 0 15378 IR-PCI-MSI 70254644-edge megasas > > 352 15377 0 15377 IR-PCI-MSI 70254645-edge megasas > > 356 15348 0 15348 IR-PCI-MSI 70254649-edge megasas > > 337 15344 0 15344 IR-PCI-MSI 70254630-edge megasas > > 343 15320 0 15320 IR-PCI-MSI 70254636-edge megasas > > 355 15266 0 15266 IR-PCI-MSI 70254648-edge megasas > > 335 15247 0 15247 IR-PCI-MSI 70254628-edge megasas > > 363 15233 0 15233 IR-PCI-MSI 70254656-edge megasas > > > > > > Average: CPU %usr %nice %sys %iowait %steal > > %irq %soft %guest %gnice %idle > > Average: 18 3.80 0.00 14.78 10.08 0.00 > > 0.00 4.01 0.00 0.00 67.33 > > Average: 19 3.26 0.00 15.35 10.62 0.00 > > 0.00 4.03 0.00 0.00 66.74 > > Average: 20 3.42 0.00 14.57 10.67 0.00 > > 0.00 3.84 0.00 0.00 67.50 > > Average: 21 3.19 0.00 15.60 10.75 0.00 > > 0.00 4.16 0.00 0.00 66.30 > > Average: 22 3.58 0.00 15.15 10.66 0.00 > > 0.00 3.51 0.00 0.00 67.11 > > Average: 23 3.34 0.00 15.36 10.63 0.00 > > 0.00 4.17 0.00 0.00 66.50 > > Average: 24 3.50 0.00 14.58 10.93 0.00 > > 0.00 3.85 0.00 0.00 67.13 > > Average: 25 3.20 0.00 14.68 10.86 0.00 > > 0.00 4.31 0.00 0.00 66.95 > > Average: 26 3.27 0.00 14.80 10.70 0.00 > > 0.00 3.68 0.00 0.00 67.55 > > Average: 27 3.58 0.00 15.36 10.80 0.00 > > 0.00 3.79 0.00 0.00 66.48 > > Average: 28 3.46 0.00 15.17 10.46 0.00 > > 0.00 3.32 0.00 0.00 67.59 > > Average: 29 3.34 0.00 14.42 10.72 0.00 > > 0.00 3.34 0.00 0.00 68.18 > > Average: 30 3.34 0.00 15.08 10.70 0.00 > > 0.00 3.89 0.00 0.00 66.99 > > Average: 31 3.26 0.00 15.33 10.47 0.00 > > 0.00 3.33 0.00 0.00 67.61 > > Average: 32 3.21 0.00 14.80 10.61 0.00 > > 0.00 3.70 0.00 0.00 67.67 > > Average: 33 3.40 0.00 13.88 10.55 0.00 > > 0.00 4.02 0.00 0.00 68.15 > > Average: 34 3.74 0.00 17.41 10.61 0.00 > > 0.00 4.51 0.00 0.00 63.73 > > Average: 35 3.35 0.00 14.37 10.74 0.00 > > 0.00 3.84 0.00 0.00 67.71 > > Average: 36 0.54 0.00 1.77 0.00 0.00 > > 0.00 0.00 0.00 0.00 97.69 > > .. > > Average: 54 3.60 0.00 15.17 10.39 0.00 > > 0.00 4.22 0.00 0.00 66.62 > > Average: 55 3.33 0.00 14.85 10.55 0.00 > > 0.00 3.96 0.00 0.00 67.31 > > Average: 56 3.40 0.00 15.19 10.54 0.00 > > 0.00 3.74 0.00 0.00 67.13 > > Average: 57 3.41 0.00 13.98 10.78 0.00 > > 0.00 4.10 0.00 0.00 67.73 > > Average: 58 3.32 0.00 15.16 10.52 0.00 > > 0.00 4.01 0.00 0.00 66.99 > > Average: 59 3.17 0.00 15.80 10.35 0.00 > > 0.00 3.86 0.00 0.00 66.80 > > Average: 60 3.00 0.00 14.63 10.59 0.00 > > 0.00 3.97 0.00 0.00 67.80 > > Average: 61 3.34 0.00 14.70 10.66 0.00 > > 0.00 4.32 0.00 0.00 66.97 > > Average: 62 3.34 0.00 15.29 10.56 0.00 > > 0.00 3.89 0.00 0.00 66.92 > > Average: 63 3.29 0.00 14.51 10.72 0.00 > > 0.00 3.85 0.00 0.00 67.62 > > Average: 64 3.48 0.00 15.31 10.65 0.00 > > 0.00 3.97 0.00 0.00 66.60 > > Average: 65 3.34 0.00 14.36 10.80 0.00 > > 0.00 4.11 0.00 0.00 67.39 > > Average: 66 3.13 0.00 14.94 10.70 0.00 > > 0.00 4.10 0.00 0.00 67.13 > > Average: 67 3.06 0.00 15.56 10.69 0.00 > > 0.00 3.82 0.00 0.00 66.88 > > Average: 68 3.33 0.00 14.98 10.61 0.00 > > 0.00 3.81 0.00 0.00 67.27 > > Average: 69 3.20 0.00 15.43 10.70 0.00 > > 0.00 3.82 0.00 0.00 66.85 > > Average: 70 3.34 0.00 17.14 10.59 0.00 > > 0.00 3.00 0.00 0.00 65.92 > > Average: 71 3.41 0.00 14.94 10.56 0.00 > > 0.00 3.41 0.00 0.00 67.69 > > > > Perf top - > > > > 64.33% [kernel] [k] bt_iter > > 4.86% [kernel] [k] blk_mq_queue_tag_busy_iter > > 4.23% [kernel] [k] _find_next_bit > > 2.40% [kernel] [k] native_queued_spin_lock_slowpath > > 1.09% [kernel] [k] sbitmap_any_bit_set > > 0.71% [kernel] [k] sbitmap_queue_clear > > 0.63% [kernel] [k] find_next_bit > > 0.54% [kernel] [k] _raw_spin_lock_irqsave > > > Ah. So we're spending quite some time in trying to find a free tag. > I guess this is due to every queue starting at the same position trying > to find a free tag, which inevitably leads to a contention. IMO, the above trace means that blk_mq_in_flight() may be the bottleneck, and looks not related with tag allocation. Kashyap, could you run your performance test again after disabling iostat by the following command on all test devices and killing all utilities which may read iostat(/proc/diskstats, ...)? echo 0 > /sys/block/sdN/queue/iostat Thanks, Ming
> -----Original Message----- > From: Ming Lei [mailto:ming.lei@redhat.com] > Sent: Thursday, February 8, 2018 10:23 PM > To: Hannes Reinecke > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; Christoph > Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar Sandoval; > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; Peter > Rivera; Paolo Bonzini; Laurence Oberman > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce > force_blk_mq > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke wrote: > > On 02/07/2018 03:14 PM, Kashyap Desai wrote: > > >> -----Original Message----- > > >> From: Ming Lei [mailto:ming.lei@redhat.com] > > >> Sent: Wednesday, February 7, 2018 5:53 PM > > >> To: Hannes Reinecke > > >> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; > > >> Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun > > >> Easi; Omar > > > Sandoval; > > >> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; > > > Peter > > >> Rivera; Paolo Bonzini; Laurence Oberman > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > >> introduce force_blk_mq > > >> > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote: > > >>> Hi all, > > >>> > > >>> [ .. ] > > >>>>> > > >>>>> Could you share us your patch for enabling global_tags/MQ on > > >>>> megaraid_sas > > >>>>> so that I can reproduce your test? > > >>>>> > > >>>>>> See below perf top data. "bt_iter" is consuming 4 times more CPU. > > >>>>> > > >>>>> Could you share us what the IOPS/CPU utilization effect is after > > >>>> applying the > > >>>>> patch V2? And your test script? > > >>>> Regarding CPU utilization, I need to test one more time. > > >>>> Currently system is in used. > > >>>> > > >>>> I run below fio test on total 24 SSDs expander attached. > > >>>> > > >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k > > >>>> --ioengine=libaio --rw=randread > > >>>> > > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs. > > >>>> > > >>> This is basically what we've seen with earlier iterations. > > >> > > >> Hi Hannes, > > >> > > >> As I mentioned in another mail[1], Kashyap's patch has a big issue, > > > which > > >> causes only reply queue 0 used. > > >> > > >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2 > > >> > > >> So could you guys run your performance test again after fixing the > > > patch? > > > > > > Ming - > > > > > > I tried after change you requested. Performance drop is still unresolved. > > > From 1.6 M IOPS to 770K IOPS. > > > > > > See below data. All 24 reply queue is in used correctly. > > > > > > IRQs / 1 second(s) > > > IRQ# TOTAL NODE0 NODE1 NAME > > > 360 16422 0 16422 IR-PCI-MSI 70254653-edge megasas > > > 364 15980 0 15980 IR-PCI-MSI 70254657-edge megasas > > > 362 15979 0 15979 IR-PCI-MSI 70254655-edge megasas > > > 345 15696 0 15696 IR-PCI-MSI 70254638-edge megasas > > > 341 15659 0 15659 IR-PCI-MSI 70254634-edge megasas > > > 369 15656 0 15656 IR-PCI-MSI 70254662-edge megasas > > > 359 15650 0 15650 IR-PCI-MSI 70254652-edge megasas > > > 358 15596 0 15596 IR-PCI-MSI 70254651-edge megasas > > > 350 15574 0 15574 IR-PCI-MSI 70254643-edge megasas > > > 342 15532 0 15532 IR-PCI-MSI 70254635-edge megasas > > > 344 15527 0 15527 IR-PCI-MSI 70254637-edge megasas > > > 346 15485 0 15485 IR-PCI-MSI 70254639-edge megasas > > > 361 15482 0 15482 IR-PCI-MSI 70254654-edge megasas > > > 348 15467 0 15467 IR-PCI-MSI 70254641-edge megasas > > > 368 15463 0 15463 IR-PCI-MSI 70254661-edge megasas > > > 354 15420 0 15420 IR-PCI-MSI 70254647-edge megasas > > > 351 15378 0 15378 IR-PCI-MSI 70254644-edge megasas > > > 352 15377 0 15377 IR-PCI-MSI 70254645-edge megasas > > > 356 15348 0 15348 IR-PCI-MSI 70254649-edge megasas > > > 337 15344 0 15344 IR-PCI-MSI 70254630-edge megasas > > > 343 15320 0 15320 IR-PCI-MSI 70254636-edge megasas > > > 355 15266 0 15266 IR-PCI-MSI 70254648-edge megasas > > > 335 15247 0 15247 IR-PCI-MSI 70254628-edge megasas > > > 363 15233 0 15233 IR-PCI-MSI 70254656-edge megasas > > > > > > > > > Average: CPU %usr %nice %sys %iowait %steal > > > %irq %soft %guest %gnice %idle > > > Average: 18 3.80 0.00 14.78 10.08 0.00 > > > 0.00 4.01 0.00 0.00 67.33 > > > Average: 19 3.26 0.00 15.35 10.62 0.00 > > > 0.00 4.03 0.00 0.00 66.74 > > > Average: 20 3.42 0.00 14.57 10.67 0.00 > > > 0.00 3.84 0.00 0.00 67.50 > > > Average: 21 3.19 0.00 15.60 10.75 0.00 > > > 0.00 4.16 0.00 0.00 66.30 > > > Average: 22 3.58 0.00 15.15 10.66 0.00 > > > 0.00 3.51 0.00 0.00 67.11 > > > Average: 23 3.34 0.00 15.36 10.63 0.00 > > > 0.00 4.17 0.00 0.00 66.50 > > > Average: 24 3.50 0.00 14.58 10.93 0.00 > > > 0.00 3.85 0.00 0.00 67.13 > > > Average: 25 3.20 0.00 14.68 10.86 0.00 > > > 0.00 4.31 0.00 0.00 66.95 > > > Average: 26 3.27 0.00 14.80 10.70 0.00 > > > 0.00 3.68 0.00 0.00 67.55 > > > Average: 27 3.58 0.00 15.36 10.80 0.00 > > > 0.00 3.79 0.00 0.00 66.48 > > > Average: 28 3.46 0.00 15.17 10.46 0.00 > > > 0.00 3.32 0.00 0.00 67.59 > > > Average: 29 3.34 0.00 14.42 10.72 0.00 > > > 0.00 3.34 0.00 0.00 68.18 > > > Average: 30 3.34 0.00 15.08 10.70 0.00 > > > 0.00 3.89 0.00 0.00 66.99 > > > Average: 31 3.26 0.00 15.33 10.47 0.00 > > > 0.00 3.33 0.00 0.00 67.61 > > > Average: 32 3.21 0.00 14.80 10.61 0.00 > > > 0.00 3.70 0.00 0.00 67.67 > > > Average: 33 3.40 0.00 13.88 10.55 0.00 > > > 0.00 4.02 0.00 0.00 68.15 > > > Average: 34 3.74 0.00 17.41 10.61 0.00 > > > 0.00 4.51 0.00 0.00 63.73 > > > Average: 35 3.35 0.00 14.37 10.74 0.00 > > > 0.00 3.84 0.00 0.00 67.71 > > > Average: 36 0.54 0.00 1.77 0.00 0.00 > > > 0.00 0.00 0.00 0.00 97.69 > > > .. > > > Average: 54 3.60 0.00 15.17 10.39 0.00 > > > 0.00 4.22 0.00 0.00 66.62 > > > Average: 55 3.33 0.00 14.85 10.55 0.00 > > > 0.00 3.96 0.00 0.00 67.31 > > > Average: 56 3.40 0.00 15.19 10.54 0.00 > > > 0.00 3.74 0.00 0.00 67.13 > > > Average: 57 3.41 0.00 13.98 10.78 0.00 > > > 0.00 4.10 0.00 0.00 67.73 > > > Average: 58 3.32 0.00 15.16 10.52 0.00 > > > 0.00 4.01 0.00 0.00 66.99 > > > Average: 59 3.17 0.00 15.80 10.35 0.00 > > > 0.00 3.86 0.00 0.00 66.80 > > > Average: 60 3.00 0.00 14.63 10.59 0.00 > > > 0.00 3.97 0.00 0.00 67.80 > > > Average: 61 3.34 0.00 14.70 10.66 0.00 > > > 0.00 4.32 0.00 0.00 66.97 > > > Average: 62 3.34 0.00 15.29 10.56 0.00 > > > 0.00 3.89 0.00 0.00 66.92 > > > Average: 63 3.29 0.00 14.51 10.72 0.00 > > > 0.00 3.85 0.00 0.00 67.62 > > > Average: 64 3.48 0.00 15.31 10.65 0.00 > > > 0.00 3.97 0.00 0.00 66.60 > > > Average: 65 3.34 0.00 14.36 10.80 0.00 > > > 0.00 4.11 0.00 0.00 67.39 > > > Average: 66 3.13 0.00 14.94 10.70 0.00 > > > 0.00 4.10 0.00 0.00 67.13 > > > Average: 67 3.06 0.00 15.56 10.69 0.00 > > > 0.00 3.82 0.00 0.00 66.88 > > > Average: 68 3.33 0.00 14.98 10.61 0.00 > > > 0.00 3.81 0.00 0.00 67.27 > > > Average: 69 3.20 0.00 15.43 10.70 0.00 > > > 0.00 3.82 0.00 0.00 66.85 > > > Average: 70 3.34 0.00 17.14 10.59 0.00 > > > 0.00 3.00 0.00 0.00 65.92 > > > Average: 71 3.41 0.00 14.94 10.56 0.00 > > > 0.00 3.41 0.00 0.00 67.69 > > > > > > Perf top - > > > > > > 64.33% [kernel] [k] bt_iter > > > 4.86% [kernel] [k] blk_mq_queue_tag_busy_iter > > > 4.23% [kernel] [k] _find_next_bit > > > 2.40% [kernel] [k] native_queued_spin_lock_slowpath > > > 1.09% [kernel] [k] sbitmap_any_bit_set > > > 0.71% [kernel] [k] sbitmap_queue_clear > > > 0.63% [kernel] [k] find_next_bit > > > 0.54% [kernel] [k] _raw_spin_lock_irqsave > > > > > Ah. So we're spending quite some time in trying to find a free tag. > > I guess this is due to every queue starting at the same position > > trying to find a free tag, which inevitably leads to a contention. > > IMO, the above trace means that blk_mq_in_flight() may be the bottleneck, > and looks not related with tag allocation. > > Kashyap, could you run your performance test again after disabling iostat by > the following command on all test devices and killing all utilities which may > read iostat(/proc/diskstats, ...)? > > echo 0 > /sys/block/sdN/queue/iostat Ming - After changing iostat = 0 , I see performance issue is resolved. Below is perf top output after iostats = 0 23.45% [kernel] [k] bt_iter 2.27% [kernel] [k] blk_mq_queue_tag_busy_iter 2.18% [kernel] [k] _find_next_bit 2.06% [megaraid_sas] [k] complete_cmd_fusion 1.87% [kernel] [k] clflush_cache_range 1.70% [kernel] [k] dma_pte_clear_level 1.56% [kernel] [k] __domain_mapping 1.55% [kernel] [k] sbitmap_queue_clear 1.30% [kernel] [k] gup_pgd_range > > Thanks, > Ming
On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote: > > -----Original Message----- > > From: Ming Lei [mailto:ming.lei@redhat.com] > > Sent: Thursday, February 8, 2018 10:23 PM > > To: Hannes Reinecke > > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; Christoph > > Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar > Sandoval; > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; > Peter > > Rivera; Paolo Bonzini; Laurence Oberman > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce > > force_blk_mq > > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke wrote: > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote: > > > >> -----Original Message----- > > > >> From: Ming Lei [mailto:ming.lei@redhat.com] > > > >> Sent: Wednesday, February 7, 2018 5:53 PM > > > >> To: Hannes Reinecke > > > >> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; > > > >> Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun > > > >> Easi; Omar > > > > Sandoval; > > > >> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; > > > > Peter > > > >> Rivera; Paolo Bonzini; Laurence Oberman > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > > >> introduce force_blk_mq > > > >> > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote: > > > >>> Hi all, > > > >>> > > > >>> [ .. ] > > > >>>>> > > > >>>>> Could you share us your patch for enabling global_tags/MQ on > > > >>>> megaraid_sas > > > >>>>> so that I can reproduce your test? > > > >>>>> > > > >>>>>> See below perf top data. "bt_iter" is consuming 4 times more > CPU. > > > >>>>> > > > >>>>> Could you share us what the IOPS/CPU utilization effect is after > > > >>>> applying the > > > >>>>> patch V2? And your test script? > > > >>>> Regarding CPU utilization, I need to test one more time. > > > >>>> Currently system is in used. > > > >>>> > > > >>>> I run below fio test on total 24 SSDs expander attached. > > > >>>> > > > >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k > > > >>>> --ioengine=libaio --rw=randread > > > >>>> > > > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs. > > > >>>> > > > >>> This is basically what we've seen with earlier iterations. > > > >> > > > >> Hi Hannes, > > > >> > > > >> As I mentioned in another mail[1], Kashyap's patch has a big issue, > > > > which > > > >> causes only reply queue 0 used. > > > >> > > > >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2 > > > >> > > > >> So could you guys run your performance test again after fixing the > > > > patch? > > > > > > > > Ming - > > > > > > > > I tried after change you requested. Performance drop is still > unresolved. > > > > From 1.6 M IOPS to 770K IOPS. > > > > > > > > See below data. All 24 reply queue is in used correctly. > > > > > > > > IRQs / 1 second(s) > > > > IRQ# TOTAL NODE0 NODE1 NAME > > > > 360 16422 0 16422 IR-PCI-MSI 70254653-edge megasas > > > > 364 15980 0 15980 IR-PCI-MSI 70254657-edge megasas > > > > 362 15979 0 15979 IR-PCI-MSI 70254655-edge megasas > > > > 345 15696 0 15696 IR-PCI-MSI 70254638-edge megasas > > > > 341 15659 0 15659 IR-PCI-MSI 70254634-edge megasas > > > > 369 15656 0 15656 IR-PCI-MSI 70254662-edge megasas > > > > 359 15650 0 15650 IR-PCI-MSI 70254652-edge megasas > > > > 358 15596 0 15596 IR-PCI-MSI 70254651-edge megasas > > > > 350 15574 0 15574 IR-PCI-MSI 70254643-edge megasas > > > > 342 15532 0 15532 IR-PCI-MSI 70254635-edge megasas > > > > 344 15527 0 15527 IR-PCI-MSI 70254637-edge megasas > > > > 346 15485 0 15485 IR-PCI-MSI 70254639-edge megasas > > > > 361 15482 0 15482 IR-PCI-MSI 70254654-edge megasas > > > > 348 15467 0 15467 IR-PCI-MSI 70254641-edge megasas > > > > 368 15463 0 15463 IR-PCI-MSI 70254661-edge megasas > > > > 354 15420 0 15420 IR-PCI-MSI 70254647-edge megasas > > > > 351 15378 0 15378 IR-PCI-MSI 70254644-edge megasas > > > > 352 15377 0 15377 IR-PCI-MSI 70254645-edge megasas > > > > 356 15348 0 15348 IR-PCI-MSI 70254649-edge megasas > > > > 337 15344 0 15344 IR-PCI-MSI 70254630-edge megasas > > > > 343 15320 0 15320 IR-PCI-MSI 70254636-edge megasas > > > > 355 15266 0 15266 IR-PCI-MSI 70254648-edge megasas > > > > 335 15247 0 15247 IR-PCI-MSI 70254628-edge megasas > > > > 363 15233 0 15233 IR-PCI-MSI 70254656-edge megasas > > > > > > > > > > > > Average: CPU %usr %nice %sys %iowait > %steal > > > > %irq %soft %guest %gnice %idle > > > > Average: 18 3.80 0.00 14.78 10.08 > 0.00 > > > > 0.00 4.01 0.00 0.00 67.33 > > > > Average: 19 3.26 0.00 15.35 10.62 > 0.00 > > > > 0.00 4.03 0.00 0.00 66.74 > > > > Average: 20 3.42 0.00 14.57 10.67 > 0.00 > > > > 0.00 3.84 0.00 0.00 67.50 > > > > Average: 21 3.19 0.00 15.60 10.75 > 0.00 > > > > 0.00 4.16 0.00 0.00 66.30 > > > > Average: 22 3.58 0.00 15.15 10.66 > 0.00 > > > > 0.00 3.51 0.00 0.00 67.11 > > > > Average: 23 3.34 0.00 15.36 10.63 > 0.00 > > > > 0.00 4.17 0.00 0.00 66.50 > > > > Average: 24 3.50 0.00 14.58 10.93 > 0.00 > > > > 0.00 3.85 0.00 0.00 67.13 > > > > Average: 25 3.20 0.00 14.68 10.86 > 0.00 > > > > 0.00 4.31 0.00 0.00 66.95 > > > > Average: 26 3.27 0.00 14.80 10.70 > 0.00 > > > > 0.00 3.68 0.00 0.00 67.55 > > > > Average: 27 3.58 0.00 15.36 10.80 > 0.00 > > > > 0.00 3.79 0.00 0.00 66.48 > > > > Average: 28 3.46 0.00 15.17 10.46 > 0.00 > > > > 0.00 3.32 0.00 0.00 67.59 > > > > Average: 29 3.34 0.00 14.42 10.72 > 0.00 > > > > 0.00 3.34 0.00 0.00 68.18 > > > > Average: 30 3.34 0.00 15.08 10.70 > 0.00 > > > > 0.00 3.89 0.00 0.00 66.99 > > > > Average: 31 3.26 0.00 15.33 10.47 > 0.00 > > > > 0.00 3.33 0.00 0.00 67.61 > > > > Average: 32 3.21 0.00 14.80 10.61 > 0.00 > > > > 0.00 3.70 0.00 0.00 67.67 > > > > Average: 33 3.40 0.00 13.88 10.55 > 0.00 > > > > 0.00 4.02 0.00 0.00 68.15 > > > > Average: 34 3.74 0.00 17.41 10.61 > 0.00 > > > > 0.00 4.51 0.00 0.00 63.73 > > > > Average: 35 3.35 0.00 14.37 10.74 > 0.00 > > > > 0.00 3.84 0.00 0.00 67.71 > > > > Average: 36 0.54 0.00 1.77 0.00 > 0.00 > > > > 0.00 0.00 0.00 0.00 97.69 > > > > .. > > > > Average: 54 3.60 0.00 15.17 10.39 > 0.00 > > > > 0.00 4.22 0.00 0.00 66.62 > > > > Average: 55 3.33 0.00 14.85 10.55 > 0.00 > > > > 0.00 3.96 0.00 0.00 67.31 > > > > Average: 56 3.40 0.00 15.19 10.54 > 0.00 > > > > 0.00 3.74 0.00 0.00 67.13 > > > > Average: 57 3.41 0.00 13.98 10.78 > 0.00 > > > > 0.00 4.10 0.00 0.00 67.73 > > > > Average: 58 3.32 0.00 15.16 10.52 > 0.00 > > > > 0.00 4.01 0.00 0.00 66.99 > > > > Average: 59 3.17 0.00 15.80 10.35 > 0.00 > > > > 0.00 3.86 0.00 0.00 66.80 > > > > Average: 60 3.00 0.00 14.63 10.59 > 0.00 > > > > 0.00 3.97 0.00 0.00 67.80 > > > > Average: 61 3.34 0.00 14.70 10.66 > 0.00 > > > > 0.00 4.32 0.00 0.00 66.97 > > > > Average: 62 3.34 0.00 15.29 10.56 > 0.00 > > > > 0.00 3.89 0.00 0.00 66.92 > > > > Average: 63 3.29 0.00 14.51 10.72 > 0.00 > > > > 0.00 3.85 0.00 0.00 67.62 > > > > Average: 64 3.48 0.00 15.31 10.65 > 0.00 > > > > 0.00 3.97 0.00 0.00 66.60 > > > > Average: 65 3.34 0.00 14.36 10.80 > 0.00 > > > > 0.00 4.11 0.00 0.00 67.39 > > > > Average: 66 3.13 0.00 14.94 10.70 > 0.00 > > > > 0.00 4.10 0.00 0.00 67.13 > > > > Average: 67 3.06 0.00 15.56 10.69 > 0.00 > > > > 0.00 3.82 0.00 0.00 66.88 > > > > Average: 68 3.33 0.00 14.98 10.61 > 0.00 > > > > 0.00 3.81 0.00 0.00 67.27 > > > > Average: 69 3.20 0.00 15.43 10.70 > 0.00 > > > > 0.00 3.82 0.00 0.00 66.85 > > > > Average: 70 3.34 0.00 17.14 10.59 > 0.00 > > > > 0.00 3.00 0.00 0.00 65.92 > > > > Average: 71 3.41 0.00 14.94 10.56 > 0.00 > > > > 0.00 3.41 0.00 0.00 67.69 > > > > > > > > Perf top - > > > > > > > > 64.33% [kernel] [k] bt_iter > > > > 4.86% [kernel] [k] blk_mq_queue_tag_busy_iter > > > > 4.23% [kernel] [k] _find_next_bit > > > > 2.40% [kernel] [k] native_queued_spin_lock_slowpath > > > > 1.09% [kernel] [k] sbitmap_any_bit_set > > > > 0.71% [kernel] [k] sbitmap_queue_clear > > > > 0.63% [kernel] [k] find_next_bit > > > > 0.54% [kernel] [k] _raw_spin_lock_irqsave > > > > > > > Ah. So we're spending quite some time in trying to find a free tag. > > > I guess this is due to every queue starting at the same position > > > trying to find a free tag, which inevitably leads to a contention. > > > > IMO, the above trace means that blk_mq_in_flight() may be the > bottleneck, > > and looks not related with tag allocation. > > > > Kashyap, could you run your performance test again after disabling > iostat by > > the following command on all test devices and killing all utilities > which may > > read iostat(/proc/diskstats, ...)? > > > > echo 0 > /sys/block/sdN/queue/iostat > > Ming - After changing iostat = 0 , I see performance issue is resolved. > > Below is perf top output after iostats = 0 > > > 23.45% [kernel] [k] bt_iter > 2.27% [kernel] [k] blk_mq_queue_tag_busy_iter > 2.18% [kernel] [k] _find_next_bit > 2.06% [megaraid_sas] [k] complete_cmd_fusion > 1.87% [kernel] [k] clflush_cache_range > 1.70% [kernel] [k] dma_pte_clear_level > 1.56% [kernel] [k] __domain_mapping > 1.55% [kernel] [k] sbitmap_queue_clear > 1.30% [kernel] [k] gup_pgd_range Hi Kashyap, Thanks for your test and update. Looks blk_mq_queue_tag_busy_iter() is still sampled by perf even though iostats is disabled, and I guess there may be utilities which are reading iostats a bit frequently. Either there is issue introduced in part_round_stats() recently since I remember that this counter should have been read at most one time during one jiffies in IO path, or the implementation of blk_mq_in_flight() can become a bit heavy in your environment. Jens may have idea about this issue. And I guess the lockup issue may be avoided by this approach now? Thanks, Ming
> -----Original Message----- > From: Ming Lei [mailto:ming.lei@redhat.com] > Sent: Friday, February 9, 2018 11:01 AM > To: Kashyap Desai > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph > Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar Sandoval; > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; Peter > Rivera; Paolo Bonzini; Laurence Oberman > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce > force_blk_mq > > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote: > > > -----Original Message----- > > > From: Ming Lei [mailto:ming.lei@redhat.com] > > > Sent: Thursday, February 8, 2018 10:23 PM > > > To: Hannes Reinecke > > > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun > > > Easi; Omar > > Sandoval; > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; > > Peter > > > Rivera; Paolo Bonzini; Laurence Oberman > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > > introduce force_blk_mq > > > > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke wrote: > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote: > > > > >> -----Original Message----- > > > > >> From: Ming Lei [mailto:ming.lei@redhat.com] > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM > > > > >> To: Hannes Reinecke > > > > >> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; > > > > >> Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; > > > > >> Arun Easi; Omar > > > > > Sandoval; > > > > >> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don > > > > >> Brace; > > > > > Peter > > > > >> Rivera; Paolo Bonzini; Laurence Oberman > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > > > >> introduce force_blk_mq > > > > >> > > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote: > > > > >>> Hi all, > > > > >>> > > > > >>> [ .. ] > > > > >>>>> > > > > >>>>> Could you share us your patch for enabling global_tags/MQ on > > > > >>>> megaraid_sas > > > > >>>>> so that I can reproduce your test? > > > > >>>>> > > > > >>>>>> See below perf top data. "bt_iter" is consuming 4 times > > > > >>>>>> more > > CPU. > > > > >>>>> > > > > >>>>> Could you share us what the IOPS/CPU utilization effect is > > > > >>>>> after > > > > >>>> applying the > > > > >>>>> patch V2? And your test script? > > > > >>>> Regarding CPU utilization, I need to test one more time. > > > > >>>> Currently system is in used. > > > > >>>> > > > > >>>> I run below fio test on total 24 SSDs expander attached. > > > > >>>> > > > > >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k > > > > >>>> --ioengine=libaio --rw=randread > > > > >>>> > > > > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs. > > > > >>>> > > > > >>> This is basically what we've seen with earlier iterations. > > > > >> > > > > >> Hi Hannes, > > > > >> > > > > >> As I mentioned in another mail[1], Kashyap's patch has a big > > > > >> issue, > > > > > which > > > > >> causes only reply queue 0 used. > > > > >> > > > > >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2 > > > > >> > > > > >> So could you guys run your performance test again after fixing > > > > >> the > > > > > patch? > > > > > > > > > > Ming - > > > > > > > > > > I tried after change you requested. Performance drop is still > > unresolved. > > > > > From 1.6 M IOPS to 770K IOPS. > > > > > > > > > > See below data. All 24 reply queue is in used correctly. > > > > > > > > > > IRQs / 1 second(s) > > > > > IRQ# TOTAL NODE0 NODE1 NAME > > > > > 360 16422 0 16422 IR-PCI-MSI 70254653-edge megasas > > > > > 364 15980 0 15980 IR-PCI-MSI 70254657-edge megasas > > > > > 362 15979 0 15979 IR-PCI-MSI 70254655-edge megasas > > > > > 345 15696 0 15696 IR-PCI-MSI 70254638-edge megasas > > > > > 341 15659 0 15659 IR-PCI-MSI 70254634-edge megasas > > > > > 369 15656 0 15656 IR-PCI-MSI 70254662-edge megasas > > > > > 359 15650 0 15650 IR-PCI-MSI 70254652-edge megasas > > > > > 358 15596 0 15596 IR-PCI-MSI 70254651-edge megasas > > > > > 350 15574 0 15574 IR-PCI-MSI 70254643-edge megasas > > > > > 342 15532 0 15532 IR-PCI-MSI 70254635-edge megasas > > > > > 344 15527 0 15527 IR-PCI-MSI 70254637-edge megasas > > > > > 346 15485 0 15485 IR-PCI-MSI 70254639-edge megasas > > > > > 361 15482 0 15482 IR-PCI-MSI 70254654-edge megasas > > > > > 348 15467 0 15467 IR-PCI-MSI 70254641-edge megasas > > > > > 368 15463 0 15463 IR-PCI-MSI 70254661-edge megasas > > > > > 354 15420 0 15420 IR-PCI-MSI 70254647-edge megasas > > > > > 351 15378 0 15378 IR-PCI-MSI 70254644-edge megasas > > > > > 352 15377 0 15377 IR-PCI-MSI 70254645-edge megasas > > > > > 356 15348 0 15348 IR-PCI-MSI 70254649-edge megasas > > > > > 337 15344 0 15344 IR-PCI-MSI 70254630-edge megasas > > > > > 343 15320 0 15320 IR-PCI-MSI 70254636-edge megasas > > > > > 355 15266 0 15266 IR-PCI-MSI 70254648-edge megasas > > > > > 335 15247 0 15247 IR-PCI-MSI 70254628-edge megasas > > > > > 363 15233 0 15233 IR-PCI-MSI 70254656-edge megasas > > > > > > > > > > > > > > > Average: CPU %usr %nice %sys %iowait > > %steal > > > > > %irq %soft %guest %gnice %idle > > > > > Average: 18 3.80 0.00 14.78 10.08 > > 0.00 > > > > > 0.00 4.01 0.00 0.00 67.33 > > > > > Average: 19 3.26 0.00 15.35 10.62 > > 0.00 > > > > > 0.00 4.03 0.00 0.00 66.74 > > > > > Average: 20 3.42 0.00 14.57 10.67 > > 0.00 > > > > > 0.00 3.84 0.00 0.00 67.50 > > > > > Average: 21 3.19 0.00 15.60 10.75 > > 0.00 > > > > > 0.00 4.16 0.00 0.00 66.30 > > > > > Average: 22 3.58 0.00 15.15 10.66 > > 0.00 > > > > > 0.00 3.51 0.00 0.00 67.11 > > > > > Average: 23 3.34 0.00 15.36 10.63 > > 0.00 > > > > > 0.00 4.17 0.00 0.00 66.50 > > > > > Average: 24 3.50 0.00 14.58 10.93 > > 0.00 > > > > > 0.00 3.85 0.00 0.00 67.13 > > > > > Average: 25 3.20 0.00 14.68 10.86 > > 0.00 > > > > > 0.00 4.31 0.00 0.00 66.95 > > > > > Average: 26 3.27 0.00 14.80 10.70 > > 0.00 > > > > > 0.00 3.68 0.00 0.00 67.55 > > > > > Average: 27 3.58 0.00 15.36 10.80 > > 0.00 > > > > > 0.00 3.79 0.00 0.00 66.48 > > > > > Average: 28 3.46 0.00 15.17 10.46 > > 0.00 > > > > > 0.00 3.32 0.00 0.00 67.59 > > > > > Average: 29 3.34 0.00 14.42 10.72 > > 0.00 > > > > > 0.00 3.34 0.00 0.00 68.18 > > > > > Average: 30 3.34 0.00 15.08 10.70 > > 0.00 > > > > > 0.00 3.89 0.00 0.00 66.99 > > > > > Average: 31 3.26 0.00 15.33 10.47 > > 0.00 > > > > > 0.00 3.33 0.00 0.00 67.61 > > > > > Average: 32 3.21 0.00 14.80 10.61 > > 0.00 > > > > > 0.00 3.70 0.00 0.00 67.67 > > > > > Average: 33 3.40 0.00 13.88 10.55 > > 0.00 > > > > > 0.00 4.02 0.00 0.00 68.15 > > > > > Average: 34 3.74 0.00 17.41 10.61 > > 0.00 > > > > > 0.00 4.51 0.00 0.00 63.73 > > > > > Average: 35 3.35 0.00 14.37 10.74 > > 0.00 > > > > > 0.00 3.84 0.00 0.00 67.71 > > > > > Average: 36 0.54 0.00 1.77 0.00 > > 0.00 > > > > > 0.00 0.00 0.00 0.00 97.69 > > > > > .. > > > > > Average: 54 3.60 0.00 15.17 10.39 > > 0.00 > > > > > 0.00 4.22 0.00 0.00 66.62 > > > > > Average: 55 3.33 0.00 14.85 10.55 > > 0.00 > > > > > 0.00 3.96 0.00 0.00 67.31 > > > > > Average: 56 3.40 0.00 15.19 10.54 > > 0.00 > > > > > 0.00 3.74 0.00 0.00 67.13 > > > > > Average: 57 3.41 0.00 13.98 10.78 > > 0.00 > > > > > 0.00 4.10 0.00 0.00 67.73 > > > > > Average: 58 3.32 0.00 15.16 10.52 > > 0.00 > > > > > 0.00 4.01 0.00 0.00 66.99 > > > > > Average: 59 3.17 0.00 15.80 10.35 > > 0.00 > > > > > 0.00 3.86 0.00 0.00 66.80 > > > > > Average: 60 3.00 0.00 14.63 10.59 > > 0.00 > > > > > 0.00 3.97 0.00 0.00 67.80 > > > > > Average: 61 3.34 0.00 14.70 10.66 > > 0.00 > > > > > 0.00 4.32 0.00 0.00 66.97 > > > > > Average: 62 3.34 0.00 15.29 10.56 > > 0.00 > > > > > 0.00 3.89 0.00 0.00 66.92 > > > > > Average: 63 3.29 0.00 14.51 10.72 > > 0.00 > > > > > 0.00 3.85 0.00 0.00 67.62 > > > > > Average: 64 3.48 0.00 15.31 10.65 > > 0.00 > > > > > 0.00 3.97 0.00 0.00 66.60 > > > > > Average: 65 3.34 0.00 14.36 10.80 > > 0.00 > > > > > 0.00 4.11 0.00 0.00 67.39 > > > > > Average: 66 3.13 0.00 14.94 10.70 > > 0.00 > > > > > 0.00 4.10 0.00 0.00 67.13 > > > > > Average: 67 3.06 0.00 15.56 10.69 > > 0.00 > > > > > 0.00 3.82 0.00 0.00 66.88 > > > > > Average: 68 3.33 0.00 14.98 10.61 > > 0.00 > > > > > 0.00 3.81 0.00 0.00 67.27 > > > > > Average: 69 3.20 0.00 15.43 10.70 > > 0.00 > > > > > 0.00 3.82 0.00 0.00 66.85 > > > > > Average: 70 3.34 0.00 17.14 10.59 > > 0.00 > > > > > 0.00 3.00 0.00 0.00 65.92 > > > > > Average: 71 3.41 0.00 14.94 10.56 > > 0.00 > > > > > 0.00 3.41 0.00 0.00 67.69 > > > > > > > > > > Perf top - > > > > > > > > > > 64.33% [kernel] [k] bt_iter > > > > > 4.86% [kernel] [k] blk_mq_queue_tag_busy_iter > > > > > 4.23% [kernel] [k] _find_next_bit > > > > > 2.40% [kernel] [k] native_queued_spin_lock_slowpath > > > > > 1.09% [kernel] [k] sbitmap_any_bit_set > > > > > 0.71% [kernel] [k] sbitmap_queue_clear > > > > > 0.63% [kernel] [k] find_next_bit > > > > > 0.54% [kernel] [k] _raw_spin_lock_irqsave > > > > > > > > > Ah. So we're spending quite some time in trying to find a free tag. > > > > I guess this is due to every queue starting at the same position > > > > trying to find a free tag, which inevitably leads to a contention. > > > > > > IMO, the above trace means that blk_mq_in_flight() may be the > > bottleneck, > > > and looks not related with tag allocation. > > > > > > Kashyap, could you run your performance test again after disabling > > iostat by > > > the following command on all test devices and killing all utilities > > which may > > > read iostat(/proc/diskstats, ...)? > > > > > > echo 0 > /sys/block/sdN/queue/iostat > > > > Ming - After changing iostat = 0 , I see performance issue is resolved. > > > > Below is perf top output after iostats = 0 > > > > > > 23.45% [kernel] [k] bt_iter > > 2.27% [kernel] [k] blk_mq_queue_tag_busy_iter > > 2.18% [kernel] [k] _find_next_bit > > 2.06% [megaraid_sas] [k] complete_cmd_fusion > > 1.87% [kernel] [k] clflush_cache_range > > 1.70% [kernel] [k] dma_pte_clear_level > > 1.56% [kernel] [k] __domain_mapping > > 1.55% [kernel] [k] sbitmap_queue_clear > > 1.30% [kernel] [k] gup_pgd_range > > Hi Kashyap, > > Thanks for your test and update. > > Looks blk_mq_queue_tag_busy_iter() is still sampled by perf even though > iostats is disabled, and I guess there may be utilities which are reading iostats > a bit frequently. I will be doing some more testing and post you my findings. > > Either there is issue introduced in part_round_stats() recently since I > remember that this counter should have been read at most one time during > one jiffies in IO path, or the implementation of blk_mq_in_flight() can become > a bit heavy in your environment. Jens may have idea about this issue. > > And I guess the lockup issue may be avoided by this approach now? NO. For CPU Lock up we need irq poll interface to quit from ISR loop of the driver. > > > Thanks, > Ming
Hi Kashyap, On Fri, Feb 09, 2018 at 02:12:16PM +0530, Kashyap Desai wrote: > > -----Original Message----- > > From: Ming Lei [mailto:ming.lei@redhat.com] > > Sent: Friday, February 9, 2018 11:01 AM > > To: Kashyap Desai > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph > > Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar > Sandoval; > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; > Peter > > Rivera; Paolo Bonzini; Laurence Oberman > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce > > force_blk_mq > > > > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote: > > > > -----Original Message----- > > > > From: Ming Lei [mailto:ming.lei@redhat.com] > > > > Sent: Thursday, February 8, 2018 10:23 PM > > > > To: Hannes Reinecke > > > > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; > > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun > > > > Easi; Omar > > > Sandoval; > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; > > > Peter > > > > Rivera; Paolo Bonzini; Laurence Oberman > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > > > introduce force_blk_mq > > > > > > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke wrote: > > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote: > > > > > >> -----Original Message----- > > > > > >> From: Ming Lei [mailto:ming.lei@redhat.com] > > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM > > > > > >> To: Hannes Reinecke > > > > > >> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; > > > > > >> Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; > > > > > >> Arun Easi; Omar > > > > > > Sandoval; > > > > > >> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don > > > > > >> Brace; > > > > > > Peter > > > > > >> Rivera; Paolo Bonzini; Laurence Oberman > > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > > > > >> introduce force_blk_mq > > > > > >> > > > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke > wrote: > > > > > >>> Hi all, > > > > > >>> > > > > > >>> [ .. ] > > > > > >>>>> > > > > > >>>>> Could you share us your patch for enabling global_tags/MQ on > > > > > >>>> megaraid_sas > > > > > >>>>> so that I can reproduce your test? > > > > > >>>>> > > > > > >>>>>> See below perf top data. "bt_iter" is consuming 4 times > > > > > >>>>>> more > > > CPU. > > > > > >>>>> > > > > > >>>>> Could you share us what the IOPS/CPU utilization effect is > > > > > >>>>> after > > > > > >>>> applying the > > > > > >>>>> patch V2? And your test script? > > > > > >>>> Regarding CPU utilization, I need to test one more time. > > > > > >>>> Currently system is in used. > > > > > >>>> > > > > > >>>> I run below fio test on total 24 SSDs expander attached. > > > > > >>>> > > > > > >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k > > > > > >>>> --ioengine=libaio --rw=randread > > > > > >>>> > > > > > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs. > > > > > >>>> > > > > > >>> This is basically what we've seen with earlier iterations. > > > > > >> > > > > > >> Hi Hannes, > > > > > >> > > > > > >> As I mentioned in another mail[1], Kashyap's patch has a big > > > > > >> issue, > > > > > > which > > > > > >> causes only reply queue 0 used. > > > > > >> > > > > > >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2 > > > > > >> > > > > > >> So could you guys run your performance test again after fixing > > > > > >> the > > > > > > patch? > > > > > > > > > > > > Ming - > > > > > > > > > > > > I tried after change you requested. Performance drop is still > > > unresolved. > > > > > > From 1.6 M IOPS to 770K IOPS. > > > > > > > > > > > > See below data. All 24 reply queue is in used correctly. > > > > > > > > > > > > IRQs / 1 second(s) > > > > > > IRQ# TOTAL NODE0 NODE1 NAME > > > > > > 360 16422 0 16422 IR-PCI-MSI 70254653-edge megasas > > > > > > 364 15980 0 15980 IR-PCI-MSI 70254657-edge megasas > > > > > > 362 15979 0 15979 IR-PCI-MSI 70254655-edge megasas > > > > > > 345 15696 0 15696 IR-PCI-MSI 70254638-edge megasas > > > > > > 341 15659 0 15659 IR-PCI-MSI 70254634-edge megasas > > > > > > 369 15656 0 15656 IR-PCI-MSI 70254662-edge megasas > > > > > > 359 15650 0 15650 IR-PCI-MSI 70254652-edge megasas > > > > > > 358 15596 0 15596 IR-PCI-MSI 70254651-edge megasas > > > > > > 350 15574 0 15574 IR-PCI-MSI 70254643-edge megasas > > > > > > 342 15532 0 15532 IR-PCI-MSI 70254635-edge megasas > > > > > > 344 15527 0 15527 IR-PCI-MSI 70254637-edge megasas > > > > > > 346 15485 0 15485 IR-PCI-MSI 70254639-edge megasas > > > > > > 361 15482 0 15482 IR-PCI-MSI 70254654-edge megasas > > > > > > 348 15467 0 15467 IR-PCI-MSI 70254641-edge megasas > > > > > > 368 15463 0 15463 IR-PCI-MSI 70254661-edge megasas > > > > > > 354 15420 0 15420 IR-PCI-MSI 70254647-edge megasas > > > > > > 351 15378 0 15378 IR-PCI-MSI 70254644-edge megasas > > > > > > 352 15377 0 15377 IR-PCI-MSI 70254645-edge megasas > > > > > > 356 15348 0 15348 IR-PCI-MSI 70254649-edge megasas > > > > > > 337 15344 0 15344 IR-PCI-MSI 70254630-edge megasas > > > > > > 343 15320 0 15320 IR-PCI-MSI 70254636-edge megasas > > > > > > 355 15266 0 15266 IR-PCI-MSI 70254648-edge megasas > > > > > > 335 15247 0 15247 IR-PCI-MSI 70254628-edge megasas > > > > > > 363 15233 0 15233 IR-PCI-MSI 70254656-edge megasas > > > > > > > > > > > > > > > > > > Average: CPU %usr %nice %sys %iowait > > > %steal > > > > > > %irq %soft %guest %gnice %idle > > > > > > Average: 18 3.80 0.00 14.78 10.08 > > > 0.00 > > > > > > 0.00 4.01 0.00 0.00 67.33 > > > > > > Average: 19 3.26 0.00 15.35 10.62 > > > 0.00 > > > > > > 0.00 4.03 0.00 0.00 66.74 > > > > > > Average: 20 3.42 0.00 14.57 10.67 > > > 0.00 > > > > > > 0.00 3.84 0.00 0.00 67.50 > > > > > > Average: 21 3.19 0.00 15.60 10.75 > > > 0.00 > > > > > > 0.00 4.16 0.00 0.00 66.30 > > > > > > Average: 22 3.58 0.00 15.15 10.66 > > > 0.00 > > > > > > 0.00 3.51 0.00 0.00 67.11 > > > > > > Average: 23 3.34 0.00 15.36 10.63 > > > 0.00 > > > > > > 0.00 4.17 0.00 0.00 66.50 > > > > > > Average: 24 3.50 0.00 14.58 10.93 > > > 0.00 > > > > > > 0.00 3.85 0.00 0.00 67.13 > > > > > > Average: 25 3.20 0.00 14.68 10.86 > > > 0.00 > > > > > > 0.00 4.31 0.00 0.00 66.95 > > > > > > Average: 26 3.27 0.00 14.80 10.70 > > > 0.00 > > > > > > 0.00 3.68 0.00 0.00 67.55 > > > > > > Average: 27 3.58 0.00 15.36 10.80 > > > 0.00 > > > > > > 0.00 3.79 0.00 0.00 66.48 > > > > > > Average: 28 3.46 0.00 15.17 10.46 > > > 0.00 > > > > > > 0.00 3.32 0.00 0.00 67.59 > > > > > > Average: 29 3.34 0.00 14.42 10.72 > > > 0.00 > > > > > > 0.00 3.34 0.00 0.00 68.18 > > > > > > Average: 30 3.34 0.00 15.08 10.70 > > > 0.00 > > > > > > 0.00 3.89 0.00 0.00 66.99 > > > > > > Average: 31 3.26 0.00 15.33 10.47 > > > 0.00 > > > > > > 0.00 3.33 0.00 0.00 67.61 > > > > > > Average: 32 3.21 0.00 14.80 10.61 > > > 0.00 > > > > > > 0.00 3.70 0.00 0.00 67.67 > > > > > > Average: 33 3.40 0.00 13.88 10.55 > > > 0.00 > > > > > > 0.00 4.02 0.00 0.00 68.15 > > > > > > Average: 34 3.74 0.00 17.41 10.61 > > > 0.00 > > > > > > 0.00 4.51 0.00 0.00 63.73 > > > > > > Average: 35 3.35 0.00 14.37 10.74 > > > 0.00 > > > > > > 0.00 3.84 0.00 0.00 67.71 > > > > > > Average: 36 0.54 0.00 1.77 0.00 > > > 0.00 > > > > > > 0.00 0.00 0.00 0.00 97.69 > > > > > > .. > > > > > > Average: 54 3.60 0.00 15.17 10.39 > > > 0.00 > > > > > > 0.00 4.22 0.00 0.00 66.62 > > > > > > Average: 55 3.33 0.00 14.85 10.55 > > > 0.00 > > > > > > 0.00 3.96 0.00 0.00 67.31 > > > > > > Average: 56 3.40 0.00 15.19 10.54 > > > 0.00 > > > > > > 0.00 3.74 0.00 0.00 67.13 > > > > > > Average: 57 3.41 0.00 13.98 10.78 > > > 0.00 > > > > > > 0.00 4.10 0.00 0.00 67.73 > > > > > > Average: 58 3.32 0.00 15.16 10.52 > > > 0.00 > > > > > > 0.00 4.01 0.00 0.00 66.99 > > > > > > Average: 59 3.17 0.00 15.80 10.35 > > > 0.00 > > > > > > 0.00 3.86 0.00 0.00 66.80 > > > > > > Average: 60 3.00 0.00 14.63 10.59 > > > 0.00 > > > > > > 0.00 3.97 0.00 0.00 67.80 > > > > > > Average: 61 3.34 0.00 14.70 10.66 > > > 0.00 > > > > > > 0.00 4.32 0.00 0.00 66.97 > > > > > > Average: 62 3.34 0.00 15.29 10.56 > > > 0.00 > > > > > > 0.00 3.89 0.00 0.00 66.92 > > > > > > Average: 63 3.29 0.00 14.51 10.72 > > > 0.00 > > > > > > 0.00 3.85 0.00 0.00 67.62 > > > > > > Average: 64 3.48 0.00 15.31 10.65 > > > 0.00 > > > > > > 0.00 3.97 0.00 0.00 66.60 > > > > > > Average: 65 3.34 0.00 14.36 10.80 > > > 0.00 > > > > > > 0.00 4.11 0.00 0.00 67.39 > > > > > > Average: 66 3.13 0.00 14.94 10.70 > > > 0.00 > > > > > > 0.00 4.10 0.00 0.00 67.13 > > > > > > Average: 67 3.06 0.00 15.56 10.69 > > > 0.00 > > > > > > 0.00 3.82 0.00 0.00 66.88 > > > > > > Average: 68 3.33 0.00 14.98 10.61 > > > 0.00 > > > > > > 0.00 3.81 0.00 0.00 67.27 > > > > > > Average: 69 3.20 0.00 15.43 10.70 > > > 0.00 > > > > > > 0.00 3.82 0.00 0.00 66.85 > > > > > > Average: 70 3.34 0.00 17.14 10.59 > > > 0.00 > > > > > > 0.00 3.00 0.00 0.00 65.92 > > > > > > Average: 71 3.41 0.00 14.94 10.56 > > > 0.00 > > > > > > 0.00 3.41 0.00 0.00 67.69 > > > > > > > > > > > > Perf top - > > > > > > > > > > > > 64.33% [kernel] [k] bt_iter > > > > > > 4.86% [kernel] [k] blk_mq_queue_tag_busy_iter > > > > > > 4.23% [kernel] [k] _find_next_bit > > > > > > 2.40% [kernel] [k] > native_queued_spin_lock_slowpath > > > > > > 1.09% [kernel] [k] sbitmap_any_bit_set > > > > > > 0.71% [kernel] [k] sbitmap_queue_clear > > > > > > 0.63% [kernel] [k] find_next_bit > > > > > > 0.54% [kernel] [k] _raw_spin_lock_irqsave > > > > > > > > > > > Ah. So we're spending quite some time in trying to find a free > tag. > > > > > I guess this is due to every queue starting at the same position > > > > > trying to find a free tag, which inevitably leads to a contention. > > > > > > > > IMO, the above trace means that blk_mq_in_flight() may be the > > > bottleneck, > > > > and looks not related with tag allocation. > > > > > > > > Kashyap, could you run your performance test again after disabling > > > iostat by > > > > the following command on all test devices and killing all utilities > > > which may > > > > read iostat(/proc/diskstats, ...)? > > > > > > > > echo 0 > /sys/block/sdN/queue/iostat > > > > > > Ming - After changing iostat = 0 , I see performance issue is > resolved. > > > > > > Below is perf top output after iostats = 0 > > > > > > > > > 23.45% [kernel] [k] bt_iter > > > 2.27% [kernel] [k] blk_mq_queue_tag_busy_iter > > > 2.18% [kernel] [k] _find_next_bit > > > 2.06% [megaraid_sas] [k] complete_cmd_fusion > > > 1.87% [kernel] [k] clflush_cache_range > > > 1.70% [kernel] [k] dma_pte_clear_level > > > 1.56% [kernel] [k] __domain_mapping > > > 1.55% [kernel] [k] sbitmap_queue_clear > > > 1.30% [kernel] [k] gup_pgd_range > > > > Hi Kashyap, > > > > Thanks for your test and update. > > > > Looks blk_mq_queue_tag_busy_iter() is still sampled by perf even though > > iostats is disabled, and I guess there may be utilities which are > reading iostats > > a bit frequently. > > I will be doing some more testing and post you my findings. I will find sometime this weekend to see if I can cook a patch to address this issue of io accounting. > > > > > Either there is issue introduced in part_round_stats() recently since I > > remember that this counter should have been read at most one time during > > one jiffies in IO path, or the implementation of blk_mq_in_flight() can > become > > a bit heavy in your environment. Jens may have idea about this issue. > > > > And I guess the lockup issue may be avoided by this approach now? > > NO. For CPU Lock up we need irq poll interface to quit from ISR loop of > the driver. Actually after this patchset starts working, the request's completion is done basically on the submission CPU. Seems all CPU shouldn't have been overloaded, given your system has so many msix irq vectors and enough CPU cores. I am interested in this problem too, but I think we have to fix the io accounting issue first. Once the accounting issue(which may cause too much CPU consumed up in interrupt handler) is fixed, let's see if there is still the lockup issue. If there is, 'perf' may tell us something. But from your previous perf trace, looks only the accounting symbols are listed in hot path. Thanks, Ming
On Sat, Feb 10, 2018 at 09:00:57AM +0800, Ming Lei wrote: > Hi Kashyap, > > On Fri, Feb 09, 2018 at 02:12:16PM +0530, Kashyap Desai wrote: > > > -----Original Message----- > > > From: Ming Lei [mailto:ming.lei@redhat.com] > > > Sent: Friday, February 9, 2018 11:01 AM > > > To: Kashyap Desai > > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph > > > Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar > > Sandoval; > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; > > Peter > > > Rivera; Paolo Bonzini; Laurence Oberman > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce > > > force_blk_mq > > > > > > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote: > > > > > -----Original Message----- > > > > > From: Ming Lei [mailto:ming.lei@redhat.com] > > > > > Sent: Thursday, February 8, 2018 10:23 PM > > > > > To: Hannes Reinecke > > > > > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; > > > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun > > > > > Easi; Omar > > > > Sandoval; > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; > > > > Peter > > > > > Rivera; Paolo Bonzini; Laurence Oberman > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > > > > introduce force_blk_mq > > > > > > > > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke wrote: > > > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote: > > > > > > >> -----Original Message----- > > > > > > >> From: Ming Lei [mailto:ming.lei@redhat.com] > > > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM > > > > > > >> To: Hannes Reinecke > > > > > > >> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; > > > > > > >> Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; > > > > > > >> Arun Easi; Omar > > > > > > > Sandoval; > > > > > > >> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don > > > > > > >> Brace; > > > > > > > Peter > > > > > > >> Rivera; Paolo Bonzini; Laurence Oberman > > > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > > > > > >> introduce force_blk_mq > > > > > > >> > > > > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke > > wrote: > > > > > > >>> Hi all, > > > > > > >>> > > > > > > >>> [ .. ] > > > > > > >>>>> > > > > > > >>>>> Could you share us your patch for enabling global_tags/MQ on > > > > > > >>>> megaraid_sas > > > > > > >>>>> so that I can reproduce your test? > > > > > > >>>>> > > > > > > >>>>>> See below perf top data. "bt_iter" is consuming 4 times > > > > > > >>>>>> more > > > > CPU. > > > > > > >>>>> > > > > > > >>>>> Could you share us what the IOPS/CPU utilization effect is > > > > > > >>>>> after > > > > > > >>>> applying the > > > > > > >>>>> patch V2? And your test script? > > > > > > >>>> Regarding CPU utilization, I need to test one more time. > > > > > > >>>> Currently system is in used. > > > > > > >>>> > > > > > > >>>> I run below fio test on total 24 SSDs expander attached. > > > > > > >>>> > > > > > > >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k > > > > > > >>>> --ioengine=libaio --rw=randread > > > > > > >>>> > > > > > > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs. > > > > > > >>>> > > > > > > >>> This is basically what we've seen with earlier iterations. > > > > > > >> > > > > > > >> Hi Hannes, > > > > > > >> > > > > > > >> As I mentioned in another mail[1], Kashyap's patch has a big > > > > > > >> issue, > > > > > > > which > > > > > > >> causes only reply queue 0 used. > > > > > > >> > > > > > > >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2 > > > > > > >> > > > > > > >> So could you guys run your performance test again after fixing > > > > > > >> the > > > > > > > patch? > > > > > > > > > > > > > > Ming - > > > > > > > > > > > > > > I tried after change you requested. Performance drop is still > > > > unresolved. > > > > > > > From 1.6 M IOPS to 770K IOPS. > > > > > > > > > > > > > > See below data. All 24 reply queue is in used correctly. > > > > > > > > > > > > > > IRQs / 1 second(s) > > > > > > > IRQ# TOTAL NODE0 NODE1 NAME > > > > > > > 360 16422 0 16422 IR-PCI-MSI 70254653-edge megasas > > > > > > > 364 15980 0 15980 IR-PCI-MSI 70254657-edge megasas > > > > > > > 362 15979 0 15979 IR-PCI-MSI 70254655-edge megasas > > > > > > > 345 15696 0 15696 IR-PCI-MSI 70254638-edge megasas > > > > > > > 341 15659 0 15659 IR-PCI-MSI 70254634-edge megasas > > > > > > > 369 15656 0 15656 IR-PCI-MSI 70254662-edge megasas > > > > > > > 359 15650 0 15650 IR-PCI-MSI 70254652-edge megasas > > > > > > > 358 15596 0 15596 IR-PCI-MSI 70254651-edge megasas > > > > > > > 350 15574 0 15574 IR-PCI-MSI 70254643-edge megasas > > > > > > > 342 15532 0 15532 IR-PCI-MSI 70254635-edge megasas > > > > > > > 344 15527 0 15527 IR-PCI-MSI 70254637-edge megasas > > > > > > > 346 15485 0 15485 IR-PCI-MSI 70254639-edge megasas > > > > > > > 361 15482 0 15482 IR-PCI-MSI 70254654-edge megasas > > > > > > > 348 15467 0 15467 IR-PCI-MSI 70254641-edge megasas > > > > > > > 368 15463 0 15463 IR-PCI-MSI 70254661-edge megasas > > > > > > > 354 15420 0 15420 IR-PCI-MSI 70254647-edge megasas > > > > > > > 351 15378 0 15378 IR-PCI-MSI 70254644-edge megasas > > > > > > > 352 15377 0 15377 IR-PCI-MSI 70254645-edge megasas > > > > > > > 356 15348 0 15348 IR-PCI-MSI 70254649-edge megasas > > > > > > > 337 15344 0 15344 IR-PCI-MSI 70254630-edge megasas > > > > > > > 343 15320 0 15320 IR-PCI-MSI 70254636-edge megasas > > > > > > > 355 15266 0 15266 IR-PCI-MSI 70254648-edge megasas > > > > > > > 335 15247 0 15247 IR-PCI-MSI 70254628-edge megasas > > > > > > > 363 15233 0 15233 IR-PCI-MSI 70254656-edge megasas > > > > > > > > > > > > > > > > > > > > > Average: CPU %usr %nice %sys %iowait > > > > %steal > > > > > > > %irq %soft %guest %gnice %idle > > > > > > > Average: 18 3.80 0.00 14.78 10.08 > > > > 0.00 > > > > > > > 0.00 4.01 0.00 0.00 67.33 > > > > > > > Average: 19 3.26 0.00 15.35 10.62 > > > > 0.00 > > > > > > > 0.00 4.03 0.00 0.00 66.74 > > > > > > > Average: 20 3.42 0.00 14.57 10.67 > > > > 0.00 > > > > > > > 0.00 3.84 0.00 0.00 67.50 > > > > > > > Average: 21 3.19 0.00 15.60 10.75 > > > > 0.00 > > > > > > > 0.00 4.16 0.00 0.00 66.30 > > > > > > > Average: 22 3.58 0.00 15.15 10.66 > > > > 0.00 > > > > > > > 0.00 3.51 0.00 0.00 67.11 > > > > > > > Average: 23 3.34 0.00 15.36 10.63 > > > > 0.00 > > > > > > > 0.00 4.17 0.00 0.00 66.50 > > > > > > > Average: 24 3.50 0.00 14.58 10.93 > > > > 0.00 > > > > > > > 0.00 3.85 0.00 0.00 67.13 > > > > > > > Average: 25 3.20 0.00 14.68 10.86 > > > > 0.00 > > > > > > > 0.00 4.31 0.00 0.00 66.95 > > > > > > > Average: 26 3.27 0.00 14.80 10.70 > > > > 0.00 > > > > > > > 0.00 3.68 0.00 0.00 67.55 > > > > > > > Average: 27 3.58 0.00 15.36 10.80 > > > > 0.00 > > > > > > > 0.00 3.79 0.00 0.00 66.48 > > > > > > > Average: 28 3.46 0.00 15.17 10.46 > > > > 0.00 > > > > > > > 0.00 3.32 0.00 0.00 67.59 > > > > > > > Average: 29 3.34 0.00 14.42 10.72 > > > > 0.00 > > > > > > > 0.00 3.34 0.00 0.00 68.18 > > > > > > > Average: 30 3.34 0.00 15.08 10.70 > > > > 0.00 > > > > > > > 0.00 3.89 0.00 0.00 66.99 > > > > > > > Average: 31 3.26 0.00 15.33 10.47 > > > > 0.00 > > > > > > > 0.00 3.33 0.00 0.00 67.61 > > > > > > > Average: 32 3.21 0.00 14.80 10.61 > > > > 0.00 > > > > > > > 0.00 3.70 0.00 0.00 67.67 > > > > > > > Average: 33 3.40 0.00 13.88 10.55 > > > > 0.00 > > > > > > > 0.00 4.02 0.00 0.00 68.15 > > > > > > > Average: 34 3.74 0.00 17.41 10.61 > > > > 0.00 > > > > > > > 0.00 4.51 0.00 0.00 63.73 > > > > > > > Average: 35 3.35 0.00 14.37 10.74 > > > > 0.00 > > > > > > > 0.00 3.84 0.00 0.00 67.71 > > > > > > > Average: 36 0.54 0.00 1.77 0.00 > > > > 0.00 > > > > > > > 0.00 0.00 0.00 0.00 97.69 > > > > > > > .. > > > > > > > Average: 54 3.60 0.00 15.17 10.39 > > > > 0.00 > > > > > > > 0.00 4.22 0.00 0.00 66.62 > > > > > > > Average: 55 3.33 0.00 14.85 10.55 > > > > 0.00 > > > > > > > 0.00 3.96 0.00 0.00 67.31 > > > > > > > Average: 56 3.40 0.00 15.19 10.54 > > > > 0.00 > > > > > > > 0.00 3.74 0.00 0.00 67.13 > > > > > > > Average: 57 3.41 0.00 13.98 10.78 > > > > 0.00 > > > > > > > 0.00 4.10 0.00 0.00 67.73 > > > > > > > Average: 58 3.32 0.00 15.16 10.52 > > > > 0.00 > > > > > > > 0.00 4.01 0.00 0.00 66.99 > > > > > > > Average: 59 3.17 0.00 15.80 10.35 > > > > 0.00 > > > > > > > 0.00 3.86 0.00 0.00 66.80 > > > > > > > Average: 60 3.00 0.00 14.63 10.59 > > > > 0.00 > > > > > > > 0.00 3.97 0.00 0.00 67.80 > > > > > > > Average: 61 3.34 0.00 14.70 10.66 > > > > 0.00 > > > > > > > 0.00 4.32 0.00 0.00 66.97 > > > > > > > Average: 62 3.34 0.00 15.29 10.56 > > > > 0.00 > > > > > > > 0.00 3.89 0.00 0.00 66.92 > > > > > > > Average: 63 3.29 0.00 14.51 10.72 > > > > 0.00 > > > > > > > 0.00 3.85 0.00 0.00 67.62 > > > > > > > Average: 64 3.48 0.00 15.31 10.65 > > > > 0.00 > > > > > > > 0.00 3.97 0.00 0.00 66.60 > > > > > > > Average: 65 3.34 0.00 14.36 10.80 > > > > 0.00 > > > > > > > 0.00 4.11 0.00 0.00 67.39 > > > > > > > Average: 66 3.13 0.00 14.94 10.70 > > > > 0.00 > > > > > > > 0.00 4.10 0.00 0.00 67.13 > > > > > > > Average: 67 3.06 0.00 15.56 10.69 > > > > 0.00 > > > > > > > 0.00 3.82 0.00 0.00 66.88 > > > > > > > Average: 68 3.33 0.00 14.98 10.61 > > > > 0.00 > > > > > > > 0.00 3.81 0.00 0.00 67.27 > > > > > > > Average: 69 3.20 0.00 15.43 10.70 > > > > 0.00 > > > > > > > 0.00 3.82 0.00 0.00 66.85 > > > > > > > Average: 70 3.34 0.00 17.14 10.59 > > > > 0.00 > > > > > > > 0.00 3.00 0.00 0.00 65.92 > > > > > > > Average: 71 3.41 0.00 14.94 10.56 > > > > 0.00 > > > > > > > 0.00 3.41 0.00 0.00 67.69 > > > > > > > > > > > > > > Perf top - > > > > > > > > > > > > > > 64.33% [kernel] [k] bt_iter > > > > > > > 4.86% [kernel] [k] blk_mq_queue_tag_busy_iter > > > > > > > 4.23% [kernel] [k] _find_next_bit > > > > > > > 2.40% [kernel] [k] > > native_queued_spin_lock_slowpath > > > > > > > 1.09% [kernel] [k] sbitmap_any_bit_set > > > > > > > 0.71% [kernel] [k] sbitmap_queue_clear > > > > > > > 0.63% [kernel] [k] find_next_bit > > > > > > > 0.54% [kernel] [k] _raw_spin_lock_irqsave > > > > > > > > > > > > > Ah. So we're spending quite some time in trying to find a free > > tag. > > > > > > I guess this is due to every queue starting at the same position > > > > > > trying to find a free tag, which inevitably leads to a contention. > > > > > > > > > > IMO, the above trace means that blk_mq_in_flight() may be the > > > > bottleneck, > > > > > and looks not related with tag allocation. > > > > > > > > > > Kashyap, could you run your performance test again after disabling > > > > iostat by > > > > > the following command on all test devices and killing all utilities > > > > which may > > > > > read iostat(/proc/diskstats, ...)? > > > > > > > > > > echo 0 > /sys/block/sdN/queue/iostat > > > > > > > > Ming - After changing iostat = 0 , I see performance issue is > > resolved. > > > > > > > > Below is perf top output after iostats = 0 > > > > > > > > > > > > 23.45% [kernel] [k] bt_iter > > > > 2.27% [kernel] [k] blk_mq_queue_tag_busy_iter > > > > 2.18% [kernel] [k] _find_next_bit > > > > 2.06% [megaraid_sas] [k] complete_cmd_fusion > > > > 1.87% [kernel] [k] clflush_cache_range > > > > 1.70% [kernel] [k] dma_pte_clear_level > > > > 1.56% [kernel] [k] __domain_mapping > > > > 1.55% [kernel] [k] sbitmap_queue_clear > > > > 1.30% [kernel] [k] gup_pgd_range > > > > > > Hi Kashyap, > > > > > > Thanks for your test and update. > > > > > > Looks blk_mq_queue_tag_busy_iter() is still sampled by perf even though > > > iostats is disabled, and I guess there may be utilities which are > > reading iostats > > > a bit frequently. > > > > I will be doing some more testing and post you my findings. > > I will find sometime this weekend to see if I can cook a patch to > address this issue of io accounting. Hi Kashyap, Please test the top 5 patches in the following tree to see if megaraid_sas's performance is OK: https://github.com/ming1/linux/commits/v4.15-for-next-global-tags-v2 This tree is made by adding these 5 patches against patchset V2. If possible, please provide us the performance data without these patches and with these patches, together with perf trace. The top 5 patches are for addressing the io accounting issue, and which should be the main reason for your performance drop, even lockup in megaraid_sas's ISR, IMO. Thanks, Ming
> -----Original Message----- > From: Ming Lei [mailto:ming.lei@redhat.com] > Sent: Sunday, February 11, 2018 11:01 AM > To: Kashyap Desai > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph > Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar Sandoval; > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; Peter > Rivera; Paolo Bonzini; Laurence Oberman > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce > force_blk_mq > > On Sat, Feb 10, 2018 at 09:00:57AM +0800, Ming Lei wrote: > > Hi Kashyap, > > > > On Fri, Feb 09, 2018 at 02:12:16PM +0530, Kashyap Desai wrote: > > > > -----Original Message----- > > > > From: Ming Lei [mailto:ming.lei@redhat.com] > > > > Sent: Friday, February 9, 2018 11:01 AM > > > > To: Kashyap Desai > > > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; > > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun > > > > Easi; Omar > > > Sandoval; > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don > > > > Brace; > > > Peter > > > > Rivera; Paolo Bonzini; Laurence Oberman > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > > > introduce force_blk_mq > > > > > > > > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote: > > > > > > -----Original Message----- > > > > > > From: Ming Lei [mailto:ming.lei@redhat.com] > > > > > > Sent: Thursday, February 8, 2018 10:23 PM > > > > > > To: Hannes Reinecke > > > > > > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; > > > > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; > > > > > > Arun Easi; Omar > > > > > Sandoval; > > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don > > > > > > Brace; > > > > > Peter > > > > > > Rivera; Paolo Bonzini; Laurence Oberman > > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > > > > > introduce force_blk_mq > > > > > > > > > > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke wrote: > > > > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote: > > > > > > > >> -----Original Message----- > > > > > > > >> From: Ming Lei [mailto:ming.lei@redhat.com] > > > > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM > > > > > > > >> To: Hannes Reinecke > > > > > > > >> Cc: Kashyap Desai; Jens Axboe; > > > > > > > >> linux-block@vger.kernel.org; Christoph Hellwig; Mike > > > > > > > >> Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar > > > > > > > > Sandoval; > > > > > > > >> Martin K . Petersen; James Bottomley; Christoph Hellwig; > > > > > > > >> Don Brace; > > > > > > > > Peter > > > > > > > >> Rivera; Paolo Bonzini; Laurence Oberman > > > > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global > > > > > > > >> tags & introduce force_blk_mq > > > > > > > >> > > > > > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke > > > wrote: > > > > > > > >>> Hi all, > > > > > > > >>> > > > > > > > >>> [ .. ] > > > > > > > >>>>> > > > > > > > >>>>> Could you share us your patch for enabling > > > > > > > >>>>> global_tags/MQ on > > > > > > > >>>> megaraid_sas > > > > > > > >>>>> so that I can reproduce your test? > > > > > > > >>>>> > > > > > > > >>>>>> See below perf top data. "bt_iter" is consuming 4 > > > > > > > >>>>>> times more > > > > > CPU. > > > > > > > >>>>> > > > > > > > >>>>> Could you share us what the IOPS/CPU utilization > > > > > > > >>>>> effect is after > > > > > > > >>>> applying the > > > > > > > >>>>> patch V2? And your test script? > > > > > > > >>>> Regarding CPU utilization, I need to test one more time. > > > > > > > >>>> Currently system is in used. > > > > > > > >>>> > > > > > > > >>>> I run below fio test on total 24 SSDs expander attached. > > > > > > > >>>> > > > > > > > >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 > > > > > > > >>>> --bs=4k --ioengine=libaio --rw=randread > > > > > > > >>>> > > > > > > > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs. > > > > > > > >>>> > > > > > > > >>> This is basically what we've seen with earlier iterations. > > > > > > > >> > > > > > > > >> Hi Hannes, > > > > > > > >> > > > > > > > >> As I mentioned in another mail[1], Kashyap's patch has a > > > > > > > >> big issue, > > > > > > > > which > > > > > > > >> causes only reply queue 0 used. > > > > > > > >> > > > > > > > >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2 > > > > > > > >> > > > > > > > >> So could you guys run your performance test again after > > > > > > > >> fixing the > > > > > > > > patch? > > > > > > > > > > > > > > > > Ming - > > > > > > > > > > > > > > > > I tried after change you requested. Performance drop is > > > > > > > > still > > > > > unresolved. > > > > > > > > From 1.6 M IOPS to 770K IOPS. > > > > > > > > > > > > > > > > See below data. All 24 reply queue is in used correctly. > > > > > > > > > > > > > > > > IRQs / 1 second(s) > > > > > > > > IRQ# TOTAL NODE0 NODE1 NAME > > > > > > > > 360 16422 0 16422 IR-PCI-MSI 70254653-edge megasas > > > > > > > > 364 15980 0 15980 IR-PCI-MSI 70254657-edge megasas > > > > > > > > 362 15979 0 15979 IR-PCI-MSI 70254655-edge megasas > > > > > > > > 345 15696 0 15696 IR-PCI-MSI 70254638-edge megasas > > > > > > > > 341 15659 0 15659 IR-PCI-MSI 70254634-edge megasas > > > > > > > > 369 15656 0 15656 IR-PCI-MSI 70254662-edge megasas > > > > > > > > 359 15650 0 15650 IR-PCI-MSI 70254652-edge megasas > > > > > > > > 358 15596 0 15596 IR-PCI-MSI 70254651-edge megasas > > > > > > > > 350 15574 0 15574 IR-PCI-MSI 70254643-edge megasas > > > > > > > > 342 15532 0 15532 IR-PCI-MSI 70254635-edge megasas > > > > > > > > 344 15527 0 15527 IR-PCI-MSI 70254637-edge megasas > > > > > > > > 346 15485 0 15485 IR-PCI-MSI 70254639-edge megasas > > > > > > > > 361 15482 0 15482 IR-PCI-MSI 70254654-edge megasas > > > > > > > > 348 15467 0 15467 IR-PCI-MSI 70254641-edge megasas > > > > > > > > 368 15463 0 15463 IR-PCI-MSI 70254661-edge megasas > > > > > > > > 354 15420 0 15420 IR-PCI-MSI 70254647-edge megasas > > > > > > > > 351 15378 0 15378 IR-PCI-MSI 70254644-edge megasas > > > > > > > > 352 15377 0 15377 IR-PCI-MSI 70254645-edge megasas > > > > > > > > 356 15348 0 15348 IR-PCI-MSI 70254649-edge megasas > > > > > > > > 337 15344 0 15344 IR-PCI-MSI 70254630-edge megasas > > > > > > > > 343 15320 0 15320 IR-PCI-MSI 70254636-edge megasas > > > > > > > > 355 15266 0 15266 IR-PCI-MSI 70254648-edge megasas > > > > > > > > 335 15247 0 15247 IR-PCI-MSI 70254628-edge megasas > > > > > > > > 363 15233 0 15233 IR-PCI-MSI 70254656-edge megasas > > > > > > > > > > > > > > > > > > > > > > > > Average: CPU %usr %nice %sys %iowait > > > > > %steal > > > > > > > > %irq %soft %guest %gnice %idle > > > > > > > > Average: 18 3.80 0.00 14.78 10.08 > > > > > 0.00 > > > > > > > > 0.00 4.01 0.00 0.00 67.33 > > > > > > > > Average: 19 3.26 0.00 15.35 10.62 > > > > > 0.00 > > > > > > > > 0.00 4.03 0.00 0.00 66.74 > > > > > > > > Average: 20 3.42 0.00 14.57 10.67 > > > > > 0.00 > > > > > > > > 0.00 3.84 0.00 0.00 67.50 > > > > > > > > Average: 21 3.19 0.00 15.60 10.75 > > > > > 0.00 > > > > > > > > 0.00 4.16 0.00 0.00 66.30 > > > > > > > > Average: 22 3.58 0.00 15.15 10.66 > > > > > 0.00 > > > > > > > > 0.00 3.51 0.00 0.00 67.11 > > > > > > > > Average: 23 3.34 0.00 15.36 10.63 > > > > > 0.00 > > > > > > > > 0.00 4.17 0.00 0.00 66.50 > > > > > > > > Average: 24 3.50 0.00 14.58 10.93 > > > > > 0.00 > > > > > > > > 0.00 3.85 0.00 0.00 67.13 > > > > > > > > Average: 25 3.20 0.00 14.68 10.86 > > > > > 0.00 > > > > > > > > 0.00 4.31 0.00 0.00 66.95 > > > > > > > > Average: 26 3.27 0.00 14.80 10.70 > > > > > 0.00 > > > > > > > > 0.00 3.68 0.00 0.00 67.55 > > > > > > > > Average: 27 3.58 0.00 15.36 10.80 > > > > > 0.00 > > > > > > > > 0.00 3.79 0.00 0.00 66.48 > > > > > > > > Average: 28 3.46 0.00 15.17 10.46 > > > > > 0.00 > > > > > > > > 0.00 3.32 0.00 0.00 67.59 > > > > > > > > Average: 29 3.34 0.00 14.42 10.72 > > > > > 0.00 > > > > > > > > 0.00 3.34 0.00 0.00 68.18 > > > > > > > > Average: 30 3.34 0.00 15.08 10.70 > > > > > 0.00 > > > > > > > > 0.00 3.89 0.00 0.00 66.99 > > > > > > > > Average: 31 3.26 0.00 15.33 10.47 > > > > > 0.00 > > > > > > > > 0.00 3.33 0.00 0.00 67.61 > > > > > > > > Average: 32 3.21 0.00 14.80 10.61 > > > > > 0.00 > > > > > > > > 0.00 3.70 0.00 0.00 67.67 > > > > > > > > Average: 33 3.40 0.00 13.88 10.55 > > > > > 0.00 > > > > > > > > 0.00 4.02 0.00 0.00 68.15 > > > > > > > > Average: 34 3.74 0.00 17.41 10.61 > > > > > 0.00 > > > > > > > > 0.00 4.51 0.00 0.00 63.73 > > > > > > > > Average: 35 3.35 0.00 14.37 10.74 > > > > > 0.00 > > > > > > > > 0.00 3.84 0.00 0.00 67.71 > > > > > > > > Average: 36 0.54 0.00 1.77 0.00 > > > > > 0.00 > > > > > > > > 0.00 0.00 0.00 0.00 97.69 > > > > > > > > .. > > > > > > > > Average: 54 3.60 0.00 15.17 10.39 > > > > > 0.00 > > > > > > > > 0.00 4.22 0.00 0.00 66.62 > > > > > > > > Average: 55 3.33 0.00 14.85 10.55 > > > > > 0.00 > > > > > > > > 0.00 3.96 0.00 0.00 67.31 > > > > > > > > Average: 56 3.40 0.00 15.19 10.54 > > > > > 0.00 > > > > > > > > 0.00 3.74 0.00 0.00 67.13 > > > > > > > > Average: 57 3.41 0.00 13.98 10.78 > > > > > 0.00 > > > > > > > > 0.00 4.10 0.00 0.00 67.73 > > > > > > > > Average: 58 3.32 0.00 15.16 10.52 > > > > > 0.00 > > > > > > > > 0.00 4.01 0.00 0.00 66.99 > > > > > > > > Average: 59 3.17 0.00 15.80 10.35 > > > > > 0.00 > > > > > > > > 0.00 3.86 0.00 0.00 66.80 > > > > > > > > Average: 60 3.00 0.00 14.63 10.59 > > > > > 0.00 > > > > > > > > 0.00 3.97 0.00 0.00 67.80 > > > > > > > > Average: 61 3.34 0.00 14.70 10.66 > > > > > 0.00 > > > > > > > > 0.00 4.32 0.00 0.00 66.97 > > > > > > > > Average: 62 3.34 0.00 15.29 10.56 > > > > > 0.00 > > > > > > > > 0.00 3.89 0.00 0.00 66.92 > > > > > > > > Average: 63 3.29 0.00 14.51 10.72 > > > > > 0.00 > > > > > > > > 0.00 3.85 0.00 0.00 67.62 > > > > > > > > Average: 64 3.48 0.00 15.31 10.65 > > > > > 0.00 > > > > > > > > 0.00 3.97 0.00 0.00 66.60 > > > > > > > > Average: 65 3.34 0.00 14.36 10.80 > > > > > 0.00 > > > > > > > > 0.00 4.11 0.00 0.00 67.39 > > > > > > > > Average: 66 3.13 0.00 14.94 10.70 > > > > > 0.00 > > > > > > > > 0.00 4.10 0.00 0.00 67.13 > > > > > > > > Average: 67 3.06 0.00 15.56 10.69 > > > > > 0.00 > > > > > > > > 0.00 3.82 0.00 0.00 66.88 > > > > > > > > Average: 68 3.33 0.00 14.98 10.61 > > > > > 0.00 > > > > > > > > 0.00 3.81 0.00 0.00 67.27 > > > > > > > > Average: 69 3.20 0.00 15.43 10.70 > > > > > 0.00 > > > > > > > > 0.00 3.82 0.00 0.00 66.85 > > > > > > > > Average: 70 3.34 0.00 17.14 10.59 > > > > > 0.00 > > > > > > > > 0.00 3.00 0.00 0.00 65.92 > > > > > > > > Average: 71 3.41 0.00 14.94 10.56 > > > > > 0.00 > > > > > > > > 0.00 3.41 0.00 0.00 67.69 > > > > > > > > > > > > > > > > Perf top - > > > > > > > > > > > > > > > > 64.33% [kernel] [k] bt_iter > > > > > > > > 4.86% [kernel] [k] blk_mq_queue_tag_busy_iter > > > > > > > > 4.23% [kernel] [k] _find_next_bit > > > > > > > > 2.40% [kernel] [k] > > > native_queued_spin_lock_slowpath > > > > > > > > 1.09% [kernel] [k] sbitmap_any_bit_set > > > > > > > > 0.71% [kernel] [k] sbitmap_queue_clear > > > > > > > > 0.63% [kernel] [k] find_next_bit > > > > > > > > 0.54% [kernel] [k] _raw_spin_lock_irqsave > > > > > > > > > > > > > > > Ah. So we're spending quite some time in trying to find a > > > > > > > free > > > tag. > > > > > > > I guess this is due to every queue starting at the same > > > > > > > position trying to find a free tag, which inevitably leads to a > contention. > > > > > > > > > > > > IMO, the above trace means that blk_mq_in_flight() may be the > > > > > bottleneck, > > > > > > and looks not related with tag allocation. > > > > > > > > > > > > Kashyap, could you run your performance test again after > > > > > > disabling > > > > > iostat by > > > > > > the following command on all test devices and killing all > > > > > > utilities > > > > > which may > > > > > > read iostat(/proc/diskstats, ...)? > > > > > > > > > > > > echo 0 > /sys/block/sdN/queue/iostat > > > > > > > > > > Ming - After changing iostat = 0 , I see performance issue is > > > resolved. > > > > > > > > > > Below is perf top output after iostats = 0 > > > > > > > > > > > > > > > 23.45% [kernel] [k] bt_iter > > > > > 2.27% [kernel] [k] blk_mq_queue_tag_busy_iter > > > > > 2.18% [kernel] [k] _find_next_bit > > > > > 2.06% [megaraid_sas] [k] complete_cmd_fusion > > > > > 1.87% [kernel] [k] clflush_cache_range > > > > > 1.70% [kernel] [k] dma_pte_clear_level > > > > > 1.56% [kernel] [k] __domain_mapping > > > > > 1.55% [kernel] [k] sbitmap_queue_clear > > > > > 1.30% [kernel] [k] gup_pgd_range > > > > > > > > Hi Kashyap, > > > > > > > > Thanks for your test and update. > > > > > > > > Looks blk_mq_queue_tag_busy_iter() is still sampled by perf even > > > > though iostats is disabled, and I guess there may be utilities > > > > which are > > > reading iostats > > > > a bit frequently. > > > > > > I will be doing some more testing and post you my findings. > > > > I will find sometime this weekend to see if I can cook a patch to > > address this issue of io accounting. > > Hi Kashyap, > > Please test the top 5 patches in the following tree to see if megaraid_sas's > performance is OK: > > https://github.com/ming1/linux/commits/v4.15-for-next-global-tags- > v2 > > This tree is made by adding these 5 patches against patchset V2. > Ming - I applied 5 patches on top of V2 and behavior is still unchanged. Below is perf top data. (1000K IOPS) 34.58% [kernel] [k] bt_iter 2.96% [kernel] [k] sbitmap_any_bit_set 2.77% [kernel] [k] bt_iter_global_tags 1.75% [megaraid_sas] [k] complete_cmd_fusion 1.62% [kernel] [k] sbitmap_queue_clear 1.62% [kernel] [k] _raw_spin_lock 1.51% [kernel] [k] blk_mq_run_hw_queue 1.45% [kernel] [k] gup_pgd_range 1.31% [kernel] [k] irq_entries_start 1.29% fio [.] __fio_gettime 1.13% [kernel] [k] _raw_spin_lock_irqsave 0.95% [kernel] [k] native_queued_spin_lock_slowpath 0.92% [kernel] [k] scsi_queue_rq 0.91% [kernel] [k] blk_mq_run_hw_queues 0.85% [kernel] [k] blk_mq_get_request 0.81% [kernel] [k] switch_mm_irqs_off 0.78% [megaraid_sas] [k] megasas_build_io_fusion 0.77% [kernel] [k] __schedule 0.73% [kernel] [k] update_load_avg 0.69% [kernel] [k] fput 0.65% [kernel] [k] scsi_dispatch_cmd 0.64% fio [.] fio_libaio_event 0.53% [kernel] [k] do_io_submit 0.52% [kernel] [k] read_tsc 0.51% [megaraid_sas] [k] megasas_build_and_issue_cmd_fusion 0.51% [kernel] [k] scsi_softirq_done 0.50% [kernel] [k] kobject_put 0.50% [kernel] [k] cpuidle_enter_state 0.49% [kernel] [k] native_write_msr 0.48% fio [.] io_completed Below is perf top data with iostat=0 (1400K IOPS) 4.87% [kernel] [k] sbitmap_any_bit_set 2.93% [kernel] [k] _raw_spin_lock 2.84% [megaraid_sas] [k] complete_cmd_fusion 2.38% [kernel] [k] irq_entries_start 2.36% [kernel] [k] gup_pgd_range 2.35% [kernel] [k] blk_mq_run_hw_queue 2.30% [kernel] [k] sbitmap_queue_clear 2.01% fio [.] __fio_gettime 1.78% [kernel] [k] _raw_spin_lock_irqsave 1.51% [kernel] [k] scsi_queue_rq 1.43% [kernel] [k] blk_mq_run_hw_queues 1.36% [kernel] [k] fput 1.32% [kernel] [k] __schedule 1.31% [kernel] [k] switch_mm_irqs_off 1.29% [kernel] [k] update_load_avg 1.25% [megaraid_sas] [k] megasas_build_io_fusion 1.22% [kernel] [k] native_queued_spin_lock_slowpath 1.03% [kernel] [k] scsi_dispatch_cmd 1.03% [kernel] [k] blk_mq_get_request 0.91% fio [.] fio_libaio_event 0.89% [kernel] [k] scsi_softirq_done 0.87% [kernel] [k] kobject_put 0.86% [kernel] [k] cpuidle_enter_state 0.84% fio [.] io_completed 0.83% [kernel] [k] do_io_submit 0.83% [megaraid_sas] [k] megasas_build_and_issue_cmd_fusion 0.83% [kernel] [k] __switch_to 0.82% [kernel] [k] read_tsc 0.80% [kernel] [k] native_write_msr 0.76% [kernel] [k] aio_comp Perf data without V2 patch applied. (1600K IOPS) 5.97% [megaraid_sas] [k] complete_cmd_fusion 5.24% [kernel] [k] bt_iter 3.28% [kernel] [k] _raw_spin_lock 2.98% [kernel] [k] irq_entries_start 2.29% fio [.] __fio_gettime 2.04% [kernel] [k] scsi_queue_rq 1.92% [megaraid_sas] [k] megasas_build_io_fusion 1.61% [kernel] [k] switch_mm_irqs_off 1.59% [megaraid_sas] [k] megasas_build_and_issue_cmd_fusion 1.41% [kernel] [k] scsi_dispatch_cmd 1.33% [kernel] [k] scsi_softirq_done 1.18% [kernel] [k] gup_pgd_range 1.18% [kernel] [k] blk_mq_complete_request 1.13% [kernel] [k] blk_mq_free_request 1.05% [kernel] [k] do_io_submit 1.04% [kernel] [k] _find_next_bit 1.02% [kernel] [k] blk_mq_get_request 0.95% [megaraid_sas] [k] megasas_build_ldio_fusion 0.95% [kernel] [k] scsi_dec_host_busy 0.89% fio [.] get_io_u 0.88% [kernel] [k] entry_SYSCALL_64 0.84% [megaraid_sas] [k] megasas_queue_command 0.79% [kernel] [k] native_write_msr 0.77% [kernel] [k] read_tsc 0.73% [kernel] [k] _raw_spin_lock_irqsave 0.73% fio [.] fio_libaio_commit 0.72% [kernel] [k] kmem_cache_alloc 0.72% [kernel] [k] blkdev_direct_IO 0.69% [megaraid_sas] [k] MR_GetPhyParams 0.68% [kernel] [k] blk_mq_dequeue_f > If possible, please provide us the performance data without these patches and > with these patches, together with perf trace. > > The top 5 patches are for addressing the io accounting issue, and which > should be the main reason for your performance drop, even lockup in > megaraid_sas's ISR, IMO. I think performance drop is different issue. May be a side effect of the patch set. Even though we fix this perf issue, cpu lock up is completely different issue. Regarding cpu lock up, there was similar discussion and folks are finding irq poll is good method to resolve lockup. Not sure why NVME driver did not opted irq_poll, but there was extensive discussion and I am also seeing cpu lock up mainly due to multiple completion queue/reply queue is tied to single CPU. We have weighing method in irq poll to quit ISR and that is the way we can avoid lock-up. http://lists.infradead.org/pipermail/linux-nvme/2017-January/007724.html > > Thanks, > Ming
Hi Kashyap, On Tue, Feb 13, 2018 at 12:05:14AM +0530, Kashyap Desai wrote: > > -----Original Message----- > > From: Ming Lei [mailto:ming.lei@redhat.com] > > Sent: Sunday, February 11, 2018 11:01 AM > > To: Kashyap Desai > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph > > Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar > Sandoval; > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; > Peter > > Rivera; Paolo Bonzini; Laurence Oberman > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce > > force_blk_mq > > > > On Sat, Feb 10, 2018 at 09:00:57AM +0800, Ming Lei wrote: > > > Hi Kashyap, > > > > > > On Fri, Feb 09, 2018 at 02:12:16PM +0530, Kashyap Desai wrote: > > > > > -----Original Message----- > > > > > From: Ming Lei [mailto:ming.lei@redhat.com] > > > > > Sent: Friday, February 9, 2018 11:01 AM > > > > > To: Kashyap Desai > > > > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; > > > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun > > > > > Easi; Omar > > > > Sandoval; > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don > > > > > Brace; > > > > Peter > > > > > Rivera; Paolo Bonzini; Laurence Oberman > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > > > > introduce force_blk_mq > > > > > > > > > > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote: > > > > > > > -----Original Message----- > > > > > > > From: Ming Lei [mailto:ming.lei@redhat.com] > > > > > > > Sent: Thursday, February 8, 2018 10:23 PM > > > > > > > To: Hannes Reinecke > > > > > > > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; > > > > > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; > > > > > > > Arun Easi; Omar > > > > > > Sandoval; > > > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don > > > > > > > Brace; > > > > > > Peter > > > > > > > Rivera; Paolo Bonzini; Laurence Oberman > > > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > > > > > > introduce force_blk_mq > > > > > > > > > > > > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke > wrote: > > > > > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote: > > > > > > > > >> -----Original Message----- > > > > > > > > >> From: Ming Lei [mailto:ming.lei@redhat.com] > > > > > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM > > > > > > > > >> To: Hannes Reinecke > > > > > > > > >> Cc: Kashyap Desai; Jens Axboe; > > > > > > > > >> linux-block@vger.kernel.org; Christoph Hellwig; Mike > > > > > > > > >> Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar > > > > > > > > > Sandoval; > > > > > > > > >> Martin K . Petersen; James Bottomley; Christoph Hellwig; > > > > > > > > >> Don Brace; > > > > > > > > > Peter > > > > > > > > >> Rivera; Paolo Bonzini; Laurence Oberman > > > > > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global > > > > > > > > >> tags & introduce force_blk_mq > > > > > > > > >> > > > > > > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke > > > > wrote: > > > > > > > > >>> Hi all, > > > > > > > > >>> > > > > > > > > >>> [ .. ] > > > > > > > > >>>>> > > > > > > > > >>>>> Could you share us your patch for enabling > > > > > > > > >>>>> global_tags/MQ on > > > > > > > > >>>> megaraid_sas > > > > > > > > >>>>> so that I can reproduce your test? > > > > > > > > >>>>> > > > > > > > > >>>>>> See below perf top data. "bt_iter" is consuming 4 > > > > > > > > >>>>>> times more > > > > > > CPU. > > > > > > > > >>>>> > > > > > > > > >>>>> Could you share us what the IOPS/CPU utilization > > > > > > > > >>>>> effect is after > > > > > > > > >>>> applying the > > > > > > > > >>>>> patch V2? And your test script? > > > > > > > > >>>> Regarding CPU utilization, I need to test one more > time. > > > > > > > > >>>> Currently system is in used. > > > > > > > > >>>> > > > > > > > > >>>> I run below fio test on total 24 SSDs expander > attached. > > > > > > > > >>>> > > > > > > > > >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 > > > > > > > > >>>> --bs=4k --ioengine=libaio --rw=randread > > > > > > > > >>>> > > > > > > > > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs. > > > > > > > > >>>> > > > > > > > > >>> This is basically what we've seen with earlier > iterations. > > > > > > > > >> > > > > > > > > >> Hi Hannes, > > > > > > > > >> > > > > > > > > >> As I mentioned in another mail[1], Kashyap's patch has a > > > > > > > > >> big issue, > > > > > > > > > which > > > > > > > > >> causes only reply queue 0 used. > > > > > > > > >> > > > > > > > > >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2 > > > > > > > > >> > > > > > > > > >> So could you guys run your performance test again after > > > > > > > > >> fixing the > > > > > > > > > patch? > > > > > > > > > > > > > > > > > > Ming - > > > > > > > > > > > > > > > > > > I tried after change you requested. Performance drop is > > > > > > > > > still > > > > > > unresolved. > > > > > > > > > From 1.6 M IOPS to 770K IOPS. > > > > > > > > > > > > > > > > > > See below data. All 24 reply queue is in used correctly. > > > > > > > > > > > > > > > > > > IRQs / 1 second(s) > > > > > > > > > IRQ# TOTAL NODE0 NODE1 NAME > > > > > > > > > 360 16422 0 16422 IR-PCI-MSI 70254653-edge > megasas > > > > > > > > > 364 15980 0 15980 IR-PCI-MSI 70254657-edge > megasas > > > > > > > > > 362 15979 0 15979 IR-PCI-MSI 70254655-edge > megasas > > > > > > > > > 345 15696 0 15696 IR-PCI-MSI 70254638-edge > megasas > > > > > > > > > 341 15659 0 15659 IR-PCI-MSI 70254634-edge > megasas > > > > > > > > > 369 15656 0 15656 IR-PCI-MSI 70254662-edge > megasas > > > > > > > > > 359 15650 0 15650 IR-PCI-MSI 70254652-edge > megasas > > > > > > > > > 358 15596 0 15596 IR-PCI-MSI 70254651-edge > megasas > > > > > > > > > 350 15574 0 15574 IR-PCI-MSI 70254643-edge > megasas > > > > > > > > > 342 15532 0 15532 IR-PCI-MSI 70254635-edge > megasas > > > > > > > > > 344 15527 0 15527 IR-PCI-MSI 70254637-edge > megasas > > > > > > > > > 346 15485 0 15485 IR-PCI-MSI 70254639-edge > megasas > > > > > > > > > 361 15482 0 15482 IR-PCI-MSI 70254654-edge > megasas > > > > > > > > > 348 15467 0 15467 IR-PCI-MSI 70254641-edge > megasas > > > > > > > > > 368 15463 0 15463 IR-PCI-MSI 70254661-edge > megasas > > > > > > > > > 354 15420 0 15420 IR-PCI-MSI 70254647-edge > megasas > > > > > > > > > 351 15378 0 15378 IR-PCI-MSI 70254644-edge > megasas > > > > > > > > > 352 15377 0 15377 IR-PCI-MSI 70254645-edge > megasas > > > > > > > > > 356 15348 0 15348 IR-PCI-MSI 70254649-edge > megasas > > > > > > > > > 337 15344 0 15344 IR-PCI-MSI 70254630-edge > megasas > > > > > > > > > 343 15320 0 15320 IR-PCI-MSI 70254636-edge > megasas > > > > > > > > > 355 15266 0 15266 IR-PCI-MSI 70254648-edge > megasas > > > > > > > > > 335 15247 0 15247 IR-PCI-MSI 70254628-edge > megasas > > > > > > > > > 363 15233 0 15233 IR-PCI-MSI 70254656-edge > megasas > > > > > > > > > > > > > > > > > > > > > > > > > > > Average: CPU %usr %nice %sys > %iowait > > > > > > %steal > > > > > > > > > %irq %soft %guest %gnice %idle > > > > > > > > > Average: 18 3.80 0.00 14.78 > 10.08 > > > > > > 0.00 > > > > > > > > > 0.00 4.01 0.00 0.00 67.33 > > > > > > > > > Average: 19 3.26 0.00 15.35 > 10.62 > > > > > > 0.00 > > > > > > > > > 0.00 4.03 0.00 0.00 66.74 > > > > > > > > > Average: 20 3.42 0.00 14.57 > 10.67 > > > > > > 0.00 > > > > > > > > > 0.00 3.84 0.00 0.00 67.50 > > > > > > > > > Average: 21 3.19 0.00 15.60 > 10.75 > > > > > > 0.00 > > > > > > > > > 0.00 4.16 0.00 0.00 66.30 > > > > > > > > > Average: 22 3.58 0.00 15.15 > 10.66 > > > > > > 0.00 > > > > > > > > > 0.00 3.51 0.00 0.00 67.11 > > > > > > > > > Average: 23 3.34 0.00 15.36 > 10.63 > > > > > > 0.00 > > > > > > > > > 0.00 4.17 0.00 0.00 66.50 > > > > > > > > > Average: 24 3.50 0.00 14.58 > 10.93 > > > > > > 0.00 > > > > > > > > > 0.00 3.85 0.00 0.00 67.13 > > > > > > > > > Average: 25 3.20 0.00 14.68 > 10.86 > > > > > > 0.00 > > > > > > > > > 0.00 4.31 0.00 0.00 66.95 > > > > > > > > > Average: 26 3.27 0.00 14.80 > 10.70 > > > > > > 0.00 > > > > > > > > > 0.00 3.68 0.00 0.00 67.55 > > > > > > > > > Average: 27 3.58 0.00 15.36 > 10.80 > > > > > > 0.00 > > > > > > > > > 0.00 3.79 0.00 0.00 66.48 > > > > > > > > > Average: 28 3.46 0.00 15.17 > 10.46 > > > > > > 0.00 > > > > > > > > > 0.00 3.32 0.00 0.00 67.59 > > > > > > > > > Average: 29 3.34 0.00 14.42 > 10.72 > > > > > > 0.00 > > > > > > > > > 0.00 3.34 0.00 0.00 68.18 > > > > > > > > > Average: 30 3.34 0.00 15.08 > 10.70 > > > > > > 0.00 > > > > > > > > > 0.00 3.89 0.00 0.00 66.99 > > > > > > > > > Average: 31 3.26 0.00 15.33 > 10.47 > > > > > > 0.00 > > > > > > > > > 0.00 3.33 0.00 0.00 67.61 > > > > > > > > > Average: 32 3.21 0.00 14.80 > 10.61 > > > > > > 0.00 > > > > > > > > > 0.00 3.70 0.00 0.00 67.67 > > > > > > > > > Average: 33 3.40 0.00 13.88 > 10.55 > > > > > > 0.00 > > > > > > > > > 0.00 4.02 0.00 0.00 68.15 > > > > > > > > > Average: 34 3.74 0.00 17.41 > 10.61 > > > > > > 0.00 > > > > > > > > > 0.00 4.51 0.00 0.00 63.73 > > > > > > > > > Average: 35 3.35 0.00 14.37 > 10.74 > > > > > > 0.00 > > > > > > > > > 0.00 3.84 0.00 0.00 67.71 > > > > > > > > > Average: 36 0.54 0.00 1.77 > 0.00 > > > > > > 0.00 > > > > > > > > > 0.00 0.00 0.00 0.00 97.69 > > > > > > > > > .. > > > > > > > > > Average: 54 3.60 0.00 15.17 > 10.39 > > > > > > 0.00 > > > > > > > > > 0.00 4.22 0.00 0.00 66.62 > > > > > > > > > Average: 55 3.33 0.00 14.85 > 10.55 > > > > > > 0.00 > > > > > > > > > 0.00 3.96 0.00 0.00 67.31 > > > > > > > > > Average: 56 3.40 0.00 15.19 > 10.54 > > > > > > 0.00 > > > > > > > > > 0.00 3.74 0.00 0.00 67.13 > > > > > > > > > Average: 57 3.41 0.00 13.98 > 10.78 > > > > > > 0.00 > > > > > > > > > 0.00 4.10 0.00 0.00 67.73 > > > > > > > > > Average: 58 3.32 0.00 15.16 > 10.52 > > > > > > 0.00 > > > > > > > > > 0.00 4.01 0.00 0.00 66.99 > > > > > > > > > Average: 59 3.17 0.00 15.80 > 10.35 > > > > > > 0.00 > > > > > > > > > 0.00 3.86 0.00 0.00 66.80 > > > > > > > > > Average: 60 3.00 0.00 14.63 > 10.59 > > > > > > 0.00 > > > > > > > > > 0.00 3.97 0.00 0.00 67.80 > > > > > > > > > Average: 61 3.34 0.00 14.70 > 10.66 > > > > > > 0.00 > > > > > > > > > 0.00 4.32 0.00 0.00 66.97 > > > > > > > > > Average: 62 3.34 0.00 15.29 > 10.56 > > > > > > 0.00 > > > > > > > > > 0.00 3.89 0.00 0.00 66.92 > > > > > > > > > Average: 63 3.29 0.00 14.51 > 10.72 > > > > > > 0.00 > > > > > > > > > 0.00 3.85 0.00 0.00 67.62 > > > > > > > > > Average: 64 3.48 0.00 15.31 > 10.65 > > > > > > 0.00 > > > > > > > > > 0.00 3.97 0.00 0.00 66.60 > > > > > > > > > Average: 65 3.34 0.00 14.36 > 10.80 > > > > > > 0.00 > > > > > > > > > 0.00 4.11 0.00 0.00 67.39 > > > > > > > > > Average: 66 3.13 0.00 14.94 > 10.70 > > > > > > 0.00 > > > > > > > > > 0.00 4.10 0.00 0.00 67.13 > > > > > > > > > Average: 67 3.06 0.00 15.56 > 10.69 > > > > > > 0.00 > > > > > > > > > 0.00 3.82 0.00 0.00 66.88 > > > > > > > > > Average: 68 3.33 0.00 14.98 > 10.61 > > > > > > 0.00 > > > > > > > > > 0.00 3.81 0.00 0.00 67.27 > > > > > > > > > Average: 69 3.20 0.00 15.43 > 10.70 > > > > > > 0.00 > > > > > > > > > 0.00 3.82 0.00 0.00 66.85 > > > > > > > > > Average: 70 3.34 0.00 17.14 > 10.59 > > > > > > 0.00 > > > > > > > > > 0.00 3.00 0.00 0.00 65.92 > > > > > > > > > Average: 71 3.41 0.00 14.94 > 10.56 > > > > > > 0.00 > > > > > > > > > 0.00 3.41 0.00 0.00 67.69 > > > > > > > > > > > > > > > > > > Perf top - > > > > > > > > > > > > > > > > > > 64.33% [kernel] [k] bt_iter > > > > > > > > > 4.86% [kernel] [k] > blk_mq_queue_tag_busy_iter > > > > > > > > > 4.23% [kernel] [k] _find_next_bit > > > > > > > > > 2.40% [kernel] [k] > > > > native_queued_spin_lock_slowpath > > > > > > > > > 1.09% [kernel] [k] sbitmap_any_bit_set > > > > > > > > > 0.71% [kernel] [k] sbitmap_queue_clear > > > > > > > > > 0.63% [kernel] [k] find_next_bit > > > > > > > > > 0.54% [kernel] [k] _raw_spin_lock_irqsave > > > > > > > > > > > > > > > > > Ah. So we're spending quite some time in trying to find a > > > > > > > > free > > > > tag. > > > > > > > > I guess this is due to every queue starting at the same > > > > > > > > position trying to find a free tag, which inevitably leads > to a > > contention. > > > > > > > > > > > > > > IMO, the above trace means that blk_mq_in_flight() may be the > > > > > > bottleneck, > > > > > > > and looks not related with tag allocation. > > > > > > > > > > > > > > Kashyap, could you run your performance test again after > > > > > > > disabling > > > > > > iostat by > > > > > > > the following command on all test devices and killing all > > > > > > > utilities > > > > > > which may > > > > > > > read iostat(/proc/diskstats, ...)? > > > > > > > > > > > > > > echo 0 > /sys/block/sdN/queue/iostat > > > > > > > > > > > > Ming - After changing iostat = 0 , I see performance issue is > > > > resolved. > > > > > > > > > > > > Below is perf top output after iostats = 0 > > > > > > > > > > > > > > > > > > 23.45% [kernel] [k] bt_iter > > > > > > 2.27% [kernel] [k] blk_mq_queue_tag_busy_iter > > > > > > 2.18% [kernel] [k] _find_next_bit > > > > > > 2.06% [megaraid_sas] [k] complete_cmd_fusion > > > > > > 1.87% [kernel] [k] clflush_cache_range > > > > > > 1.70% [kernel] [k] dma_pte_clear_level > > > > > > 1.56% [kernel] [k] __domain_mapping > > > > > > 1.55% [kernel] [k] sbitmap_queue_clear > > > > > > 1.30% [kernel] [k] gup_pgd_range > > > > > > > > > > Hi Kashyap, > > > > > > > > > > Thanks for your test and update. > > > > > > > > > > Looks blk_mq_queue_tag_busy_iter() is still sampled by perf even > > > > > though iostats is disabled, and I guess there may be utilities > > > > > which are > > > > reading iostats > > > > > a bit frequently. > > > > > > > > I will be doing some more testing and post you my findings. > > > > > > I will find sometime this weekend to see if I can cook a patch to > > > address this issue of io accounting. > > > > Hi Kashyap, > > > > Please test the top 5 patches in the following tree to see if > megaraid_sas's > > performance is OK: > > > > https://github.com/ming1/linux/commits/v4.15-for-next-global-tags- > > v2 > > > > This tree is made by adding these 5 patches against patchset V2. > > > > Ming - > I applied 5 patches on top of V2 and behavior is still unchanged. Below is > perf top data. (1000K IOPS) > > 34.58% [kernel] [k] bt_iter > 2.96% [kernel] [k] sbitmap_any_bit_set > 2.77% [kernel] [k] bt_iter_global_tags > 1.75% [megaraid_sas] [k] complete_cmd_fusion > 1.62% [kernel] [k] sbitmap_queue_clear > 1.62% [kernel] [k] _raw_spin_lock > 1.51% [kernel] [k] blk_mq_run_hw_queue > 1.45% [kernel] [k] gup_pgd_range > 1.31% [kernel] [k] irq_entries_start > 1.29% fio [.] __fio_gettime > 1.13% [kernel] [k] _raw_spin_lock_irqsave > 0.95% [kernel] [k] native_queued_spin_lock_slowpath > 0.92% [kernel] [k] scsi_queue_rq > 0.91% [kernel] [k] blk_mq_run_hw_queues > 0.85% [kernel] [k] blk_mq_get_request > 0.81% [kernel] [k] switch_mm_irqs_off > 0.78% [megaraid_sas] [k] megasas_build_io_fusion > 0.77% [kernel] [k] __schedule > 0.73% [kernel] [k] update_load_avg > 0.69% [kernel] [k] fput > 0.65% [kernel] [k] scsi_dispatch_cmd > 0.64% fio [.] fio_libaio_event > 0.53% [kernel] [k] do_io_submit > 0.52% [kernel] [k] read_tsc > 0.51% [megaraid_sas] [k] megasas_build_and_issue_cmd_fusion > 0.51% [kernel] [k] scsi_softirq_done > 0.50% [kernel] [k] kobject_put > 0.50% [kernel] [k] cpuidle_enter_state > 0.49% [kernel] [k] native_write_msr > 0.48% fio [.] io_completed > > Below is perf top data with iostat=0 (1400K IOPS) > > 4.87% [kernel] [k] sbitmap_any_bit_set > 2.93% [kernel] [k] _raw_spin_lock > 2.84% [megaraid_sas] [k] complete_cmd_fusion > 2.38% [kernel] [k] irq_entries_start > 2.36% [kernel] [k] gup_pgd_range > 2.35% [kernel] [k] blk_mq_run_hw_queue > 2.30% [kernel] [k] sbitmap_queue_clear > 2.01% fio [.] __fio_gettime > 1.78% [kernel] [k] _raw_spin_lock_irqsave > 1.51% [kernel] [k] scsi_queue_rq > 1.43% [kernel] [k] blk_mq_run_hw_queues > 1.36% [kernel] [k] fput > 1.32% [kernel] [k] __schedule > 1.31% [kernel] [k] switch_mm_irqs_off > 1.29% [kernel] [k] update_load_avg > 1.25% [megaraid_sas] [k] megasas_build_io_fusion > 1.22% [kernel] [k] > native_queued_spin_lock_slowpath > 1.03% [kernel] [k] scsi_dispatch_cmd > 1.03% [kernel] [k] blk_mq_get_request > 0.91% fio [.] fio_libaio_event > 0.89% [kernel] [k] scsi_softirq_done > 0.87% [kernel] [k] kobject_put > 0.86% [kernel] [k] cpuidle_enter_state > 0.84% fio [.] io_completed > 0.83% [kernel] [k] do_io_submit > 0.83% [megaraid_sas] [k] > megasas_build_and_issue_cmd_fusion > 0.83% [kernel] [k] __switch_to > 0.82% [kernel] [k] read_tsc > 0.80% [kernel] [k] native_write_msr > 0.76% [kernel] [k] aio_comp > > > Perf data without V2 patch applied. (1600K IOPS) > > 5.97% [megaraid_sas] [k] complete_cmd_fusion > 5.24% [kernel] [k] bt_iter > 3.28% [kernel] [k] _raw_spin_lock > 2.98% [kernel] [k] irq_entries_start > 2.29% fio [.] __fio_gettime > 2.04% [kernel] [k] scsi_queue_rq > 1.92% [megaraid_sas] [k] megasas_build_io_fusion > 1.61% [kernel] [k] switch_mm_irqs_off > 1.59% [megaraid_sas] [k] megasas_build_and_issue_cmd_fusion > 1.41% [kernel] [k] scsi_dispatch_cmd > 1.33% [kernel] [k] scsi_softirq_done > 1.18% [kernel] [k] gup_pgd_range > 1.18% [kernel] [k] blk_mq_complete_request > 1.13% [kernel] [k] blk_mq_free_request > 1.05% [kernel] [k] do_io_submit > 1.04% [kernel] [k] _find_next_bit > 1.02% [kernel] [k] blk_mq_get_request > 0.95% [megaraid_sas] [k] megasas_build_ldio_fusion > 0.95% [kernel] [k] scsi_dec_host_busy > 0.89% fio [.] get_io_u > 0.88% [kernel] [k] entry_SYSCALL_64 > 0.84% [megaraid_sas] [k] megasas_queue_command > 0.79% [kernel] [k] native_write_msr > 0.77% [kernel] [k] read_tsc > 0.73% [kernel] [k] _raw_spin_lock_irqsave > 0.73% fio [.] fio_libaio_commit > 0.72% [kernel] [k] kmem_cache_alloc > 0.72% [kernel] [k] blkdev_direct_IO > 0.69% [megaraid_sas] [k] MR_GetPhyParams > 0.68% [kernel] [k] blk_mq_dequeue_f The above data is very helpful to understand the issue, great thanks! With this patchset V2 and the 5 patches, if iostats is set as 0, IOPS is 1400K, but 1600K IOPS can be reached without all these patches with iostats as 1. BTW, could you share us what the machine is? ARM64? I saw ARM64's cache coherence performance is bad before. In the dual socket system(each socket has 8 X86 CPU cores) I tested, only ~0.5% IOPS drop can be observed after the 5 patches are applied on V2 in null_blk test, which is described in commit log. Looks it means single sbitmap can't perform well under MQ's case in which there will be much more concurrent submissions and completions. In case of single hw queue(current linus tree), one hctx->run_work only allows one __blk_mq_run_hw_queue() running at 'async' mode, and reply queues are used in round-robin way, which may cause contention on the single sbitmap too, especially io accounting may consume a bit much more CPU, I guess that may contribute some on the CPU lockup. Could you run your test without V2 patches by setting 'iostats' as 0? and could you share us what the .can_queue is in this HBA? > > > > If possible, please provide us the performance data without these > patches and > > with these patches, together with perf trace. > > > > The top 5 patches are for addressing the io accounting issue, and which > > should be the main reason for your performance drop, even lockup in > > megaraid_sas's ISR, IMO. > > I think performance drop is different issue. May be a side effect of the > patch set. Even though we fix this perf issue, cpu lock up is completely > different issue. The performance drop is caused by the global data structure of sbitmap which is accessed from all CPUs concurrently. > Regarding cpu lock up, there was similar discussion and folks are finding > irq poll is good method to resolve lockup. Not sure why NVME driver did > not opted irq_poll, but there was extensive discussion and I am also NVMe's hw queues won't use host wide tags, so no such issue. > seeing cpu lock up mainly due to multiple completion queue/reply queue is > tied to single CPU. We have weighing method in irq poll to quit ISR and > that is the way we can avoid lock-up. > http://lists.infradead.org/pipermail/linux-nvme/2017-January/007724.html This patch can make sure that one request is always completed in the submission CPU, but contention on the global sbitmap is too big and causes performance drop. Now looks this is really an interesting topic for discussion. Thanks, Ming
> -----Original Message----- > From: Ming Lei [mailto:ming.lei@redhat.com] > Sent: Tuesday, February 13, 2018 6:11 AM > To: Kashyap Desai > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph > Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar Sandoval; > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; Peter > Rivera; Paolo Bonzini; Laurence Oberman > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce > force_blk_mq > > Hi Kashyap, > > On Tue, Feb 13, 2018 at 12:05:14AM +0530, Kashyap Desai wrote: > > > -----Original Message----- > > > From: Ming Lei [mailto:ming.lei@redhat.com] > > > Sent: Sunday, February 11, 2018 11:01 AM > > > To: Kashyap Desai > > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun > > > Easi; Omar > > Sandoval; > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; > > Peter > > > Rivera; Paolo Bonzini; Laurence Oberman > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > > introduce force_blk_mq > > > > > > On Sat, Feb 10, 2018 at 09:00:57AM +0800, Ming Lei wrote: > > > > Hi Kashyap, > > > > > > > > On Fri, Feb 09, 2018 at 02:12:16PM +0530, Kashyap Desai wrote: > > > > > > -----Original Message----- > > > > > > From: Ming Lei [mailto:ming.lei@redhat.com] > > > > > > Sent: Friday, February 9, 2018 11:01 AM > > > > > > To: Kashyap Desai > > > > > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; > > > > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; > > > > > > Arun Easi; Omar > > > > > Sandoval; > > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don > > > > > > Brace; > > > > > Peter > > > > > > Rivera; Paolo Bonzini; Laurence Oberman > > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > > > > > introduce force_blk_mq > > > > > > > > > > > > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote: > > > > > > > > -----Original Message----- > > > > > > > > From: Ming Lei [mailto:ming.lei@redhat.com] > > > > > > > > Sent: Thursday, February 8, 2018 10:23 PM > > > > > > > > To: Hannes Reinecke > > > > > > > > Cc: Kashyap Desai; Jens Axboe; > > > > > > > > linux-block@vger.kernel.org; Christoph Hellwig; Mike > > > > > > > > Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar > > > > > > > Sandoval; > > > > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; > > > > > > > > Don Brace; > > > > > > > Peter > > > > > > > > Rivera; Paolo Bonzini; Laurence Oberman > > > > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global > > > > > > > > tags & introduce force_blk_mq > > > > > > > > > > > > > > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke > > wrote: > > > > > > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote: > > > > > > > > > >> -----Original Message----- > > > > > > > > > >> From: Ming Lei [mailto:ming.lei@redhat.com] > > > > > > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM > > > > > > > > > >> To: Hannes Reinecke > > > > > > > > > >> Cc: Kashyap Desai; Jens Axboe; > > > > > > > > > >> linux-block@vger.kernel.org; Christoph Hellwig; Mike > > > > > > > > > >> Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar > > > > > > > > > > Sandoval; > > > > > > > > > >> Martin K . Petersen; James Bottomley; Christoph > > > > > > > > > >> Hellwig; Don Brace; > > > > > > > > > > Peter > > > > > > > > > >> Rivera; Paolo Bonzini; Laurence Oberman > > > > > > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support > > > > > > > > > >> global tags & introduce force_blk_mq > > > > > > > > > >> > > > > > > > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes > > > > > > > > > >> Reinecke > > > > > wrote: > > > > > > > > > >>> Hi all, > > > > > > > > > >>> > > > > > > > > > >>> [ .. ] > > > > > > > > > >>>>> > > > > > > > > > >>>>> Could you share us your patch for enabling > > > > > > > > > >>>>> global_tags/MQ on > > > > > > > > > >>>> megaraid_sas > > > > > > > > > >>>>> so that I can reproduce your test? > > > > > > > > > >>>>> > > > > > > > > > >>>>>> See below perf top data. "bt_iter" is consuming 4 > > > > > > > > > >>>>>> times more > > > > > > > CPU. > > > > > > > > > >>>>> > > > > > > > > > >>>>> Could you share us what the IOPS/CPU utilization > > > > > > > > > >>>>> effect is after > > > > > > > > > >>>> applying the > > > > > > > > > >>>>> patch V2? And your test script? > > > > > > > > > >>>> Regarding CPU utilization, I need to test one more > > time. > > > > > > > > > >>>> Currently system is in used. > > > > > > > > > >>>> > > > > > > > > > >>>> I run below fio test on total 24 SSDs expander > > attached. > > > > > > > > > >>>> > > > > > > > > > >>>> numactl -N 1 fio jbod.fio --rw=randread > > > > > > > > > >>>> --iodepth=64 --bs=4k --ioengine=libaio > > > > > > > > > >>>> --rw=randread > > > > > > > > > >>>> > > > > > > > > > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs. > > > > > > > > > >>>> > > > > > > > > > >>> This is basically what we've seen with earlier > > iterations. > > > > > > > > > >> > > > > > > > > > >> Hi Hannes, > > > > > > > > > >> > > > > > > > > > >> As I mentioned in another mail[1], Kashyap's patch > > > > > > > > > >> has a big issue, > > > > > > > > > > which > > > > > > > > > >> causes only reply queue 0 used. > > > > > > > > > >> > > > > > > > > > >> [1] > > > > > > > > > >> https://marc.info/?l=linux-scsi&m=151793204014631&w=2 > > > > > > > > > >> > > > > > > > > > >> So could you guys run your performance test again > > > > > > > > > >> after fixing the > > > > > > > > > > patch? > > > > > > > > > > > > > > > > > > > > Ming - > > > > > > > > > > > > > > > > > > > > I tried after change you requested. Performance drop > > > > > > > > > > is still > > > > > > > unresolved. > > > > > > > > > > From 1.6 M IOPS to 770K IOPS. > > > > > > > > > > > > > > > > > > > > See below data. All 24 reply queue is in used correctly. > > > > > > > > > > > > > > > > > > > > IRQs / 1 second(s) > > > > > > > > > > IRQ# TOTAL NODE0 NODE1 NAME > > > > > > > > > > 360 16422 0 16422 IR-PCI-MSI 70254653-edge > > megasas > > > > > > > > > > 364 15980 0 15980 IR-PCI-MSI 70254657-edge > > megasas > > > > > > > > > > 362 15979 0 15979 IR-PCI-MSI 70254655-edge > > megasas > > > > > > > > > > 345 15696 0 15696 IR-PCI-MSI 70254638-edge > > megasas > > > > > > > > > > 341 15659 0 15659 IR-PCI-MSI 70254634-edge > > megasas > > > > > > > > > > 369 15656 0 15656 IR-PCI-MSI 70254662-edge > > megasas > > > > > > > > > > 359 15650 0 15650 IR-PCI-MSI 70254652-edge > > megasas > > > > > > > > > > 358 15596 0 15596 IR-PCI-MSI 70254651-edge > > megasas > > > > > > > > > > 350 15574 0 15574 IR-PCI-MSI 70254643-edge > > megasas > > > > > > > > > > 342 15532 0 15532 IR-PCI-MSI 70254635-edge > > megasas > > > > > > > > > > 344 15527 0 15527 IR-PCI-MSI 70254637-edge > > megasas > > > > > > > > > > 346 15485 0 15485 IR-PCI-MSI 70254639-edge > > megasas > > > > > > > > > > 361 15482 0 15482 IR-PCI-MSI 70254654-edge > > megasas > > > > > > > > > > 348 15467 0 15467 IR-PCI-MSI 70254641-edge > > megasas > > > > > > > > > > 368 15463 0 15463 IR-PCI-MSI 70254661-edge > > megasas > > > > > > > > > > 354 15420 0 15420 IR-PCI-MSI 70254647-edge > > megasas > > > > > > > > > > 351 15378 0 15378 IR-PCI-MSI 70254644-edge > > megasas > > > > > > > > > > 352 15377 0 15377 IR-PCI-MSI 70254645-edge > > megasas > > > > > > > > > > 356 15348 0 15348 IR-PCI-MSI 70254649-edge > > megasas > > > > > > > > > > 337 15344 0 15344 IR-PCI-MSI 70254630-edge > > megasas > > > > > > > > > > 343 15320 0 15320 IR-PCI-MSI 70254636-edge > > megasas > > > > > > > > > > 355 15266 0 15266 IR-PCI-MSI 70254648-edge > > megasas > > > > > > > > > > 335 15247 0 15247 IR-PCI-MSI 70254628-edge > > megasas > > > > > > > > > > 363 15233 0 15233 IR-PCI-MSI 70254656-edge > > megasas > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Average: CPU %usr %nice %sys > > %iowait > > > > > > > %steal > > > > > > > > > > %irq %soft %guest %gnice %idle > > > > > > > > > > Average: 18 3.80 0.00 14.78 > > 10.08 > > > > > > > 0.00 > > > > > > > > > > 0.00 4.01 0.00 0.00 67.33 > > > > > > > > > > Average: 19 3.26 0.00 15.35 > > 10.62 > > > > > > > 0.00 > > > > > > > > > > 0.00 4.03 0.00 0.00 66.74 > > > > > > > > > > Average: 20 3.42 0.00 14.57 > > 10.67 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.84 0.00 0.00 67.50 > > > > > > > > > > Average: 21 3.19 0.00 15.60 > > 10.75 > > > > > > > 0.00 > > > > > > > > > > 0.00 4.16 0.00 0.00 66.30 > > > > > > > > > > Average: 22 3.58 0.00 15.15 > > 10.66 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.51 0.00 0.00 67.11 > > > > > > > > > > Average: 23 3.34 0.00 15.36 > > 10.63 > > > > > > > 0.00 > > > > > > > > > > 0.00 4.17 0.00 0.00 66.50 > > > > > > > > > > Average: 24 3.50 0.00 14.58 > > 10.93 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.85 0.00 0.00 67.13 > > > > > > > > > > Average: 25 3.20 0.00 14.68 > > 10.86 > > > > > > > 0.00 > > > > > > > > > > 0.00 4.31 0.00 0.00 66.95 > > > > > > > > > > Average: 26 3.27 0.00 14.80 > > 10.70 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.68 0.00 0.00 67.55 > > > > > > > > > > Average: 27 3.58 0.00 15.36 > > 10.80 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.79 0.00 0.00 66.48 > > > > > > > > > > Average: 28 3.46 0.00 15.17 > > 10.46 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.32 0.00 0.00 67.59 > > > > > > > > > > Average: 29 3.34 0.00 14.42 > > 10.72 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.34 0.00 0.00 68.18 > > > > > > > > > > Average: 30 3.34 0.00 15.08 > > 10.70 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.89 0.00 0.00 66.99 > > > > > > > > > > Average: 31 3.26 0.00 15.33 > > 10.47 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.33 0.00 0.00 67.61 > > > > > > > > > > Average: 32 3.21 0.00 14.80 > > 10.61 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.70 0.00 0.00 67.67 > > > > > > > > > > Average: 33 3.40 0.00 13.88 > > 10.55 > > > > > > > 0.00 > > > > > > > > > > 0.00 4.02 0.00 0.00 68.15 > > > > > > > > > > Average: 34 3.74 0.00 17.41 > > 10.61 > > > > > > > 0.00 > > > > > > > > > > 0.00 4.51 0.00 0.00 63.73 > > > > > > > > > > Average: 35 3.35 0.00 14.37 > > 10.74 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.84 0.00 0.00 67.71 > > > > > > > > > > Average: 36 0.54 0.00 1.77 > > 0.00 > > > > > > > 0.00 > > > > > > > > > > 0.00 0.00 0.00 0.00 97.69 > > > > > > > > > > .. > > > > > > > > > > Average: 54 3.60 0.00 15.17 > > 10.39 > > > > > > > 0.00 > > > > > > > > > > 0.00 4.22 0.00 0.00 66.62 > > > > > > > > > > Average: 55 3.33 0.00 14.85 > > 10.55 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.96 0.00 0.00 67.31 > > > > > > > > > > Average: 56 3.40 0.00 15.19 > > 10.54 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.74 0.00 0.00 67.13 > > > > > > > > > > Average: 57 3.41 0.00 13.98 > > 10.78 > > > > > > > 0.00 > > > > > > > > > > 0.00 4.10 0.00 0.00 67.73 > > > > > > > > > > Average: 58 3.32 0.00 15.16 > > 10.52 > > > > > > > 0.00 > > > > > > > > > > 0.00 4.01 0.00 0.00 66.99 > > > > > > > > > > Average: 59 3.17 0.00 15.80 > > 10.35 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.86 0.00 0.00 66.80 > > > > > > > > > > Average: 60 3.00 0.00 14.63 > > 10.59 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.97 0.00 0.00 67.80 > > > > > > > > > > Average: 61 3.34 0.00 14.70 > > 10.66 > > > > > > > 0.00 > > > > > > > > > > 0.00 4.32 0.00 0.00 66.97 > > > > > > > > > > Average: 62 3.34 0.00 15.29 > > 10.56 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.89 0.00 0.00 66.92 > > > > > > > > > > Average: 63 3.29 0.00 14.51 > > 10.72 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.85 0.00 0.00 67.62 > > > > > > > > > > Average: 64 3.48 0.00 15.31 > > 10.65 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.97 0.00 0.00 66.60 > > > > > > > > > > Average: 65 3.34 0.00 14.36 > > 10.80 > > > > > > > 0.00 > > > > > > > > > > 0.00 4.11 0.00 0.00 67.39 > > > > > > > > > > Average: 66 3.13 0.00 14.94 > > 10.70 > > > > > > > 0.00 > > > > > > > > > > 0.00 4.10 0.00 0.00 67.13 > > > > > > > > > > Average: 67 3.06 0.00 15.56 > > 10.69 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.82 0.00 0.00 66.88 > > > > > > > > > > Average: 68 3.33 0.00 14.98 > > 10.61 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.81 0.00 0.00 67.27 > > > > > > > > > > Average: 69 3.20 0.00 15.43 > > 10.70 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.82 0.00 0.00 66.85 > > > > > > > > > > Average: 70 3.34 0.00 17.14 > > 10.59 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.00 0.00 0.00 65.92 > > > > > > > > > > Average: 71 3.41 0.00 14.94 > > 10.56 > > > > > > > 0.00 > > > > > > > > > > 0.00 3.41 0.00 0.00 67.69 > > > > > > > > > > > > > > > > > > > > Perf top - > > > > > > > > > > > > > > > > > > > > 64.33% [kernel] [k] bt_iter > > > > > > > > > > 4.86% [kernel] [k] > > blk_mq_queue_tag_busy_iter > > > > > > > > > > 4.23% [kernel] [k] _find_next_bit > > > > > > > > > > 2.40% [kernel] [k] > > > > > native_queued_spin_lock_slowpath > > > > > > > > > > 1.09% [kernel] [k] sbitmap_any_bit_set > > > > > > > > > > 0.71% [kernel] [k] sbitmap_queue_clear > > > > > > > > > > 0.63% [kernel] [k] find_next_bit > > > > > > > > > > 0.54% [kernel] [k] _raw_spin_lock_irqsave > > > > > > > > > > > > > > > > > > > Ah. So we're spending quite some time in trying to find > > > > > > > > > a free > > > > > tag. > > > > > > > > > I guess this is due to every queue starting at the same > > > > > > > > > position trying to find a free tag, which inevitably > > > > > > > > > leads > > to a > > > contention. > > > > > > > > > > > > > > > > IMO, the above trace means that blk_mq_in_flight() may be > > > > > > > > the > > > > > > > bottleneck, > > > > > > > > and looks not related with tag allocation. > > > > > > > > > > > > > > > > Kashyap, could you run your performance test again after > > > > > > > > disabling > > > > > > > iostat by > > > > > > > > the following command on all test devices and killing all > > > > > > > > utilities > > > > > > > which may > > > > > > > > read iostat(/proc/diskstats, ...)? > > > > > > > > > > > > > > > > echo 0 > /sys/block/sdN/queue/iostat > > > > > > > > > > > > > > Ming - After changing iostat = 0 , I see performance issue > > > > > > > is > > > > > resolved. > > > > > > > > > > > > > > Below is perf top output after iostats = 0 > > > > > > > > > > > > > > > > > > > > > 23.45% [kernel] [k] bt_iter > > > > > > > 2.27% [kernel] [k] blk_mq_queue_tag_busy_iter > > > > > > > 2.18% [kernel] [k] _find_next_bit > > > > > > > 2.06% [megaraid_sas] [k] complete_cmd_fusion > > > > > > > 1.87% [kernel] [k] clflush_cache_range > > > > > > > 1.70% [kernel] [k] dma_pte_clear_level > > > > > > > 1.56% [kernel] [k] __domain_mapping > > > > > > > 1.55% [kernel] [k] sbitmap_queue_clear > > > > > > > 1.30% [kernel] [k] gup_pgd_range > > > > > > > > > > > > Hi Kashyap, > > > > > > > > > > > > Thanks for your test and update. > > > > > > > > > > > > Looks blk_mq_queue_tag_busy_iter() is still sampled by perf > > > > > > even though iostats is disabled, and I guess there may be > > > > > > utilities which are > > > > > reading iostats > > > > > > a bit frequently. > > > > > > > > > > I will be doing some more testing and post you my findings. > > > > > > > > I will find sometime this weekend to see if I can cook a patch to > > > > address this issue of io accounting. > > > > > > Hi Kashyap, > > > > > > Please test the top 5 patches in the following tree to see if > > megaraid_sas's > > > performance is OK: > > > > > > https://github.com/ming1/linux/commits/v4.15-for-next-global-tags- > > > v2 > > > > > > This tree is made by adding these 5 patches against patchset V2. > > > > > > > Ming - > > I applied 5 patches on top of V2 and behavior is still unchanged. > > Below is perf top data. (1000K IOPS) > > > > 34.58% [kernel] [k] bt_iter > > 2.96% [kernel] [k] sbitmap_any_bit_set > > 2.77% [kernel] [k] bt_iter_global_tags > > 1.75% [megaraid_sas] [k] complete_cmd_fusion > > 1.62% [kernel] [k] sbitmap_queue_clear > > 1.62% [kernel] [k] _raw_spin_lock > > 1.51% [kernel] [k] blk_mq_run_hw_queue > > 1.45% [kernel] [k] gup_pgd_range > > 1.31% [kernel] [k] irq_entries_start > > 1.29% fio [.] __fio_gettime > > 1.13% [kernel] [k] _raw_spin_lock_irqsave > > 0.95% [kernel] [k] native_queued_spin_lock_slowpath > > 0.92% [kernel] [k] scsi_queue_rq > > 0.91% [kernel] [k] blk_mq_run_hw_queues > > 0.85% [kernel] [k] blk_mq_get_request > > 0.81% [kernel] [k] switch_mm_irqs_off > > 0.78% [megaraid_sas] [k] megasas_build_io_fusion > > 0.77% [kernel] [k] __schedule > > 0.73% [kernel] [k] update_load_avg > > 0.69% [kernel] [k] fput > > 0.65% [kernel] [k] scsi_dispatch_cmd > > 0.64% fio [.] fio_libaio_event > > 0.53% [kernel] [k] do_io_submit > > 0.52% [kernel] [k] read_tsc > > 0.51% [megaraid_sas] [k] megasas_build_and_issue_cmd_fusion > > 0.51% [kernel] [k] scsi_softirq_done > > 0.50% [kernel] [k] kobject_put > > 0.50% [kernel] [k] cpuidle_enter_state > > 0.49% [kernel] [k] native_write_msr > > 0.48% fio [.] io_completed > > > > Below is perf top data with iostat=0 (1400K IOPS) > > > > 4.87% [kernel] [k] sbitmap_any_bit_set > > 2.93% [kernel] [k] _raw_spin_lock > > 2.84% [megaraid_sas] [k] complete_cmd_fusion > > 2.38% [kernel] [k] irq_entries_start > > 2.36% [kernel] [k] gup_pgd_range > > 2.35% [kernel] [k] blk_mq_run_hw_queue > > 2.30% [kernel] [k] sbitmap_queue_clear > > 2.01% fio [.] __fio_gettime > > 1.78% [kernel] [k] _raw_spin_lock_irqsave > > 1.51% [kernel] [k] scsi_queue_rq > > 1.43% [kernel] [k] blk_mq_run_hw_queues > > 1.36% [kernel] [k] fput > > 1.32% [kernel] [k] __schedule > > 1.31% [kernel] [k] switch_mm_irqs_off > > 1.29% [kernel] [k] update_load_avg > > 1.25% [megaraid_sas] [k] megasas_build_io_fusion > > 1.22% [kernel] [k] > > native_queued_spin_lock_slowpath > > 1.03% [kernel] [k] scsi_dispatch_cmd > > 1.03% [kernel] [k] blk_mq_get_request > > 0.91% fio [.] fio_libaio_event > > 0.89% [kernel] [k] scsi_softirq_done > > 0.87% [kernel] [k] kobject_put > > 0.86% [kernel] [k] cpuidle_enter_state > > 0.84% fio [.] io_completed > > 0.83% [kernel] [k] do_io_submit > > 0.83% [megaraid_sas] [k] > > megasas_build_and_issue_cmd_fusion > > 0.83% [kernel] [k] __switch_to > > 0.82% [kernel] [k] read_tsc > > 0.80% [kernel] [k] native_write_msr > > 0.76% [kernel] [k] aio_comp > > > > > > Perf data without V2 patch applied. (1600K IOPS) > > > > 5.97% [megaraid_sas] [k] complete_cmd_fusion > > 5.24% [kernel] [k] bt_iter > > 3.28% [kernel] [k] _raw_spin_lock > > 2.98% [kernel] [k] irq_entries_start > > 2.29% fio [.] __fio_gettime > > 2.04% [kernel] [k] scsi_queue_rq > > 1.92% [megaraid_sas] [k] megasas_build_io_fusion > > 1.61% [kernel] [k] switch_mm_irqs_off > > 1.59% [megaraid_sas] [k] megasas_build_and_issue_cmd_fusion > > 1.41% [kernel] [k] scsi_dispatch_cmd > > 1.33% [kernel] [k] scsi_softirq_done > > 1.18% [kernel] [k] gup_pgd_range > > 1.18% [kernel] [k] blk_mq_complete_request > > 1.13% [kernel] [k] blk_mq_free_request > > 1.05% [kernel] [k] do_io_submit > > 1.04% [kernel] [k] _find_next_bit > > 1.02% [kernel] [k] blk_mq_get_request > > 0.95% [megaraid_sas] [k] megasas_build_ldio_fusion > > 0.95% [kernel] [k] scsi_dec_host_busy > > 0.89% fio [.] get_io_u > > 0.88% [kernel] [k] entry_SYSCALL_64 > > 0.84% [megaraid_sas] [k] megasas_queue_command > > 0.79% [kernel] [k] native_write_msr > > 0.77% [kernel] [k] read_tsc > > 0.73% [kernel] [k] _raw_spin_lock_irqsave > > 0.73% fio [.] fio_libaio_commit > > 0.72% [kernel] [k] kmem_cache_alloc > > 0.72% [kernel] [k] blkdev_direct_IO > > 0.69% [megaraid_sas] [k] MR_GetPhyParams > > 0.68% [kernel] [k] blk_mq_dequeue_f > > The above data is very helpful to understand the issue, great thanks! > > With this patchset V2 and the 5 patches, if iostats is set as 0, IOPS is 1400K, but > 1600K IOPS can be reached without all these patches with iostats as 1. > > BTW, could you share us what the machine is? ARM64? I saw ARM64's cache > coherence performance is bad before. In the dual socket system(each socket > has 8 X86 CPU cores) I tested, only ~0.5% IOPS drop can be observed after the > 5 patches are applied on V2 in null_blk test, which is described in commit log. I am using Intel Skylake/Lewisburg/Purley. > > Looks it means single sbitmap can't perform well under MQ's case in which > there will be much more concurrent submissions and completions. In case of > single hw queue(current linus tree), one hctx->run_work only allows one > __blk_mq_run_hw_queue() running at 'async' mode, and reply queues are > used in round-robin way, which may cause contention on the single sbitmap > too, especially io accounting may consume a bit much more CPU, I guess that > may contribute some on the CPU lockup. > > Could you run your test without V2 patches by setting 'iostats' as 0? Tested without V2 patch set. Iostat=1. IOPS = 1600K 5.93% [megaraid_sas] [k] complete_cmd_fusion 5.34% [kernel] [k] bt_iter 3.23% [kernel] [k] _raw_spin_lock 2.92% [kernel] [k] irq_entries_start 2.57% fio [.] __fio_gettime 2.10% [kernel] [k] scsi_queue_rq 1.98% [megaraid_sas] [k] megasas_build_io_fusion 1.93% [kernel] [k] switch_mm_irqs_off 1.79% [megaraid_sas] [k] megasas_build_and_issue_cmd_fusion 1.45% [kernel] [k] scsi_softirq_done 1.42% [kernel] [k] scsi_dispatch_cmd 1.23% [kernel] [k] blk_mq_complete_request 1.11% [megaraid_sas] [k] megasas_build_ldio_fusion 1.11% [kernel] [k] gup_pgd_range 1.08% [kernel] [k] blk_mq_free_request 1.03% [kernel] [k] do_io_submit 1.02% [kernel] [k] _find_next_bit 1.00% [kernel] [k] scsi_dec_host_busy 0.94% [kernel] [k] blk_mq_get_request 0.93% [megaraid_sas] [k] megasas_queue_command 0.92% [kernel] [k] native_write_msr 0.85% fio [.] get_io_u 0.83% [kernel] [k] entry_SYSCALL_64 0.83% [kernel] [k] _raw_spin_lock_irqsave 0.82% [kernel] [k] read_tsc 0.81% [sd_mod] [k] sd_init_command 0.67% [kernel] [k] kmem_cache_alloc 0.63% [kernel] [k] memset_erms 0.63% [kernel] [k] aio_read_events 0.62% [kernel] [k] blkdev_dir Tested without V2 patch set. Iostat=0. IOPS = 1600K 5.79% [megaraid_sas] [k] complete_cmd_fusion 3.28% [kernel] [k] _raw_spin_lock 3.28% [kernel] [k] irq_entries_start 2.10% [kernel] [k] scsi_queue_rq 1.96% fio [.] __fio_gettime 1.85% [megaraid_sas] [k] megasas_build_io_fusion 1.68% [megaraid_sas] [k] megasas_build_and_issue_cmd_fusion 1.36% [kernel] [k] gup_pgd_range 1.36% [kernel] [k] scsi_dispatch_cmd 1.28% [kernel] [k] do_io_submit 1.25% [kernel] [k] switch_mm_irqs_off 1.20% [kernel] [k] blk_mq_free_request 1.18% [megaraid_sas] [k] megasas_build_ldio_fusion 1.11% [kernel] [k] dput 1.07% [kernel] [k] scsi_softirq_done 1.07% fio [.] get_io_u 1.07% [kernel] [k] scsi_dec_host_busy 1.02% [kernel] [k] blk_mq_get_request 0.96% [sd_mod] [k] sd_init_command 0.92% [kernel] [k] entry_SYSCALL_64 0.89% [kernel] [k] blk_mq_make_request 0.87% [kernel] [k] blkdev_direct_IO 0.84% [kernel] [k] blk_mq_complete_request 0.78% [kernel] [k] _raw_spin_lock_irqsave 0.77% [kernel] [k] lookup_ioctx 0.76% [megaraid_sas] [k] MR_GetPhyParams 0.75% [kernel] [k] blk_mq_dequeue_from_ctx 0.75% [kernel] [k] memset_erms 0.74% [kernel] [k] kmem_cache_alloc 0.72% [megaraid_sas] [k] megasas_queue_comman > and could you share us what the .can_queue is in this HBA? can_queue = 8072. In my test I used --iodepth=128 for 12 SCSI device (R0 Volume.) FIO will only push 1536 outstanding commands. > > > > > > > > If possible, please provide us the performance data without these > > patches and > > > with these patches, together with perf trace. > > > > > > The top 5 patches are for addressing the io accounting issue, and > > > which should be the main reason for your performance drop, even > > > lockup in megaraid_sas's ISR, IMO. > > > > I think performance drop is different issue. May be a side effect of > > the patch set. Even though we fix this perf issue, cpu lock up is > > completely different issue. > > The performance drop is caused by the global data structure of sbitmap which > is accessed from all CPUs concurrently. > > > Regarding cpu lock up, there was similar discussion and folks are > > finding irq poll is good method to resolve lockup. Not sure why NVME > > driver did not opted irq_poll, but there was extensive discussion and > > I am also > > NVMe's hw queues won't use host wide tags, so no such issue. > > > seeing cpu lock up mainly due to multiple completion queue/reply queue > > is tied to single CPU. We have weighing method in irq poll to quit ISR > > and that is the way we can avoid lock-up. > > http://lists.infradead.org/pipermail/linux-nvme/2017-January/007724.ht > > ml > > This patch can make sure that one request is always completed in the > submission CPU, but contention on the global sbitmap is too big and causes > performance drop. > > Now looks this is really an interesting topic for discussion. > > > Thanks, > Ming
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c index 0f1d88f..75ea86b 100644 --- a/drivers/scsi/megaraid/megaraid_sas_base.c +++ b/drivers/scsi/megaraid/megaraid_sas_base.c @@ -50,6 +50,7 @@ #include <linux/mutex.h> #include <linux/poll.h> #include <linux/vmalloc.h> +#include <linux/blk-mq-pci.h> #include <scsi/scsi.h> #include <scsi/scsi_cmnd.h> @@ -220,6 +221,15 @@ static int megasas_get_ld_vf_affiliation(struct megasas_instance *instance, static inline void megasas_init_ctrl_params(struct megasas_instance *instance); + +static int megaraid_sas_map_queues(struct Scsi_Host *shost) +{ + struct megasas_instance *instance; + instance = (struct megasas_instance *)shost->hostdata; + + return blk_mq_pci_map_queues(&shost->tag_set, instance->pdev); +} + /** * megasas_set_dma_settings - Populate DMA address, length and flags for DCMDs * @instance: Adapter soft state @@ -3177,6 +3187,8 @@ struct device_attribute *megaraid_host_attrs[] = { .use_clustering = ENABLE_CLUSTERING, .change_queue_depth = scsi_change_queue_depth, .no_write_same = 1, + .map_queues = megaraid_sas_map_queues, + .host_tagset = 1, }; /** @@ -5965,6 +5977,9 @@ static int megasas_io_attach(struct megasas_instance *instance) host->max_lun = MEGASAS_MAX_LUN; host->max_cmd_len = 16; + /* map reply queue to blk_mq hw queue */ + host->nr_hw_queues = instance->msix_vectors; + /* * Notify the mid-layer about the new controller */ diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c index 073ced0..034d976 100644 --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c @@ -2655,11 +2655,15 @@ static void megasas_stream_detect(struct megasas_instance *instance, fp_possible = (io_info.fpOkForIo > 0) ? true : false; } +#if 0 /* Use raw_smp_processor_id() for now until cmd->request->cpu is CPU id by default, not CPU group id, otherwise all MSI-X queues won't be utilized */ cmd->request_desc->SCSIIO.MSIxIndex = instance->msix_vectors ? raw_smp_processor_id() % instance->msix_vectors : 0; +#endif +