Message ID | 20160205151334.GA82754@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Feb 05 2016 at 10:13am -0500, Mike Snitzer <snitzer@redhat.com> wrote: > Following is RFC because it really speaks to dm-mq _needing_ a variant > of blk_mq_complete_request() that supports partial completions. Not > supporting partial completions really isn't an option for DM multipath. > > From: Mike Snitzer <snitzer@redhat.com> > Date: Fri, 5 Feb 2016 08:49:01 -0500 > Subject: [RFC PATCH] dm: fix excessive dm-mq context switching > > Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower > than if an underlying null_blk device were used directly. This biggest > reason for this drop in performance is that blk_insert_clone_request() > was calling blk_mq_insert_request() with @async=true. This forced the > use of kblockd_schedule_delayed_work_on() to run the queues which > ushered in ping-ponging between process context (fio in this case) and > kblockd's kworker to submit the cloned request. The ftrace > function_graph tracer showed: > > kworker-2013 => fio-12190 > fio-12190 => kworker-2013 > ... > kworker-2013 => fio-12190 > fio-12190 => kworker-2013 > ... > > Fixing blk_mq_insert_request() to _not_ use kblockd to submit the cloned > requests isn't enough to fix eliminated the oberved context switches. > > In addition to this dm-mq specific blk-core fix, there were 2 DM core > fixes to dm-mq that (when paired with the blk-core fix) completely > eliminate the observed context switching: > > 1) don't blk_mq_run_hw_queues in blk-mq request completion > > Motivated by desire to reduce overhead of dm-mq, punting to kblockd > just increases context switches. > > In my testing against a really fast null_blk device there was no benefit > to running blk_mq_run_hw_queues() on completion (and no other blk-mq > driver does this). So hopefully this change doesn't induce the need for > yet another revert like commit 621739b00e16ca2d ! > > 2) use blk_mq_complete_request() in dm_complete_request() > > blk_complete_request() doesn't offer the traditional q->mq_ops vs > .request_fn branching pattern that other historic block interfaces > do (e.g. blk_get_request). Using blk_mq_complete_request() for > blk-mq requests is important for performance but it doesn't handle > partial completions -- which is a pretty big problem given the > potential for partial completions with DM multipath due to path > failure(s). As such this makes this entire patch only RFC-worthy. > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > index c683f6d..a618477 100644 > --- a/drivers/md/dm.c > +++ b/drivers/md/dm.c > @@ -1344,7 +1340,10 @@ static void dm_complete_request(struct request *rq, int error) > struct dm_rq_target_io *tio = tio_from_request(rq); > > tio->error = error; > - blk_complete_request(rq); > + if (!rq->q->mq_ops) > + blk_complete_request(rq); > + else > + blk_mq_complete_request(rq, rq->errors); > } > > /* Looking closer, DM is very likely OK just using blk_mq_complete_request. blk_complete_request() also doesn't provide native partial completion support (it relies on the driver to do it, which DM core does): /** * blk_complete_request - end I/O on a request * @req: the request being processed * * Description: * Ends all I/O on a request. It does not handle partial completions, * unless the driver actually implements this in its completion callback * through requeueing. The actual completion happens out-of-order, * through a softirq handler. The user must have registered a completion * callback through blk_queue_softirq_done(). **/ blk_mq_complete_request() is effectively implemented in a comparable fashion to blk_complete_request(). Given that DM core is providing partial completion support by dm.c:end_clone_bio() triggering requeueing of the request via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE. So I'm thinking I can drop the "RFC" for this patch and run with it.. once I get Jens' feedback (hopefully) confirming my understanding. Jens, please advise. If you're comfortable providing your Acked-by I can get this fix in for 4.5-rc4 or so... Thanks! Mike -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 05 2016 at 1:05pm -0500, Mike Snitzer <snitzer@redhat.com> wrote: > On Fri, Feb 05 2016 at 10:13am -0500, > Mike Snitzer <snitzer@redhat.com> wrote: > > > Following is RFC because it really speaks to dm-mq _needing_ a variant > > of blk_mq_complete_request() that supports partial completions. Not > > supporting partial completions really isn't an option for DM multipath. > > > > From: Mike Snitzer <snitzer@redhat.com> > > Date: Fri, 5 Feb 2016 08:49:01 -0500 > > Subject: [RFC PATCH] dm: fix excessive dm-mq context switching > > > > Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower > > than if an underlying null_blk device were used directly. This biggest > > reason for this drop in performance is that blk_insert_clone_request() > > was calling blk_mq_insert_request() with @async=true. This forced the > > use of kblockd_schedule_delayed_work_on() to run the queues which > > ushered in ping-ponging between process context (fio in this case) and > > kblockd's kworker to submit the cloned request. The ftrace > > function_graph tracer showed: > > > > kworker-2013 => fio-12190 > > fio-12190 => kworker-2013 > > ... > > kworker-2013 => fio-12190 > > fio-12190 => kworker-2013 > > ... > > > > Fixing blk_mq_insert_request() to _not_ use kblockd to submit the cloned > > requests isn't enough to fix eliminated the oberved context switches. > > > > In addition to this dm-mq specific blk-core fix, there were 2 DM core > > fixes to dm-mq that (when paired with the blk-core fix) completely > > eliminate the observed context switching: > > > > 1) don't blk_mq_run_hw_queues in blk-mq request completion > > > > Motivated by desire to reduce overhead of dm-mq, punting to kblockd > > just increases context switches. > > > > In my testing against a really fast null_blk device there was no benefit > > to running blk_mq_run_hw_queues() on completion (and no other blk-mq > > driver does this). So hopefully this change doesn't induce the need for > > yet another revert like commit 621739b00e16ca2d ! > > > > 2) use blk_mq_complete_request() in dm_complete_request() > > > > blk_complete_request() doesn't offer the traditional q->mq_ops vs > > .request_fn branching pattern that other historic block interfaces > > do (e.g. blk_get_request). Using blk_mq_complete_request() for > > blk-mq requests is important for performance but it doesn't handle > > partial completions -- which is a pretty big problem given the > > potential for partial completions with DM multipath due to path > > failure(s). As such this makes this entire patch only RFC-worthy. > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > > index c683f6d..a618477 100644 > > --- a/drivers/md/dm.c > > +++ b/drivers/md/dm.c > > @@ -1344,7 +1340,10 @@ static void dm_complete_request(struct request *rq, int error) > > struct dm_rq_target_io *tio = tio_from_request(rq); > > > > tio->error = error; > > - blk_complete_request(rq); > > + if (!rq->q->mq_ops) > > + blk_complete_request(rq); > > + else > > + blk_mq_complete_request(rq, rq->errors); > > } > > > > /* > > Looking closer, DM is very likely OK just using blk_mq_complete_request. > > blk_complete_request() also doesn't provide native partial completion > support (it relies on the driver to do it, which DM core does): > > /** > * blk_complete_request - end I/O on a request > * @req: the request being processed > * > * Description: > * Ends all I/O on a request. It does not handle partial completions, > * unless the driver actually implements this in its completion callback > * through requeueing. The actual completion happens out-of-order, > * through a softirq handler. The user must have registered a completion > * callback through blk_queue_softirq_done(). > **/ > > blk_mq_complete_request() is effectively implemented in a comparable > fashion to blk_complete_request(). Given that DM core is providing > partial completion support by dm.c:end_clone_bio() triggering requeueing > of the request via dm-mpath.c:multipath_end_io()'s return of > DM_ENDIO_REQUEUE. > > So I'm thinking I can drop the "RFC" for this patch and run with > it.. once I get Jens' feedback (hopefully) confirming my understanding. > > Jens, please advise. If you're comfortable providing your Acked-by I > can get this fix in for 4.5-rc4 or so... FYI, here is the latest revised patch: https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5b835282422ec41991c1dbdb88daa4af7d166d2 (revised patch header and fixed a thinko in the dm.c:rq_completed() change from the RFC patch I posted earlier) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> FYI, here is the latest revised patch: > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5b835282422ec41991c1dbdb88daa4af7d166d2 > > (revised patch header and fixed a thinko in the dm.c:rq_completed() > change from the RFC patch I posted earlier) Hi Mike, So I gave your patches a go (dm-4.6) but I still don't see the improvement you reported (while I do see a minor improvement). null_blk queue_mode=2 submit_queues=24 dm_mod blk_mq_nr_hw_queues=24 blk_mq_queue_depth=4096 use_blk_mq=Y I see 620K IOPs on dm_mq vs. 1750K IOPs on raw nullb0. Is there something I'm missing? -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Feb 07 2016 at 10:41am -0500, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote: > > >FYI, here is the latest revised patch: > >https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5b835282422ec41991c1dbdb88daa4af7d166d2 > > > >(revised patch header and fixed a thinko in the dm.c:rq_completed() > >change from the RFC patch I posted earlier) > > Hi Mike, > > So I gave your patches a go (dm-4.6) but I still don't see the > improvement you reported (while I do see a minor improvement). > > null_blk queue_mode=2 submit_queues=24 > dm_mod blk_mq_nr_hw_queues=24 blk_mq_queue_depth=4096 use_blk_mq=Y > > I see 620K IOPs on dm_mq vs. 1750K IOPs on raw nullb0. blk_mq_nr_hw_queues=24 isn't likely to help you (but with these patches, the first being the most important, it shouldn't hurt either provided you have 24 cpus). Could be you have multiple NUMA nodes and are seeing problems from that? I have 12 cpus (in the same physical cpu) and only a single NUMA node. I get the same results as blk_mq_nr_hw_queues=12 with blk_mq_nr_hw_queues=4 (same goes for null_blk submit_queues). I've seen my IOPs go from ~950K to ~1400K. The peak null_blk can get on my setup is ~1950K. So I'm still seeing a ~25% drop with dm-mq (but that is much better than the over 50% drop I saw seeing). > Is there something I'm missing? Not sure, I just emailed out all my patches (and cc'd you). Please verify you're using the latest here (same as 'dm-4.6' branch): https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next I rebased a couple times... so please diff what you have tested against this latest 'dm-4.6' branch. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/07/16 07:41, Sagi Grimberg wrote: > >> FYI, here is the latest revised patch: >> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.6&id=a5b835282422ec41991c1dbdb88daa4af7d166d2 >> >> (revised patch header and fixed a thinko in the dm.c:rq_completed() >> change from the RFC patch I posted earlier) > > Hi Mike, > > So I gave your patches a go (dm-4.6) but I still don't see the > improvement you reported (while I do see a minor improvement). > > null_blk queue_mode=2 submit_queues=24 > dm_mod blk_mq_nr_hw_queues=24 blk_mq_queue_depth=4096 use_blk_mq=Y > > I see 620K IOPs on dm_mq vs. 1750K IOPs on raw nullb0. > > Is there something I'm missing? Hello Sagi, Did you run your test on a NUMA system ? If so, can you check with e.g. perf record -ags -e LLC-load-misses sleep 10 && perf report whether this workload triggers perhaps lock contention ? What you need to look for in the perf output is whether any functions occupy more than 10% CPU time. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Hi Mike, >> >> So I gave your patches a go (dm-4.6) but I still don't see the >> improvement you reported (while I do see a minor improvement). >> >> null_blk queue_mode=2 submit_queues=24 >> dm_mod blk_mq_nr_hw_queues=24 blk_mq_queue_depth=4096 use_blk_mq=Y >> >> I see 620K IOPs on dm_mq vs. 1750K IOPs on raw nullb0. > > blk_mq_nr_hw_queues=24 isn't likely to help you (but with these patches, > the first being the most important, it shouldn't hurt either provided > you have 24 cpus). I tried with less but as you said, it didn't have an impact... > Could be you have multiple NUMA nodes and are seeing problems from that? I am running on a dual socket server, this can most likely be the culprit... > I have 12 cpus (in the same physical cpu) and only a single NUMA node. > I get the same results as blk_mq_nr_hw_queues=12 with > blk_mq_nr_hw_queues=4 (same goes for null_blk submit_queues). > I've seen my IOPs go from ~950K to ~1400K. The peak null_blk can get on > my setup is ~1950K. So I'm still seeing a ~25% drop with dm-mq (but > that is much better than the over 50% drop I saw seeing). That's what I was planning on :( >> Is there something I'm missing? > > Not sure, I just emailed out all my patches (and cc'd you). Please > verify you're using the latest here (same as 'dm-4.6' branch): > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next > > I rebased a couple times... so please diff what you have tested against > this latest 'dm-4.6' branch. I am. I'll try to instrument what's going on... -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Hello Sagi, Hey Bart, > Did you run your test on a NUMA system ? I did. > If so, can you check with e.g. > perf record -ags -e LLC-load-misses sleep 10 && perf report whether this > workload triggers perhaps lock contention ? What you need to look for in > the perf output is whether any functions occupy more than 10% CPU time. I will, thanks for the tip! -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Feb 07 2016 at 11:43am -0500, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote: > > >Hello Sagi, > > Hey Bart, > > >Did you run your test on a NUMA system ? > > I did. > > >If so, can you check with e.g. > >perf record -ags -e LLC-load-misses sleep 10 && perf report whether this > >workload triggers perhaps lock contention ? What you need to look for in > >the perf output is whether any functions occupy more than 10% CPU time. > > I will, thanks for the tip! Also, I found ftrace's function_graph tracer very helpful (it is how I found the various issues fixed by the first context switch patch). Here is my latest script: #!/bin/sh set -xv NULL_BLK_HW_QUEUES=4 NULL_BLK_QUEUE_DEPTH=4096 DM_MQ_HW_QUEUES=4 DM_MQ_QUEUE_DEPTH=2048 FIO=/root/snitm/git/fio/fio FIO_QUEUE_DEPTH=32 FIO_RUNTIME=10 FIO_NUMJOBS=12 PERF=perf #PERF=/root/snitm/git/linux/tools/perf/perf run_fio() { DEVICE=$1 TASK_NAME=$(basename ${DEVICE}) PERF_RECORD=$2 RUN_CMD="${FIO} --cpus_allowed_policy=split --group_reporting --rw=randread --bs=4k --numjobs=${FIO_NUMJOBS} \ --iodepth=${FIO_QUEUE_DEPTH} --runtime=${FIO_RUNTIME} --time_based --loops=1 --ioengine=libaio \ --direct=1 --invalidate=1 --randrepeat=1 --norandommap --exitall --name task_${TASK_NAME} --filename=${DEVICE}" if [ ! -z "${PERF_RECORD}" ]; then ${PERF_RECORD} ${RUN_CMD} mv perf.data perf.data.${TASK_NAME} else ${RUN_CMD} fi } run_fio_with_ftrace() { DEVICE=$1 TASK_NAME=$(basename ${DEVICE}) echo > /sys/kernel/debug/tracing/trace echo 0 > /sys/kernel/debug/tracing/tracing_on echo function_graph > /sys/kernel/debug/tracing/current_tracer echo 1 > /sys/kernel/debug/tracing/tracing_on run_fio $DEVICE echo 0 > /sys/kernel/debug/tracing/tracing_on cat /sys/kernel/debug/tracing/trace > trace.${TASK_NAME} echo nop > /sys/kernel/debug/tracing/current_tracer } dmsetup remove dm_mq modprobe -r null_blk modprobe null_blk gb=4 bs=512 hw_queue_depth=${NULL_BLK_QUEUE_DEPTH} nr_devices=1 queue_mode=2 irqmode=1 completion_nsec=1 submit_queues=${NULL_BLK_HW_QUEUES} #run_fio /dev/nullb0 "${PERF} record -ag -e cs" #run_fio /dev/nullb0 "${PERF} stat" echo Y > /sys/module/dm_mod/parameters/use_blk_mq echo ${DM_MQ_QUEUE_DEPTH} > /sys/module/dm_mod/parameters/blk_mq_queue_depth echo ${DM_MQ_HW_QUEUES} > /sys/module/dm_mod/parameters/blk_mq_nr_hw_queues echo "0 8388608 multipath 0 0 1 1 service-time 0 1 2 /dev/nullb0 1000 1" | dmsetup create dm_mq #echo "0 8388608 linear /dev/nullb0 0" | dmsetup create dm_mq run_fio_with_ftrace /dev/mapper/dm_mq #run_fio /dev/mapper/dm_mq #run_fio /dev/mapper/dm_mq "${PERF} record -ag -e cs" #run_fio /dev/mapper/dm_mq "${PERF} record -ag" #run_fio /dev/mapper/dm_mq "${PERF} stat" #run_fio /dev/mapper/dm_mq "trace-cmd record -e all" -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> If so, can you check with e.g. >> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this >> workload triggers perhaps lock contention ? What you need to look for in >> the perf output is whether any functions occupy more than 10% CPU time. > > I will, thanks for the tip! The perf report is very similar to the one that started this effort.. I'm afraid we'll need to resolve the per-target m->lock in order to scale with NUMA... - 17.33% fio [kernel.kallsyms] [k] queued_spin_lock_slowpath - queued_spin_lock_slowpath - 52.09% _raw_spin_lock_irq __multipath_map multipath_clone_and_map map_request dm_mq_queue_rq __blk_mq_run_hw_queue blk_mq_run_hw_queue blk_mq_insert_requests blk_mq_flush_plug_list blk_flush_plug_list blk_finish_plug do_io_submit SyS_io_submit entry_SYSCALL_64_fastpath + io_submit - 46.87% _raw_spin_lock_irqsave - 99.97% multipath_busy dm_mq_queue_rq __blk_mq_run_hw_queue blk_mq_run_hw_queue blk_mq_insert_requests blk_mq_flush_plug_list blk_flush_plug_list blk_finish_plug do_io_submit SyS_io_submit entry_SYSCALL_64_fastpath + io_submit + 4.99% fio [kernel.kallsyms] [k] blk_account_io_start + 3.93% fio [dm_multipath] [k] __multipath_map + 2.64% fio [dm_multipath] [k] multipath_busy + 2.38% fio [kernel.kallsyms] [k] _raw_spin_lock_irqsave + 2.31% fio [dm_mod] [k] dm_mq_queue_rq + 2.25% fio [kernel.kallsyms] [k] blk_mq_hctx_mark_pending + 1.81% fio [kernel.kallsyms] [k] blk_queue_enter + 1.61% perf [kernel.kallsyms] [k] copy_user_generic_string + 1.40% fio [kernel.kallsyms] [k] __blk_mq_run_hw_queue + 1.26% fio [kernel.kallsyms] [k] part_round_stats + 1.14% fio [kernel.kallsyms] [k] _raw_spin_lock_irq + 0.96% fio [kernel.kallsyms] [k] __bt_get + 0.73% fio [kernel.kallsyms] [k] enqueue_task_fair + 0.71% fio [kernel.kallsyms] [k] enqueue_entity + 0.69% fio [dm_mod] [k] dm_start_request + 0.60% ksoftirqd/6 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 0.59% ksoftirqd/10 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 0.59% fio [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore + 0.58% ksoftirqd/19 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 0.58% ksoftirqd/18 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 0.58% ksoftirqd/23 [kernel.kallsyms] [k] blk_mq_run_hw_queues -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Feb 07 2016 at 11:54am -0500, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote: > > >>If so, can you check with e.g. > >>perf record -ags -e LLC-load-misses sleep 10 && perf report whether this > >>workload triggers perhaps lock contention ? What you need to look for in > >>the perf output is whether any functions occupy more than 10% CPU time. > > > >I will, thanks for the tip! > > The perf report is very similar to the one that started this effort.. > > I'm afraid we'll need to resolve the per-target m->lock in order > to scale with NUMA... Could be. Just for testing, you can try the 2 topmost commits I've put here (once applied both __multipath_map and multipath_busy won't have _any_ locking.. again, very much test-only): http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> The perf report is very similar to the one that started this effort.. >> >> I'm afraid we'll need to resolve the per-target m->lock in order >> to scale with NUMA... > > Could be. Just for testing, you can try the 2 topmost commits I've put > here (once applied both __multipath_map and multipath_busy won't have > _any_ locking.. again, very much test-only): > > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2 Hi Mike, So I still don't see the IOPs scale like I expected. With these two patches applied I see ~670K IOPs while the perf output is different and does not indicate a clear lock contention. -- - 4.67% fio [kernel.kallsyms] [k] blk_account_io_start - blk_account_io_start - 56.05% blk_insert_cloned_request map_request dm_mq_queue_rq __blk_mq_run_hw_queue blk_mq_run_hw_queue blk_mq_insert_requests blk_mq_flush_plug_list blk_flush_plug_list blk_finish_plug do_io_submit SyS_io_submit entry_SYSCALL_64_fastpath + io_submit - 43.94% blk_mq_bio_to_request blk_mq_make_request generic_make_request submit_bio do_blockdev_direct_IO __blockdev_direct_IO blkdev_direct_IO generic_file_read_iter blkdev_read_iter aio_run_iocb io_submit_one do_io_submit SyS_io_submit entry_SYSCALL_64_fastpath + io_submit - 2.52% fio [dm_mod] [k] dm_mq_queue_rq - dm_mq_queue_rq - 99.16% __blk_mq_run_hw_queue blk_mq_run_hw_queue blk_mq_insert_requests blk_mq_flush_plug_list blk_flush_plug_list blk_finish_plug do_io_submit SyS_io_submit entry_SYSCALL_64_fastpath + io_submit - 2.52% fio [dm_mod] [k] dm_mq_queue_rq - dm_mq_queue_rq - 99.16% __blk_mq_run_hw_queue blk_mq_run_hw_queue blk_mq_insert_requests blk_mq_flush_plug_list blk_flush_plug_list blk_finish_plug do_io_submit SyS_io_submit entry_SYSCALL_64_fastpath + io_submit + 0.84% blk_mq_run_hw_queue - 2.46% fio [kernel.kallsyms] [k] blk_mq_hctx_mark_pending - blk_mq_hctx_mark_pending - 99.79% blk_mq_insert_requests blk_mq_flush_plug_list blk_flush_plug_list blk_finish_plug do_io_submit SyS_io_submit entry_SYSCALL_64_fastpath + io_submit - 2.07% ksoftirqd/6 [kernel.kallsyms] [k] blk_mq_run_hw_queues - blk_mq_run_hw_queues - 99.70% rq_completed dm_done dm_softirq_done blk_done_softirq + __do_softirq + 2.06% ksoftirqd/0 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 2.02% ksoftirqd/9 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 2.00% ksoftirqd/20 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 2.00% ksoftirqd/12 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.99% ksoftirqd/11 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.97% ksoftirqd/18 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.96% ksoftirqd/1 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.95% ksoftirqd/14 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.95% ksoftirqd/13 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.94% ksoftirqd/5 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.94% ksoftirqd/8 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.93% ksoftirqd/2 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.92% ksoftirqd/21 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.92% ksoftirqd/17 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.92% ksoftirqd/7 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.91% ksoftirqd/23 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.84% ksoftirqd/4 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.81% ksoftirqd/19 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.76% ksoftirqd/3 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.76% ksoftirqd/16 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.75% ksoftirqd/15 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.74% ksoftirqd/22 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.72% ksoftirqd/10 [kernel.kallsyms] [k] blk_mq_run_hw_queues + 1.38% perf [kernel.kallsyms] [k] copy_user_generic_string + 1.20% fio [kernel.kallsyms] [k] enqueue_task_fair + 1.18% fio [kernel.kallsyms] [k] part_round_stats + 1.08% fio [kernel.kallsyms] [k] enqueue_entity + 1.07% fio [kernel.kallsyms] [k] _raw_spin_lock + 1.02% fio [kernel.kallsyms] [k] __blk_mq_run_hw_queue + 0.79% fio [dm_multipath] [k] multipath_busy + 0.57% fio [kernel.kallsyms] [k] insert_work + 0.54% fio [kernel.kallsyms] [k] blk_flush_plug_list -- -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 08 2016 at 7:21am -0500, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote: > > >>The perf report is very similar to the one that started this effort.. > >> > >>I'm afraid we'll need to resolve the per-target m->lock in order > >>to scale with NUMA... > > > >Could be. Just for testing, you can try the 2 topmost commits I've put > >here (once applied both __multipath_map and multipath_busy won't have > >_any_ locking.. again, very much test-only): > > > >http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2 > > Hi Mike, > > So I still don't see the IOPs scale like I expected. With these two > patches applied I see ~670K IOPs while the perf output is different > and does not indicate a clear lock contention. Right, perf (with default events) isn't the right tool to track this down. But I'm seeing something that speaks to you not running with the first context switching fix (which seems odd): > - 2.07% ksoftirqd/6 [kernel.kallsyms] [k] > blk_mq_run_hw_queues > - blk_mq_run_hw_queues > - 99.70% rq_completed > dm_done > dm_softirq_done > blk_done_softirq > + __do_softirq As you can see here: https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=a5b835282422ec41991c1dbdb88daa4af7d166d2 rq_completed() shouldn't be calling blk_mq_run_hw_queues() with the latest code. Please triple check you have the latest code, e.g.: git diff snitzer/devel2 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/07/2016 06:20 PM, Mike Snitzer wrote: > On Sun, Feb 07 2016 at 11:54am -0500, > Sagi Grimberg <sagig@dev.mellanox.co.il> wrote: > >> >>>> If so, can you check with e.g. >>>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this >>>> workload triggers perhaps lock contention ? What you need to look for in >>>> the perf output is whether any functions occupy more than 10% CPU time. >>> >>> I will, thanks for the tip! >> >> The perf report is very similar to the one that started this effort.. >> >> I'm afraid we'll need to resolve the per-target m->lock in order >> to scale with NUMA... > > Could be. Just for testing, you can try the 2 topmost commits I've put > here (once applied both __multipath_map and multipath_busy won't have > _any_ locking.. again, very much test-only): > > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2 > So, I gave those patches a spin. Sad to say, they do _not_ resolve the issue fully. My testbed (2 paths per LUN, 40 CPUs, 4 cores) yields 505k IOPs with those patches. Using a single path (without those patches, but still running multipath on top of that path) the same testbed yields 550k IOPs. Which very much smells like a lock contention ... We do get a slight improvement, though; without those patches I could only get about 350k IOPs. But still, I would somehow expect 2 paths to be faster than just one .. Cheers, Hannes
On Tue, Feb 09 2016 at 2:50am -0500, Hannes Reinecke <hare@suse.de> wrote: > On 02/07/2016 06:20 PM, Mike Snitzer wrote: > > On Sun, Feb 07 2016 at 11:54am -0500, > > Sagi Grimberg <sagig@dev.mellanox.co.il> wrote: > > > >> > >>>> If so, can you check with e.g. > >>>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this > >>>> workload triggers perhaps lock contention ? What you need to look for in > >>>> the perf output is whether any functions occupy more than 10% CPU time. > >>> > >>> I will, thanks for the tip! > >> > >> The perf report is very similar to the one that started this effort.. > >> > >> I'm afraid we'll need to resolve the per-target m->lock in order > >> to scale with NUMA... > > > > Could be. Just for testing, you can try the 2 topmost commits I've put > > here (once applied both __multipath_map and multipath_busy won't have > > _any_ locking.. again, very much test-only): > > > > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2 > > > So, I gave those patches a spin. > Sad to say, they do _not_ resolve the issue fully. > > My testbed (2 paths per LUN, 40 CPUs, 4 cores) yields 505k IOPs with > those patches. That isn't a surprise. We knew the m->lock spinlock contention to be a problem. And NUMA makes it even worse. > Using a single path (without those patches, but still running > multipath on top of that path) the same testbed yields 550k IOPs. > Which very much smells like a lock contention ... > We do get a slight improvement, though; without those patches I > could only get about 350k IOPs. But still, I would somehow expect 2 > paths to be faster than just one .. https://www.redhat.com/archives/dm-devel/2016-February/msg00036.html hint hint... -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/09/2016 03:55 PM, Mike Snitzer wrote: > On Tue, Feb 09 2016 at 2:50am -0500, > Hannes Reinecke <hare@suse.de> wrote: > >> On 02/07/2016 06:20 PM, Mike Snitzer wrote: >>> On Sun, Feb 07 2016 at 11:54am -0500, >>> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote: >>> >>>> >>>>>> If so, can you check with e.g. >>>>>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this >>>>>> workload triggers perhaps lock contention ? What you need to look for in >>>>>> the perf output is whether any functions occupy more than 10% CPU time. >>>>> >>>>> I will, thanks for the tip! >>>> >>>> The perf report is very similar to the one that started this effort.. >>>> >>>> I'm afraid we'll need to resolve the per-target m->lock in order >>>> to scale with NUMA... >>> >>> Could be. Just for testing, you can try the 2 topmost commits I've put >>> here (once applied both __multipath_map and multipath_busy won't have >>> _any_ locking.. again, very much test-only): >>> >>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2 >>> >> So, I gave those patches a spin. >> Sad to say, they do _not_ resolve the issue fully. >> >> My testbed (2 paths per LUN, 40 CPUs, 4 cores) yields 505k IOPs with >> those patches. > > That isn't a surprise. We knew the m->lock spinlock contention to be a > problem. And NUMA makes it even worse. > >> Using a single path (without those patches, but still running >> multipath on top of that path) the same testbed yields 550k IOPs. >> Which very much smells like a lock contention ... >> We do get a slight improvement, though; without those patches I >> could only get about 350k IOPs. But still, I would somehow expect 2 >> paths to be faster than just one .. > > https://www.redhat.com/archives/dm-devel/2016-February/msg00036.html > > hint hint... > I hoped they wouldn't be needed with your patches. Plus perf revealed that I first need to address a spinlock contention in the lpfc driver before that even would make sense. So more debugging to follow. Cheers, Hannes
On Tue, Feb 09 2016 at 10:32am -0500, Hannes Reinecke <hare@suse.de> wrote: > On 02/09/2016 03:55 PM, Mike Snitzer wrote: > > On Tue, Feb 09 2016 at 2:50am -0500, > > Hannes Reinecke <hare@suse.de> wrote: > > > >> On 02/07/2016 06:20 PM, Mike Snitzer wrote: > >>> On Sun, Feb 07 2016 at 11:54am -0500, > >>> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote: > >>> > >>>> > >>>>>> If so, can you check with e.g. > >>>>>> perf record -ags -e LLC-load-misses sleep 10 && perf report whether this > >>>>>> workload triggers perhaps lock contention ? What you need to look for in > >>>>>> the perf output is whether any functions occupy more than 10% CPU time. > >>>>> > >>>>> I will, thanks for the tip! > >>>> > >>>> The perf report is very similar to the one that started this effort.. > >>>> > >>>> I'm afraid we'll need to resolve the per-target m->lock in order > >>>> to scale with NUMA... > >>> > >>> Could be. Just for testing, you can try the 2 topmost commits I've put > >>> here (once applied both __multipath_map and multipath_busy won't have > >>> _any_ locking.. again, very much test-only): > >>> > >>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2 > >>> > >> So, I gave those patches a spin. > >> Sad to say, they do _not_ resolve the issue fully. > >> > >> My testbed (2 paths per LUN, 40 CPUs, 4 cores) yields 505k IOPs with > >> those patches. > > > > That isn't a surprise. We knew the m->lock spinlock contention to be a > > problem. And NUMA makes it even worse. > > > >> Using a single path (without those patches, but still running > >> multipath on top of that path) the same testbed yields 550k IOPs. > >> Which very much smells like a lock contention ... > >> We do get a slight improvement, though; without those patches I > >> could only get about 350k IOPs. But still, I would somehow expect 2 > >> paths to be faster than just one .. > > > > https://www.redhat.com/archives/dm-devel/2016-February/msg00036.html > > > > hint hint... > > > I hoped they wouldn't be needed with your patches. > Plus perf revealed that I first need to address a spinlock > contention in the lpfc driver before that even would make sense. > > So more debugging to follow. OK, I took a crack at embracing RCU. Only slightly better performance on my single NUMA node testbed. (But I'll have to track down a system with multiple NUMA nodes to do any justice to the next wave of this optimization effort) This RCU work is very heavy-handed and way too fiddley (there could easily be bugs). Anyway, please see: http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745 But this might give you something to build on to arrive at something more scalable? Mike -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Feb 09 2016 at 7:45pm -0500, Mike Snitzer <snitzer@redhat.com> wrote: > > OK, I took a crack at embracing RCU. Only slightly better performance > on my single NUMA node testbed. (But I'll have to track down a system > with multiple NUMA nodes to do any justice to the next wave of this > optimization effort) > > This RCU work is very heavy-handed and way too fiddley (there could > easily be bugs). Anyway, please see: > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745 > > But this might give you something to build on to arrive at something > more scalable? I've a bit more polished version of this work (broken up into multiple commits, with some fixes, etc) here: http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3 Hannes and/or Sagi, if you get a chance to try this on your NUMA system please let me know how it goes. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 10 2016 at 8:50pm -0500, Mike Snitzer <snitzer@redhat.com> wrote: > On Tue, Feb 09 2016 at 7:45pm -0500, > Mike Snitzer <snitzer@redhat.com> wrote: > > > > > OK, I took a crack at embracing RCU. Only slightly better performance > > on my single NUMA node testbed. (But I'll have to track down a system > > with multiple NUMA nodes to do any justice to the next wave of this > > optimization effort) > > > > This RCU work is very heavy-handed and way too fiddley (there could > > easily be bugs). Anyway, please see: > > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745 > > > > But this might give you something to build on to arrive at something > > more scalable? > > I've a bit more polished version of this work (broken up into multiple > commits, with some fixes, etc) here: > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3 > > Hannes and/or Sagi, if you get a chance to try this on your NUMA system > please let me know how it goes. FYI, with these changes my single NUMA node testbed's read IOPs went from: ~1310K to ~1410K w/ nr_hw_queues dm-mq=4 and null_blk=4 ~1330K to ~1415K w/ nr_hw_queues dm-mq=4 and null_blk=12 ~1365K to ~1425K w/ nr_hw_queues dm-mq=12 and null_blk=12 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 10 2016 at 8:50pm -0500, Mike Snitzer <snitzer@redhat.com> wrote: > On Tue, Feb 09 2016 at 7:45pm -0500, > Mike Snitzer <snitzer@redhat.com> wrote: > > > > > OK, I took a crack at embracing RCU. Only slightly better performance > > on my single NUMA node testbed. (But I'll have to track down a system > > with multiple NUMA nodes to do any justice to the next wave of this > > optimization effort) > > > > This RCU work is very heavy-handed and way too fiddley (there could > > easily be bugs). Anyway, please see: > > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745 > > > > But this might give you something to build on to arrive at something > > more scalable? > > I've a bit more polished version of this work (broken up into multiple > commits, with some fixes, etc) here: > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3 > > Hannes and/or Sagi, if you get a chance to try this on your NUMA system > please let me know how it goes. Initial review has uncovered some locking problems with the current code (nothing that caused crashes or hangs in my testing but...) so please hold off on testing until you hear from me (hopefully tomorrow). Thanks, Mike -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/11/2016 04:34 PM, Mike Snitzer wrote: > On Wed, Feb 10 2016 at 8:50pm -0500, > Mike Snitzer <snitzer@redhat.com> wrote: > >> On Tue, Feb 09 2016 at 7:45pm -0500, >> Mike Snitzer <snitzer@redhat.com> wrote: >> >>> >>> OK, I took a crack at embracing RCU. Only slightly better performance >>> on my single NUMA node testbed. (But I'll have to track down a system >>> with multiple NUMA nodes to do any justice to the next wave of this >>> optimization effort) >>> >>> This RCU work is very heavy-handed and way too fiddley (there could >>> easily be bugs). Anyway, please see: >>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745 >>> >>> But this might give you something to build on to arrive at something >>> more scalable? >> >> I've a bit more polished version of this work (broken up into multiple >> commits, with some fixes, etc) here: >> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3 >> >> Hannes and/or Sagi, if you get a chance to try this on your NUMA system >> please let me know how it goes. > > Initial review has uncovered some locking problems with the current code > (nothing that caused crashes or hangs in my testing but...) so please > hold off on testing until you hear from me (hopefully tomorrow). > Good news is that I've managed to hit the roof for my array with the devel2 version of those patches. (And a _heavily_ patched-up lpfc driver :-) So from that perspective everything's fine now; we've reached the hardware limit for my setup. Which in itself is quite impressive; beating Intel P3700 with 16FC is not bad methinks :-) So thanks for all your work here. Cheers, Hannes
On Fri, Feb 12 2016 at 10:18am -0500, Hannes Reinecke <hare@suse.de> wrote: > On 02/11/2016 04:34 PM, Mike Snitzer wrote: > > On Wed, Feb 10 2016 at 8:50pm -0500, > > Mike Snitzer <snitzer@redhat.com> wrote: > > > >> On Tue, Feb 09 2016 at 7:45pm -0500, > >> Mike Snitzer <snitzer@redhat.com> wrote: > >> > >>> > >>> OK, I took a crack at embracing RCU. Only slightly better performance > >>> on my single NUMA node testbed. (But I'll have to track down a system > >>> with multiple NUMA nodes to do any justice to the next wave of this > >>> optimization effort) > >>> > >>> This RCU work is very heavy-handed and way too fiddley (there could > >>> easily be bugs). Anyway, please see: > >>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745 > >>> > >>> But this might give you something to build on to arrive at something > >>> more scalable? > >> > >> I've a bit more polished version of this work (broken up into multiple > >> commits, with some fixes, etc) here: > >> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3 > >> > >> Hannes and/or Sagi, if you get a chance to try this on your NUMA system > >> please let me know how it goes. > > > > Initial review has uncovered some locking problems with the current code > > (nothing that caused crashes or hangs in my testing but...) so please > > hold off on testing until you hear from me (hopefully tomorrow). > > > Good news is that I've managed to hit the roof for my array with the > devel2 version of those patches. (And a _heavily_ patched-up lpfc > driver :-) > So from that perspective everything's fine now; we've reached the > hardware limit for my setup. > Which in itself is quite impressive; beating Intel P3700 with 16FC > is not bad methinks :-) > > So thanks for all your work here. Ah, that's really good news! But devel2 is definitely _not_ destined for upstream. 'devel3' is much closer to "ready". But your testing and review would really push it forward. Please see/test: http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3 Also, please read this header: http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel3&id=65a01b76502dd68e8ca298ee6614c0151b677f4a Even with devel2 I hacked it such that repeat_count > 1 is effectively broken. I'm now _seriously_ considering deprecating repeat_count completely (adding a DMWARN that will inform the user. e.g.: "repeat_count > 1 is no longer supported"). I see no point going to great lengths to maintain a dm-mpath feature that was only a hack for when dm-mpath was bio-based. What do you think? Thanks, Mike -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/12/2016 04:26 PM, Mike Snitzer wrote: > On Fri, Feb 12 2016 at 10:18am -0500, > Hannes Reinecke <hare@suse.de> wrote: > >> On 02/11/2016 04:34 PM, Mike Snitzer wrote: >>> On Wed, Feb 10 2016 at 8:50pm -0500, >>> Mike Snitzer <snitzer@redhat.com> wrote: >>> >>>> On Tue, Feb 09 2016 at 7:45pm -0500, >>>> Mike Snitzer <snitzer@redhat.com> wrote: >>>> >>>>> >>>>> OK, I took a crack at embracing RCU. Only slightly better performance >>>>> on my single NUMA node testbed. (But I'll have to track down a system >>>>> with multiple NUMA nodes to do any justice to the next wave of this >>>>> optimization effort) >>>>> >>>>> This RCU work is very heavy-handed and way too fiddley (there could >>>>> easily be bugs). Anyway, please see: >>>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=d80a7e4f8b5be9c81e4d452137623b003fa64745 >>>>> >>>>> But this might give you something to build on to arrive at something >>>>> more scalable? >>>> >>>> I've a bit more polished version of this work (broken up into multiple >>>> commits, with some fixes, etc) here: >>>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3 >>>> >>>> Hannes and/or Sagi, if you get a chance to try this on your NUMA system >>>> please let me know how it goes. >>> >>> Initial review has uncovered some locking problems with the current code >>> (nothing that caused crashes or hangs in my testing but...) so please >>> hold off on testing until you hear from me (hopefully tomorrow). >>> >> Good news is that I've managed to hit the roof for my array with the >> devel2 version of those patches. (And a _heavily_ patched-up lpfc >> driver :-) >> So from that perspective everything's fine now; we've reached the >> hardware limit for my setup. >> Which in itself is quite impressive; beating Intel P3700 with 16FC >> is not bad methinks :-) >> >> So thanks for all your work here. > > Ah, that's really good news! But devel2 is definitely _not_ destined > for upstream. 'devel3' is much closer to "ready". But your testing and > review would really push it forward. > > Please see/test: > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3 > > Also, please read this header: > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel3&id=65a01b76502dd68e8ca298ee6614c0151b677f4a > > Even with devel2 I hacked it such that repeat_count > 1 is effectively > broken. I'm now _seriously_ considering deprecating repeat_count > completely (adding a DMWARN that will inform the user. e.g.: > "repeat_count > 1 is no longer supported"). I see no point going to > great lengths to maintain a dm-mpath feature that was only a hack for > when dm-mpath was bio-based. What do you think? > Drop it, and make setting of which a no-op. Never liked it anyway, and these decisions should really be delegated to the path selector. Cheers, Hannes
On Fri, Feb 12 2016 at 11:04am -0500, Hannes Reinecke <hare@suse.de> wrote: > On 02/12/2016 04:26 PM, Mike Snitzer wrote: > >On Fri, Feb 12 2016 at 10:18am -0500, > >Hannes Reinecke <hare@suse.de> wrote: > >>> > >>Good news is that I've managed to hit the roof for my array with the > >>devel2 version of those patches. (And a _heavily_ patched-up lpfc > >>driver :-) > >>So from that perspective everything's fine now; we've reached the > >>hardware limit for my setup. > >>Which in itself is quite impressive; beating Intel P3700 with 16FC > >>is not bad methinks :-) > >> > >>So thanks for all your work here. > > > >Ah, that's really good news! But devel2 is definitely _not_ destined > >for upstream. 'devel3' is much closer to "ready". But your testing and > >review would really push it forward. > > > >Please see/test: > >http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3 > > > >Also, please read this header: > >http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel3&id=65a01b76502dd68e8ca298ee6614c0151b677f4a > > > >Even with devel2 I hacked it such that repeat_count > 1 is effectively > >broken. I'm now _seriously_ considering deprecating repeat_count > >completely (adding a DMWARN that will inform the user. e.g.: > >"repeat_count > 1 is no longer supported"). I see no point going to > >great lengths to maintain a dm-mpath feature that was only a hack for > >when dm-mpath was bio-based. What do you think? > > Drop it, and make setting of which a no-op. > Never liked it anyway, and these decisions should really be > delegated to the path selector. Sure, but my point is DM-mpath will no longer be able to provide the ability to properly handle repeat_count > 1 (because updating the ->current_pgpath crushes the write-side of the RCU). As such I've rebased 'devel3' to impose repeat_count = 1 (both in dm-mpath.c and the defaults in each path selector). -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/12/2016 07:00 PM, Mike Snitzer wrote: > On Fri, Feb 12 2016 at 11:04am -0500, > Hannes Reinecke <hare@suse.de> wrote: > >> On 02/12/2016 04:26 PM, Mike Snitzer wrote: >>> On Fri, Feb 12 2016 at 10:18am -0500, >>> Hannes Reinecke <hare@suse.de> wrote: >>>>> >>>> Good news is that I've managed to hit the roof for my array with the >>>> devel2 version of those patches. (And a _heavily_ patched-up lpfc >>>> driver :-) >>>> So from that perspective everything's fine now; we've reached the >>>> hardware limit for my setup. >>>> Which in itself is quite impressive; beating Intel P3700 with 16FC >>>> is not bad methinks :-) >>>> >>>> So thanks for all your work here. >>> >>> Ah, that's really good news! But devel2 is definitely _not_ destined >>> for upstream. 'devel3' is much closer to "ready". But your testing and >>> review would really push it forward. >>> >>> Please see/test: >>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel3 >>> >>> Also, please read this header: >>> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel3&id=65a01b76502dd68e8ca298ee6614c0151b677f4a >>> >>> Even with devel2 I hacked it such that repeat_count > 1 is effectively >>> broken. I'm now _seriously_ considering deprecating repeat_count >>> completely (adding a DMWARN that will inform the user. e.g.: >>> "repeat_count > 1 is no longer supported"). I see no point going to >>> great lengths to maintain a dm-mpath feature that was only a hack for >>> when dm-mpath was bio-based. What do you think? >> >> Drop it, and make setting of which a no-op. >> Never liked it anyway, and these decisions should really be >> delegated to the path selector. > > Sure, but my point is DM-mpath will no longer be able to provide the > ability to properly handle repeat_count > 1 (because updating the > ->current_pgpath crushes the write-side of the RCU). > Fully understood. But as I said, the _need_ for repeat_count has essentially vanished with the move to request-based multipath. > As such I've rebased 'devel3' to impose repeat_count = 1 (both in > dm-mpath.c and the defaults in each path selector). Kewl. I'll give it a go. Cheers, Hannes
diff --git a/block/blk-core.c b/block/blk-core.c index ab51685..c60e233 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2198,7 +2198,7 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq) if (q->mq_ops) { if (blk_queue_io_stat(q)) blk_account_io_start(rq, true); - blk_mq_insert_request(rq, false, true, true); + blk_mq_insert_request(rq, false, true, false); return 0; } diff --git a/drivers/md/dm.c b/drivers/md/dm.c index c683f6d..a618477 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1119,12 +1119,8 @@ static void rq_completed(struct mapped_device *md, int rw, bool run_queue) * back into ->request_fn() could deadlock attempting to grab the * queue lock again. */ - if (run_queue) { - if (md->queue->mq_ops) - blk_mq_run_hw_queues(md->queue, true); - else - blk_run_queue_async(md->queue); - } + if (!md->queue->mq_ops && run_queue) + blk_mq_run_hw_queues(md->queue, true); /* * dm_put() must be at the end of this function. See the comment above @@ -1344,7 +1340,10 @@ static void dm_complete_request(struct request *rq, int error) struct dm_rq_target_io *tio = tio_from_request(rq); tio->error = error; - blk_complete_request(rq); + if (!rq->q->mq_ops) + blk_complete_request(rq); + else + blk_mq_complete_request(rq, rq->errors); } /*