Message ID | 20220309064209.4169303-1-song@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: check more requests for multiple_queues in blk_attempt_plug_merge | expand |
On Tue, Mar 08, 2022 at 10:42:09PM -0800, Song Liu wrote: > RAID arrays check/repair operations benefit a lot from merging requests. > If we only check the previous entry for merge attempt, many merge will be > missed. As a result, significant regression is observed for RAID check > and repair. > > Fix this by checking more than just the previous entry when > plug->multiple_queues == true. But this also means really significant CPU overhead for all other workloads. > > This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to > 103 MB/s. What driver uses multiple queues for HDDs? Can you explain the workload submitted by a md a bit better? I wonder if we can easily do the right thing straight in the md driver.
On Wed, Mar 9, 2022 at 10:48 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Tue, Mar 08, 2022 at 10:42:09PM -0800, Song Liu wrote: > > RAID arrays check/repair operations benefit a lot from merging requests. > > If we only check the previous entry for merge attempt, many merge will be > > missed. As a result, significant regression is observed for RAID check > > and repair. > > > > Fix this by checking more than just the previous entry when > > plug->multiple_queues == true. > > But this also means really significant CPU overhead for all other > workloads. Would the following check help with these workloads? if (!plug->multiple_queues) break; > > > > > This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to > > 103 MB/s. > > What driver uses multiple queues for HDDs? > > Can you explain the workload submitted by a md a bit better? I wonder > if we can easily do the right thing straight in the md driver. It is the md sync_thread doing check and repair. Basically, the md thread reads all the disks and computes parity from data. Maybe we should add a new flag to struct blk_plug for this special case? Song
On Wed, Mar 9, 2022 at 11:23 PM Song Liu <song@kernel.org> wrote: > > On Wed, Mar 9, 2022 at 10:48 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Tue, Mar 08, 2022 at 10:42:09PM -0800, Song Liu wrote: > > > RAID arrays check/repair operations benefit a lot from merging requests. > > > If we only check the previous entry for merge attempt, many merge will be > > > missed. As a result, significant regression is observed for RAID check > > > and repair. > > > > > > Fix this by checking more than just the previous entry when > > > plug->multiple_queues == true. > > > > But this also means really significant CPU overhead for all other > > workloads. > > Would the following check help with these workloads? > > if (!plug->multiple_queues) > break; > > > > > > > > > This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to > > > 103 MB/s. > > > > What driver uses multiple queues for HDDs? > > > > Can you explain the workload submitted by a md a bit better? I wonder > > if we can easily do the right thing straight in the md driver. > > It is the md sync_thread doing check and repair. Basically, the md > thread reads all > the disks and computes parity from data. > > Maybe we should add a new flag to struct blk_plug for this special case? I meant something like: diff --git c/block/blk-core.c w/block/blk-core.c index 1039515c99d6..4fb09243e908 100644 --- c/block/blk-core.c +++ w/block/blk-core.c @@ -1303,6 +1303,12 @@ void blk_finish_plug(struct blk_plug *plug) } EXPORT_SYMBOL(blk_finish_plug); +void blk_plug_merge_aggressively(struct blk_plug *plug) +{ + plug->aggresive_merge = true; +} +EXPORT_SYMBOL(blk_plug_merge_aggressively); + void blk_io_schedule(void) { /* Prevent hang_check timer from firing at us during very long I/O */ diff --git c/block/blk-merge.c w/block/blk-merge.c index 4de34a332c9f..8b673288bc5f 100644 --- c/block/blk-merge.c +++ w/block/blk-merge.c @@ -1089,12 +1089,14 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio, if (!plug || rq_list_empty(plug->mq_list)) return false; - /* check the previously added entry for a quick merge attempt */ - rq = rq_list_peek(&plug->mq_list); - if (rq->q == q) { - if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) == - BIO_MERGE_OK) - return true; + rq_list_for_each(&plug->mq_list, rq) { + if (rq->q == q) { + if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) == + BIO_MERGE_OK) + return true; + } + if (!plug->aggresive_merge) + break; } return false; } diff --git c/drivers/md/md.c w/drivers/md/md.c index 4d38bd7dadd6..6be56632a412 100644 --- c/drivers/md/md.c +++ w/drivers/md/md.c @@ -8901,6 +8901,7 @@ void md_do_sync(struct md_thread *thread) update_time = jiffies; blk_start_plug(&plug); + blk_plug_merge_aggressively(&plug); while (j < max_sectors) { sector_t sectors; diff --git c/drivers/md/raid1.c w/drivers/md/raid1.c index e2d8acb1e988..501d15532170 100644 --- c/drivers/md/raid1.c +++ w/drivers/md/raid1.c @@ -838,6 +838,7 @@ static void flush_pending_writes(struct r1conf *conf) */ __set_current_state(TASK_RUNNING); blk_start_plug(&plug); + blk_plug_merge_aggressively(&plug); flush_bio_list(conf, bio); blk_finish_plug(&plug); } else @@ -2591,6 +2592,7 @@ static void raid1d(struct md_thread *thread) } blk_start_plug(&plug); + blk_plug_merge_aggressively(&plug); for (;;) { flush_pending_writes(conf); diff --git c/drivers/md/raid10.c w/drivers/md/raid10.c index 2b969f70a31f..0a594613a075 100644 --- c/drivers/md/raid10.c +++ w/drivers/md/raid10.c @@ -876,6 +876,7 @@ static void flush_pending_writes(struct r10conf *conf) __set_current_state(TASK_RUNNING); blk_start_plug(&plug); + blk_plug_merge_aggressively(&plug); /* flush any pending bitmap writes to disk * before proceeding w/ I/O */ md_bitmap_unplug(conf->mddev->bitmap); @@ -3088,6 +3089,7 @@ static void raid10d(struct md_thread *thread) } blk_start_plug(&plug); + blk_plug_merge_aggressively(&plug); for (;;) { flush_pending_writes(conf); diff --git c/drivers/md/raid5.c w/drivers/md/raid5.c index ffe720c73b0a..a96884ca5f08 100644 --- c/drivers/md/raid5.c +++ w/drivers/md/raid5.c @@ -6447,6 +6447,7 @@ static void raid5_do_work(struct work_struct *work) pr_debug("+++ raid5worker active\n"); blk_start_plug(&plug); + blk_plug_merge_aggressively(&plug); handled = 0; spin_lock_irq(&conf->device_lock); while (1) { @@ -6497,6 +6498,7 @@ static void raid5d(struct md_thread *thread) md_check_recovery(mddev); blk_start_plug(&plug); + blk_plug_merge_aggressively(&plug); handled = 0; spin_lock_irq(&conf->device_lock); while (1) { diff --git c/include/linux/blkdev.h w/include/linux/blkdev.h index 16b47035e4b0..45b0da416302 100644 --- c/include/linux/blkdev.h +++ w/include/linux/blkdev.h @@ -775,6 +775,7 @@ struct blk_plug { bool multiple_queues; bool has_elevator; bool nowait; + bool aggresive_merge; struct list_head cb_list; /* md requires an unplug callback */ }; @@ -791,6 +792,7 @@ extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug, extern void blk_start_plug(struct blk_plug *); extern void blk_start_plug_nr_ios(struct blk_plug *, unsigned short); extern void blk_finish_plug(struct blk_plug *); +void blk_plug_merge_aggressively(struct blk_plug *plug); void blk_flush_plug(struct blk_plug *plug, bool from_schedule);
On Thu, Mar 10, 2022 at 2:10 PM Song Liu <song@kernel.org> wrote: > > On Wed, Mar 9, 2022 at 11:23 PM Song Liu <song@kernel.org> wrote: > > > > On Wed, Mar 9, 2022 at 10:48 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > On Tue, Mar 08, 2022 at 10:42:09PM -0800, Song Liu wrote: > > > > RAID arrays check/repair operations benefit a lot from merging requests. > > > > If we only check the previous entry for merge attempt, many merge will be > > > > missed. As a result, significant regression is observed for RAID check > > > > and repair. > > > > > > > > Fix this by checking more than just the previous entry when > > > > plug->multiple_queues == true. > > > > > > But this also means really significant CPU overhead for all other > > > workloads. > > > > Would the following check help with these workloads? > > > > if (!plug->multiple_queues) > > break; > > > > > > > > > > > > > This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to > > > > 103 MB/s. > > > > > > What driver uses multiple queues for HDDs? > > > > > > Can you explain the workload submitted by a md a bit better? I wonder > > > if we can easily do the right thing straight in the md driver. > > > > It is the md sync_thread doing check and repair. Basically, the md > > thread reads all > > the disks and computes parity from data. > > > > Maybe we should add a new flag to struct blk_plug for this special case? > > I meant something like: > > diff --git c/block/blk-core.c w/block/blk-core.c > index 1039515c99d6..4fb09243e908 100644 > --- c/block/blk-core.c > +++ w/block/blk-core.c > @@ -1303,6 +1303,12 @@ void blk_finish_plug(struct blk_plug *plug) > } > EXPORT_SYMBOL(blk_finish_plug); > > +void blk_plug_merge_aggressively(struct blk_plug *plug) > +{ > + plug->aggresive_merge = true; > +} > +EXPORT_SYMBOL(blk_plug_merge_aggressively); > + > void blk_io_schedule(void) > { > /* Prevent hang_check timer from firing at us during very long I/O */ Missed one change: --- c/block/blk-core.c +++ w/block/blk-core.c @@ -1188,6 +1188,7 @@ void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios) plug->multiple_queues = false; plug->has_elevator = false; plug->nowait = false; + plug->aggresive_merge = false; INIT_LIST_HEAD(&plug->cb_list); /*
On 3/8/22 11:42 PM, Song Liu wrote: > RAID arrays check/repair operations benefit a lot from merging requests. > If we only check the previous entry for merge attempt, many merge will be > missed. As a result, significant regression is observed for RAID check > and repair. > > Fix this by checking more than just the previous entry when > plug->multiple_queues == true. > > This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to > 103 MB/s. Do the underlying disks not have an IO scheduler attached? Curious why the merges aren't being done there, would be trivial when the list is flushed out. Because if the perf difference is that big, then other workloads would be suffering they are that sensitive to being within a plug worth of IO. Between your two approaches, I do greatly prefer the first one though.
On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote: > > On 3/8/22 11:42 PM, Song Liu wrote: > > RAID arrays check/repair operations benefit a lot from merging requests. > > If we only check the previous entry for merge attempt, many merge will be > > missed. As a result, significant regression is observed for RAID check > > and repair. > > > > Fix this by checking more than just the previous entry when > > plug->multiple_queues == true. > > > > This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to > > 103 MB/s. > > Do the underlying disks not have an IO scheduler attached? Curious why > the merges aren't being done there, would be trivial when the list is > flushed out. Because if the perf difference is that big, then other > workloads would be suffering they are that sensitive to being within a > plug worth of IO. The disks have mq-deadline by default. I also tried kyber, the result is the same. Raid repair work sends IOs to all the HDDs in a round-robin manner. If we only check the previous request, there isn't much opportunity for merge. I guess other workloads may have different behavior? > Between your two approaches, I do greatly prefer the first one though. I also like the first one better. But I am not sure whether it will slow down other workloads. We can probably also make the second one cleaner with a new variation of blk_start_plug. Thanks, Song
On 3/10/22 3:37 PM, Song Liu wrote: > On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote: >> >> On 3/8/22 11:42 PM, Song Liu wrote: >>> RAID arrays check/repair operations benefit a lot from merging requests. >>> If we only check the previous entry for merge attempt, many merge will be >>> missed. As a result, significant regression is observed for RAID check >>> and repair. >>> >>> Fix this by checking more than just the previous entry when >>> plug->multiple_queues == true. >>> >>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to >>> 103 MB/s. >> >> Do the underlying disks not have an IO scheduler attached? Curious why >> the merges aren't being done there, would be trivial when the list is >> flushed out. Because if the perf difference is that big, then other >> workloads would be suffering they are that sensitive to being within a >> plug worth of IO. > > The disks have mq-deadline by default. I also tried kyber, the result > is the same. Raid repair work sends IOs to all the HDDs in a > round-robin manner. If we only check the previous request, there isn't > much opportunity for merge. I guess other workloads may have different > behavior? Round robin one at the time? I feel like there's something odd or suboptimal with the raid rebuild, if it's that sensitive to plug merging. Plug merging is mainly meant to reduce the overhead of merging, complement what the scheduler would do. If there's a big drop in performance just by not getting as efficient merging on the plug side, that points to an issue with something else.
On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@kernel.dk> wrote: > > On 3/10/22 3:37 PM, Song Liu wrote: > > On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote: > >> > >> On 3/8/22 11:42 PM, Song Liu wrote: > >>> RAID arrays check/repair operations benefit a lot from merging requests. > >>> If we only check the previous entry for merge attempt, many merge will be > >>> missed. As a result, significant regression is observed for RAID check > >>> and repair. > >>> > >>> Fix this by checking more than just the previous entry when > >>> plug->multiple_queues == true. > >>> > >>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to > >>> 103 MB/s. > >> > >> Do the underlying disks not have an IO scheduler attached? Curious why > >> the merges aren't being done there, would be trivial when the list is > >> flushed out. Because if the perf difference is that big, then other > >> workloads would be suffering they are that sensitive to being within a > >> plug worth of IO. > > > > The disks have mq-deadline by default. I also tried kyber, the result > > is the same. Raid repair work sends IOs to all the HDDs in a > > round-robin manner. If we only check the previous request, there isn't > > much opportunity for merge. I guess other workloads may have different > > behavior? > > Round robin one at the time? I feel like there's something odd or > suboptimal with the raid rebuild, if it's that sensitive to plug > merging. It is not one request at a time, but more like (for raid456): read 4kB from HDD1, HDD2, HDD3..., then read another 4kB from HDD1, HDD2, HDD3, ... > Plug merging is mainly meant to reduce the overhead of merging, > complement what the scheduler would do. If there's a big drop in > performance just by not getting as efficient merging on the plug side, > that points to an issue with something else. We introduced blk_plug_max_rq_count() to give md more opportunities to merge at plug side, so I guess the behavior has been like this for a long time. I will take a look at the scheduler side and see whether we can just merge later, but I am not very optimistic about it. Thanks, Song
On 3/10/22 4:33 PM, Song Liu wrote: > On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@kernel.dk> wrote: >> >> On 3/10/22 3:37 PM, Song Liu wrote: >>> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote: >>>> >>>> On 3/8/22 11:42 PM, Song Liu wrote: >>>>> RAID arrays check/repair operations benefit a lot from merging requests. >>>>> If we only check the previous entry for merge attempt, many merge will be >>>>> missed. As a result, significant regression is observed for RAID check >>>>> and repair. >>>>> >>>>> Fix this by checking more than just the previous entry when >>>>> plug->multiple_queues == true. >>>>> >>>>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to >>>>> 103 MB/s. >>>> >>>> Do the underlying disks not have an IO scheduler attached? Curious why >>>> the merges aren't being done there, would be trivial when the list is >>>> flushed out. Because if the perf difference is that big, then other >>>> workloads would be suffering they are that sensitive to being within a >>>> plug worth of IO. >>> >>> The disks have mq-deadline by default. I also tried kyber, the result >>> is the same. Raid repair work sends IOs to all the HDDs in a >>> round-robin manner. If we only check the previous request, there isn't >>> much opportunity for merge. I guess other workloads may have different >>> behavior? >> >> Round robin one at the time? I feel like there's something odd or >> suboptimal with the raid rebuild, if it's that sensitive to plug >> merging. > > It is not one request at a time, but more like (for raid456): > read 4kB from HDD1, HDD2, HDD3..., > then read another 4kB from HDD1, HDD2, HDD3, ... Ehm, that very much looks like one-at-the-time from each drive, which is pretty much the worst way to do it :-) Is there a reason for that? Why isn't it using 64k chunks or something like that? You could still do that as a kind of read-ahead, even if you're still processing in chunks of 4k. >> Plug merging is mainly meant to reduce the overhead of merging, >> complement what the scheduler would do. If there's a big drop in >> performance just by not getting as efficient merging on the plug side, >> that points to an issue with something else. > > We introduced blk_plug_max_rq_count() to give md more opportunities to > merge at plug side, so I guess the behavior has been like this for a > long time. I will take a look at the scheduler side and see whether we > can just merge later, but I am not very optimistic about it. Yeah I remember, and that also kind of felt like a work-around for some underlying issue. Maybe there's something about how the IO is issued that makes it go straight to disk and we never get any merging? Is it because they are sync reads? In any case, just doing larger reads would likely help quite a bit, but would still be nice to get to the bottom of why we're not seeing the level of merging we expect.
On Thu, Mar 10, 2022 at 4:07 PM Jens Axboe <axboe@kernel.dk> wrote: > > On 3/10/22 4:33 PM, Song Liu wrote: > > On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@kernel.dk> wrote: > >> > >> On 3/10/22 3:37 PM, Song Liu wrote: > >>> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote: > >>>> > >>>> On 3/8/22 11:42 PM, Song Liu wrote: > >>>>> RAID arrays check/repair operations benefit a lot from merging requests. > >>>>> If we only check the previous entry for merge attempt, many merge will be > >>>>> missed. As a result, significant regression is observed for RAID check > >>>>> and repair. > >>>>> > >>>>> Fix this by checking more than just the previous entry when > >>>>> plug->multiple_queues == true. > >>>>> > >>>>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to > >>>>> 103 MB/s. > >>>> > >>>> Do the underlying disks not have an IO scheduler attached? Curious why > >>>> the merges aren't being done there, would be trivial when the list is > >>>> flushed out. Because if the perf difference is that big, then other > >>>> workloads would be suffering they are that sensitive to being within a > >>>> plug worth of IO. > >>> > >>> The disks have mq-deadline by default. I also tried kyber, the result > >>> is the same. Raid repair work sends IOs to all the HDDs in a > >>> round-robin manner. If we only check the previous request, there isn't > >>> much opportunity for merge. I guess other workloads may have different > >>> behavior? > >> > >> Round robin one at the time? I feel like there's something odd or > >> suboptimal with the raid rebuild, if it's that sensitive to plug > >> merging. > > > > It is not one request at a time, but more like (for raid456): > > read 4kB from HDD1, HDD2, HDD3..., > > then read another 4kB from HDD1, HDD2, HDD3, ... > > Ehm, that very much looks like one-at-the-time from each drive, which is > pretty much the worst way to do it :-) > > Is there a reason for that? Why isn't it using 64k chunks or something > like that? You could still do that as a kind of read-ahead, even if > you're still processing in chunks of 4k. raid456 handles logic in the granularity of stripe. Each stripe is 4kB from every HDD in the array. AFAICT, we need some non-trivial change to enable the read ahead. > > >> Plug merging is mainly meant to reduce the overhead of merging, > >> complement what the scheduler would do. If there's a big drop in > >> performance just by not getting as efficient merging on the plug side, > >> that points to an issue with something else. > > > > We introduced blk_plug_max_rq_count() to give md more opportunities to > > merge at plug side, so I guess the behavior has been like this for a > > long time. I will take a look at the scheduler side and see whether we > > can just merge later, but I am not very optimistic about it. > > Yeah I remember, and that also kind of felt like a work-around for some > underlying issue. Maybe there's something about how the IO is issued > that makes it go straight to disk and we never get any merging? Is it > because they are sync reads? > > In any case, just doing larger reads would likely help quite a bit, but > would still be nice to get to the bottom of why we're not seeing the > level of merging we expect. Let me look more into this. Maybe we messed something up in the scheduler. Thanks, Song
On 3/10/22 5:31 PM, Song Liu wrote: > On Thu, Mar 10, 2022 at 4:07 PM Jens Axboe <axboe@kernel.dk> wrote: >> >> On 3/10/22 4:33 PM, Song Liu wrote: >>> On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@kernel.dk> wrote: >>>> >>>> On 3/10/22 3:37 PM, Song Liu wrote: >>>>> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>>> >>>>>> On 3/8/22 11:42 PM, Song Liu wrote: >>>>>>> RAID arrays check/repair operations benefit a lot from merging requests. >>>>>>> If we only check the previous entry for merge attempt, many merge will be >>>>>>> missed. As a result, significant regression is observed for RAID check >>>>>>> and repair. >>>>>>> >>>>>>> Fix this by checking more than just the previous entry when >>>>>>> plug->multiple_queues == true. >>>>>>> >>>>>>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to >>>>>>> 103 MB/s. >>>>>> >>>>>> Do the underlying disks not have an IO scheduler attached? Curious why >>>>>> the merges aren't being done there, would be trivial when the list is >>>>>> flushed out. Because if the perf difference is that big, then other >>>>>> workloads would be suffering they are that sensitive to being within a >>>>>> plug worth of IO. >>>>> >>>>> The disks have mq-deadline by default. I also tried kyber, the result >>>>> is the same. Raid repair work sends IOs to all the HDDs in a >>>>> round-robin manner. If we only check the previous request, there isn't >>>>> much opportunity for merge. I guess other workloads may have different >>>>> behavior? >>>> >>>> Round robin one at the time? I feel like there's something odd or >>>> suboptimal with the raid rebuild, if it's that sensitive to plug >>>> merging. >>> >>> It is not one request at a time, but more like (for raid456): >>> read 4kB from HDD1, HDD2, HDD3..., >>> then read another 4kB from HDD1, HDD2, HDD3, ... >> >> Ehm, that very much looks like one-at-the-time from each drive, which is >> pretty much the worst way to do it :-) >> >> Is there a reason for that? Why isn't it using 64k chunks or something >> like that? You could still do that as a kind of read-ahead, even if >> you're still processing in chunks of 4k. > > raid456 handles logic in the granularity of stripe. Each stripe is 4kB from > every HDD in the array. AFAICT, we need some non-trivial change to > enable the read ahead. Right, you'd need to stick some sort of caching in between so instead of reading 4k directly, you ask the cache for 4k and that can manage read-ahead. >>>> Plug merging is mainly meant to reduce the overhead of merging, >>>> complement what the scheduler would do. If there's a big drop in >>>> performance just by not getting as efficient merging on the plug side, >>>> that points to an issue with something else. >>> >>> We introduced blk_plug_max_rq_count() to give md more opportunities to >>> merge at plug side, so I guess the behavior has been like this for a >>> long time. I will take a look at the scheduler side and see whether we >>> can just merge later, but I am not very optimistic about it. >> >> Yeah I remember, and that also kind of felt like a work-around for some >> underlying issue. Maybe there's something about how the IO is issued >> that makes it go straight to disk and we never get any merging? Is it >> because they are sync reads? >> >> In any case, just doing larger reads would likely help quite a bit, but >> would still be nice to get to the bottom of why we're not seeing the >> level of merging we expect. > > Let me look more into this. Maybe we messed something up in the > scheduler. I'm assuming you have a plug setup for doing the reads, which is why you see the big difference (or there would be none). But blk_mq_flush_plug_list() should really take care of this when the plug is flushed, requests should be merged at that point. And from your description, doesn't sound like they are at all.
On 3/10/22 5:36 PM, Jens Axboe wrote: > I'm assuming you have a plug setup for doing the reads, which is why you > see the big difference (or there would be none). But > blk_mq_flush_plug_list() should really take care of this when the plug > is flushed, requests should be merged at that point. And from your > description, doesn't sound like they are at all. Maybe you need a list sort in blk_mq_flush_plug_list(). If you round-robin all the drives, then we'll hit that "run queue" path for each of them when we flush.
On Thu, Mar 10, 2022 at 05:36:44PM -0700, Jens Axboe wrote: > On 3/10/22 5:31 PM, Song Liu wrote: > > On Thu, Mar 10, 2022 at 4:07 PM Jens Axboe <axboe@kernel.dk> wrote: > >> > >> On 3/10/22 4:33 PM, Song Liu wrote: > >>> On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@kernel.dk> wrote: > >>>> > >>>> On 3/10/22 3:37 PM, Song Liu wrote: > >>>>> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote: > >>>>>> > >>>>>> On 3/8/22 11:42 PM, Song Liu wrote: > >>>>>>> RAID arrays check/repair operations benefit a lot from merging requests. > >>>>>>> If we only check the previous entry for merge attempt, many merge will be > >>>>>>> missed. As a result, significant regression is observed for RAID check > >>>>>>> and repair. > >>>>>>> > >>>>>>> Fix this by checking more than just the previous entry when > >>>>>>> plug->multiple_queues == true. > >>>>>>> > >>>>>>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to > >>>>>>> 103 MB/s. > >>>>>> > >>>>>> Do the underlying disks not have an IO scheduler attached? Curious why > >>>>>> the merges aren't being done there, would be trivial when the list is > >>>>>> flushed out. Because if the perf difference is that big, then other > >>>>>> workloads would be suffering they are that sensitive to being within a > >>>>>> plug worth of IO. > >>>>> > >>>>> The disks have mq-deadline by default. I also tried kyber, the result > >>>>> is the same. Raid repair work sends IOs to all the HDDs in a > >>>>> round-robin manner. If we only check the previous request, there isn't > >>>>> much opportunity for merge. I guess other workloads may have different > >>>>> behavior? > >>>> > >>>> Round robin one at the time? I feel like there's something odd or > >>>> suboptimal with the raid rebuild, if it's that sensitive to plug > >>>> merging. > >>> > >>> It is not one request at a time, but more like (for raid456): > >>> read 4kB from HDD1, HDD2, HDD3..., > >>> then read another 4kB from HDD1, HDD2, HDD3, ... > >> > >> Ehm, that very much looks like one-at-the-time from each drive, which is > >> pretty much the worst way to do it :-) > >> > >> Is there a reason for that? Why isn't it using 64k chunks or something > >> like that? You could still do that as a kind of read-ahead, even if > >> you're still processing in chunks of 4k. > > > > raid456 handles logic in the granularity of stripe. Each stripe is 4kB from > > every HDD in the array. AFAICT, we need some non-trivial change to > > enable the read ahead. > > Right, you'd need to stick some sort of caching in between so instead of > reading 4k directly, you ask the cache for 4k and that can manage > read-ahead. > > >>>> Plug merging is mainly meant to reduce the overhead of merging, > >>>> complement what the scheduler would do. If there's a big drop in > >>>> performance just by not getting as efficient merging on the plug side, > >>>> that points to an issue with something else. > >>> > >>> We introduced blk_plug_max_rq_count() to give md more opportunities to > >>> merge at plug side, so I guess the behavior has been like this for a > >>> long time. I will take a look at the scheduler side and see whether we > >>> can just merge later, but I am not very optimistic about it. > >> > >> Yeah I remember, and that also kind of felt like a work-around for some > >> underlying issue. Maybe there's something about how the IO is issued > >> that makes it go straight to disk and we never get any merging? Is it > >> because they are sync reads? > >> > >> In any case, just doing larger reads would likely help quite a bit, but > >> would still be nice to get to the bottom of why we're not seeing the > >> level of merging we expect. > > > > Let me look more into this. Maybe we messed something up in the > > scheduler. > > I'm assuming you have a plug setup for doing the reads, which is why you > see the big difference (or there would be none). But > blk_mq_flush_plug_list() should really take care of this when the plug > is flushed, requests should be merged at that point. And from your > description, doesn't sound like they are at all. requests are shared, when running out of request, plug list will be flushed early. Maybe we can just put bios into plug list directly, then handle them all in blk_mq_flush_plug_list. thanks, Ming
On 3/10/22 6:14 PM, Ming Lei wrote: > On Thu, Mar 10, 2022 at 05:36:44PM -0700, Jens Axboe wrote: >> On 3/10/22 5:31 PM, Song Liu wrote: >>> On Thu, Mar 10, 2022 at 4:07 PM Jens Axboe <axboe@kernel.dk> wrote: >>>> >>>> On 3/10/22 4:33 PM, Song Liu wrote: >>>>> On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>>> >>>>>> On 3/10/22 3:37 PM, Song Liu wrote: >>>>>>> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>>>>> >>>>>>>> On 3/8/22 11:42 PM, Song Liu wrote: >>>>>>>>> RAID arrays check/repair operations benefit a lot from merging requests. >>>>>>>>> If we only check the previous entry for merge attempt, many merge will be >>>>>>>>> missed. As a result, significant regression is observed for RAID check >>>>>>>>> and repair. >>>>>>>>> >>>>>>>>> Fix this by checking more than just the previous entry when >>>>>>>>> plug->multiple_queues == true. >>>>>>>>> >>>>>>>>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to >>>>>>>>> 103 MB/s. >>>>>>>> >>>>>>>> Do the underlying disks not have an IO scheduler attached? Curious why >>>>>>>> the merges aren't being done there, would be trivial when the list is >>>>>>>> flushed out. Because if the perf difference is that big, then other >>>>>>>> workloads would be suffering they are that sensitive to being within a >>>>>>>> plug worth of IO. >>>>>>> >>>>>>> The disks have mq-deadline by default. I also tried kyber, the result >>>>>>> is the same. Raid repair work sends IOs to all the HDDs in a >>>>>>> round-robin manner. If we only check the previous request, there isn't >>>>>>> much opportunity for merge. I guess other workloads may have different >>>>>>> behavior? >>>>>> >>>>>> Round robin one at the time? I feel like there's something odd or >>>>>> suboptimal with the raid rebuild, if it's that sensitive to plug >>>>>> merging. >>>>> >>>>> It is not one request at a time, but more like (for raid456): >>>>> read 4kB from HDD1, HDD2, HDD3..., >>>>> then read another 4kB from HDD1, HDD2, HDD3, ... >>>> >>>> Ehm, that very much looks like one-at-the-time from each drive, which is >>>> pretty much the worst way to do it :-) >>>> >>>> Is there a reason for that? Why isn't it using 64k chunks or something >>>> like that? You could still do that as a kind of read-ahead, even if >>>> you're still processing in chunks of 4k. >>> >>> raid456 handles logic in the granularity of stripe. Each stripe is 4kB from >>> every HDD in the array. AFAICT, we need some non-trivial change to >>> enable the read ahead. >> >> Right, you'd need to stick some sort of caching in between so instead of >> reading 4k directly, you ask the cache for 4k and that can manage >> read-ahead. >> >>>>>> Plug merging is mainly meant to reduce the overhead of merging, >>>>>> complement what the scheduler would do. If there's a big drop in >>>>>> performance just by not getting as efficient merging on the plug side, >>>>>> that points to an issue with something else. >>>>> >>>>> We introduced blk_plug_max_rq_count() to give md more opportunities to >>>>> merge at plug side, so I guess the behavior has been like this for a >>>>> long time. I will take a look at the scheduler side and see whether we >>>>> can just merge later, but I am not very optimistic about it. >>>> >>>> Yeah I remember, and that also kind of felt like a work-around for some >>>> underlying issue. Maybe there's something about how the IO is issued >>>> that makes it go straight to disk and we never get any merging? Is it >>>> because they are sync reads? >>>> >>>> In any case, just doing larger reads would likely help quite a bit, but >>>> would still be nice to get to the bottom of why we're not seeing the >>>> level of merging we expect. >>> >>> Let me look more into this. Maybe we messed something up in the >>> scheduler. >> >> I'm assuming you have a plug setup for doing the reads, which is why you >> see the big difference (or there would be none). But >> blk_mq_flush_plug_list() should really take care of this when the plug >> is flushed, requests should be merged at that point. And from your >> description, doesn't sound like they are at all. > > requests are shared, when running out of request, plug list will be > flushed early. That is true, but I don't think that's the problem here with the round robin approach. Seems like it'd drive a pretty low queue depth, even considering SATA. > Maybe we can just put bios into plug list directly, then handle them > all in blk_mq_flush_plug_list. That might not be a bad idea regardless. And should be trivial now, with the plug list being singly linked anyway.
On Thu, Mar 10, 2022 at 06:21:33PM -0700, Jens Axboe wrote: > On 3/10/22 6:14 PM, Ming Lei wrote: > > On Thu, Mar 10, 2022 at 05:36:44PM -0700, Jens Axboe wrote: > >> On 3/10/22 5:31 PM, Song Liu wrote: > >>> On Thu, Mar 10, 2022 at 4:07 PM Jens Axboe <axboe@kernel.dk> wrote: > >>>> > >>>> On 3/10/22 4:33 PM, Song Liu wrote: > >>>>> On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@kernel.dk> wrote: > >>>>>> > >>>>>> On 3/10/22 3:37 PM, Song Liu wrote: > >>>>>>> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote: > >>>>>>>> > >>>>>>>> On 3/8/22 11:42 PM, Song Liu wrote: > >>>>>>>>> RAID arrays check/repair operations benefit a lot from merging requests. > >>>>>>>>> If we only check the previous entry for merge attempt, many merge will be > >>>>>>>>> missed. As a result, significant regression is observed for RAID check > >>>>>>>>> and repair. > >>>>>>>>> > >>>>>>>>> Fix this by checking more than just the previous entry when > >>>>>>>>> plug->multiple_queues == true. > >>>>>>>>> > >>>>>>>>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to > >>>>>>>>> 103 MB/s. > >>>>>>>> > >>>>>>>> Do the underlying disks not have an IO scheduler attached? Curious why > >>>>>>>> the merges aren't being done there, would be trivial when the list is > >>>>>>>> flushed out. Because if the perf difference is that big, then other > >>>>>>>> workloads would be suffering they are that sensitive to being within a > >>>>>>>> plug worth of IO. > >>>>>>> > >>>>>>> The disks have mq-deadline by default. I also tried kyber, the result > >>>>>>> is the same. Raid repair work sends IOs to all the HDDs in a > >>>>>>> round-robin manner. If we only check the previous request, there isn't > >>>>>>> much opportunity for merge. I guess other workloads may have different > >>>>>>> behavior? > >>>>>> > >>>>>> Round robin one at the time? I feel like there's something odd or > >>>>>> suboptimal with the raid rebuild, if it's that sensitive to plug > >>>>>> merging. > >>>>> > >>>>> It is not one request at a time, but more like (for raid456): > >>>>> read 4kB from HDD1, HDD2, HDD3..., > >>>>> then read another 4kB from HDD1, HDD2, HDD3, ... > >>>> > >>>> Ehm, that very much looks like one-at-the-time from each drive, which is > >>>> pretty much the worst way to do it :-) > >>>> > >>>> Is there a reason for that? Why isn't it using 64k chunks or something > >>>> like that? You could still do that as a kind of read-ahead, even if > >>>> you're still processing in chunks of 4k. > >>> > >>> raid456 handles logic in the granularity of stripe. Each stripe is 4kB from > >>> every HDD in the array. AFAICT, we need some non-trivial change to > >>> enable the read ahead. > >> > >> Right, you'd need to stick some sort of caching in between so instead of > >> reading 4k directly, you ask the cache for 4k and that can manage > >> read-ahead. > >> > >>>>>> Plug merging is mainly meant to reduce the overhead of merging, > >>>>>> complement what the scheduler would do. If there's a big drop in > >>>>>> performance just by not getting as efficient merging on the plug side, > >>>>>> that points to an issue with something else. > >>>>> > >>>>> We introduced blk_plug_max_rq_count() to give md more opportunities to > >>>>> merge at plug side, so I guess the behavior has been like this for a > >>>>> long time. I will take a look at the scheduler side and see whether we > >>>>> can just merge later, but I am not very optimistic about it. > >>>> > >>>> Yeah I remember, and that also kind of felt like a work-around for some > >>>> underlying issue. Maybe there's something about how the IO is issued > >>>> that makes it go straight to disk and we never get any merging? Is it > >>>> because they are sync reads? > >>>> > >>>> In any case, just doing larger reads would likely help quite a bit, but > >>>> would still be nice to get to the bottom of why we're not seeing the > >>>> level of merging we expect. > >>> > >>> Let me look more into this. Maybe we messed something up in the > >>> scheduler. > >> > >> I'm assuming you have a plug setup for doing the reads, which is why you > >> see the big difference (or there would be none). But > >> blk_mq_flush_plug_list() should really take care of this when the plug > >> is flushed, requests should be merged at that point. And from your > >> description, doesn't sound like they are at all. > > > > requests are shared, when running out of request, plug list will be > > flushed early. > > That is true, but I don't think that's the problem here with the round > robin approach. Seems like it'd drive a pretty low queue depth, even > considering SATA. Another one may be plug list not sorted before inserting requests to scheduler in blk_mq_flush_plug_list(), looks you have mentioned. Thanks, Ming
On 3/10/22 6:32 PM, Ming Lei wrote: > On Thu, Mar 10, 2022 at 06:21:33PM -0700, Jens Axboe wrote: >> On 3/10/22 6:14 PM, Ming Lei wrote: >>> On Thu, Mar 10, 2022 at 05:36:44PM -0700, Jens Axboe wrote: >>>> On 3/10/22 5:31 PM, Song Liu wrote: >>>>> On Thu, Mar 10, 2022 at 4:07 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>>> >>>>>> On 3/10/22 4:33 PM, Song Liu wrote: >>>>>>> On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>>>>> >>>>>>>> On 3/10/22 3:37 PM, Song Liu wrote: >>>>>>>>> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>>>>>>> >>>>>>>>>> On 3/8/22 11:42 PM, Song Liu wrote: >>>>>>>>>>> RAID arrays check/repair operations benefit a lot from merging requests. >>>>>>>>>>> If we only check the previous entry for merge attempt, many merge will be >>>>>>>>>>> missed. As a result, significant regression is observed for RAID check >>>>>>>>>>> and repair. >>>>>>>>>>> >>>>>>>>>>> Fix this by checking more than just the previous entry when >>>>>>>>>>> plug->multiple_queues == true. >>>>>>>>>>> >>>>>>>>>>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to >>>>>>>>>>> 103 MB/s. >>>>>>>>>> >>>>>>>>>> Do the underlying disks not have an IO scheduler attached? Curious why >>>>>>>>>> the merges aren't being done there, would be trivial when the list is >>>>>>>>>> flushed out. Because if the perf difference is that big, then other >>>>>>>>>> workloads would be suffering they are that sensitive to being within a >>>>>>>>>> plug worth of IO. >>>>>>>>> >>>>>>>>> The disks have mq-deadline by default. I also tried kyber, the result >>>>>>>>> is the same. Raid repair work sends IOs to all the HDDs in a >>>>>>>>> round-robin manner. If we only check the previous request, there isn't >>>>>>>>> much opportunity for merge. I guess other workloads may have different >>>>>>>>> behavior? >>>>>>>> >>>>>>>> Round robin one at the time? I feel like there's something odd or >>>>>>>> suboptimal with the raid rebuild, if it's that sensitive to plug >>>>>>>> merging. >>>>>>> >>>>>>> It is not one request at a time, but more like (for raid456): >>>>>>> read 4kB from HDD1, HDD2, HDD3..., >>>>>>> then read another 4kB from HDD1, HDD2, HDD3, ... >>>>>> >>>>>> Ehm, that very much looks like one-at-the-time from each drive, which is >>>>>> pretty much the worst way to do it :-) >>>>>> >>>>>> Is there a reason for that? Why isn't it using 64k chunks or something >>>>>> like that? You could still do that as a kind of read-ahead, even if >>>>>> you're still processing in chunks of 4k. >>>>> >>>>> raid456 handles logic in the granularity of stripe. Each stripe is 4kB from >>>>> every HDD in the array. AFAICT, we need some non-trivial change to >>>>> enable the read ahead. >>>> >>>> Right, you'd need to stick some sort of caching in between so instead of >>>> reading 4k directly, you ask the cache for 4k and that can manage >>>> read-ahead. >>>> >>>>>>>> Plug merging is mainly meant to reduce the overhead of merging, >>>>>>>> complement what the scheduler would do. If there's a big drop in >>>>>>>> performance just by not getting as efficient merging on the plug side, >>>>>>>> that points to an issue with something else. >>>>>>> >>>>>>> We introduced blk_plug_max_rq_count() to give md more opportunities to >>>>>>> merge at plug side, so I guess the behavior has been like this for a >>>>>>> long time. I will take a look at the scheduler side and see whether we >>>>>>> can just merge later, but I am not very optimistic about it. >>>>>> >>>>>> Yeah I remember, and that also kind of felt like a work-around for some >>>>>> underlying issue. Maybe there's something about how the IO is issued >>>>>> that makes it go straight to disk and we never get any merging? Is it >>>>>> because they are sync reads? >>>>>> >>>>>> In any case, just doing larger reads would likely help quite a bit, but >>>>>> would still be nice to get to the bottom of why we're not seeing the >>>>>> level of merging we expect. >>>>> >>>>> Let me look more into this. Maybe we messed something up in the >>>>> scheduler. >>>> >>>> I'm assuming you have a plug setup for doing the reads, which is why you >>>> see the big difference (or there would be none). But >>>> blk_mq_flush_plug_list() should really take care of this when the plug >>>> is flushed, requests should be merged at that point. And from your >>>> description, doesn't sound like they are at all. >>> >>> requests are shared, when running out of request, plug list will be >>> flushed early. >> >> That is true, but I don't think that's the problem here with the round >> robin approach. Seems like it'd drive a pretty low queue depth, even >> considering SATA. > > Another one may be plug list not sorted before inserting requests to > scheduler in blk_mq_flush_plug_list(), looks you have mentioned. Yep, it'd probably be the first thing I'd try... The way the IO is issued, it's pretty much guaranteed that the plug list will be fully interleaved with different queues and we're then issuine one-by-one and running the queue each time.
On 11/03/2022 01:35, Jens Axboe wrote: > On 3/10/22 6:32 PM, Ming Lei wrote: >> On Thu, Mar 10, 2022 at 06:21:33PM -0700, Jens Axboe wrote: >>> On 3/10/22 6:14 PM, Ming Lei wrote: >>>> On Thu, Mar 10, 2022 at 05:36:44PM -0700, Jens Axboe wrote: >>>>> On 3/10/22 5:31 PM, Song Liu wrote: >>>>>> On Thu, Mar 10, 2022 at 4:07 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>>>> >>>>>>> On 3/10/22 4:33 PM, Song Liu wrote: >>>>>>>> On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>>>>>> >>>>>>>>> On 3/10/22 3:37 PM, Song Liu wrote: >>>>>>>>>> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>>>>>>>> >>>>>>>>>>> On 3/8/22 11:42 PM, Song Liu wrote: >>>>>>>>>>>> RAID arrays check/repair operations benefit a lot from merging requests. >>>>>>>>>>>> If we only check the previous entry for merge attempt, many merge will be >>>>>>>>>>>> missed. As a result, significant regression is observed for RAID check >>>>>>>>>>>> and repair. >>>>>>>>>>>> >>>>>>>>>>>> Fix this by checking more than just the previous entry when >>>>>>>>>>>> plug->multiple_queues == true. >>>>>>>>>>>> >>>>>>>>>>>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to >>>>>>>>>>>> 103 MB/s. >>>>>>>>>>> >>>>>>>>>>> Do the underlying disks not have an IO scheduler attached? Curious why >>>>>>>>>>> the merges aren't being done there, would be trivial when the list is >>>>>>>>>>> flushed out. Because if the perf difference is that big, then other >>>>>>>>>>> workloads would be suffering they are that sensitive to being within a >>>>>>>>>>> plug worth of IO. >>>>>>>>>> >>>>>>>>>> The disks have mq-deadline by default. I also tried kyber, the result >>>>>>>>>> is the same. Raid repair work sends IOs to all the HDDs in a >>>>>>>>>> round-robin manner. If we only check the previous request, there isn't >>>>>>>>>> much opportunity for merge. I guess other workloads may have different >>>>>>>>>> behavior? >>>>>>>>> >>>>>>>>> Round robin one at the time? I feel like there's something odd or >>>>>>>>> suboptimal with the raid rebuild, if it's that sensitive to plug >>>>>>>>> merging. >>>>>>>> >>>>>>>> It is not one request at a time, but more like (for raid456): >>>>>>>> read 4kB from HDD1, HDD2, HDD3..., >>>>>>>> then read another 4kB from HDD1, HDD2, HDD3, ... >>>>>>> >>>>>>> Ehm, that very much looks like one-at-the-time from each drive, which is >>>>>>> pretty much the worst way to do it :-) >>>>>>> >>>>>>> Is there a reason for that? Why isn't it using 64k chunks or something >>>>>>> like that? You could still do that as a kind of read-ahead, even if >>>>>>> you're still processing in chunks of 4k. >>>>>> >>>>>> raid456 handles logic in the granularity of stripe. Each stripe is 4kB from >>>>>> every HDD in the array. AFAICT, we need some non-trivial change to >>>>>> enable the read ahead. >>>>> >>>>> Right, you'd need to stick some sort of caching in between so instead of >>>>> reading 4k directly, you ask the cache for 4k and that can manage >>>>> read-ahead. >>>>> >>>>>>>>> Plug merging is mainly meant to reduce the overhead of merging, >>>>>>>>> complement what the scheduler would do. If there's a big drop in >>>>>>>>> performance just by not getting as efficient merging on the plug side, >>>>>>>>> that points to an issue with something else. >>>>>>>> >>>>>>>> We introduced blk_plug_max_rq_count() to give md more opportunities to >>>>>>>> merge at plug side, so I guess the behavior has been like this for a >>>>>>>> long time. I will take a look at the scheduler side and see whether we >>>>>>>> can just merge later, but I am not very optimistic about it. >>>>>>> >>>>>>> Yeah I remember, and that also kind of felt like a work-around for some >>>>>>> underlying issue. Maybe there's something about how the IO is issued >>>>>>> that makes it go straight to disk and we never get any merging? Is it >>>>>>> because they are sync reads? >>>>>>> >>>>>>> In any case, just doing larger reads would likely help quite a bit, but >>>>>>> would still be nice to get to the bottom of why we're not seeing the >>>>>>> level of merging we expect. >>>>>> >>>>>> Let me look more into this. Maybe we messed something up in the >>>>>> scheduler. >>>>> >>>>> I'm assuming you have a plug setup for doing the reads, which is why you >>>>> see the big difference (or there would be none). But >>>>> blk_mq_flush_plug_list() should really take care of this when the plug >>>>> is flushed, requests should be merged at that point. And from your >>>>> description, doesn't sound like they are at all. >>>> >>>> requests are shared, when running out of request, plug list will be >>>> flushed early. >>> >>> That is true, but I don't think that's the problem here with the round >>> robin approach. Seems like it'd drive a pretty low queue depth, even >>> considering SATA. >> >> Another one may be plug list not sorted before inserting requests to >> scheduler in blk_mq_flush_plug_list(), looks you have mentioned. > > Yep, it'd probably be the first thing I'd try... The way the IO is > issued, it's pretty much guaranteed that the plug list will be fully > interleaved with different queues and we're then issuine one-by-one and > running the queue each time. > Naive question, but can you make the flush flush the first one, then scan the queue for all bios for the same device, then go back and start again? Simple approach if it'll work, at the expense of scanning the queue once per device. Cheers, Wol
On Thu, 2022-03-10 at 14:37 -0800, Song Liu wrote: > On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote: > > > > On 3/8/22 11:42 PM, Song Liu wrote: > > > RAID arrays check/repair operations benefit a lot from merging > > > requests. > > > If we only check the previous entry for merge attempt, many merge > > > will be > > > missed. As a result, significant regression is observed for RAID > > > check > > > and repair. > > > > > > Fix this by checking more than just the previous entry when > > > plug->multiple_queues == true. > > > > > > This improves the check/repair speed of a 20-HDD raid6 from 19 > > > MB/s to > > > 103 MB/s. > > > > Do the underlying disks not have an IO scheduler attached? Curious > > why > > the merges aren't being done there, would be trivial when the list > > is > > flushed out. Because if the perf difference is that big, then other > > workloads would be suffering they are that sensitive to being > > within a > > plug worth of IO. > > The disks have mq-deadline by default. I also tried kyber, the result > is the > same. Raid repair work sends IOs to all the HDDs in a round-robin > manner. > If we only check the previous request, there isn't much opportunity > for > merge. I guess other workloads may have different behavior? > > > Between your two approaches, I do greatly prefer the first one > > though. > > I also like the first one better. But I am not sure whether it will > slow down > other workloads. We can probably also make the second one cleaner > with a new variation of blk_start_plug. As a matter of note and purely anecdotal: Before the raid "check" slow down/regression my system would be responsive but delayed (opening a program or opening the xface application menu or switching a file in VLC would take longer than normal, fractions of seconds to a second but slugish and notacable) and with the regression that slow down went from annoying to unbearable. The slowdowns (in programs and menus and file changes) also *seems* to get worse (in both pre & post regression) the longer the check has been running and the slower a run naturally gets (I assume as the check moves from the outer portion of the disk to the inner portion?) and the lower the KB's reported in cat /proc/mdstat/. In the post regression situation it wasn't just that the check was taking much longer and was much slower it was also that it slowed down everything else to the point that it was painful to try and use the computer as it was so much less responsive (multiple seconds for anything to load/run/swtch; even web pages). A laggy annoyance had become an actual hindrance. I have no idea why the speed of the "check" would seemingly affect the apparent responsiveness of the computer and why it would appear that the slower the check the slower the responsiveness. > > Thanks, > Song
On 3/10/22 5:07 PM, Jens Axboe wrote: > In any case, just doing larger reads would likely help quite a bit, but > would still be nice to get to the bottom of why we're not seeing the > level of merging we expect. Song, can you try this one? It'll do the dispatch in a somewhat saner fashion, bundling identical queues. And we'll keep iterating the plug list for a merge if we have multiple disks, until we've seen a queue match and checked. diff --git a/block/blk-merge.c b/block/blk-merge.c index 0e871d4e7cb8..68b623d00db5 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -1073,12 +1073,20 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio, if (!plug || rq_list_empty(plug->mq_list)) return false; - /* check the previously added entry for a quick merge attempt */ - rq = rq_list_peek(&plug->mq_list); - if (rq->q == q) { - if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) == - BIO_MERGE_OK) - return true; + rq_list_for_each(&plug->mq_list, rq) { + if (rq->q == q) { + if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) == + BIO_MERGE_OK) + return true; + break; + } + + /* + * Only keep iterating plug list for merges if we have multiple + * queues + */ + if (!plug->multiple_queues) + break; } return false; } diff --git a/block/blk-mq.c b/block/blk-mq.c index bb263abbb40f..9c784262fd6b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2576,13 +2576,36 @@ static void __blk_mq_flush_plug_list(struct request_queue *q, q->mq_ops->queue_rqs(&plug->mq_list); } +static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched) +{ + struct blk_mq_hw_ctx *this_hctx = NULL; + struct blk_mq_ctx *this_ctx = NULL; + struct request *requeue_list = NULL; + unsigned int depth = 0; + LIST_HEAD(list); + + do { + struct request *rq = rq_list_pop(&plug->mq_list); + + if (!this_hctx) { + this_hctx = rq->mq_hctx; + this_ctx = rq->mq_ctx; + } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx) { + rq_list_add(&requeue_list, rq); + continue; + } + list_add_tail(&rq->queuelist, &list); + depth++; + } while (!rq_list_empty(plug->mq_list)); + + plug->mq_list = requeue_list; + trace_block_unplug(this_hctx->queue, depth, !from_sched); + blk_mq_sched_insert_requests(this_hctx, this_ctx, &list, from_sched); +} + void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) { - struct blk_mq_hw_ctx *this_hctx; - struct blk_mq_ctx *this_ctx; struct request *rq; - unsigned int depth; - LIST_HEAD(list); if (rq_list_empty(plug->mq_list)) return; @@ -2618,35 +2641,9 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) return; } - this_hctx = NULL; - this_ctx = NULL; - depth = 0; do { - rq = rq_list_pop(&plug->mq_list); - - if (!this_hctx) { - this_hctx = rq->mq_hctx; - this_ctx = rq->mq_ctx; - } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx) { - trace_block_unplug(this_hctx->queue, depth, - !from_schedule); - blk_mq_sched_insert_requests(this_hctx, this_ctx, - &list, from_schedule); - depth = 0; - this_hctx = rq->mq_hctx; - this_ctx = rq->mq_ctx; - - } - - list_add(&rq->queuelist, &list); - depth++; + blk_mq_dispatch_plug_list(plug, from_schedule); } while (!rq_list_empty(plug->mq_list)); - - if (!list_empty(&list)) { - trace_block_unplug(this_hctx->queue, depth, !from_schedule); - blk_mq_sched_insert_requests(this_hctx, this_ctx, &list, - from_schedule); - } } void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
On 11/03/2022 11:30, Wilson Jonathan wrote: > I have no idea why the speed of the "check" would seemingly affect the > apparent responsiveness of the computer and why it would appear that > the slower the check the slower the responsiveness. The buffers were filling up, nicking your free ram, and you were having to wait longer and longer before they could be flushed to swap your applications back in? Cheers, Wol
Hi Jens, On Fri, Mar 11, 2022 at 6:16 AM Jens Axboe <axboe@kernel.dk> wrote: > > On 3/10/22 5:07 PM, Jens Axboe wrote: > > In any case, just doing larger reads would likely help quite a bit, but > > would still be nice to get to the bottom of why we're not seeing the > > level of merging we expect. > > Song, can you try this one? It'll do the dispatch in a somewhat saner > fashion, bundling identical queues. And we'll keep iterating the plug > list for a merge if we have multiple disks, until we've seen a queue > match and checked. This one works great! We are seeing 99% read request merge and 500kB+ average read size. The original patch in this thread only got 88% and 34kB for these two metrics. Thanks, Song [...]
Dear Song, Am 11.03.22 um 17:59 schrieb Song Liu: > On Fri, Mar 11, 2022 at 6:16 AM Jens Axboe <axboe@kernel.dk> wrote: >> >> On 3/10/22 5:07 PM, Jens Axboe wrote: >>> In any case, just doing larger reads would likely help quite a bit, but >>> would still be nice to get to the bottom of why we're not seeing the >>> level of merging we expect. >> >> Song, can you try this one? It'll do the dispatch in a somewhat saner >> fashion, bundling identical queues. And we'll keep iterating the plug >> list for a merge if we have multiple disks, until we've seen a queue >> match and checked. > > This one works great! We are seeing 99% read request merge and > 500kB+ average read size. The original patch in this thread only got > 88% and 34kB for these two metrics. Nice. I am curious, how these metrics can be obtained? […] Kind regards, Paul
On Fri, Mar 11, 2022 at 1:42 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote: > > Dear Song, > > > Am 11.03.22 um 17:59 schrieb Song Liu: > > > On Fri, Mar 11, 2022 at 6:16 AM Jens Axboe <axboe@kernel.dk> wrote: > >> > >> On 3/10/22 5:07 PM, Jens Axboe wrote: > >>> In any case, just doing larger reads would likely help quite a bit, but > >>> would still be nice to get to the bottom of why we're not seeing the > >>> level of merging we expect. > >> > >> Song, can you try this one? It'll do the dispatch in a somewhat saner > >> fashion, bundling identical queues. And we'll keep iterating the plug > >> list for a merge if we have multiple disks, until we've seen a queue > >> match and checked. > > > > This one works great! We are seeing 99% read request merge and > > 500kB+ average read size. The original patch in this thread only got > > 88% and 34kB for these two metrics. > > Nice. I am curious, how these metrics can be obtained? > We can use tools as iostat: iostat -mx 2 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdb 3176.50 1.00 100.57 0.00 22503.00 0.00 87.63 0.00 10.22 3.50 32.46 32.42 4.00 0.24 76.60 sdi 3167.00 1.00 100.57 0.00 22512.50 0.00 87.67 0.00 11.58 4.00 36.68 32.52 4.00 0.24 77.55 The two metrics we used here are %rrqm and rareq-sz. Thanks, Song
diff --git a/block/blk-merge.c b/block/blk-merge.c index 4de34a332c9f..57e2075fb2f4 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -1089,12 +1089,14 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio, if (!plug || rq_list_empty(plug->mq_list)) return false; - /* check the previously added entry for a quick merge attempt */ - rq = rq_list_peek(&plug->mq_list); - if (rq->q == q) { - if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) == - BIO_MERGE_OK) - return true; + rq_list_for_each(&plug->mq_list, rq) { + if (rq->q == q) { + if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) == + BIO_MERGE_OK) + return true; + } + if (!plug->multiple_queues) + break; } return false; }
RAID arrays check/repair operations benefit a lot from merging requests. If we only check the previous entry for merge attempt, many merge will be missed. As a result, significant regression is observed for RAID check and repair. Fix this by checking more than just the previous entry when plug->multiple_queues == true. This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to 103 MB/s. Fixes: d38a9c04c0d5 ("block: only check previous entry for plug merge attempt") Cc: stable@vger.kernel.org # v5.16 Reported-by: Larkin Lowrey <llowrey@nuclearwinter.com> Reported-by: Wilson Jonathan <i400sjon@gmail.com> Reported-by: Roger Heflin <rogerheflin@gmail.com> Signed-off-by: Song Liu <song@kernel.org> --- block/blk-merge.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-)