diff mbox series

block: check more requests for multiple_queues in blk_attempt_plug_merge

Message ID 20220309064209.4169303-1-song@kernel.org (mailing list archive)
State New, archived
Headers show
Series block: check more requests for multiple_queues in blk_attempt_plug_merge | expand

Commit Message

Song Liu March 9, 2022, 6:42 a.m. UTC
RAID arrays check/repair operations benefit a lot from merging requests.
If we only check the previous entry for merge attempt, many merge will be
missed. As a result, significant regression is observed for RAID check
and repair.

Fix this by checking more than just the previous entry when
plug->multiple_queues == true.

This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to
103 MB/s.

Fixes: d38a9c04c0d5 ("block: only check previous entry for plug merge attempt")
Cc: stable@vger.kernel.org # v5.16
Reported-by: Larkin Lowrey <llowrey@nuclearwinter.com>
Reported-by: Wilson Jonathan <i400sjon@gmail.com>
Reported-by: Roger Heflin <rogerheflin@gmail.com>
Signed-off-by: Song Liu <song@kernel.org>
---
 block/blk-merge.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

Comments

Christoph Hellwig March 10, 2022, 6:48 a.m. UTC | #1
On Tue, Mar 08, 2022 at 10:42:09PM -0800, Song Liu wrote:
> RAID arrays check/repair operations benefit a lot from merging requests.
> If we only check the previous entry for merge attempt, many merge will be
> missed. As a result, significant regression is observed for RAID check
> and repair.
> 
> Fix this by checking more than just the previous entry when
> plug->multiple_queues == true.

But this also means really significant CPU overhead for all other
workloads.

> 
> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to
> 103 MB/s.

What driver uses multiple queues for HDDs?

Can you explain the workload submitted by a md a bit better?  I wonder
if we can easily do the right thing straight in the md driver.
Song Liu March 10, 2022, 7:23 a.m. UTC | #2
On Wed, Mar 9, 2022 at 10:48 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Tue, Mar 08, 2022 at 10:42:09PM -0800, Song Liu wrote:
> > RAID arrays check/repair operations benefit a lot from merging requests.
> > If we only check the previous entry for merge attempt, many merge will be
> > missed. As a result, significant regression is observed for RAID check
> > and repair.
> >
> > Fix this by checking more than just the previous entry when
> > plug->multiple_queues == true.
>
> But this also means really significant CPU overhead for all other
> workloads.

Would the following check help with these workloads?

 if (!plug->multiple_queues)
              break;

>
> >
> > This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to
> > 103 MB/s.
>
> What driver uses multiple queues for HDDs?
>
> Can you explain the workload submitted by a md a bit better?  I wonder
> if we can easily do the right thing straight in the md driver.

It is the md sync_thread doing check and repair. Basically, the md
thread reads all
the disks and computes parity from data.

Maybe we should add a new flag to struct blk_plug for this special case?

Song
Song Liu March 10, 2022, 10:10 p.m. UTC | #3
On Wed, Mar 9, 2022 at 11:23 PM Song Liu <song@kernel.org> wrote:
>
> On Wed, Mar 9, 2022 at 10:48 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Tue, Mar 08, 2022 at 10:42:09PM -0800, Song Liu wrote:
> > > RAID arrays check/repair operations benefit a lot from merging requests.
> > > If we only check the previous entry for merge attempt, many merge will be
> > > missed. As a result, significant regression is observed for RAID check
> > > and repair.
> > >
> > > Fix this by checking more than just the previous entry when
> > > plug->multiple_queues == true.
> >
> > But this also means really significant CPU overhead for all other
> > workloads.
>
> Would the following check help with these workloads?
>
>  if (!plug->multiple_queues)
>               break;
>
> >
> > >
> > > This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to
> > > 103 MB/s.
> >
> > What driver uses multiple queues for HDDs?
> >
> > Can you explain the workload submitted by a md a bit better?  I wonder
> > if we can easily do the right thing straight in the md driver.
>
> It is the md sync_thread doing check and repair. Basically, the md
> thread reads all
> the disks and computes parity from data.
>
> Maybe we should add a new flag to struct blk_plug for this special case?

I meant something like:

diff --git c/block/blk-core.c w/block/blk-core.c
index 1039515c99d6..4fb09243e908 100644
--- c/block/blk-core.c
+++ w/block/blk-core.c
@@ -1303,6 +1303,12 @@ void blk_finish_plug(struct blk_plug *plug)
 }
 EXPORT_SYMBOL(blk_finish_plug);

+void blk_plug_merge_aggressively(struct blk_plug *plug)
+{
+    plug->aggresive_merge = true;
+}
+EXPORT_SYMBOL(blk_plug_merge_aggressively);
+
 void blk_io_schedule(void)
 {
     /* Prevent hang_check timer from firing at us during very long I/O */
diff --git c/block/blk-merge.c w/block/blk-merge.c
index 4de34a332c9f..8b673288bc5f 100644
--- c/block/blk-merge.c
+++ w/block/blk-merge.c
@@ -1089,12 +1089,14 @@ bool blk_attempt_plug_merge(struct
request_queue *q, struct bio *bio,
     if (!plug || rq_list_empty(plug->mq_list))
         return false;

-    /* check the previously added entry for a quick merge attempt */
-    rq = rq_list_peek(&plug->mq_list);
-    if (rq->q == q) {
-        if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
-                BIO_MERGE_OK)
-            return true;
+    rq_list_for_each(&plug->mq_list, rq) {
+        if (rq->q == q) {
+            if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
+                BIO_MERGE_OK)
+                return true;
+        }
+        if (!plug->aggresive_merge)
+            break;
     }
     return false;
 }
diff --git c/drivers/md/md.c w/drivers/md/md.c
index 4d38bd7dadd6..6be56632a412 100644
--- c/drivers/md/md.c
+++ w/drivers/md/md.c
@@ -8901,6 +8901,7 @@ void md_do_sync(struct md_thread *thread)
     update_time = jiffies;

     blk_start_plug(&plug);
+    blk_plug_merge_aggressively(&plug);
     while (j < max_sectors) {
         sector_t sectors;

diff --git c/drivers/md/raid1.c w/drivers/md/raid1.c
index e2d8acb1e988..501d15532170 100644
--- c/drivers/md/raid1.c
+++ w/drivers/md/raid1.c
@@ -838,6 +838,7 @@ static void flush_pending_writes(struct r1conf *conf)
          */
         __set_current_state(TASK_RUNNING);
         blk_start_plug(&plug);
+        blk_plug_merge_aggressively(&plug);
         flush_bio_list(conf, bio);
         blk_finish_plug(&plug);
     } else
@@ -2591,6 +2592,7 @@ static void raid1d(struct md_thread *thread)
     }

     blk_start_plug(&plug);
+    blk_plug_merge_aggressively(&plug);
     for (;;) {

         flush_pending_writes(conf);
diff --git c/drivers/md/raid10.c w/drivers/md/raid10.c
index 2b969f70a31f..0a594613a075 100644
--- c/drivers/md/raid10.c
+++ w/drivers/md/raid10.c
@@ -876,6 +876,7 @@ static void flush_pending_writes(struct r10conf *conf)
         __set_current_state(TASK_RUNNING);

         blk_start_plug(&plug);
+        blk_plug_merge_aggressively(&plug);
         /* flush any pending bitmap writes to disk
          * before proceeding w/ I/O */
         md_bitmap_unplug(conf->mddev->bitmap);
@@ -3088,6 +3089,7 @@ static void raid10d(struct md_thread *thread)
     }

     blk_start_plug(&plug);
+    blk_plug_merge_aggressively(&plug);
     for (;;) {

         flush_pending_writes(conf);
diff --git c/drivers/md/raid5.c w/drivers/md/raid5.c
index ffe720c73b0a..a96884ca5f08 100644
--- c/drivers/md/raid5.c
+++ w/drivers/md/raid5.c
@@ -6447,6 +6447,7 @@ static void raid5_do_work(struct work_struct *work)
     pr_debug("+++ raid5worker active\n");

     blk_start_plug(&plug);
+    blk_plug_merge_aggressively(&plug);
     handled = 0;
     spin_lock_irq(&conf->device_lock);
     while (1) {
@@ -6497,6 +6498,7 @@ static void raid5d(struct md_thread *thread)
     md_check_recovery(mddev);

     blk_start_plug(&plug);
+    blk_plug_merge_aggressively(&plug);
     handled = 0;
     spin_lock_irq(&conf->device_lock);
     while (1) {
diff --git c/include/linux/blkdev.h w/include/linux/blkdev.h
index 16b47035e4b0..45b0da416302 100644
--- c/include/linux/blkdev.h
+++ w/include/linux/blkdev.h
@@ -775,6 +775,7 @@ struct blk_plug {
     bool multiple_queues;
     bool has_elevator;
     bool nowait;
+    bool aggresive_merge;

     struct list_head cb_list; /* md requires an unplug callback */
 };
@@ -791,6 +792,7 @@ extern struct blk_plug_cb
*blk_check_plugged(blk_plug_cb_fn unplug,
 extern void blk_start_plug(struct blk_plug *);
 extern void blk_start_plug_nr_ios(struct blk_plug *, unsigned short);
 extern void blk_finish_plug(struct blk_plug *);
+void blk_plug_merge_aggressively(struct blk_plug *plug);

 void blk_flush_plug(struct blk_plug *plug, bool from_schedule);
Song Liu March 10, 2022, 10:12 p.m. UTC | #4
On Thu, Mar 10, 2022 at 2:10 PM Song Liu <song@kernel.org> wrote:
>
> On Wed, Mar 9, 2022 at 11:23 PM Song Liu <song@kernel.org> wrote:
> >
> > On Wed, Mar 9, 2022 at 10:48 PM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Tue, Mar 08, 2022 at 10:42:09PM -0800, Song Liu wrote:
> > > > RAID arrays check/repair operations benefit a lot from merging requests.
> > > > If we only check the previous entry for merge attempt, many merge will be
> > > > missed. As a result, significant regression is observed for RAID check
> > > > and repair.
> > > >
> > > > Fix this by checking more than just the previous entry when
> > > > plug->multiple_queues == true.
> > >
> > > But this also means really significant CPU overhead for all other
> > > workloads.
> >
> > Would the following check help with these workloads?
> >
> >  if (!plug->multiple_queues)
> >               break;
> >
> > >
> > > >
> > > > This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to
> > > > 103 MB/s.
> > >
> > > What driver uses multiple queues for HDDs?
> > >
> > > Can you explain the workload submitted by a md a bit better?  I wonder
> > > if we can easily do the right thing straight in the md driver.
> >
> > It is the md sync_thread doing check and repair. Basically, the md
> > thread reads all
> > the disks and computes parity from data.
> >
> > Maybe we should add a new flag to struct blk_plug for this special case?
>
> I meant something like:
>
> diff --git c/block/blk-core.c w/block/blk-core.c
> index 1039515c99d6..4fb09243e908 100644
> --- c/block/blk-core.c
> +++ w/block/blk-core.c
> @@ -1303,6 +1303,12 @@ void blk_finish_plug(struct blk_plug *plug)
>  }
>  EXPORT_SYMBOL(blk_finish_plug);
>
> +void blk_plug_merge_aggressively(struct blk_plug *plug)
> +{
> +    plug->aggresive_merge = true;
> +}
> +EXPORT_SYMBOL(blk_plug_merge_aggressively);
> +
>  void blk_io_schedule(void)
>  {
>      /* Prevent hang_check timer from firing at us during very long I/O */

Missed one change:

--- c/block/blk-core.c
+++ w/block/blk-core.c
@@ -1188,6 +1188,7 @@ void blk_start_plug_nr_ios(struct blk_plug
*plug, unsigned short nr_ios)
     plug->multiple_queues = false;
     plug->has_elevator = false;
     plug->nowait = false;
+    plug->aggresive_merge = false;
     INIT_LIST_HEAD(&plug->cb_list);

     /*
Jens Axboe March 10, 2022, 10:15 p.m. UTC | #5
On 3/8/22 11:42 PM, Song Liu wrote:
> RAID arrays check/repair operations benefit a lot from merging requests.
> If we only check the previous entry for merge attempt, many merge will be
> missed. As a result, significant regression is observed for RAID check
> and repair.
> 
> Fix this by checking more than just the previous entry when
> plug->multiple_queues == true.
> 
> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to
> 103 MB/s.

Do the underlying disks not have an IO scheduler attached? Curious why
the merges aren't being done there, would be trivial when the list is
flushed out. Because if the perf difference is that big, then other
workloads would be suffering they are that sensitive to being within a
plug worth of IO.

Between your two approaches, I do greatly prefer the first one though.
Song Liu March 10, 2022, 10:37 p.m. UTC | #6
On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 3/8/22 11:42 PM, Song Liu wrote:
> > RAID arrays check/repair operations benefit a lot from merging requests.
> > If we only check the previous entry for merge attempt, many merge will be
> > missed. As a result, significant regression is observed for RAID check
> > and repair.
> >
> > Fix this by checking more than just the previous entry when
> > plug->multiple_queues == true.
> >
> > This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to
> > 103 MB/s.
>
> Do the underlying disks not have an IO scheduler attached? Curious why
> the merges aren't being done there, would be trivial when the list is
> flushed out. Because if the perf difference is that big, then other
> workloads would be suffering they are that sensitive to being within a
> plug worth of IO.

The disks have mq-deadline by default. I also tried kyber, the result is the
same. Raid repair work sends IOs to all the HDDs in a round-robin manner.
If we only check the previous request, there isn't much opportunity for
merge. I guess other workloads may have different behavior?

> Between your two approaches, I do greatly prefer the first one though.

I also like the first one better. But I am not sure whether it will slow down
other workloads. We can probably also make the second one cleaner
with a new variation of blk_start_plug.

Thanks,
Song
Jens Axboe March 10, 2022, 11:02 p.m. UTC | #7
On 3/10/22 3:37 PM, Song Liu wrote:
> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 3/8/22 11:42 PM, Song Liu wrote:
>>> RAID arrays check/repair operations benefit a lot from merging requests.
>>> If we only check the previous entry for merge attempt, many merge will be
>>> missed. As a result, significant regression is observed for RAID check
>>> and repair.
>>>
>>> Fix this by checking more than just the previous entry when
>>> plug->multiple_queues == true.
>>>
>>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to
>>> 103 MB/s.
>>
>> Do the underlying disks not have an IO scheduler attached? Curious why
>> the merges aren't being done there, would be trivial when the list is
>> flushed out. Because if the perf difference is that big, then other
>> workloads would be suffering they are that sensitive to being within a
>> plug worth of IO.
> 
> The disks have mq-deadline by default. I also tried kyber, the result
> is the same. Raid repair work sends IOs to all the HDDs in a
> round-robin manner. If we only check the previous request, there isn't
> much opportunity for merge. I guess other workloads may have different
> behavior?

Round robin one at the time? I feel like there's something odd or
suboptimal with the raid rebuild, if it's that sensitive to plug
merging. Plug merging is mainly meant to reduce the overhead of merging,
complement what the scheduler would do. If there's a big drop in
performance just by not getting as efficient merging on the plug side,
that points to an issue with something else.
Song Liu March 10, 2022, 11:33 p.m. UTC | #8
On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 3/10/22 3:37 PM, Song Liu wrote:
> > On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 3/8/22 11:42 PM, Song Liu wrote:
> >>> RAID arrays check/repair operations benefit a lot from merging requests.
> >>> If we only check the previous entry for merge attempt, many merge will be
> >>> missed. As a result, significant regression is observed for RAID check
> >>> and repair.
> >>>
> >>> Fix this by checking more than just the previous entry when
> >>> plug->multiple_queues == true.
> >>>
> >>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to
> >>> 103 MB/s.
> >>
> >> Do the underlying disks not have an IO scheduler attached? Curious why
> >> the merges aren't being done there, would be trivial when the list is
> >> flushed out. Because if the perf difference is that big, then other
> >> workloads would be suffering they are that sensitive to being within a
> >> plug worth of IO.
> >
> > The disks have mq-deadline by default. I also tried kyber, the result
> > is the same. Raid repair work sends IOs to all the HDDs in a
> > round-robin manner. If we only check the previous request, there isn't
> > much opportunity for merge. I guess other workloads may have different
> > behavior?
>
> Round robin one at the time? I feel like there's something odd or
> suboptimal with the raid rebuild, if it's that sensitive to plug
> merging.

It is not one request at a time, but more like (for raid456):
   read 4kB from HDD1, HDD2, HDD3...,
   then read another 4kB from HDD1, HDD2, HDD3, ...

> Plug merging is mainly meant to reduce the overhead of merging,
> complement what the scheduler would do. If there's a big drop in
> performance just by not getting as efficient merging on the plug side,
> that points to an issue with something else.

We introduced blk_plug_max_rq_count() to give md more opportunities
to merge at plug side, so I guess the behavior has been like this for a
long time. I will take a look at the scheduler side and see whether we
can just merge later, but I am not very optimistic about it.

Thanks,
Song
Jens Axboe March 11, 2022, 12:07 a.m. UTC | #9
On 3/10/22 4:33 PM, Song Liu wrote:
> On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 3/10/22 3:37 PM, Song Liu wrote:
>>> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 3/8/22 11:42 PM, Song Liu wrote:
>>>>> RAID arrays check/repair operations benefit a lot from merging requests.
>>>>> If we only check the previous entry for merge attempt, many merge will be
>>>>> missed. As a result, significant regression is observed for RAID check
>>>>> and repair.
>>>>>
>>>>> Fix this by checking more than just the previous entry when
>>>>> plug->multiple_queues == true.
>>>>>
>>>>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to
>>>>> 103 MB/s.
>>>>
>>>> Do the underlying disks not have an IO scheduler attached? Curious why
>>>> the merges aren't being done there, would be trivial when the list is
>>>> flushed out. Because if the perf difference is that big, then other
>>>> workloads would be suffering they are that sensitive to being within a
>>>> plug worth of IO.
>>>
>>> The disks have mq-deadline by default. I also tried kyber, the result
>>> is the same. Raid repair work sends IOs to all the HDDs in a
>>> round-robin manner. If we only check the previous request, there isn't
>>> much opportunity for merge. I guess other workloads may have different
>>> behavior?
>>
>> Round robin one at the time? I feel like there's something odd or
>> suboptimal with the raid rebuild, if it's that sensitive to plug
>> merging.
> 
> It is not one request at a time, but more like (for raid456):
>    read 4kB from HDD1, HDD2, HDD3...,
>    then read another 4kB from HDD1, HDD2, HDD3, ...

Ehm, that very much looks like one-at-the-time from each drive, which is
pretty much the worst way to do it :-)

Is there a reason for that? Why isn't it using 64k chunks or something
like that? You could still do that as a kind of read-ahead, even if
you're still processing in chunks of 4k.

>> Plug merging is mainly meant to reduce the overhead of merging,
>> complement what the scheduler would do. If there's a big drop in
>> performance just by not getting as efficient merging on the plug side,
>> that points to an issue with something else.
> 
> We introduced blk_plug_max_rq_count() to give md more opportunities to
> merge at plug side, so I guess the behavior has been like this for a
> long time. I will take a look at the scheduler side and see whether we
> can just merge later, but I am not very optimistic about it.

Yeah I remember, and that also kind of felt like a work-around for some
underlying issue. Maybe there's something about how the IO is issued
that makes it go straight to disk and we never get any merging? Is it
because they are sync reads?

In any case, just doing larger reads would likely help quite a bit, but
would still be nice to get to the bottom of why we're not seeing the
level of merging we expect.
Song Liu March 11, 2022, 12:31 a.m. UTC | #10
On Thu, Mar 10, 2022 at 4:07 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 3/10/22 4:33 PM, Song Liu wrote:
> > On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 3/10/22 3:37 PM, Song Liu wrote:
> >>> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>
> >>>> On 3/8/22 11:42 PM, Song Liu wrote:
> >>>>> RAID arrays check/repair operations benefit a lot from merging requests.
> >>>>> If we only check the previous entry for merge attempt, many merge will be
> >>>>> missed. As a result, significant regression is observed for RAID check
> >>>>> and repair.
> >>>>>
> >>>>> Fix this by checking more than just the previous entry when
> >>>>> plug->multiple_queues == true.
> >>>>>
> >>>>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to
> >>>>> 103 MB/s.
> >>>>
> >>>> Do the underlying disks not have an IO scheduler attached? Curious why
> >>>> the merges aren't being done there, would be trivial when the list is
> >>>> flushed out. Because if the perf difference is that big, then other
> >>>> workloads would be suffering they are that sensitive to being within a
> >>>> plug worth of IO.
> >>>
> >>> The disks have mq-deadline by default. I also tried kyber, the result
> >>> is the same. Raid repair work sends IOs to all the HDDs in a
> >>> round-robin manner. If we only check the previous request, there isn't
> >>> much opportunity for merge. I guess other workloads may have different
> >>> behavior?
> >>
> >> Round robin one at the time? I feel like there's something odd or
> >> suboptimal with the raid rebuild, if it's that sensitive to plug
> >> merging.
> >
> > It is not one request at a time, but more like (for raid456):
> >    read 4kB from HDD1, HDD2, HDD3...,
> >    then read another 4kB from HDD1, HDD2, HDD3, ...
>
> Ehm, that very much looks like one-at-the-time from each drive, which is
> pretty much the worst way to do it :-)
>
> Is there a reason for that? Why isn't it using 64k chunks or something
> like that? You could still do that as a kind of read-ahead, even if
> you're still processing in chunks of 4k.

raid456 handles logic in the granularity of stripe. Each stripe is 4kB from
every HDD in the array. AFAICT, we need some non-trivial change to
enable the read ahead.

>
> >> Plug merging is mainly meant to reduce the overhead of merging,
> >> complement what the scheduler would do. If there's a big drop in
> >> performance just by not getting as efficient merging on the plug side,
> >> that points to an issue with something else.
> >
> > We introduced blk_plug_max_rq_count() to give md more opportunities to
> > merge at plug side, so I guess the behavior has been like this for a
> > long time. I will take a look at the scheduler side and see whether we
> > can just merge later, but I am not very optimistic about it.
>
> Yeah I remember, and that also kind of felt like a work-around for some
> underlying issue. Maybe there's something about how the IO is issued
> that makes it go straight to disk and we never get any merging? Is it
> because they are sync reads?
>
> In any case, just doing larger reads would likely help quite a bit, but
> would still be nice to get to the bottom of why we're not seeing the
> level of merging we expect.

Let me look more into this. Maybe we messed something up in the
scheduler.

Thanks,
Song
Jens Axboe March 11, 2022, 12:36 a.m. UTC | #11
On 3/10/22 5:31 PM, Song Liu wrote:
> On Thu, Mar 10, 2022 at 4:07 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 3/10/22 4:33 PM, Song Liu wrote:
>>> On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 3/10/22 3:37 PM, Song Liu wrote:
>>>>> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> On 3/8/22 11:42 PM, Song Liu wrote:
>>>>>>> RAID arrays check/repair operations benefit a lot from merging requests.
>>>>>>> If we only check the previous entry for merge attempt, many merge will be
>>>>>>> missed. As a result, significant regression is observed for RAID check
>>>>>>> and repair.
>>>>>>>
>>>>>>> Fix this by checking more than just the previous entry when
>>>>>>> plug->multiple_queues == true.
>>>>>>>
>>>>>>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to
>>>>>>> 103 MB/s.
>>>>>>
>>>>>> Do the underlying disks not have an IO scheduler attached? Curious why
>>>>>> the merges aren't being done there, would be trivial when the list is
>>>>>> flushed out. Because if the perf difference is that big, then other
>>>>>> workloads would be suffering they are that sensitive to being within a
>>>>>> plug worth of IO.
>>>>>
>>>>> The disks have mq-deadline by default. I also tried kyber, the result
>>>>> is the same. Raid repair work sends IOs to all the HDDs in a
>>>>> round-robin manner. If we only check the previous request, there isn't
>>>>> much opportunity for merge. I guess other workloads may have different
>>>>> behavior?
>>>>
>>>> Round robin one at the time? I feel like there's something odd or
>>>> suboptimal with the raid rebuild, if it's that sensitive to plug
>>>> merging.
>>>
>>> It is not one request at a time, but more like (for raid456):
>>>    read 4kB from HDD1, HDD2, HDD3...,
>>>    then read another 4kB from HDD1, HDD2, HDD3, ...
>>
>> Ehm, that very much looks like one-at-the-time from each drive, which is
>> pretty much the worst way to do it :-)
>>
>> Is there a reason for that? Why isn't it using 64k chunks or something
>> like that? You could still do that as a kind of read-ahead, even if
>> you're still processing in chunks of 4k.
> 
> raid456 handles logic in the granularity of stripe. Each stripe is 4kB from
> every HDD in the array. AFAICT, we need some non-trivial change to
> enable the read ahead.

Right, you'd need to stick some sort of caching in between so instead of
reading 4k directly, you ask the cache for 4k and that can manage
read-ahead.

>>>> Plug merging is mainly meant to reduce the overhead of merging,
>>>> complement what the scheduler would do. If there's a big drop in
>>>> performance just by not getting as efficient merging on the plug side,
>>>> that points to an issue with something else.
>>>
>>> We introduced blk_plug_max_rq_count() to give md more opportunities to
>>> merge at plug side, so I guess the behavior has been like this for a
>>> long time. I will take a look at the scheduler side and see whether we
>>> can just merge later, but I am not very optimistic about it.
>>
>> Yeah I remember, and that also kind of felt like a work-around for some
>> underlying issue. Maybe there's something about how the IO is issued
>> that makes it go straight to disk and we never get any merging? Is it
>> because they are sync reads?
>>
>> In any case, just doing larger reads would likely help quite a bit, but
>> would still be nice to get to the bottom of why we're not seeing the
>> level of merging we expect.
> 
> Let me look more into this. Maybe we messed something up in the
> scheduler.

I'm assuming you have a plug setup for doing the reads, which is why you
see the big difference (or there would be none). But
blk_mq_flush_plug_list() should really take care of this when the plug
is flushed, requests should be merged at that point. And from your
description, doesn't sound like they are at all.
Jens Axboe March 11, 2022, 12:38 a.m. UTC | #12
On 3/10/22 5:36 PM, Jens Axboe wrote:
> I'm assuming you have a plug setup for doing the reads, which is why you
> see the big difference (or there would be none). But
> blk_mq_flush_plug_list() should really take care of this when the plug
> is flushed, requests should be merged at that point. And from your
> description, doesn't sound like they are at all.

Maybe you need a list sort in blk_mq_flush_plug_list(). If you
round-robin all the drives, then we'll hit that "run queue" path for
each of them when we flush.
Ming Lei March 11, 2022, 1:14 a.m. UTC | #13
On Thu, Mar 10, 2022 at 05:36:44PM -0700, Jens Axboe wrote:
> On 3/10/22 5:31 PM, Song Liu wrote:
> > On Thu, Mar 10, 2022 at 4:07 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 3/10/22 4:33 PM, Song Liu wrote:
> >>> On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>
> >>>> On 3/10/22 3:37 PM, Song Liu wrote:
> >>>>> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>
> >>>>>> On 3/8/22 11:42 PM, Song Liu wrote:
> >>>>>>> RAID arrays check/repair operations benefit a lot from merging requests.
> >>>>>>> If we only check the previous entry for merge attempt, many merge will be
> >>>>>>> missed. As a result, significant regression is observed for RAID check
> >>>>>>> and repair.
> >>>>>>>
> >>>>>>> Fix this by checking more than just the previous entry when
> >>>>>>> plug->multiple_queues == true.
> >>>>>>>
> >>>>>>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to
> >>>>>>> 103 MB/s.
> >>>>>>
> >>>>>> Do the underlying disks not have an IO scheduler attached? Curious why
> >>>>>> the merges aren't being done there, would be trivial when the list is
> >>>>>> flushed out. Because if the perf difference is that big, then other
> >>>>>> workloads would be suffering they are that sensitive to being within a
> >>>>>> plug worth of IO.
> >>>>>
> >>>>> The disks have mq-deadline by default. I also tried kyber, the result
> >>>>> is the same. Raid repair work sends IOs to all the HDDs in a
> >>>>> round-robin manner. If we only check the previous request, there isn't
> >>>>> much opportunity for merge. I guess other workloads may have different
> >>>>> behavior?
> >>>>
> >>>> Round robin one at the time? I feel like there's something odd or
> >>>> suboptimal with the raid rebuild, if it's that sensitive to plug
> >>>> merging.
> >>>
> >>> It is not one request at a time, but more like (for raid456):
> >>>    read 4kB from HDD1, HDD2, HDD3...,
> >>>    then read another 4kB from HDD1, HDD2, HDD3, ...
> >>
> >> Ehm, that very much looks like one-at-the-time from each drive, which is
> >> pretty much the worst way to do it :-)
> >>
> >> Is there a reason for that? Why isn't it using 64k chunks or something
> >> like that? You could still do that as a kind of read-ahead, even if
> >> you're still processing in chunks of 4k.
> > 
> > raid456 handles logic in the granularity of stripe. Each stripe is 4kB from
> > every HDD in the array. AFAICT, we need some non-trivial change to
> > enable the read ahead.
> 
> Right, you'd need to stick some sort of caching in between so instead of
> reading 4k directly, you ask the cache for 4k and that can manage
> read-ahead.
> 
> >>>> Plug merging is mainly meant to reduce the overhead of merging,
> >>>> complement what the scheduler would do. If there's a big drop in
> >>>> performance just by not getting as efficient merging on the plug side,
> >>>> that points to an issue with something else.
> >>>
> >>> We introduced blk_plug_max_rq_count() to give md more opportunities to
> >>> merge at plug side, so I guess the behavior has been like this for a
> >>> long time. I will take a look at the scheduler side and see whether we
> >>> can just merge later, but I am not very optimistic about it.
> >>
> >> Yeah I remember, and that also kind of felt like a work-around for some
> >> underlying issue. Maybe there's something about how the IO is issued
> >> that makes it go straight to disk and we never get any merging? Is it
> >> because they are sync reads?
> >>
> >> In any case, just doing larger reads would likely help quite a bit, but
> >> would still be nice to get to the bottom of why we're not seeing the
> >> level of merging we expect.
> > 
> > Let me look more into this. Maybe we messed something up in the
> > scheduler.
> 
> I'm assuming you have a plug setup for doing the reads, which is why you
> see the big difference (or there would be none). But
> blk_mq_flush_plug_list() should really take care of this when the plug
> is flushed, requests should be merged at that point. And from your
> description, doesn't sound like they are at all.

requests are shared, when running out of request, plug list will be flushed
early.

Maybe we can just put bios into plug list directly, then handle them all in
blk_mq_flush_plug_list.


thanks,
Ming
Jens Axboe March 11, 2022, 1:21 a.m. UTC | #14
On 3/10/22 6:14 PM, Ming Lei wrote:
> On Thu, Mar 10, 2022 at 05:36:44PM -0700, Jens Axboe wrote:
>> On 3/10/22 5:31 PM, Song Liu wrote:
>>> On Thu, Mar 10, 2022 at 4:07 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 3/10/22 4:33 PM, Song Liu wrote:
>>>>> On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> On 3/10/22 3:37 PM, Song Liu wrote:
>>>>>>> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>
>>>>>>>> On 3/8/22 11:42 PM, Song Liu wrote:
>>>>>>>>> RAID arrays check/repair operations benefit a lot from merging requests.
>>>>>>>>> If we only check the previous entry for merge attempt, many merge will be
>>>>>>>>> missed. As a result, significant regression is observed for RAID check
>>>>>>>>> and repair.
>>>>>>>>>
>>>>>>>>> Fix this by checking more than just the previous entry when
>>>>>>>>> plug->multiple_queues == true.
>>>>>>>>>
>>>>>>>>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to
>>>>>>>>> 103 MB/s.
>>>>>>>>
>>>>>>>> Do the underlying disks not have an IO scheduler attached? Curious why
>>>>>>>> the merges aren't being done there, would be trivial when the list is
>>>>>>>> flushed out. Because if the perf difference is that big, then other
>>>>>>>> workloads would be suffering they are that sensitive to being within a
>>>>>>>> plug worth of IO.
>>>>>>>
>>>>>>> The disks have mq-deadline by default. I also tried kyber, the result
>>>>>>> is the same. Raid repair work sends IOs to all the HDDs in a
>>>>>>> round-robin manner. If we only check the previous request, there isn't
>>>>>>> much opportunity for merge. I guess other workloads may have different
>>>>>>> behavior?
>>>>>>
>>>>>> Round robin one at the time? I feel like there's something odd or
>>>>>> suboptimal with the raid rebuild, if it's that sensitive to plug
>>>>>> merging.
>>>>>
>>>>> It is not one request at a time, but more like (for raid456):
>>>>>    read 4kB from HDD1, HDD2, HDD3...,
>>>>>    then read another 4kB from HDD1, HDD2, HDD3, ...
>>>>
>>>> Ehm, that very much looks like one-at-the-time from each drive, which is
>>>> pretty much the worst way to do it :-)
>>>>
>>>> Is there a reason for that? Why isn't it using 64k chunks or something
>>>> like that? You could still do that as a kind of read-ahead, even if
>>>> you're still processing in chunks of 4k.
>>>
>>> raid456 handles logic in the granularity of stripe. Each stripe is 4kB from
>>> every HDD in the array. AFAICT, we need some non-trivial change to
>>> enable the read ahead.
>>
>> Right, you'd need to stick some sort of caching in between so instead of
>> reading 4k directly, you ask the cache for 4k and that can manage
>> read-ahead.
>>
>>>>>> Plug merging is mainly meant to reduce the overhead of merging,
>>>>>> complement what the scheduler would do. If there's a big drop in
>>>>>> performance just by not getting as efficient merging on the plug side,
>>>>>> that points to an issue with something else.
>>>>>
>>>>> We introduced blk_plug_max_rq_count() to give md more opportunities to
>>>>> merge at plug side, so I guess the behavior has been like this for a
>>>>> long time. I will take a look at the scheduler side and see whether we
>>>>> can just merge later, but I am not very optimistic about it.
>>>>
>>>> Yeah I remember, and that also kind of felt like a work-around for some
>>>> underlying issue. Maybe there's something about how the IO is issued
>>>> that makes it go straight to disk and we never get any merging? Is it
>>>> because they are sync reads?
>>>>
>>>> In any case, just doing larger reads would likely help quite a bit, but
>>>> would still be nice to get to the bottom of why we're not seeing the
>>>> level of merging we expect.
>>>
>>> Let me look more into this. Maybe we messed something up in the
>>> scheduler.
>>
>> I'm assuming you have a plug setup for doing the reads, which is why you
>> see the big difference (or there would be none). But
>> blk_mq_flush_plug_list() should really take care of this when the plug
>> is flushed, requests should be merged at that point. And from your
>> description, doesn't sound like they are at all.
> 
> requests are shared, when running out of request, plug list will be
> flushed early.

That is true, but I don't think that's the problem here with the round
robin approach. Seems like it'd drive a pretty low queue depth, even
considering SATA.

> Maybe we can just put bios into plug list directly, then handle them
> all in blk_mq_flush_plug_list.

That might not be a bad idea regardless. And should be trivial now, with
the plug list being singly linked anyway.
Ming Lei March 11, 2022, 1:32 a.m. UTC | #15
On Thu, Mar 10, 2022 at 06:21:33PM -0700, Jens Axboe wrote:
> On 3/10/22 6:14 PM, Ming Lei wrote:
> > On Thu, Mar 10, 2022 at 05:36:44PM -0700, Jens Axboe wrote:
> >> On 3/10/22 5:31 PM, Song Liu wrote:
> >>> On Thu, Mar 10, 2022 at 4:07 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>
> >>>> On 3/10/22 4:33 PM, Song Liu wrote:
> >>>>> On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>
> >>>>>> On 3/10/22 3:37 PM, Song Liu wrote:
> >>>>>>> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>>>
> >>>>>>>> On 3/8/22 11:42 PM, Song Liu wrote:
> >>>>>>>>> RAID arrays check/repair operations benefit a lot from merging requests.
> >>>>>>>>> If we only check the previous entry for merge attempt, many merge will be
> >>>>>>>>> missed. As a result, significant regression is observed for RAID check
> >>>>>>>>> and repair.
> >>>>>>>>>
> >>>>>>>>> Fix this by checking more than just the previous entry when
> >>>>>>>>> plug->multiple_queues == true.
> >>>>>>>>>
> >>>>>>>>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to
> >>>>>>>>> 103 MB/s.
> >>>>>>>>
> >>>>>>>> Do the underlying disks not have an IO scheduler attached? Curious why
> >>>>>>>> the merges aren't being done there, would be trivial when the list is
> >>>>>>>> flushed out. Because if the perf difference is that big, then other
> >>>>>>>> workloads would be suffering they are that sensitive to being within a
> >>>>>>>> plug worth of IO.
> >>>>>>>
> >>>>>>> The disks have mq-deadline by default. I also tried kyber, the result
> >>>>>>> is the same. Raid repair work sends IOs to all the HDDs in a
> >>>>>>> round-robin manner. If we only check the previous request, there isn't
> >>>>>>> much opportunity for merge. I guess other workloads may have different
> >>>>>>> behavior?
> >>>>>>
> >>>>>> Round robin one at the time? I feel like there's something odd or
> >>>>>> suboptimal with the raid rebuild, if it's that sensitive to plug
> >>>>>> merging.
> >>>>>
> >>>>> It is not one request at a time, but more like (for raid456):
> >>>>>    read 4kB from HDD1, HDD2, HDD3...,
> >>>>>    then read another 4kB from HDD1, HDD2, HDD3, ...
> >>>>
> >>>> Ehm, that very much looks like one-at-the-time from each drive, which is
> >>>> pretty much the worst way to do it :-)
> >>>>
> >>>> Is there a reason for that? Why isn't it using 64k chunks or something
> >>>> like that? You could still do that as a kind of read-ahead, even if
> >>>> you're still processing in chunks of 4k.
> >>>
> >>> raid456 handles logic in the granularity of stripe. Each stripe is 4kB from
> >>> every HDD in the array. AFAICT, we need some non-trivial change to
> >>> enable the read ahead.
> >>
> >> Right, you'd need to stick some sort of caching in between so instead of
> >> reading 4k directly, you ask the cache for 4k and that can manage
> >> read-ahead.
> >>
> >>>>>> Plug merging is mainly meant to reduce the overhead of merging,
> >>>>>> complement what the scheduler would do. If there's a big drop in
> >>>>>> performance just by not getting as efficient merging on the plug side,
> >>>>>> that points to an issue with something else.
> >>>>>
> >>>>> We introduced blk_plug_max_rq_count() to give md more opportunities to
> >>>>> merge at plug side, so I guess the behavior has been like this for a
> >>>>> long time. I will take a look at the scheduler side and see whether we
> >>>>> can just merge later, but I am not very optimistic about it.
> >>>>
> >>>> Yeah I remember, and that also kind of felt like a work-around for some
> >>>> underlying issue. Maybe there's something about how the IO is issued
> >>>> that makes it go straight to disk and we never get any merging? Is it
> >>>> because they are sync reads?
> >>>>
> >>>> In any case, just doing larger reads would likely help quite a bit, but
> >>>> would still be nice to get to the bottom of why we're not seeing the
> >>>> level of merging we expect.
> >>>
> >>> Let me look more into this. Maybe we messed something up in the
> >>> scheduler.
> >>
> >> I'm assuming you have a plug setup for doing the reads, which is why you
> >> see the big difference (or there would be none). But
> >> blk_mq_flush_plug_list() should really take care of this when the plug
> >> is flushed, requests should be merged at that point. And from your
> >> description, doesn't sound like they are at all.
> > 
> > requests are shared, when running out of request, plug list will be
> > flushed early.
> 
> That is true, but I don't think that's the problem here with the round
> robin approach. Seems like it'd drive a pretty low queue depth, even
> considering SATA.

Another one may be plug list not sorted before inserting requests to
scheduler in blk_mq_flush_plug_list(), looks you have mentioned.

Thanks,
Ming
Jens Axboe March 11, 2022, 1:35 a.m. UTC | #16
On 3/10/22 6:32 PM, Ming Lei wrote:
> On Thu, Mar 10, 2022 at 06:21:33PM -0700, Jens Axboe wrote:
>> On 3/10/22 6:14 PM, Ming Lei wrote:
>>> On Thu, Mar 10, 2022 at 05:36:44PM -0700, Jens Axboe wrote:
>>>> On 3/10/22 5:31 PM, Song Liu wrote:
>>>>> On Thu, Mar 10, 2022 at 4:07 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> On 3/10/22 4:33 PM, Song Liu wrote:
>>>>>>> On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>
>>>>>>>> On 3/10/22 3:37 PM, Song Liu wrote:
>>>>>>>>> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>
>>>>>>>>>> On 3/8/22 11:42 PM, Song Liu wrote:
>>>>>>>>>>> RAID arrays check/repair operations benefit a lot from merging requests.
>>>>>>>>>>> If we only check the previous entry for merge attempt, many merge will be
>>>>>>>>>>> missed. As a result, significant regression is observed for RAID check
>>>>>>>>>>> and repair.
>>>>>>>>>>>
>>>>>>>>>>> Fix this by checking more than just the previous entry when
>>>>>>>>>>> plug->multiple_queues == true.
>>>>>>>>>>>
>>>>>>>>>>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to
>>>>>>>>>>> 103 MB/s.
>>>>>>>>>>
>>>>>>>>>> Do the underlying disks not have an IO scheduler attached? Curious why
>>>>>>>>>> the merges aren't being done there, would be trivial when the list is
>>>>>>>>>> flushed out. Because if the perf difference is that big, then other
>>>>>>>>>> workloads would be suffering they are that sensitive to being within a
>>>>>>>>>> plug worth of IO.
>>>>>>>>>
>>>>>>>>> The disks have mq-deadline by default. I also tried kyber, the result
>>>>>>>>> is the same. Raid repair work sends IOs to all the HDDs in a
>>>>>>>>> round-robin manner. If we only check the previous request, there isn't
>>>>>>>>> much opportunity for merge. I guess other workloads may have different
>>>>>>>>> behavior?
>>>>>>>>
>>>>>>>> Round robin one at the time? I feel like there's something odd or
>>>>>>>> suboptimal with the raid rebuild, if it's that sensitive to plug
>>>>>>>> merging.
>>>>>>>
>>>>>>> It is not one request at a time, but more like (for raid456):
>>>>>>>    read 4kB from HDD1, HDD2, HDD3...,
>>>>>>>    then read another 4kB from HDD1, HDD2, HDD3, ...
>>>>>>
>>>>>> Ehm, that very much looks like one-at-the-time from each drive, which is
>>>>>> pretty much the worst way to do it :-)
>>>>>>
>>>>>> Is there a reason for that? Why isn't it using 64k chunks or something
>>>>>> like that? You could still do that as a kind of read-ahead, even if
>>>>>> you're still processing in chunks of 4k.
>>>>>
>>>>> raid456 handles logic in the granularity of stripe. Each stripe is 4kB from
>>>>> every HDD in the array. AFAICT, we need some non-trivial change to
>>>>> enable the read ahead.
>>>>
>>>> Right, you'd need to stick some sort of caching in between so instead of
>>>> reading 4k directly, you ask the cache for 4k and that can manage
>>>> read-ahead.
>>>>
>>>>>>>> Plug merging is mainly meant to reduce the overhead of merging,
>>>>>>>> complement what the scheduler would do. If there's a big drop in
>>>>>>>> performance just by not getting as efficient merging on the plug side,
>>>>>>>> that points to an issue with something else.
>>>>>>>
>>>>>>> We introduced blk_plug_max_rq_count() to give md more opportunities to
>>>>>>> merge at plug side, so I guess the behavior has been like this for a
>>>>>>> long time. I will take a look at the scheduler side and see whether we
>>>>>>> can just merge later, but I am not very optimistic about it.
>>>>>>
>>>>>> Yeah I remember, and that also kind of felt like a work-around for some
>>>>>> underlying issue. Maybe there's something about how the IO is issued
>>>>>> that makes it go straight to disk and we never get any merging? Is it
>>>>>> because they are sync reads?
>>>>>>
>>>>>> In any case, just doing larger reads would likely help quite a bit, but
>>>>>> would still be nice to get to the bottom of why we're not seeing the
>>>>>> level of merging we expect.
>>>>>
>>>>> Let me look more into this. Maybe we messed something up in the
>>>>> scheduler.
>>>>
>>>> I'm assuming you have a plug setup for doing the reads, which is why you
>>>> see the big difference (or there would be none). But
>>>> blk_mq_flush_plug_list() should really take care of this when the plug
>>>> is flushed, requests should be merged at that point. And from your
>>>> description, doesn't sound like they are at all.
>>>
>>> requests are shared, when running out of request, plug list will be
>>> flushed early.
>>
>> That is true, but I don't think that's the problem here with the round
>> robin approach. Seems like it'd drive a pretty low queue depth, even
>> considering SATA.
> 
> Another one may be plug list not sorted before inserting requests to
> scheduler in blk_mq_flush_plug_list(), looks you have mentioned.

Yep, it'd probably be the first thing I'd try... The way the IO is
issued, it's pretty much guaranteed that the plug list will be fully
interleaved with different queues and we're then issuine one-by-one and
running the queue each time.
Wols Lists March 11, 2022, 8:09 a.m. UTC | #17
On 11/03/2022 01:35, Jens Axboe wrote:
> On 3/10/22 6:32 PM, Ming Lei wrote:
>> On Thu, Mar 10, 2022 at 06:21:33PM -0700, Jens Axboe wrote:
>>> On 3/10/22 6:14 PM, Ming Lei wrote:
>>>> On Thu, Mar 10, 2022 at 05:36:44PM -0700, Jens Axboe wrote:
>>>>> On 3/10/22 5:31 PM, Song Liu wrote:
>>>>>> On Thu, Mar 10, 2022 at 4:07 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>
>>>>>>> On 3/10/22 4:33 PM, Song Liu wrote:
>>>>>>>> On Thu, Mar 10, 2022 at 3:02 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>
>>>>>>>>> On 3/10/22 3:37 PM, Song Liu wrote:
>>>>>>>>>> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 3/8/22 11:42 PM, Song Liu wrote:
>>>>>>>>>>>> RAID arrays check/repair operations benefit a lot from merging requests.
>>>>>>>>>>>> If we only check the previous entry for merge attempt, many merge will be
>>>>>>>>>>>> missed. As a result, significant regression is observed for RAID check
>>>>>>>>>>>> and repair.
>>>>>>>>>>>>
>>>>>>>>>>>> Fix this by checking more than just the previous entry when
>>>>>>>>>>>> plug->multiple_queues == true.
>>>>>>>>>>>>
>>>>>>>>>>>> This improves the check/repair speed of a 20-HDD raid6 from 19 MB/s to
>>>>>>>>>>>> 103 MB/s.
>>>>>>>>>>>
>>>>>>>>>>> Do the underlying disks not have an IO scheduler attached? Curious why
>>>>>>>>>>> the merges aren't being done there, would be trivial when the list is
>>>>>>>>>>> flushed out. Because if the perf difference is that big, then other
>>>>>>>>>>> workloads would be suffering they are that sensitive to being within a
>>>>>>>>>>> plug worth of IO.
>>>>>>>>>>
>>>>>>>>>> The disks have mq-deadline by default. I also tried kyber, the result
>>>>>>>>>> is the same. Raid repair work sends IOs to all the HDDs in a
>>>>>>>>>> round-robin manner. If we only check the previous request, there isn't
>>>>>>>>>> much opportunity for merge. I guess other workloads may have different
>>>>>>>>>> behavior?
>>>>>>>>>
>>>>>>>>> Round robin one at the time? I feel like there's something odd or
>>>>>>>>> suboptimal with the raid rebuild, if it's that sensitive to plug
>>>>>>>>> merging.
>>>>>>>>
>>>>>>>> It is not one request at a time, but more like (for raid456):
>>>>>>>>     read 4kB from HDD1, HDD2, HDD3...,
>>>>>>>>     then read another 4kB from HDD1, HDD2, HDD3, ...
>>>>>>>
>>>>>>> Ehm, that very much looks like one-at-the-time from each drive, which is
>>>>>>> pretty much the worst way to do it :-)
>>>>>>>
>>>>>>> Is there a reason for that? Why isn't it using 64k chunks or something
>>>>>>> like that? You could still do that as a kind of read-ahead, even if
>>>>>>> you're still processing in chunks of 4k.
>>>>>>
>>>>>> raid456 handles logic in the granularity of stripe. Each stripe is 4kB from
>>>>>> every HDD in the array. AFAICT, we need some non-trivial change to
>>>>>> enable the read ahead.
>>>>>
>>>>> Right, you'd need to stick some sort of caching in between so instead of
>>>>> reading 4k directly, you ask the cache for 4k and that can manage
>>>>> read-ahead.
>>>>>
>>>>>>>>> Plug merging is mainly meant to reduce the overhead of merging,
>>>>>>>>> complement what the scheduler would do. If there's a big drop in
>>>>>>>>> performance just by not getting as efficient merging on the plug side,
>>>>>>>>> that points to an issue with something else.
>>>>>>>>
>>>>>>>> We introduced blk_plug_max_rq_count() to give md more opportunities to
>>>>>>>> merge at plug side, so I guess the behavior has been like this for a
>>>>>>>> long time. I will take a look at the scheduler side and see whether we
>>>>>>>> can just merge later, but I am not very optimistic about it.
>>>>>>>
>>>>>>> Yeah I remember, and that also kind of felt like a work-around for some
>>>>>>> underlying issue. Maybe there's something about how the IO is issued
>>>>>>> that makes it go straight to disk and we never get any merging? Is it
>>>>>>> because they are sync reads?
>>>>>>>
>>>>>>> In any case, just doing larger reads would likely help quite a bit, but
>>>>>>> would still be nice to get to the bottom of why we're not seeing the
>>>>>>> level of merging we expect.
>>>>>>
>>>>>> Let me look more into this. Maybe we messed something up in the
>>>>>> scheduler.
>>>>>
>>>>> I'm assuming you have a plug setup for doing the reads, which is why you
>>>>> see the big difference (or there would be none). But
>>>>> blk_mq_flush_plug_list() should really take care of this when the plug
>>>>> is flushed, requests should be merged at that point. And from your
>>>>> description, doesn't sound like they are at all.
>>>>
>>>> requests are shared, when running out of request, plug list will be
>>>> flushed early.
>>>
>>> That is true, but I don't think that's the problem here with the round
>>> robin approach. Seems like it'd drive a pretty low queue depth, even
>>> considering SATA.
>>
>> Another one may be plug list not sorted before inserting requests to
>> scheduler in blk_mq_flush_plug_list(), looks you have mentioned.
> 
> Yep, it'd probably be the first thing I'd try... The way the IO is
> issued, it's pretty much guaranteed that the plug list will be fully
> interleaved with different queues and we're then issuine one-by-one and
> running the queue each time.
> 
Naive question, but can you make the flush flush the first one, then 
scan the queue for all bios for the same device, then go back and start 
again? Simple approach if it'll work, at the expense of scanning the 
queue once per device.

Cheers,
Wol
Wilson Jonathan March 11, 2022, 11:30 a.m. UTC | #18
On Thu, 2022-03-10 at 14:37 -0800, Song Liu wrote:
> On Thu, Mar 10, 2022 at 2:15 PM Jens Axboe <axboe@kernel.dk> wrote:
> > 
> > On 3/8/22 11:42 PM, Song Liu wrote:
> > > RAID arrays check/repair operations benefit a lot from merging
> > > requests.
> > > If we only check the previous entry for merge attempt, many merge
> > > will be
> > > missed. As a result, significant regression is observed for RAID
> > > check
> > > and repair.
> > > 
> > > Fix this by checking more than just the previous entry when
> > > plug->multiple_queues == true.
> > > 
> > > This improves the check/repair speed of a 20-HDD raid6 from 19
> > > MB/s to
> > > 103 MB/s.
> > 
> > Do the underlying disks not have an IO scheduler attached? Curious
> > why
> > the merges aren't being done there, would be trivial when the list
> > is
> > flushed out. Because if the perf difference is that big, then other
> > workloads would be suffering they are that sensitive to being
> > within a
> > plug worth of IO.
> 
> The disks have mq-deadline by default. I also tried kyber, the result
> is the
> same. Raid repair work sends IOs to all the HDDs in a round-robin
> manner.
> If we only check the previous request, there isn't much opportunity
> for
> merge. I guess other workloads may have different behavior?
> 
> > Between your two approaches, I do greatly prefer the first one
> > though.
> 
> I also like the first one better. But I am not sure whether it will
> slow down
> other workloads. We can probably also make the second one cleaner
> with a new variation of blk_start_plug.

As a matter of note and purely anecdotal: Before the raid "check" slow
down/regression my system would be responsive but delayed (opening a
program or opening the xface application menu or switching a file in
VLC would take longer than normal, fractions of seconds to a second but
slugish and notacable) and with the regression that slow down went from
annoying to unbearable. 

The slowdowns (in programs and menus and file changes) also *seems* to
get worse (in both pre & post regression) the longer the check has been
running and the slower a run naturally gets (I assume as the check
moves from the outer portion of the disk to the inner portion?) and the
lower the KB's reported in cat /proc/mdstat/.

In the post regression situation it wasn't just that the check was
taking much longer and was much slower it was also that it slowed down
everything else to the point that it was painful to try and use the
computer as it was so much less responsive (multiple seconds for
anything to load/run/swtch; even web pages). A laggy annoyance had
become an actual hindrance. 

I have no idea why the speed of the "check" would seemingly affect the
apparent responsiveness of the computer and why it would appear that
the slower the check the slower the responsiveness. 

> 
> Thanks,
> Song
Jens Axboe March 11, 2022, 2:16 p.m. UTC | #19
On 3/10/22 5:07 PM, Jens Axboe wrote:
> In any case, just doing larger reads would likely help quite a bit, but
> would still be nice to get to the bottom of why we're not seeing the
> level of merging we expect.

Song, can you try this one? It'll do the dispatch in a somewhat saner
fashion, bundling identical queues. And we'll keep iterating the plug
list for a merge if we have multiple disks, until we've seen a queue
match and checked.


diff --git a/block/blk-merge.c b/block/blk-merge.c
index 0e871d4e7cb8..68b623d00db5 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -1073,12 +1073,20 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 	if (!plug || rq_list_empty(plug->mq_list))
 		return false;
 
-	/* check the previously added entry for a quick merge attempt */
-	rq = rq_list_peek(&plug->mq_list);
-	if (rq->q == q) {
-		if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
-				BIO_MERGE_OK)
-			return true;
+	rq_list_for_each(&plug->mq_list, rq) {
+		if (rq->q == q) {
+			if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
+			    BIO_MERGE_OK)
+				return true;
+			break;
+		}
+
+		/*
+		 * Only keep iterating plug list for merges if we have multiple
+		 * queues
+		 */
+		if (!plug->multiple_queues)
+			break;
 	}
 	return false;
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index bb263abbb40f..9c784262fd6b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2576,13 +2576,36 @@ static void __blk_mq_flush_plug_list(struct request_queue *q,
 	q->mq_ops->queue_rqs(&plug->mq_list);
 }
 
+static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
+{
+	struct blk_mq_hw_ctx *this_hctx = NULL;
+	struct blk_mq_ctx *this_ctx = NULL;
+	struct request *requeue_list = NULL;
+	unsigned int depth = 0;
+	LIST_HEAD(list);
+
+	do {
+		struct request *rq = rq_list_pop(&plug->mq_list);
+
+		if (!this_hctx) {
+			this_hctx = rq->mq_hctx;
+			this_ctx = rq->mq_ctx;
+		} else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx) {
+			rq_list_add(&requeue_list, rq);
+			continue;
+		}
+		list_add_tail(&rq->queuelist, &list);
+		depth++;
+	} while (!rq_list_empty(plug->mq_list));
+
+	plug->mq_list = requeue_list;
+	trace_block_unplug(this_hctx->queue, depth, !from_sched);
+	blk_mq_sched_insert_requests(this_hctx, this_ctx, &list, from_sched);
+}
+
 void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 {
-	struct blk_mq_hw_ctx *this_hctx;
-	struct blk_mq_ctx *this_ctx;
 	struct request *rq;
-	unsigned int depth;
-	LIST_HEAD(list);
 
 	if (rq_list_empty(plug->mq_list))
 		return;
@@ -2618,35 +2641,9 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 			return;
 	}
 
-	this_hctx = NULL;
-	this_ctx = NULL;
-	depth = 0;
 	do {
-		rq = rq_list_pop(&plug->mq_list);
-
-		if (!this_hctx) {
-			this_hctx = rq->mq_hctx;
-			this_ctx = rq->mq_ctx;
-		} else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx) {
-			trace_block_unplug(this_hctx->queue, depth,
-						!from_schedule);
-			blk_mq_sched_insert_requests(this_hctx, this_ctx,
-						&list, from_schedule);
-			depth = 0;
-			this_hctx = rq->mq_hctx;
-			this_ctx = rq->mq_ctx;
-
-		}
-
-		list_add(&rq->queuelist, &list);
-		depth++;
+		blk_mq_dispatch_plug_list(plug, from_schedule);
 	} while (!rq_list_empty(plug->mq_list));
-
-	if (!list_empty(&list)) {
-		trace_block_unplug(this_hctx->queue, depth, !from_schedule);
-		blk_mq_sched_insert_requests(this_hctx, this_ctx, &list,
-						from_schedule);
-	}
 }
 
 void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
Wols Lists March 11, 2022, 3:58 p.m. UTC | #20
On 11/03/2022 11:30, Wilson Jonathan wrote:
> I have no idea why the speed of the "check" would seemingly affect the
> apparent responsiveness of the computer and why it would appear that
> the slower the check the slower the responsiveness.

The buffers were filling up, nicking your free ram, and you were having 
to wait longer and longer before they could be flushed to swap your 
applications back in?

Cheers,
Wol
Song Liu March 11, 2022, 4:59 p.m. UTC | #21
Hi Jens,

On Fri, Mar 11, 2022 at 6:16 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 3/10/22 5:07 PM, Jens Axboe wrote:
> > In any case, just doing larger reads would likely help quite a bit, but
> > would still be nice to get to the bottom of why we're not seeing the
> > level of merging we expect.
>
> Song, can you try this one? It'll do the dispatch in a somewhat saner
> fashion, bundling identical queues. And we'll keep iterating the plug
> list for a merge if we have multiple disks, until we've seen a queue
> match and checked.

This one works great! We are seeing 99% read request merge and
500kB+ average read size. The original patch in this thread only got
88% and 34kB for these two metrics.

Thanks,
Song

[...]
Paul Menzel March 11, 2022, 9:41 p.m. UTC | #22
Dear Song,


Am 11.03.22 um 17:59 schrieb Song Liu:

> On Fri, Mar 11, 2022 at 6:16 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 3/10/22 5:07 PM, Jens Axboe wrote:
>>> In any case, just doing larger reads would likely help quite a bit, but
>>> would still be nice to get to the bottom of why we're not seeing the
>>> level of merging we expect.
>>
>> Song, can you try this one? It'll do the dispatch in a somewhat saner
>> fashion, bundling identical queues. And we'll keep iterating the plug
>> list for a merge if we have multiple disks, until we've seen a queue
>> match and checked.
> 
> This one works great! We are seeing 99% read request merge and
> 500kB+ average read size. The original patch in this thread only got
> 88% and 34kB for these two metrics.

Nice. I am curious, how these metrics can be obtained?

[…]


Kind regards,

Paul
Song Liu March 11, 2022, 10:40 p.m. UTC | #23
On Fri, Mar 11, 2022 at 1:42 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote:
>
> Dear Song,
>
>
> Am 11.03.22 um 17:59 schrieb Song Liu:
>
> > On Fri, Mar 11, 2022 at 6:16 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 3/10/22 5:07 PM, Jens Axboe wrote:
> >>> In any case, just doing larger reads would likely help quite a bit, but
> >>> would still be nice to get to the bottom of why we're not seeing the
> >>> level of merging we expect.
> >>
> >> Song, can you try this one? It'll do the dispatch in a somewhat saner
> >> fashion, bundling identical queues. And we'll keep iterating the plug
> >> list for a merge if we have multiple disks, until we've seen a queue
> >> match and checked.
> >
> > This one works great! We are seeing 99% read request merge and
> > 500kB+ average read size. The original patch in this thread only got
> > 88% and 34kB for these two metrics.
>
> Nice. I am curious, how these metrics can be obtained?
>

We can use tools as iostat:
iostat -mx 2
Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s
%rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdb           3176.50    1.00    100.57      0.00 22503.00     0.00
87.63   0.00   10.22    3.50  32.46    32.42     4.00   0.24  76.60
sdi           3167.00    1.00    100.57      0.00 22512.50     0.00
87.67   0.00   11.58    4.00  36.68    32.52     4.00   0.24  77.55

The two metrics we used here are %rrqm and rareq-sz.

Thanks,
Song
diff mbox series

Patch

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 4de34a332c9f..57e2075fb2f4 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -1089,12 +1089,14 @@  bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 	if (!plug || rq_list_empty(plug->mq_list))
 		return false;
 
-	/* check the previously added entry for a quick merge attempt */
-	rq = rq_list_peek(&plug->mq_list);
-	if (rq->q == q) {
-		if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
-				BIO_MERGE_OK)
-			return true;
+	rq_list_for_each(&plug->mq_list, rq) {
+		if (rq->q == q) {
+			if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
+			    BIO_MERGE_OK)
+				return true;
+		}
+		if (!plug->multiple_queues)
+			break;
 	}
 	return false;
 }