mbox series

[0/9,v4] block: Fix IO priority mess

Message ID 20220621102201.26337-1-jack@suse.cz (mailing list archive)
Headers show
Series block: Fix IO priority mess | expand

Message

Jan Kara June 21, 2022, 10:24 a.m. UTC
Hello,

This is the fourth revision of my patches fixing IO priority handling in the
block layer.

Changes since v3:
* Added Reviewed-by tags from Damien
* Fixed build failure without CONFIG_BLOCK
* Separated refactoring of get_current_ioprio() into a separate patch

Changes since v2:
* added some comments to better explain things
* changed handling of ioprio_get(2)
* a few small tweaks based on Damien's feedback

Original cover letter:
Recently, I've been looking into 10% regression reported by our performance
measurement infrastructure in reaim benchmark that was bisected down to
5a9d041ba2f6 ("block: move io_context creation into where it's needed"). This
didn't really make much sense and it took me a while to understand this but the
culprit is actually in even older commit e70344c05995 ("block: fix default IO
priority handling") and 5a9d041ba2f6 just made the breakage visible.
Essentially the problem was that after these commits some IO was queued with IO
priority class IOPRIO_CLASS_BE while other IO was queued with IOPRIO_CLASS_NONE
and as a result they could not be merged together resulting in performance
regression. I think what commit e70344c05995 ("block: fix default IO priority
handling") did is actually broken not only because of this performance
regression but because of other reasons as well (see changelog of patch 3/8 for
details). Besides this problem, there are multiple other inconsistencies in the
IO priority handling throughout the block stack we have identified when
discussing this with Damien Le Moal. So this patch set aims at fixing all these
various issues.

Note that there are a few choices I've made I'm not 100% sold on. In particular
the conversion of blk-ioprio from rqos is somewhat disputable since it now
incurs a cost similar to blk-throttle in the bio submission fast path (need to
load bio->bi_blkg->pd[ioprio_policy.plid]).  If people think the removed
boilerplate code is not worth the cost, I can certainly go via the "additional
rqos hook" path.

								Honza
Previous versions:
Link: http://lore.kernel.org/r/20220601132347.13543-1-jack@suse.cz # v1
Link: http://lore.kernel.org/r/20220615160437.5478-1-jack@suse.cz # v2
Link: http://lore.kernel.org/r/20220620160726.19798-1-jack@suse.cz # v3

Comments

Damien Le Moal June 21, 2022, 1:03 p.m. UTC | #1
On 6/21/22 19:24, Jan Kara wrote:
> Hello,
> 
> This is the fourth revision of my patches fixing IO priority handling in the
> block layer.

Thanks for this. I reviewed and this all looks good to me.
Nevertheless, I will give this a spin tomorrow (with ATA drives that have
NCQ priority).

> 
> Changes since v3:
> * Added Reviewed-by tags from Damien
> * Fixed build failure without CONFIG_BLOCK
> * Separated refactoring of get_current_ioprio() into a separate patch
> 
> Changes since v2:
> * added some comments to better explain things
> * changed handling of ioprio_get(2)
> * a few small tweaks based on Damien's feedback
> 
> Original cover letter:
> Recently, I've been looking into 10% regression reported by our performance
> measurement infrastructure in reaim benchmark that was bisected down to
> 5a9d041ba2f6 ("block: move io_context creation into where it's needed"). This
> didn't really make much sense and it took me a while to understand this but the
> culprit is actually in even older commit e70344c05995 ("block: fix default IO
> priority handling") and 5a9d041ba2f6 just made the breakage visible.
> Essentially the problem was that after these commits some IO was queued with IO
> priority class IOPRIO_CLASS_BE while other IO was queued with IOPRIO_CLASS_NONE
> and as a result they could not be merged together resulting in performance
> regression. I think what commit e70344c05995 ("block: fix default IO priority
> handling") did is actually broken not only because of this performance
> regression but because of other reasons as well (see changelog of patch 3/8 for
> details). Besides this problem, there are multiple other inconsistencies in the
> IO priority handling throughout the block stack we have identified when
> discussing this with Damien Le Moal. So this patch set aims at fixing all these
> various issues.
> 
> Note that there are a few choices I've made I'm not 100% sold on. In particular
> the conversion of blk-ioprio from rqos is somewhat disputable since it now
> incurs a cost similar to blk-throttle in the bio submission fast path (need to
> load bio->bi_blkg->pd[ioprio_policy.plid]).  If people think the removed
> boilerplate code is not worth the cost, I can certainly go via the "additional
> rqos hook" path.
> 
> 								Honza
> Previous versions:
> Link: http://lore.kernel.org/r/20220601132347.13543-1-jack@suse.cz # v1
> Link: http://lore.kernel.org/r/20220615160437.5478-1-jack@suse.cz # v2
> Link: http://lore.kernel.org/r/20220620160726.19798-1-jack@suse.cz # v3
Jan Kara June 22, 2022, 9:21 a.m. UTC | #2
On Tue 21-06-22 22:03:54, Damien Le Moal wrote:
> On 6/21/22 19:24, Jan Kara wrote:
> > Hello,
> > 
> > This is the fourth revision of my patches fixing IO priority handling in the
> > block layer.
> 
> Thanks for this. I reviewed and this all looks good to me.
> Nevertheless, I will give this a spin tomorrow (with ATA drives that have
> NCQ priority).

Great, thanks for the review and the testing! I will wait for the results
and then post the final version of the patchset.

								Honza
Damien Le Moal June 23, 2022, 12:24 a.m. UTC | #3
On 6/21/22 19:24, Jan Kara wrote:
> Hello,
> 
> This is the fourth revision of my patches fixing IO priority handling in the
> block layer.

Hey,

I ran some tests with RT IO class on an ATA HDD with NCQ priority.
Results look good:

randread: (groupid=0, jobs=1): err= 0: pid=1862: Thu Jun 23 09:22:42 2022
  read: IOPS=152, BW=19.1MiB/s (20.0MB/s)(11.1GiB/592929msec)
    slat (usec): min=37, max=3280, avg=174.96, stdev=120.90
    clat (msec): min=7, max=918, avg=156.95, stdev=149.43
     lat (msec): min=7, max=918, avg=157.13, stdev=149.43
    clat prio 0/0 (msec): min=7, max=918, avg=171.68, stdev=150.48
    clat prio 1/0 (msec): min=8, max=166, avg=24.64, stdev= 5.93
    clat percentiles (msec):
     |  1.00th=[   15],  5.00th=[   20], 10.00th=[   23], 20.00th=[   32],
     | 30.00th=[   51], 40.00th=[   77], 50.00th=[  108], 60.00th=[  146],
     | 70.00th=[  194], 80.00th=[  262], 90.00th=[  372], 95.00th=[  477],
     | 99.00th=[  659], 99.50th=[  701], 99.90th=[  793], 99.95th=[  818],
     | 99.99th=[  885]
    clat prio 0/0 (89.99% of IOs) percentiles (msec):
     |  1.00th=[   15],  5.00th=[   21], 10.00th=[   27], 20.00th=[   46],
     | 30.00th=[   68], 40.00th=[   94], 50.00th=[  126], 60.00th=[  163],
     | 70.00th=[  213], 80.00th=[  279], 90.00th=[  388], 95.00th=[  489],
     | 99.00th=[  659], 99.50th=[  709], 99.90th=[  793], 99.95th=[  827],
     | 99.99th=[  885]
    clat prio 1/0 (10.01% of IOs) percentiles (msec):
     |  1.00th=[   14],  5.00th=[   17], 10.00th=[   18], 20.00th=[   20],
     | 30.00th=[   22], 40.00th=[   23], 50.00th=[   25], 60.00th=[   26],
     | 70.00th=[   28], 80.00th=[   30], 90.00th=[   33], 95.00th=[   35],
     | 99.00th=[   40], 99.50th=[   42], 99.90th=[   47], 99.95th=[   50],
     | 99.99th=[  167]

Clearly significantly lower tail latencies for the 10% of IOs with class
RT (1/0), as expected. This is with "none" scheduler at QD=24 (128K random
read).

Feel free to add:

Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>

> 
> Changes since v3:
> * Added Reviewed-by tags from Damien
> * Fixed build failure without CONFIG_BLOCK
> * Separated refactoring of get_current_ioprio() into a separate patch
> 
> Changes since v2:
> * added some comments to better explain things
> * changed handling of ioprio_get(2)
> * a few small tweaks based on Damien's feedback
> 
> Original cover letter:
> Recently, I've been looking into 10% regression reported by our performance
> measurement infrastructure in reaim benchmark that was bisected down to
> 5a9d041ba2f6 ("block: move io_context creation into where it's needed"). This
> didn't really make much sense and it took me a while to understand this but the
> culprit is actually in even older commit e70344c05995 ("block: fix default IO
> priority handling") and 5a9d041ba2f6 just made the breakage visible.
> Essentially the problem was that after these commits some IO was queued with IO
> priority class IOPRIO_CLASS_BE while other IO was queued with IOPRIO_CLASS_NONE
> and as a result they could not be merged together resulting in performance
> regression. I think what commit e70344c05995 ("block: fix default IO priority
> handling") did is actually broken not only because of this performance
> regression but because of other reasons as well (see changelog of patch 3/8 for
> details). Besides this problem, there are multiple other inconsistencies in the
> IO priority handling throughout the block stack we have identified when
> discussing this with Damien Le Moal. So this patch set aims at fixing all these
> various issues.
> 
> Note that there are a few choices I've made I'm not 100% sold on. In particular
> the conversion of blk-ioprio from rqos is somewhat disputable since it now
> incurs a cost similar to blk-throttle in the bio submission fast path (need to
> load bio->bi_blkg->pd[ioprio_policy.plid]).  If people think the removed
> boilerplate code is not worth the cost, I can certainly go via the "additional
> rqos hook" path.
> 
> 								Honza
> Previous versions:
> Link: http://lore.kernel.org/r/20220601132347.13543-1-jack@suse.cz # v1
> Link: http://lore.kernel.org/r/20220615160437.5478-1-jack@suse.cz # v2
> Link: http://lore.kernel.org/r/20220620160726.19798-1-jack@suse.cz # v3
Jan Kara June 23, 2022, 7:48 a.m. UTC | #4
On Thu 23-06-22 09:24:56, Damien Le Moal wrote:
> On 6/21/22 19:24, Jan Kara wrote:
> > Hello,
> > 
> > This is the fourth revision of my patches fixing IO priority handling in the
> > block layer.
> 
> Hey,
> 
> I ran some tests with RT IO class on an ATA HDD with NCQ priority.
> Results look good:
> 
> randread: (groupid=0, jobs=1): err= 0: pid=1862: Thu Jun 23 09:22:42 2022
>   read: IOPS=152, BW=19.1MiB/s (20.0MB/s)(11.1GiB/592929msec)
>     slat (usec): min=37, max=3280, avg=174.96, stdev=120.90
>     clat (msec): min=7, max=918, avg=156.95, stdev=149.43
>      lat (msec): min=7, max=918, avg=157.13, stdev=149.43
>     clat prio 0/0 (msec): min=7, max=918, avg=171.68, stdev=150.48
>     clat prio 1/0 (msec): min=8, max=166, avg=24.64, stdev= 5.93
>     clat percentiles (msec):
>      |  1.00th=[   15],  5.00th=[   20], 10.00th=[   23], 20.00th=[   32],
>      | 30.00th=[   51], 40.00th=[   77], 50.00th=[  108], 60.00th=[  146],
>      | 70.00th=[  194], 80.00th=[  262], 90.00th=[  372], 95.00th=[  477],
>      | 99.00th=[  659], 99.50th=[  701], 99.90th=[  793], 99.95th=[  818],
>      | 99.99th=[  885]
>     clat prio 0/0 (89.99% of IOs) percentiles (msec):
>      |  1.00th=[   15],  5.00th=[   21], 10.00th=[   27], 20.00th=[   46],
>      | 30.00th=[   68], 40.00th=[   94], 50.00th=[  126], 60.00th=[  163],
>      | 70.00th=[  213], 80.00th=[  279], 90.00th=[  388], 95.00th=[  489],
>      | 99.00th=[  659], 99.50th=[  709], 99.90th=[  793], 99.95th=[  827],
>      | 99.99th=[  885]
>     clat prio 1/0 (10.01% of IOs) percentiles (msec):
>      |  1.00th=[   14],  5.00th=[   17], 10.00th=[   18], 20.00th=[   20],
>      | 30.00th=[   22], 40.00th=[   23], 50.00th=[   25], 60.00th=[   26],
>      | 70.00th=[   28], 80.00th=[   30], 90.00th=[   33], 95.00th=[   35],
>      | 99.00th=[   40], 99.50th=[   42], 99.90th=[   47], 99.95th=[   50],
>      | 99.99th=[  167]
> 
> Clearly significantly lower tail latencies for the 10% of IOs with class
> RT (1/0), as expected. This is with "none" scheduler at QD=24 (128K random
> read).

Nice! :)

> Feel free to add:
> 
> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>

Thanks for testing! I'll post the final patch version shortly.

								Honza