mbox series

[0/8,v3] bfq: Limit number of allocated scheduler tags per cgroup

Message ID 20211006164110.10817-1-jack@suse.cz (mailing list archive)
Headers show
Series bfq: Limit number of allocated scheduler tags per cgroup | expand

Message

Jan Kara Oct. 6, 2021, 5:31 p.m. UTC
Hello!

Here is the third revision of my patches to fix how bfq weights apply on cgroup
throughput and on throughput of processes with different IO priorities. Since
v2 I've added some more patches so that now IO priorities also result in
service differentiation (previously they had no effect on service
differentiation on some workloads similarly to cgroup weights). The last patch
in the series still needs some work as in the current state it causes a
notable regression (~20-30%) with dbench benchmark for large numbers of
clients. I've verified that the last patch is indeed necessary for the service
differentiation with the workload described in its changelog. As we discussed
with Paolo, I have also found out that if I remove the "waker has enough
budget" condition from bfq_select_queue(), dbench performance is restored
and the service differentiation is still good. But we probably need some
justification or cleaner solution than just removing the condition so that
is still up to discussion. But first seven patches already noticeably improve
the situation for lots of workloads so IMO they stand on their own and
can be merged regardless of how we go about the last patch.

Changes since v2:
* Rebased on top of current Linus' tree
* Updated computation of scheduler tag proportions to work correctly even
  for processes within the same cgroup but with different IO priorities
* Added comment roughly explaining why we limit tag depth
* Added patch limiting waker / wakee detection in time so avoid at least the
  most obvious false positives
* Added patch to log waker / wakee detections in blktrace for better debugging
* Added patch properly account injected IO

Changes since v1:
* Fixed computation of appropriate proportion of scheduler tags for a cgroup
  to work with deeper cgroup hierarchies.

Original cover letter:

I was looking into why cgroup weights do not have any measurable impact on
writeback throughput from different cgroups. This actually a regression from
CFQ where things work more or less OK and weights have roughly the impact they
should. The problem can be reproduced e.g. by running the following easy fio
job in two cgroups with different weight:

[writer]
directory=/mnt/repro/
numjobs=1
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync

I can observe there's no significat difference in the amount of data written
from different cgroups despite their weights are in say 1:3 ratio.

After some debugging I've understood the dynamics of the system. There are two
issues:

1) The amount of scheduler tags needs to be significantly larger than the
amount of device tags. Otherwise there are not enough requests waiting in BFQ
to be dispatched to the device and thus there's nothing to schedule on.

2) Even with enough scheduler tags, writers from two cgroups eventually start
contending on scheduler tag allocation. These are served on first come first
served basis so writers from both cgroups feed requests into bfq with
approximately the same speed. Since bfq prefers IO from heavier cgroup, that is
submitted and completed faster and eventually we end up in a situation when
there's no IO from the heavier cgroup in bfq and all scheduler tags are
consumed by requests from the lighter cgroup. At that point bfq just dispatches
lots of the IO from the lighter cgroup since there's no contender for disk
throughput. As a result observed throughput for both cgroups are the same.

This series fixes this problem by accounting how many scheduler tags are
allocated for each cgroup and if a cgroup has more tags allocated than its
fair share (based on weights) in its service tree, we heavily limit scheduler
tag bitmap depth for it so that it is not be able to starve other cgroups from
scheduler tags.

What do people think about this?

								Honza

Previous versions:
Link: http://lore.kernel.org/r/20210712171146.12231-1-jack@suse.cz # v1
Link: http://lore.kernel.org/r/20210715132047.20874-1-jack@suse.cz # v2

Comments

Paolo Valente Oct. 7, 2021, 4:33 p.m. UTC | #1
> Il giorno 6 ott 2021, alle ore 19:31, Jan Kara <jack@suse.cz> ha scritto:
> 
> Hello!
> 
> Here is the third revision of my patches to fix how bfq weights apply on cgroup
> throughput and on throughput of processes with different IO priorities. Since
> v2 I've added some more patches so that now IO priorities also result in
> service differentiation (previously they had no effect on service
> differentiation on some workloads similarly to cgroup weights). The last patch
> in the series still needs some work as in the current state it causes a
> notable regression (~20-30%) with dbench benchmark for large numbers of
> clients. I've verified that the last patch is indeed necessary for the service
> differentiation with the workload described in its changelog. As we discussed
> with Paolo, I have also found out that if I remove the "waker has enough
> budget" condition from bfq_select_queue(), dbench performance is restored
> and the service differentiation is still good. But we probably need some
> justification or cleaner solution than just removing the condition so that
> is still up to discussion. But first seven patches already noticeably improve
> the situation for lots of workloads so IMO they stand on their own and
> can be merged regardless of how we go about the last patch.
> 

Hi Jan,
I have just one more (easy-to-resolve) doubt: you seem to have tested
these patches mostly on the throughput side.  Did you run a
startup-latency test as well?  I can run some for you, if you prefer
so. Just give me a few days.

When that is cleared, your first seven patches are ok for me.
Actually I think they are a very good and relevant contribution.
Patch number eight probably deserve some ore analysis, as you pointed
out.  As I already told you in our call, we can look at that budget
condition together.  And figure out the best tests to use, to check
whether I/O control does not get lost too much.

Thanks,
Paolo

> Changes since v2:
> * Rebased on top of current Linus' tree
> * Updated computation of scheduler tag proportions to work correctly even
>  for processes within the same cgroup but with different IO priorities
> * Added comment roughly explaining why we limit tag depth
> * Added patch limiting waker / wakee detection in time so avoid at least the
>  most obvious false positives
> * Added patch to log waker / wakee detections in blktrace for better debugging
> * Added patch properly account injected IO
> 
> Changes since v1:
> * Fixed computation of appropriate proportion of scheduler tags for a cgroup
>  to work with deeper cgroup hierarchies.
> 
> Original cover letter:
> 
> I was looking into why cgroup weights do not have any measurable impact on
> writeback throughput from different cgroups. This actually a regression from
> CFQ where things work more or less OK and weights have roughly the impact they
> should. The problem can be reproduced e.g. by running the following easy fio
> job in two cgroups with different weight:
> 
> [writer]
> directory=/mnt/repro/
> numjobs=1
> rw=write
> size=8g
> time_based
> runtime=30
> ramp_time=10
> blocksize=1m
> direct=0
> ioengine=sync
> 
> I can observe there's no significat difference in the amount of data written
> from different cgroups despite their weights are in say 1:3 ratio.
> 
> After some debugging I've understood the dynamics of the system. There are two
> issues:
> 
> 1) The amount of scheduler tags needs to be significantly larger than the
> amount of device tags. Otherwise there are not enough requests waiting in BFQ
> to be dispatched to the device and thus there's nothing to schedule on.
> 
> 2) Even with enough scheduler tags, writers from two cgroups eventually start
> contending on scheduler tag allocation. These are served on first come first
> served basis so writers from both cgroups feed requests into bfq with
> approximately the same speed. Since bfq prefers IO from heavier cgroup, that is
> submitted and completed faster and eventually we end up in a situation when
> there's no IO from the heavier cgroup in bfq and all scheduler tags are
> consumed by requests from the lighter cgroup. At that point bfq just dispatches
> lots of the IO from the lighter cgroup since there's no contender for disk
> throughput. As a result observed throughput for both cgroups are the same.
> 
> This series fixes this problem by accounting how many scheduler tags are
> allocated for each cgroup and if a cgroup has more tags allocated than its
> fair share (based on weights) in its service tree, we heavily limit scheduler
> tag bitmap depth for it so that it is not be able to starve other cgroups from
> scheduler tags.
> 
> What do people think about this?
> 
> 								Honza
> 
> Previous versions:
> Link: http://lore.kernel.org/r/20210712171146.12231-1-jack@suse.cz # v1
> Link: http://lore.kernel.org/r/20210715132047.20874-1-jack@suse.cz # v2
Paolo Valente Oct. 25, 2021, 7:58 a.m. UTC | #2
> Il giorno 7 ott 2021, alle ore 18:33, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
> 
> 
>> Il giorno 6 ott 2021, alle ore 19:31, Jan Kara <jack@suse.cz> ha scritto:
>> 
>> Hello!
>> 
>> Here is the third revision of my patches to fix how bfq weights apply on cgroup
>> throughput and on throughput of processes with different IO priorities. Since
>> v2 I've added some more patches so that now IO priorities also result in
>> service differentiation (previously they had no effect on service
>> differentiation on some workloads similarly to cgroup weights). The last patch
>> in the series still needs some work as in the current state it causes a
>> notable regression (~20-30%) with dbench benchmark for large numbers of
>> clients. I've verified that the last patch is indeed necessary for the service
>> differentiation with the workload described in its changelog. As we discussed
>> with Paolo, I have also found out that if I remove the "waker has enough
>> budget" condition from bfq_select_queue(), dbench performance is restored
>> and the service differentiation is still good. But we probably need some
>> justification or cleaner solution than just removing the condition so that
>> is still up to discussion. But first seven patches already noticeably improve
>> the situation for lots of workloads so IMO they stand on their own and
>> can be merged regardless of how we go about the last patch.
>> 
> 
> Hi Jan,
> I have just one more (easy-to-resolve) doubt: you seem to have tested
> these patches mostly on the throughput side.  Did you run a
> startup-latency test as well?  I can run some for you, if you prefer
> so. Just give me a few days.
> 

We are finally testing your patches a little bit right now, for
regressions with our typical benchmarks ...

Paolo

> When that is cleared, your first seven patches are ok for me.
> Actually I think they are a very good and relevant contribution.
> Patch number eight probably deserve some ore analysis, as you pointed
> out.  As I already told you in our call, we can look at that budget
> condition together.  And figure out the best tests to use, to check
> whether I/O control does not get lost too much.
> 
> Thanks,
> Paolo
> 
>> Changes since v2:
>> * Rebased on top of current Linus' tree
>> * Updated computation of scheduler tag proportions to work correctly even
>> for processes within the same cgroup but with different IO priorities
>> * Added comment roughly explaining why we limit tag depth
>> * Added patch limiting waker / wakee detection in time so avoid at least the
>> most obvious false positives
>> * Added patch to log waker / wakee detections in blktrace for better debugging
>> * Added patch properly account injected IO
>> 
>> Changes since v1:
>> * Fixed computation of appropriate proportion of scheduler tags for a cgroup
>> to work with deeper cgroup hierarchies.
>> 
>> Original cover letter:
>> 
>> I was looking into why cgroup weights do not have any measurable impact on
>> writeback throughput from different cgroups. This actually a regression from
>> CFQ where things work more or less OK and weights have roughly the impact they
>> should. The problem can be reproduced e.g. by running the following easy fio
>> job in two cgroups with different weight:
>> 
>> [writer]
>> directory=/mnt/repro/
>> numjobs=1
>> rw=write
>> size=8g
>> time_based
>> runtime=30
>> ramp_time=10
>> blocksize=1m
>> direct=0
>> ioengine=sync
>> 
>> I can observe there's no significat difference in the amount of data written
>> from different cgroups despite their weights are in say 1:3 ratio.
>> 
>> After some debugging I've understood the dynamics of the system. There are two
>> issues:
>> 
>> 1) The amount of scheduler tags needs to be significantly larger than the
>> amount of device tags. Otherwise there are not enough requests waiting in BFQ
>> to be dispatched to the device and thus there's nothing to schedule on.
>> 
>> 2) Even with enough scheduler tags, writers from two cgroups eventually start
>> contending on scheduler tag allocation. These are served on first come first
>> served basis so writers from both cgroups feed requests into bfq with
>> approximately the same speed. Since bfq prefers IO from heavier cgroup, that is
>> submitted and completed faster and eventually we end up in a situation when
>> there's no IO from the heavier cgroup in bfq and all scheduler tags are
>> consumed by requests from the lighter cgroup. At that point bfq just dispatches
>> lots of the IO from the lighter cgroup since there's no contender for disk
>> throughput. As a result observed throughput for both cgroups are the same.
>> 
>> This series fixes this problem by accounting how many scheduler tags are
>> allocated for each cgroup and if a cgroup has more tags allocated than its
>> fair share (based on weights) in its service tree, we heavily limit scheduler
>> tag bitmap depth for it so that it is not be able to starve other cgroups from
>> scheduler tags.
>> 
>> What do people think about this?
>> 
>> 								Honza
>> 
>> Previous versions:
>> Link: http://lore.kernel.org/r/20210712171146.12231-1-jack@suse.cz # v1
>> Link: http://lore.kernel.org/r/20210715132047.20874-1-jack@suse.cz # v2
Jan Kara Oct. 25, 2021, 11:14 a.m. UTC | #3
On Mon 25-10-21 09:58:11, Paolo Valente wrote:
> > Il giorno 7 ott 2021, alle ore 18:33, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> >> Il giorno 6 ott 2021, alle ore 19:31, Jan Kara <jack@suse.cz> ha scritto:
> >> 
> >> Hello!
> >> 
> >> Here is the third revision of my patches to fix how bfq weights apply on cgroup
> >> throughput and on throughput of processes with different IO priorities. Since
> >> v2 I've added some more patches so that now IO priorities also result in
> >> service differentiation (previously they had no effect on service
> >> differentiation on some workloads similarly to cgroup weights). The last patch
> >> in the series still needs some work as in the current state it causes a
> >> notable regression (~20-30%) with dbench benchmark for large numbers of
> >> clients. I've verified that the last patch is indeed necessary for the service
> >> differentiation with the workload described in its changelog. As we discussed
> >> with Paolo, I have also found out that if I remove the "waker has enough
> >> budget" condition from bfq_select_queue(), dbench performance is restored
> >> and the service differentiation is still good. But we probably need some
> >> justification or cleaner solution than just removing the condition so that
> >> is still up to discussion. But first seven patches already noticeably improve
> >> the situation for lots of workloads so IMO they stand on their own and
> >> can be merged regardless of how we go about the last patch.
> >> 
> > 
> > Hi Jan,
> > I have just one more (easy-to-resolve) doubt: you seem to have tested
> > these patches mostly on the throughput side.  Did you run a
> > startup-latency test as well?  I can run some for you, if you prefer
> > so. Just give me a few days.
> > 
> 
> We are finally testing your patches a little bit right now, for
> regressions with our typical benchmarks ...

Hum, strange I didn't get your previous email about benchmarks. You're
right I didn't run startup-latency AFAIR. Now that you've started them,
probably there's no big point in me queuing them as well. So thanks for
the benchmarking :)

								Honza
Paolo Valente Nov. 10, 2021, 10:24 a.m. UTC | #4
> Il giorno 25 ott 2021, alle ore 13:14, Jan Kara <jack@suse.cz> ha scritto:
> 
> On Mon 25-10-21 09:58:11, Paolo Valente wrote:
>>> Il giorno 7 ott 2021, alle ore 18:33, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>>>> Il giorno 6 ott 2021, alle ore 19:31, Jan Kara <jack@suse.cz> ha scritto:
>>>> 
>>>> Hello!
>>>> 
>>>> Here is the third revision of my patches to fix how bfq weights apply on cgroup
>>>> throughput and on throughput of processes with different IO priorities. Since
>>>> v2 I've added some more patches so that now IO priorities also result in
>>>> service differentiation (previously they had no effect on service
>>>> differentiation on some workloads similarly to cgroup weights). The last patch
>>>> in the series still needs some work as in the current state it causes a
>>>> notable regression (~20-30%) with dbench benchmark for large numbers of
>>>> clients. I've verified that the last patch is indeed necessary for the service
>>>> differentiation with the workload described in its changelog. As we discussed
>>>> with Paolo, I have also found out that if I remove the "waker has enough
>>>> budget" condition from bfq_select_queue(), dbench performance is restored
>>>> and the service differentiation is still good. But we probably need some
>>>> justification or cleaner solution than just removing the condition so that
>>>> is still up to discussion. But first seven patches already noticeably improve
>>>> the situation for lots of workloads so IMO they stand on their own and
>>>> can be merged regardless of how we go about the last patch.
>>>> 
>>> 
>>> Hi Jan,
>>> I have just one more (easy-to-resolve) doubt: you seem to have tested
>>> these patches mostly on the throughput side.  Did you run a
>>> startup-latency test as well?  I can run some for you, if you prefer
>>> so. Just give me a few days.
>>> 
>> 
>> We are finally testing your patches a little bit right now, for
>> regressions with our typical benchmarks ...
> 
> Hum, strange I didn't get your previous email about benchmarks. You're
> right I didn't run startup-latency AFAIR. Now that you've started them,
> probably there's no big point in me queuing them as well. So thanks for
> the benchmarking :)
> 

Hi Jan,
we have executed a lot of benchmarking, on various hardware.  No
regression found!

So, thank you very much for this important contribution, and:

Acked-by: Paolo Valente <paolo.valente@linaro.org>

> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR