mbox series

[0/3,v2] bfq: Limit number of allocated scheduler tags per cgroup

Message ID 20210715132047.20874-1-jack@suse.cz (mailing list archive)
Headers show
Series bfq: Limit number of allocated scheduler tags per cgroup | expand

Message

Jan Kara July 15, 2021, 1:30 p.m. UTC
Hello!

Here is the second revision of my patches to fix how bfq weights apply on
cgroup throughput. This version has only one change fixing how we compute
number of tags that should be available to a cgroup. Previous version didn't
combine weights at several levels correctly for deeper hierarchies. It is
somewhat unfortunate that for really deep cgroup hierarchies we would now do
memory allocation inside bfq_limit_depth(). I have an idea how we could avoid
that if the rest of the approach proves OK so don't concentrate too much on
that detail please.

Changes since v1:
* Fixed computation of appropriate proportion of scheduler tags for a cgroup
  to work with deeper cgroup hierarchies.

Original cover letter:

I was looking into why cgroup weights do not have any measurable impact on
writeback throughput from different cgroups. This actually a regression from
CFQ where things work more or less OK and weights have roughly the impact they
should. The problem can be reproduced e.g. by running the following easy fio
job in two cgroups with different weight:

[writer]
directory=/mnt/repro/
numjobs=1
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync

I can observe there's no significat difference in the amount of data written
from different cgroups despite their weights are in say 1:3 ratio.

After some debugging I've understood the dynamics of the system. There are two
issues:

1) The amount of scheduler tags needs to be significantly larger than the
amount of device tags. Otherwise there are not enough requests waiting in BFQ
to be dispatched to the device and thus there's nothing to schedule on.

2) Even with enough scheduler tags, writers from two cgroups eventually start
contending on scheduler tag allocation. These are served on first come first
served basis so writers from both cgroups feed requests into bfq with
approximately the same speed. Since bfq prefers IO from heavier cgroup, that is
submitted and completed faster and eventually we end up in a situation when
there's no IO from the heavier cgroup in bfq and all scheduler tags are
consumed by requests from the lighter cgroup. At that point bfq just dispatches
lots of the IO from the lighter cgroup since there's no contender for disk
throughput. As a result observed throughput for both cgroups are the same.

This series fixes this problem by accounting how many scheduler tags are
allocated for each cgroup and if a cgroup has more tags allocated than its
fair share (based on weights) in its service tree, we heavily limit scheduler
tag bitmap depth for it so that it is not be able to starve other cgroups from
scheduler tags.

What do people think about this?

								Honza

Previous versions:
Link: http://lore.kernel.org/r/20210712171146.12231-1-jack@suse.cz # v1

Comments

Paolo Valente Aug. 27, 2021, 10:07 a.m. UTC | #1
> Il giorno 15 lug 2021, alle ore 15:30, Jan Kara <jack@suse.cz> ha scritto:
> 
> Hello!
> 

Hi!

> Here is the second revision of my patches to fix how bfq weights apply on
> cgroup throughput.

I don't remember whether I replied to your first version.  Anyway,
thanks for this important contribution.

> This version has only one change fixing how we compute
> number of tags that should be available to a cgroup. Previous version didn't
> combine weights at several levels correctly for deeper hierarchies. It is
> somewhat unfortunate that for really deep cgroup hierarchies we would now do
> memory allocation inside bfq_limit_depth(). I have an idea how we could avoid
> that if the rest of the approach proves OK so don't concentrate too much on
> that detail please.
> 
> Changes since v1:
> * Fixed computation of appropriate proportion of scheduler tags for a cgroup
>  to work with deeper cgroup hierarchies.
> 
> Original cover letter:
> 
> I was looking into why cgroup weights do not have any measurable impact on
> writeback throughput from different cgroups. This actually a regression from
> CFQ where things work more or less OK and weights have roughly the impact they
> should. The problem can be reproduced e.g. by running the following easy fio
> job in two cgroups with different weight:
> 
> [writer]
> directory=/mnt/repro/
> numjobs=1
> rw=write
> size=8g
> time_based
> runtime=30
> ramp_time=10
> blocksize=1m
> direct=0
> ioengine=sync
> 
> I can observe there's no significat difference in the amount of data written
> from different cgroups despite their weights are in say 1:3 ratio.
> 
> After some debugging I've understood the dynamics of the system. There are two
> issues:
> 
> 1) The amount of scheduler tags needs to be significantly larger than the
> amount of device tags. Otherwise there are not enough requests waiting in BFQ
> to be dispatched to the device and thus there's nothing to schedule on.
> 

Before discussing your patches in detail, I need a little help on this
point.  You state that the number of scheduler tags must be larger
than the number of device tags.  So, I expected some of your patches
to address somehow this issue, e.g., by increasing the number of
scheduler tags.  Yet I have not found such a change.  Did I miss
something?

Thanks,
Paolo

> 2) Even with enough scheduler tags, writers from two cgroups eventually start
> contending on scheduler tag allocation. These are served on first come first
> served basis so writers from both cgroups feed requests into bfq with
> approximately the same speed. Since bfq prefers IO from heavier cgroup, that is
> submitted and completed faster and eventually we end up in a situation when
> there's no IO from the heavier cgroup in bfq and all scheduler tags are
> consumed by requests from the lighter cgroup. At that point bfq just dispatches
> lots of the IO from the lighter cgroup since there's no contender for disk
> throughput. As a result observed throughput for both cgroups are the same.
> 
> This series fixes this problem by accounting how many scheduler tags are
> allocated for each cgroup and if a cgroup has more tags allocated than its
> fair share (based on weights) in its service tree, we heavily limit scheduler
> tag bitmap depth for it so that it is not be able to starve other cgroups from
> scheduler tags.
> 
> What do people think about this?
> 
> 								Honza
> 
> Previous versions:
> Link: http://lore.kernel.org/r/20210712171146.12231-1-jack@suse.cz # v1
Michal Koutný Aug. 31, 2021, 9:59 a.m. UTC | #2
Hello Paolo.

On Fri, Aug 27, 2021 at 12:07:20PM +0200, Paolo Valente <paolo.valente@linaro.org> wrote:
> Before discussing your patches in detail, I need a little help on this
> point.  You state that the number of scheduler tags must be larger
> than the number of device tags.  So, I expected some of your patches
> to address somehow this issue, e.g., by increasing the number of
> scheduler tags.  Yet I have not found such a change.  Did I miss
> something?

I believe Jan's conclusions so far are based on "manual" modifications
of available scheduler tags by /sys/block/$dev/queue/nr_requests.
Finding a good default value may be an additional change.

(Hopefully this clarifies a bit before Jan can reply.)

Michal
Jan Kara Sept. 15, 2021, 1:15 p.m. UTC | #3
On Tue 31-08-21 11:59:30, Michal Koutný wrote:
> Hello Paolo.
> 
> On Fri, Aug 27, 2021 at 12:07:20PM +0200, Paolo Valente <paolo.valente@linaro.org> wrote:
> > Before discussing your patches in detail, I need a little help on this
> > point.  You state that the number of scheduler tags must be larger
> > than the number of device tags.  So, I expected some of your patches
> > to address somehow this issue, e.g., by increasing the number of
> > scheduler tags.  Yet I have not found such a change.  Did I miss
> > something?
> 
> I believe Jan's conclusions so far are based on "manual" modifications
> of available scheduler tags by /sys/block/$dev/queue/nr_requests.
> Finding a good default value may be an additional change.

Exactly. So far I was manually increasing nr_requests. I agree that
improving the default nr_requests value selection would be desirable as
well so that manual tuning is not needed. But for now I've left that aside.

								Honza
Paolo Valente Sept. 18, 2021, 10:58 a.m. UTC | #4
> Il giorno 15 set 2021, alle ore 15:15, Jan Kara <jack@suse.cz> ha scritto:
> 
> On Tue 31-08-21 11:59:30, Michal Koutný wrote:
>> Hello Paolo.
>> 
>> On Fri, Aug 27, 2021 at 12:07:20PM +0200, Paolo Valente <paolo.valente@linaro.org> wrote:
>>> Before discussing your patches in detail, I need a little help on this
>>> point.  You state that the number of scheduler tags must be larger
>>> than the number of device tags.  So, I expected some of your patches
>>> to address somehow this issue, e.g., by increasing the number of
>>> scheduler tags.  Yet I have not found such a change.  Did I miss
>>> something?
>> 
>> I believe Jan's conclusions so far are based on "manual" modifications
>> of available scheduler tags by /sys/block/$dev/queue/nr_requests.
>> Finding a good default value may be an additional change.
> 
> Exactly. So far I was manually increasing nr_requests. I agree that
> improving the default nr_requests value selection would be desirable as
> well so that manual tuning is not needed. But for now I've left that aside.
> 

Ok. So, IIUC, to recover control on bandwidth you need to
(1) increase nr_requests manually
and
(2) apply your patch

If you don't do (1), then (2) is not sufficient, and viceversa. Correct?

Thanks,
Paolo

> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jan Kara Sept. 20, 2021, 9:28 a.m. UTC | #5
On Sat 18-09-21 12:58:34, Paolo Valente wrote:
> > Il giorno 15 set 2021, alle ore 15:15, Jan Kara <jack@suse.cz> ha scritto:
> > 
> > On Tue 31-08-21 11:59:30, Michal Koutný wrote:
> >> Hello Paolo.
> >> 
> >> On Fri, Aug 27, 2021 at 12:07:20PM +0200, Paolo Valente <paolo.valente@linaro.org> wrote:
> >>> Before discussing your patches in detail, I need a little help on this
> >>> point.  You state that the number of scheduler tags must be larger
> >>> than the number of device tags.  So, I expected some of your patches
> >>> to address somehow this issue, e.g., by increasing the number of
> >>> scheduler tags.  Yet I have not found such a change.  Did I miss
> >>> something?
> >> 
> >> I believe Jan's conclusions so far are based on "manual" modifications
> >> of available scheduler tags by /sys/block/$dev/queue/nr_requests.
> >> Finding a good default value may be an additional change.
> > 
> > Exactly. So far I was manually increasing nr_requests. I agree that
> > improving the default nr_requests value selection would be desirable as
> > well so that manual tuning is not needed. But for now I've left that aside.
> > 
> 
> Ok. So, IIUC, to recover control on bandwidth you need to
> (1) increase nr_requests manually
> and
> (2) apply your patch
> 
> If you don't do (1), then (2) is not sufficient, and viceversa. Correct?

Correct, although 1) depends on HW capabilities - e.g. for standard SATA
NCQ drive with queue depth of 32, the current nr_requests setting of 256 is
fine and just 2) is enough to recover control. If you run on top of virtio
device or storage controller card with queue depth of 1024, you need to
bump up the nr_requests setting.

								Honza
Jan Kara Sept. 22, 2021, 2:33 p.m. UTC | #6
On Mon 20-09-21 11:28:15, Jan Kara wrote:
> On Sat 18-09-21 12:58:34, Paolo Valente wrote:
> > > Il giorno 15 set 2021, alle ore 15:15, Jan Kara <jack@suse.cz> ha scritto:
> > > 
> > > On Tue 31-08-21 11:59:30, Michal Koutný wrote:
> > >> Hello Paolo.
> > >> 
> > >> On Fri, Aug 27, 2021 at 12:07:20PM +0200, Paolo Valente <paolo.valente@linaro.org> wrote:
> > >>> Before discussing your patches in detail, I need a little help on this
> > >>> point.  You state that the number of scheduler tags must be larger
> > >>> than the number of device tags.  So, I expected some of your patches
> > >>> to address somehow this issue, e.g., by increasing the number of
> > >>> scheduler tags.  Yet I have not found such a change.  Did I miss
> > >>> something?
> > >> 
> > >> I believe Jan's conclusions so far are based on "manual" modifications
> > >> of available scheduler tags by /sys/block/$dev/queue/nr_requests.
> > >> Finding a good default value may be an additional change.
> > > 
> > > Exactly. So far I was manually increasing nr_requests. I agree that
> > > improving the default nr_requests value selection would be desirable as
> > > well so that manual tuning is not needed. But for now I've left that aside.
> > > 
> > 
> > Ok. So, IIUC, to recover control on bandwidth you need to
> > (1) increase nr_requests manually
> > and
> > (2) apply your patch
> > 
> > If you don't do (1), then (2) is not sufficient, and viceversa. Correct?
> 
> Correct, although 1) depends on HW capabilities - e.g. for standard SATA
> NCQ drive with queue depth of 32, the current nr_requests setting of 256 is
> fine and just 2) is enough to recover control. If you run on top of virtio
> device or storage controller card with queue depth of 1024, you need to
> bump up the nr_requests setting.

Paolo, do you have any thoughts about the patches? Any estimate when you
can have a look into them? BTW I have sligthly updated version locally
which also helps with restoring service differentiation for IO priorities
but in principle there's no fundamental difference.

								Honza