[0/3] bfq: Limit number of allocated scheduler tags per cgroup

Message ID	20210712171146.12231-1-jack@suse.cz (mailing list archive)
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Jan Kara <jack@suse.cz> To: <linux-block@vger.kernel.org> Cc: Paolo Valente <paolo.valente@linaro.org>, Jens Axboe <axboe@kernel.dk>, mkoutny@suse.cz, Jan Kara <jack@suse.cz> Subject: [PATCH 0/3] bfq: Limit number of allocated scheduler tags per cgroup Date: Mon, 12 Jul 2021 19:27:36 +0200 Message-Id: <20210712171146.12231-1-jack@suse.cz> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	bfq: Limit number of allocated scheduler tags per cgroup \| expand [0/3] bfq: Limit number of allocated scheduler tags per cgroup [1/3] block: Provide icq in request allocation data [2/3] bfq: Track number of allocated requests in bfq_entity [3/3] bfq: Limit number of requests consumed by each cgroup

Message ID

20210712171146.12231-1-jack@suse.cz (mailing list archive)

Headers

From: Jan Kara <jack@suse.cz>
To: <linux-block@vger.kernel.org>
Cc: Paolo Valente <paolo.valente@linaro.org>,
        Jens Axboe <axboe@kernel.dk>, mkoutny@suse.cz,
        Jan Kara <jack@suse.cz>
Subject: [PATCH 0/3] bfq: Limit number of allocated scheduler tags per cgroup
Date: Mon, 12 Jul 2021 19:27:36 +0200
Message-Id: <20210712171146.12231-1-jack@suse.cz>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

bfq: Limit number of allocated scheduler tags per cgroup | expand

Message

Jan Kara July 12, 2021, 5:27 p.m. UTC

Hello!

I was looking into why cgroup weights do not have any measurable impact on
writeback throughput from different cgroups. This actually a regression from
CFQ where things work more or less OK and weights have roughly the impact they
should. The problem can be reproduced e.g. by running the following easy fio
job in two cgroups with different weight:

[writer]
directory=/mnt/repro/
numjobs=1
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync

I can observe there's no significat difference in the amount of data written
from different cgroups despite their weights are in say 1:3 ratio.

After some debugging I've understood the dynamics of the system. There are two
issues:

1) The amount of scheduler tags needs to be significantly larger than the
amount of device tags. Otherwise there are not enough requests waiting in BFQ
to be dispatched to the device and thus there's nothing to schedule on.

2) Even with enough scheduler tags, writers from two cgroups eventually start
contending on scheduler tag allocation. These are served on first come first
served basis so writers from both cgroups feed requests into bfq with
approximately the same speed. Since bfq prefers IO from heavier cgroup, that is
submitted and completed faster and eventually we end up in a situation when
there's no IO from the heavier cgroup in bfq and all scheduler tags are
consumed by requests from the lighter cgroup. At that point bfq just dispatches
lots of the IO from the lighter cgroup since there's no contender for disk
throughput. As a result observed throughput for both cgroups are the same.

This series fixes this problem by accounting how many scheduler tags are
allocated for each cgroup and if a cgroup has more tags allocated than its
fair share (based on weights) in its service tree, we heavily limit scheduler
tag bitmap depth for it so that it is not be able to starve other cgroups from
scheduler tags.

What do people think about this?

								Honza