diff mbox series

[-next] blk-throttle: enable io throttle for root in cgroup v2

Message ID 20220114093000.3323470-1-yukuai3@huawei.com (mailing list archive)
State New, archived
Headers show
Series [-next] blk-throttle: enable io throttle for root in cgroup v2 | expand

Commit Message

Yu Kuai Jan. 14, 2022, 9:30 a.m. UTC
RFC patch: https://lkml.org/lkml/2021/9/9/1432

There is a proformance problem in our environment:

A host can provide a remote device to difierent client. If one client is
under high io pressure, other clients might be affected.

Limit the overall iops/bps(io.max) from the client can fix the problem,
however, config files do not exist in root cgroup currently, which makes
it impossible.

This patch enables io throttle for root cgroup:
 - enable "io.max" and "io.low" in root
 - don't skip root group in tg_iops_limit() and tg_bps_limit()
 - don't skip root group in tg_conf_updated()

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-throttle.c | 13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

Comments

Tejun Heo Jan. 26, 2022, 5:29 p.m. UTC | #1
On Fri, Jan 14, 2022 at 05:30:00PM +0800, Yu Kuai wrote:
> RFC patch: https://lkml.org/lkml/2021/9/9/1432
> 
> There is a proformance problem in our environment:
> 
> A host can provide a remote device to difierent client. If one client is
> under high io pressure, other clients might be affected.
> 
> Limit the overall iops/bps(io.max) from the client can fix the problem,
> however, config files do not exist in root cgroup currently, which makes
> it impossible.
> 
> This patch enables io throttle for root cgroup:
>  - enable "io.max" and "io.low" in root
>  - don't skip root group in tg_iops_limit() and tg_bps_limit()
>  - don't skip root group in tg_conf_updated()
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

Yeah, I'm kinda split. It's a simple change with some utility, but it's also
something which doesn't fit with the cgroup feature or interface. It's
regulating the whole system behavior. There's no reason for any of the
control "groups" to be involved here and semantically the interface would
fit a lot better under /proc, /sys or some other system-wide location. Here
are some points to consider:

* As a comparison, it'd be rather absurd to enable memory.max at system root
  in terms of interface and most likely break whole lot of mm operations.

* Resource control knobs of a cgroup belong to the parent as the parent is
  responsible for divvying up the available resources to its children. Here
  too, the knobs are making sense because there's a higher level parent
  (whether that's hypervisor or some network server).

Is your use case VMs or network attached storage?

Thanks.
Yu Kuai Jan. 27, 2022, 2:36 a.m. UTC | #2
在 2022/01/27 1:29, Tejun Heo 写道:
> On Fri, Jan 14, 2022 at 05:30:00PM +0800, Yu Kuai wrote:
>> RFC patch: https://lkml.org/lkml/2021/9/9/1432
>>
>> There is a proformance problem in our environment:
>>
>> A host can provide a remote device to difierent client. If one client is
>> under high io pressure, other clients might be affected.
>>
>> Limit the overall iops/bps(io.max) from the client can fix the problem,
>> however, config files do not exist in root cgroup currently, which makes
>> it impossible.
>>
>> This patch enables io throttle for root cgroup:
>>   - enable "io.max" and "io.low" in root
>>   - don't skip root group in tg_iops_limit() and tg_bps_limit()
>>   - don't skip root group in tg_conf_updated()
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> 
> Yeah, I'm kinda split. It's a simple change with some utility, but it's also
> something which doesn't fit with the cgroup feature or interface. It's
> regulating the whole system behavior. There's no reason for any of the
> control "groups" to be involved here and semantically the interface would
> fit a lot better under /proc, /sys or some other system-wide location. Here
> are some points to consider:
> 
> * As a comparison, it'd be rather absurd to enable memory.max at system root
>    in terms of interface and most likely break whole lot of mm operations.
> 
> * Resource control knobs of a cgroup belong to the parent as the parent is
>    responsible for divvying up the available resources to its children. Here
>    too, the knobs are making sense because there's a higher level parent
>    (whether that's hypervisor or some network server).
> 
> Is your use case VMs or network attached storage?
> 
Hi,

In our case, the disk is provided by server, and such disk can be shared
by multipul clients. Thus for the client side, the server is a higher
level parent.

Theoretically, limit the io from server for each client is feasible,
however, the main reason we don't want to do this is the following
shortcoming:

client can still send io to server unlimited, we can just limit the
amount of io that can complete from server, which might cause too much
pressure on the server side.

Thanks,
Kuai
Tejun Heo Feb. 1, 2022, 5:20 p.m. UTC | #3
Hello,

On Thu, Jan 27, 2022 at 10:36:38AM +0800, yukuai (C) wrote:
> In our case, the disk is provided by server, and such disk can be shared
> by multipul clients. Thus for the client side, the server is a higher
> level parent.
> 
> Theoretically, limit the io from server for each client is feasible,
> however, the main reason we don't want to do this is the following
> shortcoming:
> 
> client can still send io to server unlimited, we can just limit the
> amount of io that can complete from server, which might cause too much
> pressure on the server side.

I don't quite follow the "send io to server unlimited" part. Doesn't that
get limited by available number of requests? ie. if the server throttles,
the in-flight requests will take longer to complete which exhausts the
available requests and thus slows down the client. That's how it's supposed
to work on the local machine too.

Thanks.
Yu Kuai Feb. 8, 2022, 1:38 a.m. UTC | #4
在 2022/02/02 1:20, Tejun Heo 写道:
> Hello,
> 
> On Thu, Jan 27, 2022 at 10:36:38AM +0800, yukuai (C) wrote:
>> In our case, the disk is provided by server, and such disk can be shared
>> by multipul clients. Thus for the client side, the server is a higher
>> level parent.
>>
>> Theoretically, limit the io from server for each client is feasible,
>> however, the main reason we don't want to do this is the following
>> shortcoming:
>>
>> client can still send io to server unlimited, we can just limit the
>> amount of io that can complete from server, which might cause too much
>> pressure on the server side.
> 
> I don't quite follow the "send io to server unlimited" part. Doesn't that
> get limited by available number of requests? 

Hi, Tejun

Here I mean that io is not limited through io throttle from client. Of
course io must be limited by avaliable number of requests.

> ie. if the server throttles,
> the in-flight requests will take longer to complete which exhausts the
> available requests and thus slows down the client.

For example, if we have 8 clients, and available requests is 64.

1) If we don't limit iops, and each client send 64 io to server, some
client might have performance issue.

2) If we limit iops to 8 from clients, then server can receive 64 io at
most at the same time, both client and server should work fine.

3) If we limit iops to 8 for each client from server, client should work
fine, however, server can receive 64 x 8 = 512 io at most at the same
time, which might cause too much pressure on the server.(maybe bps is
more appropriate to explain the high pressure).

Thus I prefer to limit io from client.

Thanks,
Kuai
> That's how it's supposed
> to work on the local machine too.
> 
> Thanks.
>
Tejun Heo Feb. 8, 2022, 6:49 p.m. UTC | #5
Hello,

On Tue, Feb 08, 2022 at 09:38:33AM +0800, yukuai (C) wrote:
> 3) If we limit iops to 8 for each client from server, client should work
> fine, however, server can receive 64 x 8 = 512 io at most at the same
> time, which might cause too much pressure on the server.(maybe bps is
> more appropriate to explain the high pressure).

I don't follow this part. Why can't the server control however it wanna
control?

Thanks.
Yu Kuai Feb. 9, 2022, 1:22 a.m. UTC | #6
在 2022/02/09 2:49, Tejun Heo 写道:
> Hello,
> 
> On Tue, Feb 08, 2022 at 09:38:33AM +0800, yukuai (C) wrote:
>> 3) If we limit iops to 8 for each client from server, client should work
>> fine, however, server can receive 64 x 8 = 512 io at most at the same
>> time, which might cause too much pressure on the server.(maybe bps is
>> more appropriate to explain the high pressure).
> 
> I don't follow this part. Why can't the server control however it wanna
> control?

Hi,

Do you agree that the server can't control how many io it can receives
from one client if we limit from server? I think the difference is that
limit from client can control it...

Thanks,
Kuai
> 
> Thanks.
>
Ming Lei Feb. 9, 2022, 3:14 a.m. UTC | #7
Hello Yu Kuai,

On Fri, Jan 14, 2022 at 05:30:00PM +0800, Yu Kuai wrote:
> RFC patch: https://lkml.org/lkml/2021/9/9/1432
> 
> There is a proformance problem in our environment:
> 
> A host can provide a remote device to difierent client. If one client is
> under high io pressure, other clients might be affected.

Can you use the linux kernel storage term to describe the issue?
Such as, I guess here host means target server(iscsi, nvme target?),
client should be scsi initiator, or nvme host. If not, can you provide
one actual example for your storage use case?

With common term used, it becomes pretty easy for people to understand &
solve the issue, and avoid any misunderstanding.

> 
> Limit the overall iops/bps(io.max) from the client can fix the problem,

Just be curious how each client can figure out perfect iops/bps limit?
Given one client doesn't know how many clients are connected to the
target server.

It sounds like the throttle shouldn't be done in client side cgroup,
given the throttle is nothing to do with tasks. 

Maybe it should be done in server side, since server has enough
information to provide fair iops/bps allocation for each clients.


Thanks, 
Ming
Ofir Gal Feb. 5, 2023, 3:55 p.m. UTC | #8
From: Ofir Gal <ofir@gal.software>

Hello Ming Lei,

I am trying to use cgroups v2 to throttle a media disk that is controlled by an NVME target.
Unfortunately, it cannot be done without setting the limit in the root cgroup.
It can be done via cgroups v1. Yu Kuai's patch allows this to be accomplished.

My setup consist from 3 servers.
Server #1:
    a. SSD media disk (needs to be throttled to 100K IOPs)
    b. NVME target controlling the SSD (1.a)

Server #2:
    a. NVME initiator is connected to Server #1 NVME target (1.b)

Server #3:
    a. NVME initiator is connected to Server #1 NVME target (1.b)

My setup accesses this media from multiple servers using NVMe over TCP,
but the initiator servers' workloads are unknown and can be changed dynamically. I need to limit the media disk to 100K IOPS on the target side.

I have tried to limit the SSD on Server #1, but it seems that the NVME target kworkers are not affected unless I use Yu Kuai's patch.

Can you elaborate on the issues with this patch or how the scenario described above can be done with cgroups v2?

Best regards, Ofir Gal.
Michal Koutný Feb. 6, 2023, 3 p.m. UTC | #9
On Sun, Feb 05, 2023 at 05:55:41PM +0200, Ofir Gal <ofir.gal@volumez.com> wrote:
> I have tried to limit the SSD on Server #1, but it seems that the NVME
> target kworkers are not affected unless I use Yu Kuai's patch.
> 
> Can you elaborate on the issues with this patch or how the scenario
> described above can be done with cgroups v2?

The issue is that if there's a client that doesn't implement
self-throttling you cannot guarantee anything on the server side.
Hence the mechanism must exist on the server side.

The NVME target should charge IO to respective blkcg's (just a generic
advice, I'm not familiar with that interface; see also
kthread_associate_blkcg() use in loop device driver).

HTH,
Michal
Michal Koutný Feb. 6, 2023, 3:10 p.m. UTC | #10
Hello Kuai.

On Wed, Feb 09, 2022 at 09:22:30AM +0800, "yukuai (C)" <yukuai3@huawei.com> wrote:
> Do you agree that the server can't control how many io it can receives
> from one client if we limit from server? I think the difference is that
> limit from client can control it...

(Perhaps it depends on the protocol used for the IO but) eventually
client requests would be noticably lost/dropped and that's how the
server propagates the requested control onto the clients without relying
on the clients throttling themselves.

(Maybe better place to implement this would be a "testing" device mapper
target akin to the 'delay' one.)

Regards,
Michal
diff mbox series

Patch

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 7c462c006b26..ac25bfbbfe7f 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -156,9 +156,6 @@  static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
 	struct throtl_data *td;
 	uint64_t ret;
 
-	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
-		return U64_MAX;
-
 	td = tg->td;
 	ret = tg->bps[rw][td->limit_index];
 	if (ret == 0 && td->limit_index == LIMIT_LOW) {
@@ -186,9 +183,6 @@  static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 	struct throtl_data *td;
 	unsigned int ret;
 
-	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
-		return UINT_MAX;
-
 	td = tg->td;
 	ret = tg->iops[rw][td->limit_index];
 	if (ret == 0 && tg->td->limit_index == LIMIT_LOW) {
@@ -1284,9 +1278,8 @@  static void tg_conf_updated(struct throtl_grp *tg, bool global)
 		struct throtl_grp *parent_tg;
 
 		tg_update_has_rules(this_tg);
-		/* ignore root/second level */
-		if (!cgroup_subsys_on_dfl(io_cgrp_subsys) || !blkg->parent ||
-		    !blkg->parent->parent)
+		/* ignore root level */
+		if (!cgroup_subsys_on_dfl(io_cgrp_subsys) || !blkg->parent)
 			continue;
 		parent_tg = blkg_to_tg(blkg->parent);
 		/*
@@ -1625,7 +1618,6 @@  static struct cftype throtl_files[] = {
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
 	{
 		.name = "low",
-		.flags = CFTYPE_NOT_ON_ROOT,
 		.seq_show = tg_print_limit,
 		.write = tg_set_limit,
 		.private = LIMIT_LOW,
@@ -1633,7 +1625,6 @@  static struct cftype throtl_files[] = {
 #endif
 	{
 		.name = "max",
-		.flags = CFTYPE_NOT_ON_ROOT,
 		.seq_show = tg_print_limit,
 		.write = tg_set_limit,
 		.private = LIMIT_MAX,