diff mbox series

[v7,1/9] blk-throttle: fix that io throttle can only work for single bio

Message ID 20220802140415.2960284-2-yukuai1@huaweicloud.com (mailing list archive)
State New, archived
Headers show
Series bugfix and cleanup for blk-throttle | expand

Commit Message

Yu Kuai Aug. 2, 2022, 2:04 p.m. UTC
From: Yu Kuai <yukuai3@huawei.com>

Test scripts:
cd /sys/fs/cgroup/blkio/
echo "8:0 1024" > blkio.throttle.write_bps_device
echo $$ > cgroup.procs
dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &

Test result:
10240 bytes (10 kB, 10 KiB) copied, 10.0134 s, 1.0 kB/s
10240 bytes (10 kB, 10 KiB) copied, 10.0135 s, 1.0 kB/s

The problem is that the second bio is finished after 10s instead of 20s.

Root cause:
1) second bio will be flaged:

__blk_throtl_bio
 while (true) {
  ...
  if (sq->nr_queued[rw]) -> some bio is throttled already
   break
 };
 bio_set_flag(bio, BIO_THROTTLED); -> flag the bio

2) flaged bio will be dispatched without waiting:

throtl_dispatch_tg
 tg_may_dispatch
  tg_with_in_bps_limit
   if (bps_limit == U64_MAX || bio_flagged(bio, BIO_THROTTLED))
    *wait = 0; -> wait time is zero
    return true;

commit 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
support to count splited bios for iops limit, thus it adds flaged bio
checking in tg_with_in_bps_limit() so that splited bios will only count
once for bps limit, however, it introduce a new problem that io throttle
won't work if multiple bios are throttled.

In order to fix the problem, at first, don't skip flaged bio in
tg_with_in_bps_limit(), however, this will break that splited bios should
only count once for bps limit. And this patch tries to avoid
over-accounting by decrementing it first in __blk_throtl_bio(), and
then counting it again while dispatching it.

Fixes: 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-throttle.c | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

Comments

Tejun Heo Aug. 16, 2022, 7:37 p.m. UTC | #1
On Tue, Aug 02, 2022 at 10:04:07PM +0800, Yu Kuai wrote:
...
> commit 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
> support to count splited bios for iops limit, thus it adds flaged bio
                                                             ^
                                                             flagged

> checking in tg_with_in_bps_limit() so that splited bios will only count
                                             ^
                                             split

> once for bps limit, however, it introduce a new problem that io throttle
> won't work if multiple bios are throttled.
> 
> In order to fix the problem, at first, don't skip flaged bio in
> tg_with_in_bps_limit(), however, this will break that splited bios should
> only count once for bps limit. And this patch tries to avoid
> over-accounting by decrementing it first in __blk_throtl_bio(), and
> then counting it again while dispatching it.
> 
> Fixes: 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Reviewed-by: Ming Lei <ming.lei@redhat.com>

Please cc stable w/ version tag.

> ---
>  block/blk-throttle.c | 26 ++++++++++++++++++++------
>  1 file changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 9f5fe62afff9..2957e2c643f4 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -811,7 +811,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
>  	unsigned int bio_size = throtl_bio_data_size(bio);
>  
>  	/* no need to throttle if this bio's bytes have been accounted */
> -	if (bps_limit == U64_MAX || bio_flagged(bio, BIO_THROTTLED)) {
> +	if (bps_limit == U64_MAX) {
>  		if (wait)
>  			*wait = 0;
>  		return true;
> @@ -921,11 +921,8 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
>  	unsigned int bio_size = throtl_bio_data_size(bio);
>  
>  	/* Charge the bio to the group */
> -	if (!bio_flagged(bio, BIO_THROTTLED)) {
> -		tg->bytes_disp[rw] += bio_size;
> -		tg->last_bytes_disp[rw] += bio_size;
> -	}
> -
> +	tg->bytes_disp[rw] += bio_size;
> +	tg->last_bytes_disp[rw] += bio_size;
>  	tg->io_disp[rw]++;
>  	tg->last_io_disp[rw]++;
>  
> @@ -2121,6 +2118,23 @@ bool __blk_throtl_bio(struct bio *bio)
>  			tg->last_low_overflow_time[rw] = jiffies;
>  		throtl_downgrade_check(tg);
>  		throtl_upgrade_check(tg);
> +
> +		/*
> +		 * Splited bios can be re-entered because iops limit should be
                   ^                ^^^^^^^^^^^^^
                   Split            re-enter

> +		 * counted again, however, bps limit should not. Since bps limit
> +		 * will be counted again while dispatching it, compensate the
> +		 * over-accounting here. Noted that compensation can fail if
> +		 * new slice is started.

I can't really follow the comment. Please improve the explanation.

> +		 */
> +		if (bio_flagged(bio, BIO_THROTTLED)) {
> +			unsigned int bio_size = throtl_bio_data_size(bio);
> +
> +			if (tg->bytes_disp[rw] >= bio_size)
> +				tg->bytes_disp[rw] -= bio_size;
> +			if (tg->last_bytes_disp[rw] >= bio_size)
> +				tg->last_bytes_disp[rw] -= bio_size;
> +		}

So, as a fix for the immediate problem, I guess this might do but this feels
really fragile. How can we be certain that re-entering only happens because
of splitting? What if future core development changes that? It seems to be
solving the problem in the wrong place. Shouldn't we flag the bio indicating
that it's split when we're splitting the bio so that we only limit them for
iops in the first place?

Thanks.
Yu Kuai Aug. 17, 2022, 1:13 a.m. UTC | #2
Hi, Tejun

在 2022/08/17 3:37, Tejun Heo 写道:
> On Tue, Aug 02, 2022 at 10:04:07PM +0800, Yu Kuai wrote:
> ...
>> commit 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
>> support to count splited bios for iops limit, thus it adds flaged bio
>                                                               ^
>                                                               flagged
> 
>> checking in tg_with_in_bps_limit() so that splited bios will only count
>                                               ^
>                                               split
> 
>> once for bps limit, however, it introduce a new problem that io throttle
>> won't work if multiple bios are throttled.
>>
>> In order to fix the problem, at first, don't skip flaged bio in
>> tg_with_in_bps_limit(), however, this will break that splited bios should
>> only count once for bps limit. And this patch tries to avoid
>> over-accounting by decrementing it first in __blk_throtl_bio(), and
>> then counting it again while dispatching it.
>>
>> Fixes: 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> 
> Please cc stable w/ version tag.
> 
>> ---
>>   block/blk-throttle.c | 26 ++++++++++++++++++++------
>>   1 file changed, 20 insertions(+), 6 deletions(-)
>>
>> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
>> index 9f5fe62afff9..2957e2c643f4 100644
>> --- a/block/blk-throttle.c
>> +++ b/block/blk-throttle.c
>> @@ -811,7 +811,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
>>   	unsigned int bio_size = throtl_bio_data_size(bio);
>>   
>>   	/* no need to throttle if this bio's bytes have been accounted */
>> -	if (bps_limit == U64_MAX || bio_flagged(bio, BIO_THROTTLED)) {
>> +	if (bps_limit == U64_MAX) {
>>   		if (wait)
>>   			*wait = 0;
>>   		return true;
>> @@ -921,11 +921,8 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
>>   	unsigned int bio_size = throtl_bio_data_size(bio);
>>   
>>   	/* Charge the bio to the group */
>> -	if (!bio_flagged(bio, BIO_THROTTLED)) {
>> -		tg->bytes_disp[rw] += bio_size;
>> -		tg->last_bytes_disp[rw] += bio_size;
>> -	}
>> -
>> +	tg->bytes_disp[rw] += bio_size;
>> +	tg->last_bytes_disp[rw] += bio_size;
>>   	tg->io_disp[rw]++;
>>   	tg->last_io_disp[rw]++;
>>   
>> @@ -2121,6 +2118,23 @@ bool __blk_throtl_bio(struct bio *bio)
>>   			tg->last_low_overflow_time[rw] = jiffies;
>>   		throtl_downgrade_check(tg);
>>   		throtl_upgrade_check(tg);
>> +
>> +		/*
>> +		 * Splited bios can be re-entered because iops limit should be
>                     ^                ^^^^^^^^^^^^^
>                     Split            re-enter
> 
>> +		 * counted again, however, bps limit should not. Since bps limit
>> +		 * will be counted again while dispatching it, compensate the
>> +		 * over-accounting here. Noted that compensation can fail if
>> +		 * new slice is started.
> 
> I can't really follow the comment. Please improve the explanation.
> 
>> +		 */
>> +		if (bio_flagged(bio, BIO_THROTTLED)) {
>> +			unsigned int bio_size = throtl_bio_data_size(bio);
>> +
>> +			if (tg->bytes_disp[rw] >= bio_size)
>> +				tg->bytes_disp[rw] -= bio_size;
>> +			if (tg->last_bytes_disp[rw] >= bio_size)
>> +				tg->last_bytes_disp[rw] -= bio_size;
>> +		}
> 
> So, as a fix for the immediate problem, I guess this might do but this feels
> really fragile. How can we be certain that re-entering only happens because
> of splitting? What if future core development changes that? It seems to be
> solving the problem in the wrong place. Shouldn't we flag the bio indicating
> that it's split when we're splitting the bio so that we only limit them for
> iops in the first place?
> 
Splited bio is tracked in __bio_clone:

if (bio_flagged(bio_src, BIO_THROTTLED))
	bio_set_flag(bio, BIO_THROTTLED);

And currenty, the iops limit and bps limit are treated differently,
however there are only one flag 'BIO_THROTTLED' and they can't be
distinguished.

Perhaps I can use two flags, for example BIO_IOPS_THROTTLED and
BIO_BPS_THROTTLED, this way only iops limit can be handled and bps
limit can be skipped for splited bio.

What do you think?

Thanks,
Kuai
> Thanks.
>
Tejun Heo Aug. 17, 2022, 5:50 p.m. UTC | #3
Hello,

On Wed, Aug 17, 2022 at 09:13:38AM +0800, Yu Kuai wrote:
> > So, as a fix for the immediate problem, I guess this might do but this feels
> > really fragile. How can we be certain that re-entering only happens because
> > of splitting? What if future core development changes that? It seems to be
> > solving the problem in the wrong place. Shouldn't we flag the bio indicating
> > that it's split when we're splitting the bio so that we only limit them for
> > iops in the first place?
>
> Splited bio is tracked in __bio_clone:

As the word is used in commit messages and comments, the past perfect form
of the verb "split" is "split". It looks like "splitted" is used in rare
cases but dictionary says it's an archaic form.

> if (bio_flagged(bio_src, BIO_THROTTLED))
> 	bio_set_flag(bio, BIO_THROTTLED);
> 
> And currenty, the iops limit and bps limit are treated differently,
> however there are only one flag 'BIO_THROTTLED' and they can't be
> distinguished.
> 
> Perhaps I can use two flags, for example BIO_IOPS_THROTTLED and
> BIO_BPS_THROTTLED, this way only iops limit can be handled and bps
> limit can be skipped for splited bio.
> 
> What do you think?

I think the code would be a lot more intuitive and less fragile if we used
two flags but the bits in the bi_flags field are a scarce resource
unfortunately. Even then, I think the right thing to do here is using two
flags.

Thanks.
Yu Kuai Aug. 18, 2022, 1:23 a.m. UTC | #4
Hi, Tejun!

在 2022/08/18 1:50, Tejun Heo 写道:
> Hello,
> 
> On Wed, Aug 17, 2022 at 09:13:38AM +0800, Yu Kuai wrote:
>>> So, as a fix for the immediate problem, I guess this might do but this feels
>>> really fragile. How can we be certain that re-entering only happens because
>>> of splitting? What if future core development changes that? It seems to be
>>> solving the problem in the wrong place. Shouldn't we flag the bio indicating
>>> that it's split when we're splitting the bio so that we only limit them for
>>> iops in the first place?
>>
>> Splited bio is tracked in __bio_clone:
> 
> As the word is used in commit messages and comments, the past perfect form
> of the verb "split" is "split". It looks like "splitted" is used in rare
> cases but dictionary says it's an archaic form.

Ok, thanks for pointing it out, I'll change that in next iteration.
> 
>> if (bio_flagged(bio_src, BIO_THROTTLED))
>> 	bio_set_flag(bio, BIO_THROTTLED);
>>
>> And currenty, the iops limit and bps limit are treated differently,
>> however there are only one flag 'BIO_THROTTLED' and they can't be
>> distinguished.
>>
>> Perhaps I can use two flags, for example BIO_IOPS_THROTTLED and
>> BIO_BPS_THROTTLED, this way only iops limit can be handled and bps
>> limit can be skipped for splited bio.
>>
>> What do you think?
> 
> I think the code would be a lot more intuitive and less fragile if we used
> two flags but the bits in the bi_flags field are a scarce resource
> unfortunately. Even then, I think the right thing to do here is using two
> flags.

Yes, the field 'bio->bi_flags' is unsigned short, and there are only two
bits left. I'll use the new sulution which will acquire a new bit.

Thanks,
Kuai
> 
> Thanks.
>
Yu Kuai Aug. 22, 2022, 3:06 a.m. UTC | #5
Hi,

在 2022/08/18 9:23, Yu Kuai 写道:
> Hi, Tejun!
> 
> 在 2022/08/18 1:50, Tejun Heo 写道:
>> Hello,
>>
>> On Wed, Aug 17, 2022 at 09:13:38AM +0800, Yu Kuai wrote:
>>>> So, as a fix for the immediate problem, I guess this might do but 
>>>> this feels
>>>> really fragile. How can we be certain that re-entering only happens 
>>>> because
>>>> of splitting? What if future core development changes that? It seems 
>>>> to be
>>>> solving the problem in the wrong place. Shouldn't we flag the bio 
>>>> indicating
>>>> that it's split when we're splitting the bio so that we only limit 
>>>> them for
>>>> iops in the first place?
>>>
>>> Splited bio is tracked in __bio_clone:
>>
>> As the word is used in commit messages and comments, the past perfect 
>> form
>> of the verb "split" is "split". It looks like "splitted" is used in rare
>> cases but dictionary says it's an archaic form.
> 
> Ok, thanks for pointing it out, I'll change that in next iteration.
>>
>>> if (bio_flagged(bio_src, BIO_THROTTLED))
>>>     bio_set_flag(bio, BIO_THROTTLED);

While implementing the new method, I found that there seems to be a
misunderstanding here, the code seems to try to add flag to split bio
so that it won't be throttled again for bps limit, however:

1) for blk throttle, split bio is issued directly and will never be
throttled again, while orignal bio will go through throttle path again.
2) if cloned bio is directed to a new disk, the flag is cleared anyway.
>>>
>>> And currenty, the iops limit and bps limit are treated differently,
>>> however there are only one flag 'BIO_THROTTLED' and they can't be
>>> distinguished.
>>>
>>> Perhaps I can use two flags, for example BIO_IOPS_THROTTLED and
>>> BIO_BPS_THROTTLED, this way only iops limit can be handled and bps
>>> limit can be skipped for splited bio.
>>>
>>> What do you think?
>>
>> I think the code would be a lot more intuitive and less fragile if we 
>> used
>> two flags but the bits in the bi_flags field are a scarce resource
>> unfortunately. Even then, I think the right thing to do here is using two
>> flags.
> 
> Yes, the field 'bio->bi_flags' is unsigned short, and there are only two
> bits left. I'll use the new sulution which will acquire a new bit.
> 
> Thanks,
> Kuai
>>
>> Thanks.
>>
> 
> .
>
Tejun Heo Aug. 22, 2022, 7:25 a.m. UTC | #6
Hello,

On Mon, Aug 22, 2022 at 11:06:44AM +0800, Yu Kuai wrote:
> While implementing the new method, I found that there seems to be a
> misunderstanding here, the code seems to try to add flag to split bio
> so that it won't be throttled again for bps limit, however:
> 
> 1) for blk throttle, split bio is issued directly and will never be
> throttled again, while orignal bio will go through throttle path again.
> 2) if cloned bio is directed to a new disk, the flag is cleared anyway.

Doesn't that make the current code correct then? But you were seeing
incorrect behaviors, right?

Thanks.
Yu Kuai Aug. 22, 2022, 7:44 a.m. UTC | #7
Hi, Tejun

在 2022/08/22 15:25, Tejun Heo 写道:
> Hello,
> 
> On Mon, Aug 22, 2022 at 11:06:44AM +0800, Yu Kuai wrote:
>> While implementing the new method, I found that there seems to be a
>> misunderstanding here, the code seems to try to add flag to split bio
>> so that it won't be throttled again for bps limit, however:
>>
>> 1) for blk throttle, split bio is issued directly and will never be
>> throttled again, while orignal bio will go through throttle path again.
>> 2) if cloned bio is directed to a new disk, the flag is cleared anyway.
> 
> Doesn't that make the current code correct then? But you were seeing
> incorrect behaviors, right?

According to the commit message in commit 111be8839817 ("block-throttle:
avoid double charge"):

If the bio is cloned/split, we copy the flag to new bio too to avoid a
double charge.

Which make me think the split bio will be resubmitted, and after
implementing the new solution, I found that test result is not as
expected. After spending sometime figuring out what is wrong, I found
that split bio will be dispatched directly from caller, while orignal
bio will be resubmitted.

I guess commit 111be8839817 made a mistake, however, there should be
no problem because orignal bio is flagged already, and it's handled
correctly.

Anyway, I removed the code in __bio_clone() and check flag in
__bio_split_to_limits in my patch:
--- a/block/bio.c
+++ b/block/bio.c
@@ -760,8 +760,6 @@ EXPORT_SYMBOL(bio_put);
  static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
  {
         bio_set_flag(bio, BIO_CLONED);
-       if (bio_flagged(bio_src, BIO_THROTTLED))
-               bio_set_flag(bio, BIO_THROTTLED);
         bio->bi_ioprio = bio_src->bi_ioprio;
         bio->bi_iter = bio_src->bi_iter;

diff --git a/block/blk-merge.c b/block/blk-merge.c
index ff04e9290715..10330bb038ca 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -358,6 +358,13 @@ struct bio *__bio_split_to_limits(struct bio *bio, 
struct queue_limits *lim,
                 blkcg_bio_issue_init(split);
                 bio_chain(split, bio);
                 trace_block_split(split, bio->bi_iter.bi_sector);
+
+               /*
+                * original bio will be resubmited and throttled again, 
clear
+                * the iops flag so that it can be count again for iops 
limit.
+                */
+               if (bio_flagged(bio, BIO_IOPS_THROTTLED))
+                       bio_clear_flag(bio, BIO_IOPS_THROTTLED);
                 submit_bio_noacct(bio);
                 return split;
         }


> 
> Thanks.
>
diff mbox series

Patch

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 9f5fe62afff9..2957e2c643f4 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -811,7 +811,7 @@  static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 	unsigned int bio_size = throtl_bio_data_size(bio);
 
 	/* no need to throttle if this bio's bytes have been accounted */
-	if (bps_limit == U64_MAX || bio_flagged(bio, BIO_THROTTLED)) {
+	if (bps_limit == U64_MAX) {
 		if (wait)
 			*wait = 0;
 		return true;
@@ -921,11 +921,8 @@  static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 	unsigned int bio_size = throtl_bio_data_size(bio);
 
 	/* Charge the bio to the group */
-	if (!bio_flagged(bio, BIO_THROTTLED)) {
-		tg->bytes_disp[rw] += bio_size;
-		tg->last_bytes_disp[rw] += bio_size;
-	}
-
+	tg->bytes_disp[rw] += bio_size;
+	tg->last_bytes_disp[rw] += bio_size;
 	tg->io_disp[rw]++;
 	tg->last_io_disp[rw]++;
 
@@ -2121,6 +2118,23 @@  bool __blk_throtl_bio(struct bio *bio)
 			tg->last_low_overflow_time[rw] = jiffies;
 		throtl_downgrade_check(tg);
 		throtl_upgrade_check(tg);
+
+		/*
+		 * Splited bios can be re-entered because iops limit should be
+		 * counted again, however, bps limit should not. Since bps limit
+		 * will be counted again while dispatching it, compensate the
+		 * over-accounting here. Noted that compensation can fail if
+		 * new slice is started.
+		 */
+		if (bio_flagged(bio, BIO_THROTTLED)) {
+			unsigned int bio_size = throtl_bio_data_size(bio);
+
+			if (tg->bytes_disp[rw] >= bio_size)
+				tg->bytes_disp[rw] -= bio_size;
+			if (tg->last_bytes_disp[rw] >= bio_size)
+				tg->last_bytes_disp[rw] -= bio_size;
+		}
+
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[rw])
 			break;