diff mbox series

blk-throttle: fix lower bps rate by throtl_trim_slice()

Message ID 20250226011627.242912-1-yukuai1@huaweicloud.com (mailing list archive)
State New
Headers show
Series blk-throttle: fix lower bps rate by throtl_trim_slice() | expand

Commit Message

Yu Kuai Feb. 26, 2025, 1:16 a.m. UTC
From: Yu Kuai <yukuai3@huawei.com>

The bio submission time may be a few jiffies more than the expected
waiting time, due to 'extra_bytes' can't be divided in
tg_within_bps_limit(), and also due to timer wakeup delay. In this
case, adjust slice_start to jiffies will discard the extra wait time,
causing lower rate than expected.

This problem will cause blktests throtl/001 failure in case of
CONFIG_HZ_100=y, fix it by preserving one finished slice in
throtl_trim_slice() and allowing deviation between [0, 2 slices).

For example, assume bps_limit is 1000bytes, 1 jiffes is 10ms, and
slice is 20ms(2 jiffies), expected rate is 1000 / 1000 * 20 = 20 bytes
per slice.

If user issues two 21 bytes IO, then wait time will be 30ms for the
first IO:

bytes_allowed = 20, extra_bytes = 1;
jiffy_wait = 1 + 2 = 3 jiffies

and consider
extra 1 jiffies by timer, throtl_trim_slice() will be called at:

jiffies = 40ms
slice_start = 0ms, slice_end= 40ms
bytes_disp = 21

In this case, before the patch, real rate in the first two slices is
10.5 bytes per slice, and slice will be updated to:

jiffies = 40ms
slice_start = 40ms, slice_end = 60ms,
bytes_disp = 0;

Hence the second IO will have to wait another 30ms;

With the patch, the real rate in the first slice is 20 bytes per slice,
which is the same as expected, and slice will be updated:

jiffies=40ms,
slice_start = 20ms, slice_end = 60ms,
bytes_disp = 1;

And now, there is still 19 bytes allowed in the second slice, and the
second IO will only have to wait 10ms;

Fixes: e43473b7f223 ("blkio: Core implementation of throttle policy")
Reported-by: Ming Lei <ming.lei@redhat.com>
Closes: https://lore.kernel.org/linux-block/20250222092823.210318-3-yukuai1@huaweicloud.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-throttle.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

Comments

Ming Lei Feb. 26, 2025, 8:34 a.m. UTC | #1
On Wed, Feb 26, 2025 at 09:16:27AM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> The bio submission time may be a few jiffies more than the expected
> waiting time, due to 'extra_bytes' can't be divided in
> tg_within_bps_limit(), and also due to timer wakeup delay. In this
> case, adjust slice_start to jiffies will discard the extra wait time,
> causing lower rate than expected.
> 
> This problem will cause blktests throtl/001 failure in case of
> CONFIG_HZ_100=y, fix it by preserving one finished slice in
> throtl_trim_slice() and allowing deviation between [0, 2 slices).

I think it only can cover single default slice deviation, since
throtl_trim_slice() just keeps dispatch data in the previous single
default slice. Or can you add words on how to allow 2 default slices
deviation?

> 
> For example, assume bps_limit is 1000bytes, 1 jiffes is 10ms, and
> slice is 20ms(2 jiffies), expected rate is 1000 / 1000 * 20 = 20 bytes
> per slice.
> 
> If user issues two 21 bytes IO, then wait time will be 30ms for the
> first IO:
> 
> bytes_allowed = 20, extra_bytes = 1;
> jiffy_wait = 1 + 2 = 3 jiffies
> 
> and consider
> extra 1 jiffies by timer, throtl_trim_slice() will be called at:
> 
> jiffies = 40ms
> slice_start = 0ms, slice_end= 40ms
> bytes_disp = 21
> 
> In this case, before the patch, real rate in the first two slices is
> 10.5 bytes per slice, and slice will be updated to:
> 
> jiffies = 40ms
> slice_start = 40ms, slice_end = 60ms,
> bytes_disp = 0;
> 
> Hence the second IO will have to wait another 30ms;
> 
> With the patch, the real rate in the first slice is 20 bytes per slice,
> which is the same as expected, and slice will be updated:
> 
> jiffies=40ms,
> slice_start = 20ms, slice_end = 60ms,
> bytes_disp = 1;
> 
> And now, there is still 19 bytes allowed in the second slice, and the
> second IO will only have to wait 10ms;
> 
> Fixes: e43473b7f223 ("blkio: Core implementation of throttle policy")
> Reported-by: Ming Lei <ming.lei@redhat.com>
> Closes: https://lore.kernel.org/linux-block/20250222092823.210318-3-yukuai1@huaweicloud.com/
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/blk-throttle.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 8d149aff9fd0..cb472cf7b6b6 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -599,14 +599,23 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
>  	 * sooner, then we need to reduce slice_end. A high bogus slice_end
>  	 * is bad because it does not allow new slice to start.
>  	 */
> -
>  	throtl_set_slice_end(tg, rw, jiffies + tg->td->throtl_slice);
>  
>  	time_elapsed = rounddown(jiffies - tg->slice_start[rw],
>  				 tg->td->throtl_slice);
> -	if (!time_elapsed)
> +	/* Don't trim slice until at least 2 slices are used */
> +	if (time_elapsed < tg->td->throtl_slice * 2)
>  		return;
>  
> +	/*
> +	 * The bio submission time may be a few jiffies more than the expected
> +	 * waiting time, due to 'extra_bytes' can't be divided in
> +	 * tg_within_bps_limit(), and also due to timer wakeup delay. In this
> +	 * case, adjust slice_start to jiffies will discard the extra wait time,
> +	 * causing lower rate than expected. Therefore, one slice is preserved,
> +	 * allowing deviation that is less than two slices.
> +	 */
> +	time_elapsed -= tg->td->throtl_slice;

Please document that default slice window size is doubled actually in
this way.


Thanks,
Ming
Yu Kuai Feb. 26, 2025, 9:56 a.m. UTC | #2
Hi,

在 2025/02/26 16:34, Ming Lei 写道:
> On Wed, Feb 26, 2025 at 09:16:27AM +0800, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> The bio submission time may be a few jiffies more than the expected
>> waiting time, due to 'extra_bytes' can't be divided in
>> tg_within_bps_limit(), and also due to timer wakeup delay. In this
>> case, adjust slice_start to jiffies will discard the extra wait time,
>> causing lower rate than expected.
>>
>> This problem will cause blktests throtl/001 failure in case of
>> CONFIG_HZ_100=y, fix it by preserving one finished slice in
>> throtl_trim_slice() and allowing deviation between [0, 2 slices).
> 
> I think it only can cover single default slice deviation, since
> throtl_trim_slice() just keeps dispatch data in the previous single
> default slice. Or can you add words on how to allow 2 default slices
> deviation?
> 
>>
>> For example, assume bps_limit is 1000bytes, 1 jiffes is 10ms, and
>> slice is 20ms(2 jiffies), expected rate is 1000 / 1000 * 20 = 20 bytes
>> per slice.
>>
>> If user issues two 21 bytes IO, then wait time will be 30ms for the
>> first IO:
>>
>> bytes_allowed = 20, extra_bytes = 1;
>> jiffy_wait = 1 + 2 = 3 jiffies
>>
>> and consider
>> extra 1 jiffies by timer, throtl_trim_slice() will be called at:
>>
>> jiffies = 40ms
>> slice_start = 0ms, slice_end= 40ms
>> bytes_disp = 21
>>
>> In this case, before the patch, real rate in the first two slices is
>> 10.5 bytes per slice, and slice will be updated to:
>>
>> jiffies = 40ms
>> slice_start = 40ms, slice_end = 60ms,
>> bytes_disp = 0;
>>
>> Hence the second IO will have to wait another 30ms;
>>
>> With the patch, the real rate in the first slice is 20 bytes per slice,
>> which is the same as expected, and slice will be updated:
>>
>> jiffies=40ms,
>> slice_start = 20ms, slice_end = 60ms,
>> bytes_disp = 1;
>>
>> And now, there is still 19 bytes allowed in the second slice, and the
>> second IO will only have to wait 10ms;
>>
>> Fixes: e43473b7f223 ("blkio: Core implementation of throttle policy")
>> Reported-by: Ming Lei <ming.lei@redhat.com>
>> Closes: https://lore.kernel.org/linux-block/20250222092823.210318-3-yukuai1@huaweicloud.com/
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   block/blk-throttle.c | 13 +++++++++++--
>>   1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
>> index 8d149aff9fd0..cb472cf7b6b6 100644
>> --- a/block/blk-throttle.c
>> +++ b/block/blk-throttle.c
>> @@ -599,14 +599,23 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
>>   	 * sooner, then we need to reduce slice_end. A high bogus slice_end
>>   	 * is bad because it does not allow new slice to start.
>>   	 */
>> -
>>   	throtl_set_slice_end(tg, rw, jiffies + tg->td->throtl_slice);
>>   
>>   	time_elapsed = rounddown(jiffies - tg->slice_start[rw],
>>   				 tg->td->throtl_slice);
>> -	if (!time_elapsed)
>> +	/* Don't trim slice until at least 2 slices are used */
>> +	if (time_elapsed < tg->td->throtl_slice * 2)
>>   		return;
>>   
>> +	/*
>> +	 * The bio submission time may be a few jiffies more than the expected
>> +	 * waiting time, due to 'extra_bytes' can't be divided in
>> +	 * tg_within_bps_limit(), and also due to timer wakeup delay. In this
>> +	 * case, adjust slice_start to jiffies will discard the extra wait time,
>> +	 * causing lower rate than expected. Therefore, one slice is preserved,
>> +	 * allowing deviation that is less than two slices.
>> +	 */
>> +	time_elapsed -= tg->td->throtl_slice;
> 
> Please document that default slice window size is doubled actually in
> this way.

I said two slices because there is a round down:

 >>   	time_elapsed = rounddown(jiffies - tg->slice_start[rw],
 >>   				 tg->td->throtl_slice);

Hence the deviation is actually between [1 ,2) jiffies, depends on the
time start to wait and how long the delay is.

If start to wait at slice_start + n * throtl_slice - 1, the deviation is
*at most 1 slice*

If start to wait at slice_stat + n * throtl_slice, the max deviation is
*less than 2 slices* (2 slices not included)

Now, I agree allowing deviation at most 1 slice is more appropriate. :)

Thanks,
Kuai

> 
> 
> Thanks,
> Ming
> 
> 
> .
>
Ming Lei Feb. 26, 2025, 10:34 a.m. UTC | #3
On Wed, Feb 26, 2025 at 05:56:03PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/02/26 16:34, Ming Lei 写道:
> > On Wed, Feb 26, 2025 at 09:16:27AM +0800, Yu Kuai wrote:
> > > From: Yu Kuai <yukuai3@huawei.com>
> > > 
> > > The bio submission time may be a few jiffies more than the expected
> > > waiting time, due to 'extra_bytes' can't be divided in
> > > tg_within_bps_limit(), and also due to timer wakeup delay. In this
> > > case, adjust slice_start to jiffies will discard the extra wait time,
> > > causing lower rate than expected.
> > > 
> > > This problem will cause blktests throtl/001 failure in case of
> > > CONFIG_HZ_100=y, fix it by preserving one finished slice in
> > > throtl_trim_slice() and allowing deviation between [0, 2 slices).
> > 
> > I think it only can cover single default slice deviation, since
> > throtl_trim_slice() just keeps dispatch data in the previous single
> > default slice. Or can you add words on how to allow 2 default slices
> > deviation?
> > 
> > > 
> > > For example, assume bps_limit is 1000bytes, 1 jiffes is 10ms, and
> > > slice is 20ms(2 jiffies), expected rate is 1000 / 1000 * 20 = 20 bytes
> > > per slice.
> > > 
> > > If user issues two 21 bytes IO, then wait time will be 30ms for the
> > > first IO:
> > > 
> > > bytes_allowed = 20, extra_bytes = 1;
> > > jiffy_wait = 1 + 2 = 3 jiffies
> > > 
> > > and consider
> > > extra 1 jiffies by timer, throtl_trim_slice() will be called at:
> > > 
> > > jiffies = 40ms
> > > slice_start = 0ms, slice_end= 40ms
> > > bytes_disp = 21
> > > 
> > > In this case, before the patch, real rate in the first two slices is
> > > 10.5 bytes per slice, and slice will be updated to:
> > > 
> > > jiffies = 40ms
> > > slice_start = 40ms, slice_end = 60ms,
> > > bytes_disp = 0;
> > > 
> > > Hence the second IO will have to wait another 30ms;
> > > 
> > > With the patch, the real rate in the first slice is 20 bytes per slice,
> > > which is the same as expected, and slice will be updated:
> > > 
> > > jiffies=40ms,
> > > slice_start = 20ms, slice_end = 60ms,
> > > bytes_disp = 1;
> > > 
> > > And now, there is still 19 bytes allowed in the second slice, and the
> > > second IO will only have to wait 10ms;
> > > 
> > > Fixes: e43473b7f223 ("blkio: Core implementation of throttle policy")
> > > Reported-by: Ming Lei <ming.lei@redhat.com>
> > > Closes: https://lore.kernel.org/linux-block/20250222092823.210318-3-yukuai1@huaweicloud.com/
> > > Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> > > ---
> > >   block/blk-throttle.c | 13 +++++++++++--
> > >   1 file changed, 11 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> > > index 8d149aff9fd0..cb472cf7b6b6 100644
> > > --- a/block/blk-throttle.c
> > > +++ b/block/blk-throttle.c
> > > @@ -599,14 +599,23 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
> > >   	 * sooner, then we need to reduce slice_end. A high bogus slice_end
> > >   	 * is bad because it does not allow new slice to start.
> > >   	 */
> > > -
> > >   	throtl_set_slice_end(tg, rw, jiffies + tg->td->throtl_slice);
> > >   	time_elapsed = rounddown(jiffies - tg->slice_start[rw],
> > >   				 tg->td->throtl_slice);
> > > -	if (!time_elapsed)
> > > +	/* Don't trim slice until at least 2 slices are used */
> > > +	if (time_elapsed < tg->td->throtl_slice * 2)
> > >   		return;
> > > +	/*
> > > +	 * The bio submission time may be a few jiffies more than the expected
> > > +	 * waiting time, due to 'extra_bytes' can't be divided in
> > > +	 * tg_within_bps_limit(), and also due to timer wakeup delay. In this
> > > +	 * case, adjust slice_start to jiffies will discard the extra wait time,
> > > +	 * causing lower rate than expected. Therefore, one slice is preserved,
> > > +	 * allowing deviation that is less than two slices.
> > > +	 */
> > > +	time_elapsed -= tg->td->throtl_slice;
> > 
> > Please document that default slice window size is doubled actually in
> > this way.
> 
> I said two slices because there is a round down:
> 
> >>   	time_elapsed = rounddown(jiffies - tg->slice_start[rw],
> >>   				 tg->td->throtl_slice);
> 
> Hence the deviation is actually between [1 ,2) jiffies, depends on the
> time start to wait and how long the delay is.
> 
> If start to wait at slice_start + n * throtl_slice - 1, the deviation is
> *at most 1 slice*
> 
> If start to wait at slice_stat + n * throtl_slice, the max deviation is
> *less than 2 slices* (2 slices not included)
> 
> Now, I agree allowing deviation at most 1 slice is more appropriate. :)

Actually the in-tree code already covers deviation from the rounddown(),
but turns out it is still not enough, so this patch increases one extra
def slice size. With this thing is documented, feel free to add:

Reviewed-by: Ming Lei <ming.lei@redhat.com>


Thanks,
Ming
Tejun Heo Feb. 26, 2025, 4:48 p.m. UTC | #4
On Wed, Feb 26, 2025 at 09:16:27AM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> The bio submission time may be a few jiffies more than the expected
> waiting time, due to 'extra_bytes' can't be divided in
> tg_within_bps_limit(), and also due to timer wakeup delay. In this
> case, adjust slice_start to jiffies will discard the extra wait time,
> causing lower rate than expected.
> 
> This problem will cause blktests throtl/001 failure in case of
> CONFIG_HZ_100=y, fix it by preserving one finished slice in
> throtl_trim_slice() and allowing deviation between [0, 2 slices).
> 
> For example, assume bps_limit is 1000bytes, 1 jiffes is 10ms, and
> slice is 20ms(2 jiffies), expected rate is 1000 / 1000 * 20 = 20 bytes
> per slice.
> 
> If user issues two 21 bytes IO, then wait time will be 30ms for the
> first IO:
> 
> bytes_allowed = 20, extra_bytes = 1;
> jiffy_wait = 1 + 2 = 3 jiffies
> 
> and consider
> extra 1 jiffies by timer, throtl_trim_slice() will be called at:
> 
> jiffies = 40ms
> slice_start = 0ms, slice_end= 40ms
> bytes_disp = 21
> 
> In this case, before the patch, real rate in the first two slices is
> 10.5 bytes per slice, and slice will be updated to:
> 
> jiffies = 40ms
> slice_start = 40ms, slice_end = 60ms,
> bytes_disp = 0;
> 
> Hence the second IO will have to wait another 30ms;
> 
> With the patch, the real rate in the first slice is 20 bytes per slice,
> which is the same as expected, and slice will be updated:
> 
> jiffies=40ms,
> slice_start = 20ms, slice_end = 60ms,
> bytes_disp = 1;
> 
> And now, there is still 19 bytes allowed in the second slice, and the
> second IO will only have to wait 10ms;
> 
> Fixes: e43473b7f223 ("blkio: Core implementation of throttle policy")
> Reported-by: Ming Lei <ming.lei@redhat.com>
> Closes: https://lore.kernel.org/linux-block/20250222092823.210318-3-yukuai1@huaweicloud.com/
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.
Yu Kuai Feb. 27, 2025, 2:49 a.m. UTC | #5
Hi,

在 2025/02/26 18:34, Ming Lei 写道:
> Actually the in-tree code already covers deviation from the rounddown(),
> but turns out it is still not enough, so this patch increases one extra
> def slice size. With this thing is documented, feel free to add:
> 
> Reviewed-by: Ming Lei<ming.lei@redhat.com>

Thanks for the review, will send a new verion.
Kuai
diff mbox series

Patch

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 8d149aff9fd0..cb472cf7b6b6 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -599,14 +599,23 @@  static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 	 * sooner, then we need to reduce slice_end. A high bogus slice_end
 	 * is bad because it does not allow new slice to start.
 	 */
-
 	throtl_set_slice_end(tg, rw, jiffies + tg->td->throtl_slice);
 
 	time_elapsed = rounddown(jiffies - tg->slice_start[rw],
 				 tg->td->throtl_slice);
-	if (!time_elapsed)
+	/* Don't trim slice until at least 2 slices are used */
+	if (time_elapsed < tg->td->throtl_slice * 2)
 		return;
 
+	/*
+	 * The bio submission time may be a few jiffies more than the expected
+	 * waiting time, due to 'extra_bytes' can't be divided in
+	 * tg_within_bps_limit(), and also due to timer wakeup delay. In this
+	 * case, adjust slice_start to jiffies will discard the extra wait time,
+	 * causing lower rate than expected. Therefore, one slice is preserved,
+	 * allowing deviation that is less than two slices.
+	 */
+	time_elapsed -= tg->td->throtl_slice;
 	bytes_trim = calculate_bytes_allowed(tg_bps_limit(tg, rw),
 					     time_elapsed) +
 		     tg->carryover_bytes[rw];