diff mbox

[PATCHSET,v5] Make background writeback great again for the first time

Message ID 20160427205915.GC25397@kernel.dk (mailing list archive)
State New, archived
Headers show

Commit Message

Jens Axboe April 27, 2016, 8:59 p.m. UTC
On Wed, Apr 27 2016, Jens Axboe wrote:
> On Wed, Apr 27 2016, Jens Axboe wrote:
> > On 04/27/2016 12:01 PM, Jan Kara wrote:
> > >Hi,
> > >
> > >On Tue 26-04-16 09:55:23, Jens Axboe wrote:
> > >>Since the dawn of time, our background buffered writeback has sucked.
> > >>When we do background buffered writeback, it should have little impact
> > >>on foreground activity. That's the definition of background activity...
> > >>But for as long as I can remember, heavy buffered writers have not
> > >>behaved like that. For instance, if I do something like this:
> > >>
> > >>$ dd if=/dev/zero of=foo bs=1M count=10k
> > >>
> > >>on my laptop, and then try and start chrome, it basically won't start
> > >>before the buffered writeback is done. Or, for server oriented
> > >>workloads, where installation of a big RPM (or similar) adversely
> > >>impacts database reads or sync writes. When that happens, I get people
> > >>yelling at me.
> > >>
> > >>I have posted plenty of results previously, I'll keep it shorter
> > >>this time. Here's a run on my laptop, using read-to-pipe-async for
> > >>reading a 5g file, and rewriting it. You can find this test program
> > >>in the fio git repo.
> > >
> > >I have tested your patchset on my test system. Generally I have observed
> > >noticeable drop in average throughput for heavy background writes without
> > >any other disk activity and also somewhat increased variance in the
> > >runtimes. It is most visible on this simple testcases:
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> > >
> > >and
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > >
> > >The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
> > >created before each dd run on a dedicated disk.
> > >
> > >Without your patches I get pretty stable dd runtimes for both cases:
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> > >Runtimes: 87.9611 87.3279 87.2554
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > >Runtimes: 93.3502 93.2086 93.541
> > >
> > >With your patches the numbers look like:
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> > >Runtimes: 108.183, 97.184, 99.9587
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > >Runtimes: 104.9, 102.775, 102.892
> > >
> > >I have checked whether the variance is due to some interaction with CFQ
> > >which is used for the disk. When I switched the disk to deadline, I still
> > >get some variance although, the throughput is still ~10% lower:
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> > >Runtimes: 100.417 100.643 100.866
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > >Runtimes: 104.208 106.341 105.483
> > >
> > >The disk is rotational SATA drive with writeback cache, queue depth of the
> > >disk reported in /sys/block/sdb/device/queue_depth is 1.
> > >
> > >So I think we still need some tweaking on the low end of the storage
> > >spectrum so that we don't lose 10% of throughput for simple cases like
> > >this.
> > 
> > Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
> > you are seeing smaller requests, and that is why it both varies and
> > you get lower throughput? I'll try and setup a test here similar to
> > yours.
> 
> Jan, care to try the below patch? I can't fully reproduce your issue on
> a SCSI disk limited to QD=1, but I have a feeling this might help. It's
> a bit of a hack, but the general idea is to allow one more request to
> build up for QD=1 devices. That eliminates wait time between one request
> finishing, and the next being submitted.

That accidentally added a potentially stall, this one is both cleaner
and should have that fixed.

Comments

xiakaixu April 28, 2016, 4:06 a.m. UTC | #1
? 2016/4/28 4:59, Jens Axboe ??:
> On Wed, Apr 27 2016, Jens Axboe wrote:
>> On Wed, Apr 27 2016, Jens Axboe wrote:
>>> On 04/27/2016 12:01 PM, Jan Kara wrote:
>>>> Hi,
>>>>
>>>> On Tue 26-04-16 09:55:23, Jens Axboe wrote:
>>>>> Since the dawn of time, our background buffered writeback has sucked.
>>>>> When we do background buffered writeback, it should have little impact
>>>>> on foreground activity. That's the definition of background activity...
>>>>> But for as long as I can remember, heavy buffered writers have not
>>>>> behaved like that. For instance, if I do something like this:
>>>>>
>>>>> $ dd if=/dev/zero of=foo bs=1M count=10k
>>>>>
>>>>> on my laptop, and then try and start chrome, it basically won't start
>>>>> before the buffered writeback is done. Or, for server oriented
>>>>> workloads, where installation of a big RPM (or similar) adversely
>>>>> impacts database reads or sync writes. When that happens, I get people
>>>>> yelling at me.
>>>>>
>>>>> I have posted plenty of results previously, I'll keep it shorter
>>>>> this time. Here's a run on my laptop, using read-to-pipe-async for
>>>>> reading a 5g file, and rewriting it. You can find this test program
>>>>> in the fio git repo.
>>>>
>>>> I have tested your patchset on my test system. Generally I have observed
>>>> noticeable drop in average throughput for heavy background writes without
>>>> any other disk activity and also somewhat increased variance in the
>>>> runtimes. It is most visible on this simple testcases:
>>>>
>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000
>>>>
>>>> and
>>>>
>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
>>>>
>>>> The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
>>>> created before each dd run on a dedicated disk.
>>>>
>>>> Without your patches I get pretty stable dd runtimes for both cases:
>>>>
>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000
>>>> Runtimes: 87.9611 87.3279 87.2554
>>>>
>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
>>>> Runtimes: 93.3502 93.2086 93.541
>>>>
>>>> With your patches the numbers look like:
>>>>
>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000
>>>> Runtimes: 108.183, 97.184, 99.9587
>>>>
>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
>>>> Runtimes: 104.9, 102.775, 102.892
>>>>
>>>> I have checked whether the variance is due to some interaction with CFQ
>>>> which is used for the disk. When I switched the disk to deadline, I still
>>>> get some variance although, the throughput is still ~10% lower:
>>>>
>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000
>>>> Runtimes: 100.417 100.643 100.866
>>>>
>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
>>>> Runtimes: 104.208 106.341 105.483
>>>>
>>>> The disk is rotational SATA drive with writeback cache, queue depth of the
>>>> disk reported in /sys/block/sdb/device/queue_depth is 1.
>>>>
>>>> So I think we still need some tweaking on the low end of the storage
>>>> spectrum so that we don't lose 10% of throughput for simple cases like
>>>> this.
>>>
>>> Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
>>> you are seeing smaller requests, and that is why it both varies and
>>> you get lower throughput? I'll try and setup a test here similar to
>>> yours.
>>
>> Jan, care to try the below patch? I can't fully reproduce your issue on
>> a SCSI disk limited to QD=1, but I have a feeling this might help. It's
>> a bit of a hack, but the general idea is to allow one more request to
>> build up for QD=1 devices. That eliminates wait time between one request
>> finishing, and the next being submitted.
> 
> That accidentally added a potentially stall, this one is both cleaner
> and should have that fixed.
> 
> diff --git a/lib/wbt.c b/lib/wbt.c
> index 650da911f24f..322f5e04e994 100644
> --- a/lib/wbt.c
> +++ b/lib/wbt.c
> @@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb)
>  	else
>  		limit = rwb->wb_normal;
Hi Jens,

This statement 'limit = rwb->wb_normal' is executed twice, maybe once is
enough. It is not a big deal anyway :)


Another question about this if branch:

   if (rwb->wc && !atomic_read(&rwb->bdi->wb.dirty_sleeping))
	limit = 0;

I can't follow the logic of this if branch. why set limit equal to 0
when the device supports write back caches and there are tasks being
limited in balance_dirty_pages(). Could you pelase give more info
about this ?  Thanks!
>  
> +	inflight = atomic_dec_return(&rwb->inflight);
> +
>  	/*
> -	 * Don't wake anyone up if we are above the normal limit. If
> -	 * throttling got disabled (limit == 0) with waiters, ensure
> -	 * that we wake them up.
> +	 * wbt got disabled with IO in flight. Wake up any potential
> +	 * waiters, we don't have to do more than that.
>  	 */
> -	inflight = atomic_dec_return(&rwb->inflight);
> -	if (limit && inflight >= limit) {
> -		if (!rwb->wb_max)
> -			wake_up_all(&rwb->wait);
> +	if (!rwb_enabled(rwb)) {
> +		wake_up_all(&rwb->wait);
>  		return;
>  	}

Maybe it is better that executing this if branch earlier. So we can wake up
potential waiters in time when wbt got disabled.
>  
> +	/*
> +	 * Don't wake anyone up if we are above the normal limit.
> +	 */
> +	if (inflight && inflight >= limit)
> +		return;
> +
>  	if (waitqueue_active(&rwb->wait)) {
>  		int diff = limit - inflight;
>  
> @@ -150,14 +155,26 @@ static void calc_wb_limits(struct rq_wb *rwb)
>  		return;
>  	}
>  
> -	depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth);
> -
>  	/*
> -	 * Reduce max depth by 50%, and re-calculate normal/bg based on that
> +	 * For QD=1 devices, this is a special case. It's important for those
> +	 * to have one request ready when one completes, so force a depth of
> +	 * 2 for those devices. On the backend, it'll be a depth of 1 anyway,
> +	 * since the device can't have more than that in flight.
>  	 */
> -	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> -	rwb->wb_normal = (rwb->wb_max + 1) / 2;
> -	rwb->wb_background = (rwb->wb_max + 3) / 4;
> +	if (rwb->queue_depth == 1) {
> +		rwb->wb_max = rwb->wb_normal = 2;
> +		rwb->wb_background = 1;
> +	} else {
> +		depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth);
> +
> +		/*
> +		 * Reduce max depth by 50%, and re-calculate normal/bg based on
> +		 * that.
> +		 */
> +		rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> +		rwb->wb_normal = (rwb->wb_max + 1) / 2;
> +		rwb->wb_background = (rwb->wb_max + 3) / 4;
> +	}
>  }
>  
>  static bool inline stat_sample_valid(struct blk_rq_stat *stat)
>
Jan Kara April 28, 2016, 11:54 a.m. UTC | #2
On Wed 27-04-16 14:59:15, Jens Axboe wrote:
> On Wed, Apr 27 2016, Jens Axboe wrote:
> > On Wed, Apr 27 2016, Jens Axboe wrote:
> > > On 04/27/2016 12:01 PM, Jan Kara wrote:
> > > >Hi,
> > > >
> > > >On Tue 26-04-16 09:55:23, Jens Axboe wrote:
> > > >>Since the dawn of time, our background buffered writeback has sucked.
> > > >>When we do background buffered writeback, it should have little impact
> > > >>on foreground activity. That's the definition of background activity...
> > > >>But for as long as I can remember, heavy buffered writers have not
> > > >>behaved like that. For instance, if I do something like this:
> > > >>
> > > >>$ dd if=/dev/zero of=foo bs=1M count=10k
> > > >>
> > > >>on my laptop, and then try and start chrome, it basically won't start
> > > >>before the buffered writeback is done. Or, for server oriented
> > > >>workloads, where installation of a big RPM (or similar) adversely
> > > >>impacts database reads or sync writes. When that happens, I get people
> > > >>yelling at me.
> > > >>
> > > >>I have posted plenty of results previously, I'll keep it shorter
> > > >>this time. Here's a run on my laptop, using read-to-pipe-async for
> > > >>reading a 5g file, and rewriting it. You can find this test program
> > > >>in the fio git repo.
> > > >
> > > >I have tested your patchset on my test system. Generally I have observed
> > > >noticeable drop in average throughput for heavy background writes without
> > > >any other disk activity and also somewhat increased variance in the
> > > >runtimes. It is most visible on this simple testcases:
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> > > >
> > > >and
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > > >
> > > >The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
> > > >created before each dd run on a dedicated disk.
> > > >
> > > >Without your patches I get pretty stable dd runtimes for both cases:
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> > > >Runtimes: 87.9611 87.3279 87.2554
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > > >Runtimes: 93.3502 93.2086 93.541
> > > >
> > > >With your patches the numbers look like:
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> > > >Runtimes: 108.183, 97.184, 99.9587
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > > >Runtimes: 104.9, 102.775, 102.892
> > > >
> > > >I have checked whether the variance is due to some interaction with CFQ
> > > >which is used for the disk. When I switched the disk to deadline, I still
> > > >get some variance although, the throughput is still ~10% lower:
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> > > >Runtimes: 100.417 100.643 100.866
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > > >Runtimes: 104.208 106.341 105.483
> > > >
> > > >The disk is rotational SATA drive with writeback cache, queue depth of the
> > > >disk reported in /sys/block/sdb/device/queue_depth is 1.
> > > >
> > > >So I think we still need some tweaking on the low end of the storage
> > > >spectrum so that we don't lose 10% of throughput for simple cases like
> > > >this.
> > > 
> > > Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
> > > you are seeing smaller requests, and that is why it both varies and
> > > you get lower throughput? I'll try and setup a test here similar to
> > > yours.
> > 
> > Jan, care to try the below patch? I can't fully reproduce your issue on
> > a SCSI disk limited to QD=1, but I have a feeling this might help. It's
> > a bit of a hack, but the general idea is to allow one more request to
> > build up for QD=1 devices. That eliminates wait time between one request
> > finishing, and the next being submitted.
> 
> That accidentally added a potentially stall, this one is both cleaner
> and should have that fixed.
> 
..
> -	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> -	rwb->wb_normal = (rwb->wb_max + 1) / 2;
> -	rwb->wb_background = (rwb->wb_max + 3) / 4;
> +	if (rwb->queue_depth == 1) {
> +		rwb->wb_max = rwb->wb_normal = 2;
> +		rwb->wb_background = 1;

This breaks the detection of too big scale_step in scale_up() where we key
of wb_max == 1 value. However even with that fixed no luck :(:

dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
Runtime: 105.126 107.125 105.641

So about the same as before. I'll try to debug this later today...

								Honza
Jens Axboe April 28, 2016, 6:36 p.m. UTC | #3
On 04/27/2016 10:06 PM, xiakaixu wrote:
>> diff --git a/lib/wbt.c b/lib/wbt.c
>> index 650da911f24f..322f5e04e994 100644
>> --- a/lib/wbt.c
>> +++ b/lib/wbt.c
>> @@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb)
>>   	else
>>   		limit = rwb->wb_normal;
> Hi Jens,
>
> This statement 'limit = rwb->wb_normal' is executed twice, maybe once is
> enough. It is not a big deal anyway :)

I'll clean that up, thanks for noticing. No functional difference.

> Another question about this if branch:
>
>     if (rwb->wc && !atomic_read(&rwb->bdi->wb.dirty_sleeping))
> 	limit = 0;
>
> I can't follow the logic of this if branch. why set limit equal to 0
> when the device supports write back caches and there are tasks being
> limited in balance_dirty_pages(). Could you pelase give more info
> about this ?  Thanks!

Sure. So for write back caching, we have to try a bit harder to ensure 
that the device doesn't build up long internal queues with a lot of 
dirty data in the cache. So for the case where we have write back 
caching AND we don't have anyone waiting for the IO, allow the queue 
depth to drain to zero before building it back up again.

Does that make sense?

>>
>> +	inflight = atomic_dec_return(&rwb->inflight);
>> +
>>   	/*
>> -	 * Don't wake anyone up if we are above the normal limit. If
>> -	 * throttling got disabled (limit == 0) with waiters, ensure
>> -	 * that we wake them up.
>> +	 * wbt got disabled with IO in flight. Wake up any potential
>> +	 * waiters, we don't have to do more than that.
>>   	 */
>> -	inflight = atomic_dec_return(&rwb->inflight);
>> -	if (limit && inflight >= limit) {
>> -		if (!rwb->wb_max)
>> -			wake_up_all(&rwb->wait);
>> +	if (!rwb_enabled(rwb)) {
>> +		wake_up_all(&rwb->wait);
>>   		return;
>>   	}
>
> Maybe it is better that executing this if branch earlier. So we can wake up
> potential waiters in time when wbt got disabled.

The !rwb_enabled() case will only happen if someone disabled wbt while 
we had tracked IO in flight. We have to it below the 
atomic_dec_return(), so we could reorder that to be at the front. 
Ideally we just want it out-of-line instead, as it's the unexpected 
slower path.
Jens Axboe April 28, 2016, 6:46 p.m. UTC | #4
On 04/28/2016 05:54 AM, Jan Kara wrote:
> On Wed 27-04-16 14:59:15, Jens Axboe wrote:
>> On Wed, Apr 27 2016, Jens Axboe wrote:
>>> On Wed, Apr 27 2016, Jens Axboe wrote:
>>>> On 04/27/2016 12:01 PM, Jan Kara wrote:
>>>>> Hi,
>>>>>
>>>>> On Tue 26-04-16 09:55:23, Jens Axboe wrote:
>>>>>> Since the dawn of time, our background buffered writeback has sucked.
>>>>>> When we do background buffered writeback, it should have little impact
>>>>>> on foreground activity. That's the definition of background activity...
>>>>>> But for as long as I can remember, heavy buffered writers have not
>>>>>> behaved like that. For instance, if I do something like this:
>>>>>>
>>>>>> $ dd if=/dev/zero of=foo bs=1M count=10k
>>>>>>
>>>>>> on my laptop, and then try and start chrome, it basically won't start
>>>>>> before the buffered writeback is done. Or, for server oriented
>>>>>> workloads, where installation of a big RPM (or similar) adversely
>>>>>> impacts database reads or sync writes. When that happens, I get people
>>>>>> yelling at me.
>>>>>>
>>>>>> I have posted plenty of results previously, I'll keep it shorter
>>>>>> this time. Here's a run on my laptop, using read-to-pipe-async for
>>>>>> reading a 5g file, and rewriting it. You can find this test program
>>>>>> in the fio git repo.
>>>>>
>>>>> I have tested your patchset on my test system. Generally I have observed
>>>>> noticeable drop in average throughput for heavy background writes without
>>>>> any other disk activity and also somewhat increased variance in the
>>>>> runtimes. It is most visible on this simple testcases:
>>>>>
>>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000
>>>>>
>>>>> and
>>>>>
>>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
>>>>>
>>>>> The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
>>>>> created before each dd run on a dedicated disk.
>>>>>
>>>>> Without your patches I get pretty stable dd runtimes for both cases:
>>>>>
>>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000
>>>>> Runtimes: 87.9611 87.3279 87.2554
>>>>>
>>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
>>>>> Runtimes: 93.3502 93.2086 93.541
>>>>>
>>>>> With your patches the numbers look like:
>>>>>
>>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000
>>>>> Runtimes: 108.183, 97.184, 99.9587
>>>>>
>>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
>>>>> Runtimes: 104.9, 102.775, 102.892
>>>>>
>>>>> I have checked whether the variance is due to some interaction with CFQ
>>>>> which is used for the disk. When I switched the disk to deadline, I still
>>>>> get some variance although, the throughput is still ~10% lower:
>>>>>
>>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000
>>>>> Runtimes: 100.417 100.643 100.866
>>>>>
>>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
>>>>> Runtimes: 104.208 106.341 105.483
>>>>>
>>>>> The disk is rotational SATA drive with writeback cache, queue depth of the
>>>>> disk reported in /sys/block/sdb/device/queue_depth is 1.
>>>>>
>>>>> So I think we still need some tweaking on the low end of the storage
>>>>> spectrum so that we don't lose 10% of throughput for simple cases like
>>>>> this.
>>>>
>>>> Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
>>>> you are seeing smaller requests, and that is why it both varies and
>>>> you get lower throughput? I'll try and setup a test here similar to
>>>> yours.
>>>
>>> Jan, care to try the below patch? I can't fully reproduce your issue on
>>> a SCSI disk limited to QD=1, but I have a feeling this might help. It's
>>> a bit of a hack, but the general idea is to allow one more request to
>>> build up for QD=1 devices. That eliminates wait time between one request
>>> finishing, and the next being submitted.
>>
>> That accidentally added a potentially stall, this one is both cleaner
>> and should have that fixed.
>>
> ..
>> -	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
>> -	rwb->wb_normal = (rwb->wb_max + 1) / 2;
>> -	rwb->wb_background = (rwb->wb_max + 3) / 4;
>> +	if (rwb->queue_depth == 1) {
>> +		rwb->wb_max = rwb->wb_normal = 2;
>> +		rwb->wb_background = 1;
>
> This breaks the detection of too big scale_step in scale_up() where we key
> of wb_max == 1 value. However even with that fixed no luck :(:

Yeah, I need to look at that. For QD=1, I think the only sensible values 
for max/normal/bg is 2/2/1 and 1/1/1 if we step down.

> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> Runtime: 105.126 107.125 105.641
>
> So about the same as before. I'll try to debug this later today...

Thanks, I'm very interested in what you find!
Jan Kara May 3, 2016, 12:17 p.m. UTC | #5
On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> >>-	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> >>-	rwb->wb_normal = (rwb->wb_max + 1) / 2;
> >>-	rwb->wb_background = (rwb->wb_max + 3) / 4;
> >>+	if (rwb->queue_depth == 1) {
> >>+		rwb->wb_max = rwb->wb_normal = 2;
> >>+		rwb->wb_background = 1;
> >
> >This breaks the detection of too big scale_step in scale_up() where we key
> >of wb_max == 1 value. However even with that fixed no luck :(:
> 
> Yeah, I need to look at that. For QD=1, I think the only sensible values for
> max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> 
> >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> >Runtime: 105.126 107.125 105.641
> >
> >So about the same as before. I'll try to debug this later today...
> 
> Thanks, I'm very interested in what you find!

OK, so the reason was relatively standard in the end. I was using ext3 (or
more exactly ext4 without delayed allocation) for the test. The throttling
of background writes gave more priority to writes from the journalling
thread which happen with WRITE_SYNC and thus are not throttled. Thus the
journalling thread ended up having to do more data writeback to be able to
commit a transaction (due to requirements of data=ordered mode) and it is
less efficient at that than the normal flusher thread.

So this is an example where throttling background writeback effectively
just pushes more work into another context which does it less efficiently
and indirectly makes everyone wait for it. ext3 has been always sensitive to
issues like this. ext4 is using delayed allocation and thus only data
writes into holes end up being part of a transaction -> simple dd test case
doesn't hit that path. And indeed when I repeat the same test with ext4,
the numbers with and without your patch are exactly the same.

The question remains how common a pattern where throttling of background
writeback delays also something else is. I'll schedule a couple of
benchmarks to measure impact of your patches for a wider range of workloads
(but sadly pretty limited set of hw). If ext3 is the only one seeing
issues, I would be willing to accept that ext3 takes the hit since it is
doing something rather stupid (but inherent in its journal design) and we
have a way to deal with this either by enabling delayed allocation or by
turning off the writeback throttling...

								Honza
Chris Mason May 3, 2016, 12:40 p.m. UTC | #6
On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote:
> On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> > >>-	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> > >>-	rwb->wb_normal = (rwb->wb_max + 1) / 2;
> > >>-	rwb->wb_background = (rwb->wb_max + 3) / 4;
> > >>+	if (rwb->queue_depth == 1) {
> > >>+		rwb->wb_max = rwb->wb_normal = 2;
> > >>+		rwb->wb_background = 1;
> > >
> > >This breaks the detection of too big scale_step in scale_up() where we key
> > >of wb_max == 1 value. However even with that fixed no luck :(:
> > 
> > Yeah, I need to look at that. For QD=1, I think the only sensible values for
> > max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> > 
> > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > >Runtime: 105.126 107.125 105.641
> > >
> > >So about the same as before. I'll try to debug this later today...
> > 
> > Thanks, I'm very interested in what you find!
> 
> OK, so the reason was relatively standard in the end. I was using ext3 (or
> more exactly ext4 without delayed allocation) for the test. The throttling
> of background writes gave more priority to writes from the journalling
> thread which happen with WRITE_SYNC and thus are not throttled. Thus the
> journalling thread ended up having to do more data writeback to be able to
> commit a transaction (due to requirements of data=ordered mode) and it is
> less efficient at that than the normal flusher thread.
> 
> So this is an example where throttling background writeback effectively
> just pushes more work into another context which does it less efficiently
> and indirectly makes everyone wait for it. ext3 has been always sensitive to
> issues like this. ext4 is using delayed allocation and thus only data
> writes into holes end up being part of a transaction -> simple dd test case
> doesn't hit that path. And indeed when I repeat the same test with ext4,
> the numbers with and without your patch are exactly the same.
> 
> The question remains how common a pattern where throttling of background
> writeback delays also something else is. I'll schedule a couple of
> benchmarks to measure impact of your patches for a wider range of workloads
> (but sadly pretty limited set of hw). If ext3 is the only one seeing
> issues, I would be willing to accept that ext3 takes the hit since it is
> doing something rather stupid (but inherent in its journal design) and we
> have a way to deal with this either by enabling delayed allocation or by
> turning off the writeback throttling...

At least in the case of io that we know is going to be data=ordered, we
can bump the prio of those pages?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 3, 2016, 1:06 p.m. UTC | #7
On Tue 03-05-16 08:40:11, Chris Mason wrote:
> On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote:
> > On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> > > >>-	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> > > >>-	rwb->wb_normal = (rwb->wb_max + 1) / 2;
> > > >>-	rwb->wb_background = (rwb->wb_max + 3) / 4;
> > > >>+	if (rwb->queue_depth == 1) {
> > > >>+		rwb->wb_max = rwb->wb_normal = 2;
> > > >>+		rwb->wb_background = 1;
> > > >
> > > >This breaks the detection of too big scale_step in scale_up() where we key
> > > >of wb_max == 1 value. However even with that fixed no luck :(:
> > > 
> > > Yeah, I need to look at that. For QD=1, I think the only sensible values for
> > > max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> > > 
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > > >Runtime: 105.126 107.125 105.641
> > > >
> > > >So about the same as before. I'll try to debug this later today...
> > > 
> > > Thanks, I'm very interested in what you find!
> > 
> > OK, so the reason was relatively standard in the end. I was using ext3 (or
> > more exactly ext4 without delayed allocation) for the test. The throttling
> > of background writes gave more priority to writes from the journalling
> > thread which happen with WRITE_SYNC and thus are not throttled. Thus the
> > journalling thread ended up having to do more data writeback to be able to
> > commit a transaction (due to requirements of data=ordered mode) and it is
> > less efficient at that than the normal flusher thread.
> > 
> > So this is an example where throttling background writeback effectively
> > just pushes more work into another context which does it less efficiently
> > and indirectly makes everyone wait for it. ext3 has been always sensitive to
> > issues like this. ext4 is using delayed allocation and thus only data
> > writes into holes end up being part of a transaction -> simple dd test case
> > doesn't hit that path. And indeed when I repeat the same test with ext4,
> > the numbers with and without your patch are exactly the same.
> > 
> > The question remains how common a pattern where throttling of background
> > writeback delays also something else is. I'll schedule a couple of
> > benchmarks to measure impact of your patches for a wider range of workloads
> > (but sadly pretty limited set of hw). If ext3 is the only one seeing
> > issues, I would be willing to accept that ext3 takes the hit since it is
> > doing something rather stupid (but inherent in its journal design) and we
> > have a way to deal with this either by enabling delayed allocation or by
> > turning off the writeback throttling...
> 
> At least in the case of io that we know is going to be data=ordered, we
> can bump the prio of those pages?

But how would flusher thread, which is submitting IO, know that? We would
have to somehow mark inodes that are part of the running transaction and
flusher thread could give more priority to such writeback - e.g. by using
WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that,
it could be doable.

								Honza
Chris Mason May 3, 2016, 1:42 p.m. UTC | #8
On Tue, May 03, 2016 at 03:06:09PM +0200, Jan Kara wrote:
> On Tue 03-05-16 08:40:11, Chris Mason wrote:
> > On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote:
> > > On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> > > > >>-	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> > > > >>-	rwb->wb_normal = (rwb->wb_max + 1) / 2;
> > > > >>-	rwb->wb_background = (rwb->wb_max + 3) / 4;
> > > > >>+	if (rwb->queue_depth == 1) {
> > > > >>+		rwb->wb_max = rwb->wb_normal = 2;
> > > > >>+		rwb->wb_background = 1;
> > > > >
> > > > >This breaks the detection of too big scale_step in scale_up() where we key
> > > > >of wb_max == 1 value. However even with that fixed no luck :(:
> > > > 
> > > > Yeah, I need to look at that. For QD=1, I think the only sensible values for
> > > > max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> > > > 
> > > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > > > >Runtime: 105.126 107.125 105.641
> > > > >
> > > > >So about the same as before. I'll try to debug this later today...
> > > > 
> > > > Thanks, I'm very interested in what you find!
> > > 
> > > OK, so the reason was relatively standard in the end. I was using ext3 (or
> > > more exactly ext4 without delayed allocation) for the test. The throttling
> > > of background writes gave more priority to writes from the journalling
> > > thread which happen with WRITE_SYNC and thus are not throttled. Thus the
> > > journalling thread ended up having to do more data writeback to be able to
> > > commit a transaction (due to requirements of data=ordered mode) and it is
> > > less efficient at that than the normal flusher thread.
> > > 
> > > So this is an example where throttling background writeback effectively
> > > just pushes more work into another context which does it less efficiently
> > > and indirectly makes everyone wait for it. ext3 has been always sensitive to
> > > issues like this. ext4 is using delayed allocation and thus only data
> > > writes into holes end up being part of a transaction -> simple dd test case
> > > doesn't hit that path. And indeed when I repeat the same test with ext4,
> > > the numbers with and without your patch are exactly the same.
> > > 
> > > The question remains how common a pattern where throttling of background
> > > writeback delays also something else is. I'll schedule a couple of
> > > benchmarks to measure impact of your patches for a wider range of workloads
> > > (but sadly pretty limited set of hw). If ext3 is the only one seeing
> > > issues, I would be willing to accept that ext3 takes the hit since it is
> > > doing something rather stupid (but inherent in its journal design) and we
> > > have a way to deal with this either by enabling delayed allocation or by
> > > turning off the writeback throttling...
> > 
> > At least in the case of io that we know is going to be data=ordered, we
> > can bump the prio of those pages?
> 
> But how would flusher thread, which is submitting IO, know that? We would
> have to somehow mark inodes that are part of the running transaction and
> flusher thread could give more priority to such writeback - e.g. by using
> WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that,
> it could be doable.

This would be specific to the data=ordered code in the FS.  If there's
some way to test for an inode or a page's status in the data=ordered
list, the FS writepages call could flag the IO as higher prio?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 3, 2016, 1:57 p.m. UTC | #9
On Tue 03-05-16 09:42:40, Chris Mason wrote:
> On Tue, May 03, 2016 at 03:06:09PM +0200, Jan Kara wrote:
> > On Tue 03-05-16 08:40:11, Chris Mason wrote:
> > > On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote:
> > > > On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> > > > > >>-	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> > > > > >>-	rwb->wb_normal = (rwb->wb_max + 1) / 2;
> > > > > >>-	rwb->wb_background = (rwb->wb_max + 3) / 4;
> > > > > >>+	if (rwb->queue_depth == 1) {
> > > > > >>+		rwb->wb_max = rwb->wb_normal = 2;
> > > > > >>+		rwb->wb_background = 1;
> > > > > >
> > > > > >This breaks the detection of too big scale_step in scale_up() where we key
> > > > > >of wb_max == 1 value. However even with that fixed no luck :(:
> > > > > 
> > > > > Yeah, I need to look at that. For QD=1, I think the only sensible values for
> > > > > max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> > > > > 
> > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > > > > >Runtime: 105.126 107.125 105.641
> > > > > >
> > > > > >So about the same as before. I'll try to debug this later today...
> > > > > 
> > > > > Thanks, I'm very interested in what you find!
> > > > 
> > > > OK, so the reason was relatively standard in the end. I was using ext3 (or
> > > > more exactly ext4 without delayed allocation) for the test. The throttling
> > > > of background writes gave more priority to writes from the journalling
> > > > thread which happen with WRITE_SYNC and thus are not throttled. Thus the
> > > > journalling thread ended up having to do more data writeback to be able to
> > > > commit a transaction (due to requirements of data=ordered mode) and it is
> > > > less efficient at that than the normal flusher thread.
> > > > 
> > > > So this is an example where throttling background writeback effectively
> > > > just pushes more work into another context which does it less efficiently
> > > > and indirectly makes everyone wait for it. ext3 has been always sensitive to
> > > > issues like this. ext4 is using delayed allocation and thus only data
> > > > writes into holes end up being part of a transaction -> simple dd test case
> > > > doesn't hit that path. And indeed when I repeat the same test with ext4,
> > > > the numbers with and without your patch are exactly the same.
> > > > 
> > > > The question remains how common a pattern where throttling of background
> > > > writeback delays also something else is. I'll schedule a couple of
> > > > benchmarks to measure impact of your patches for a wider range of workloads
> > > > (but sadly pretty limited set of hw). If ext3 is the only one seeing
> > > > issues, I would be willing to accept that ext3 takes the hit since it is
> > > > doing something rather stupid (but inherent in its journal design) and we
> > > > have a way to deal with this either by enabling delayed allocation or by
> > > > turning off the writeback throttling...
> > > 
> > > At least in the case of io that we know is going to be data=ordered, we
> > > can bump the prio of those pages?
> > 
> > But how would flusher thread, which is submitting IO, know that? We would
> > have to somehow mark inodes that are part of the running transaction and
> > flusher thread could give more priority to such writeback - e.g. by using
> > WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that,
> > it could be doable.
> 
> This would be specific to the data=ordered code in the FS.  If there's
> some way to test for an inode or a page's status in the data=ordered
> list, the FS writepages call could flag the IO as higher prio?

Oh, right, we could do that. I can experiment with that later.

								Honza
Jan Kara May 11, 2016, 4:36 p.m. UTC | #10
On Tue 03-05-16 14:17:19, Jan Kara wrote:
> The question remains how common a pattern where throttling of background
> writeback delays also something else is. I'll schedule a couple of
> benchmarks to measure impact of your patches for a wider range of workloads
> (but sadly pretty limited set of hw). If ext3 is the only one seeing
> issues, I would be willing to accept that ext3 takes the hit since it is
> doing something rather stupid (but inherent in its journal design) and we
> have a way to deal with this either by enabling delayed allocation or by
> turning off the writeback throttling...

So I've run some benchmarks on a machine with 6 GB of RAM and SSD with
queue depth 32. The filesystem on the disk was XFS this time. I've found
couple of regressions. A clear one is with dbench (version 4). The average
throughput numbers look like:

			Baseline		WBT
Hmean    mb/sec-1         30.26 (  0.00%)       18.67 (-38.28%)
Hmean    mb/sec-2         40.71 (  0.00%)       31.25 (-23.23%)
Hmean    mb/sec-4         52.67 (  0.00%)       46.83 (-11.09%)
Hmean    mb/sec-8         69.51 (  0.00%)       64.35 ( -7.42%)
Hmean    mb/sec-16        91.07 (  0.00%)       86.46 ( -5.07%)
Hmean    mb/sec-32       115.10 (  0.00%)      110.29 ( -4.18%)
Hmean    mb/sec-64       145.14 (  0.00%)      134.97 ( -7.00%)
Hmean    mb/sec-512       93.99 (  0.00%)      133.85 ( 42.41%)

There were also some losses in a filebench webproxy workload (I can give
you exact details of the settings if you want to reproduce it).

Also, and this really puzzles me, I've seen higher read latencies in some
cases (I've verified they are not just noise by rerunning the test for
kernel with writeback throttling patches). For example with the following
fio job file:

[global]
direct=0
ioengine=sync
runtime=300
time_based
invalidate=1
blocksize=4096
size=10g        # Just random value, we are running time based workload
log_avg_msec=10
group_reporting=1

[writer]
nrfiles=1
filesize=1g
fdatasync=256
readwrite=randwrite
numjobs=4

[reader]
# Simulate random reading from different files, switching to different file
# after 16 ios. This somewhat simulates application startup.
new_group
filesize=100m
nrfiles=20
file_service_type=random:16
readwrite=randread

I get the following results:

Throughput			Baseline		WBT
Hmean    kb/sec-writer-write      591.60 (  0.00%)      507.00 (-14.30%)
Hmean    kb/sec-reader-read       211.81 (  0.00%)      137.53 (-35.07%)

So both read and write throughput have suffered. And latencies don't offset
for the loss either:

FIO read latency
Min         latency-read     1383.00 (  0.00%)     1519.00 ( -9.83%)
1st-qrtle   latency-read     3485.00 (  0.00%)     5235.00 (-50.22%)
2nd-qrtle   latency-read     4708.00 (  0.00%)    15028.00 (-219.20%)
3rd-qrtle   latency-read    10286.00 (  0.00%)    57622.00 (-460.20%)
Max-90%     latency-read   195834.00 (  0.00%)   167149.00 ( 14.65%)
Max-93%     latency-read   273145.00 (  0.00%)   200319.00 ( 26.66%)
Max-95%     latency-read   335434.00 (  0.00%)   220695.00 ( 34.21%)
Max-99%     latency-read   537017.00 (  0.00%)   347174.00 ( 35.35%)
Max         latency-read   991101.00 (  0.00%)   485835.00 ( 50.98%)
Mean        latency-read    51282.79 (  0.00%)    49953.95 (  2.59%)

So we have reduced the extra high read latencies which is nice but on
average there is no change.

And another fio jobfile which doesn't look great:

[global]
direct=0
ioengine=sync
runtime=300
blocksize=4096
invalidate=1
time_based
ramp_time=5     # Let the flusher thread start before taking measurements
log_avg_msec=10
group_reporting=1

[writer]
nrfiles=1
filesize=$((MEMTOTAL_BYTES*2))
readwrite=randwrite

[reader]
# Simulate random reading from different files, switching to different file
# after 16 ios. This somewhat simulates application startup.
new_group
filesize=100m
nrfiles=20
file_service_type=random:16
readwrite=randread

The throughput numbers look like:
Hmean    kb/sec-writer-write    24707.22 (  0.00%)    19912.23 (-19.41%)
Hmean    kb/sec-reader-read       886.65 (  0.00%)      905.71 (  2.15%)

So we've got significant hit in writes not really offset by a big increase
in reads. Read latency numbers look like (I show the WBT numbers for two runs
just so that one can see how variable the latency numbers are because I was
puzzled by very high max latency for WBT kernels - quartiles seem rather
stable higher percentiles and min/max are rather variable):

			   Baseline		WBT			WBT
Min         latency-read     1230.00 (  0.00%)     1560.00 (-26.83%)	1100.00 ( 10.57%)
1st-qrtle   latency-read     3357.00 (  0.00%)     3351.00 (  0.18%)	3351.00 (  0.18%)
2nd-qrtle   latency-read     4074.00 (  0.00%)     4056.00 (  0.44%)	4022.00 (  1.28%)
3rd-qrtle   latency-read     5198.00 (  0.00%)     5145.00 (  1.02%)	5095.00 (  1.98%)
Max-90%     latency-read     6594.00 (  0.00%)     6370.00 (  3.40%)	6130.00 (  7.04%)
Max-93%     latency-read    11251.00 (  0.00%)     9410.00 ( 16.36%)	6654.00 ( 40.86%)
Max-95%     latency-read    14769.00 (  0.00%)    13231.00 ( 10.41%)	10306.00 ( 30.22%)
Max-99%     latency-read    27826.00 (  0.00%)    28728.00 ( -3.24%)	25077.00 (  9.88%)
Max         latency-read    80202.00 (  0.00%)   186491.00 (-132.53%)	141346.00 (-76.24%)
Mean        latency-read     5356.12 (  0.00%)     5229.00 (  2.37%)	4927.23 (  8.01%)

I have run also other tests but they have mostly shown no significant
difference.

								Honza
Jens Axboe May 13, 2016, 6:29 p.m. UTC | #11
On 05/11/2016 10:36 AM, Jan Kara wrote:
> On Tue 03-05-16 14:17:19, Jan Kara wrote:
>> The question remains how common a pattern where throttling of background
>> writeback delays also something else is. I'll schedule a couple of
>> benchmarks to measure impact of your patches for a wider range of workloads
>> (but sadly pretty limited set of hw). If ext3 is the only one seeing
>> issues, I would be willing to accept that ext3 takes the hit since it is
>> doing something rather stupid (but inherent in its journal design) and we
>> have a way to deal with this either by enabling delayed allocation or by
>> turning off the writeback throttling...
>
> So I've run some benchmarks on a machine with 6 GB of RAM and SSD with
> queue depth 32. The filesystem on the disk was XFS this time. I've found
> couple of regressions. A clear one is with dbench (version 4). The average
> throughput numbers look like:
>
> 			Baseline		WBT
> Hmean    mb/sec-1         30.26 (  0.00%)       18.67 (-38.28%)
> Hmean    mb/sec-2         40.71 (  0.00%)       31.25 (-23.23%)
> Hmean    mb/sec-4         52.67 (  0.00%)       46.83 (-11.09%)
> Hmean    mb/sec-8         69.51 (  0.00%)       64.35 ( -7.42%)
> Hmean    mb/sec-16        91.07 (  0.00%)       86.46 ( -5.07%)
> Hmean    mb/sec-32       115.10 (  0.00%)      110.29 ( -4.18%)
> Hmean    mb/sec-64       145.14 (  0.00%)      134.97 ( -7.00%)
> Hmean    mb/sec-512       93.99 (  0.00%)      133.85 ( 42.41%)
>
> There were also some losses in a filebench webproxy workload (I can give
> you exact details of the settings if you want to reproduce it).
>
> Also, and this really puzzles me, I've seen higher read latencies in some
> cases (I've verified they are not just noise by rerunning the test for
> kernel with writeback throttling patches). For example with the following
> fio job file:
>
> [global]
> direct=0
> ioengine=sync
> runtime=300
> time_based
> invalidate=1
> blocksize=4096
> size=10g        # Just random value, we are running time based workload
> log_avg_msec=10
> group_reporting=1
>
> [writer]
> nrfiles=1
> filesize=1g
> fdatasync=256
> readwrite=randwrite
> numjobs=4
>
> [reader]
> # Simulate random reading from different files, switching to different file
> # after 16 ios. This somewhat simulates application startup.
> new_group
> filesize=100m
> nrfiles=20
> file_service_type=random:16
> readwrite=randread
>
> I get the following results:
>
> Throughput			Baseline		WBT
> Hmean    kb/sec-writer-write      591.60 (  0.00%)      507.00 (-14.30%)
> Hmean    kb/sec-reader-read       211.81 (  0.00%)      137.53 (-35.07%)
>
> So both read and write throughput have suffered. And latencies don't offset
> for the loss either:
>
> FIO read latency
> Min         latency-read     1383.00 (  0.00%)     1519.00 ( -9.83%)
> 1st-qrtle   latency-read     3485.00 (  0.00%)     5235.00 (-50.22%)
> 2nd-qrtle   latency-read     4708.00 (  0.00%)    15028.00 (-219.20%)
> 3rd-qrtle   latency-read    10286.00 (  0.00%)    57622.00 (-460.20%)
> Max-90%     latency-read   195834.00 (  0.00%)   167149.00 ( 14.65%)
> Max-93%     latency-read   273145.00 (  0.00%)   200319.00 ( 26.66%)
> Max-95%     latency-read   335434.00 (  0.00%)   220695.00 ( 34.21%)
> Max-99%     latency-read   537017.00 (  0.00%)   347174.00 ( 35.35%)
> Max         latency-read   991101.00 (  0.00%)   485835.00 ( 50.98%)
> Mean        latency-read    51282.79 (  0.00%)    49953.95 (  2.59%)
>
> So we have reduced the extra high read latencies which is nice but on
> average there is no change.
>
> And another fio jobfile which doesn't look great:
>
> [global]
> direct=0
> ioengine=sync
> runtime=300
> blocksize=4096
> invalidate=1
> time_based
> ramp_time=5     # Let the flusher thread start before taking measurements
> log_avg_msec=10
> group_reporting=1
>
> [writer]
> nrfiles=1
> filesize=$((MEMTOTAL_BYTES*2))
> readwrite=randwrite
>
> [reader]
> # Simulate random reading from different files, switching to different file
> # after 16 ios. This somewhat simulates application startup.
> new_group
> filesize=100m
> nrfiles=20
> file_service_type=random:16
> readwrite=randread
>
> The throughput numbers look like:
> Hmean    kb/sec-writer-write    24707.22 (  0.00%)    19912.23 (-19.41%)
> Hmean    kb/sec-reader-read       886.65 (  0.00%)      905.71 (  2.15%)
>
> So we've got significant hit in writes not really offset by a big increase
> in reads. Read latency numbers look like (I show the WBT numbers for two runs
> just so that one can see how variable the latency numbers are because I was
> puzzled by very high max latency for WBT kernels - quartiles seem rather
> stable higher percentiles and min/max are rather variable):
>
> 			   Baseline		WBT			WBT
> Min         latency-read     1230.00 (  0.00%)     1560.00 (-26.83%)	1100.00 ( 10.57%)
> 1st-qrtle   latency-read     3357.00 (  0.00%)     3351.00 (  0.18%)	3351.00 (  0.18%)
> 2nd-qrtle   latency-read     4074.00 (  0.00%)     4056.00 (  0.44%)	4022.00 (  1.28%)
> 3rd-qrtle   latency-read     5198.00 (  0.00%)     5145.00 (  1.02%)	5095.00 (  1.98%)
> Max-90%     latency-read     6594.00 (  0.00%)     6370.00 (  3.40%)	6130.00 (  7.04%)
> Max-93%     latency-read    11251.00 (  0.00%)     9410.00 ( 16.36%)	6654.00 ( 40.86%)
> Max-95%     latency-read    14769.00 (  0.00%)    13231.00 ( 10.41%)	10306.00 ( 30.22%)
> Max-99%     latency-read    27826.00 (  0.00%)    28728.00 ( -3.24%)	25077.00 (  9.88%)
> Max         latency-read    80202.00 (  0.00%)   186491.00 (-132.53%)	141346.00 (-76.24%)
> Mean        latency-read     5356.12 (  0.00%)     5229.00 (  2.37%)	4927.23 (  8.01%)
>
> I have run also other tests but they have mostly shown no significant
> difference.

Thanks Jan, this is great and super useful! I'm revamping certain parts 
of it to deal with write back caching better, and I'll take a look at 
the regressions that you reported.

What kind of SSD is this? I'm assuming it's SATA (QD=32), and then it 
would probably be a safe assumption that it's flagging itself as having 
a volatile write back cache, would that be a correct assumption?

Are you using scsi-mq, or do you have an IO scheduler attached to it?
Jan Kara May 16, 2016, 7:47 a.m. UTC | #12
On Fri 13-05-16 12:29:10, Jens Axboe wrote:
> Thanks Jan, this is great and super useful! I'm revamping certain parts of
> it to deal with write back caching better, and I'll take a look at the
> regressions that you reported.
> 
> What kind of SSD is this? I'm assuming it's SATA (QD=32), and then it would
> probably be a safe assumption that it's flagging itself as having a volatile
> write back cache, would that be a correct assumption?

Yes, it is SATA with writeback cache.

> Are you using scsi-mq, or do you have an IO scheduler attached to it?

The disk was using IO scheduler, however at this point I'm not 100% sure
which scheduler (deadline or cfq) was the default one for the distro that
was installed. The machine is currently testing something else so I cannot
reinstall it and check. Maybe I can rerun some tests later in the week when
the machine gets freed with scsi-mq or deadline IO scheduler so that we
have 100% certain config.

								Honza
diff mbox

Patch

diff --git a/lib/wbt.c b/lib/wbt.c
index 650da911f24f..322f5e04e994 100644
--- a/lib/wbt.c
+++ b/lib/wbt.c
@@ -98,18 +98,23 @@  void __wbt_done(struct rq_wb *rwb)
 	else
 		limit = rwb->wb_normal;
 
+	inflight = atomic_dec_return(&rwb->inflight);
+
 	/*
-	 * Don't wake anyone up if we are above the normal limit. If
-	 * throttling got disabled (limit == 0) with waiters, ensure
-	 * that we wake them up.
+	 * wbt got disabled with IO in flight. Wake up any potential
+	 * waiters, we don't have to do more than that.
 	 */
-	inflight = atomic_dec_return(&rwb->inflight);
-	if (limit && inflight >= limit) {
-		if (!rwb->wb_max)
-			wake_up_all(&rwb->wait);
+	if (!rwb_enabled(rwb)) {
+		wake_up_all(&rwb->wait);
 		return;
 	}
 
+	/*
+	 * Don't wake anyone up if we are above the normal limit.
+	 */
+	if (inflight && inflight >= limit)
+		return;
+
 	if (waitqueue_active(&rwb->wait)) {
 		int diff = limit - inflight;
 
@@ -150,14 +155,26 @@  static void calc_wb_limits(struct rq_wb *rwb)
 		return;
 	}
 
-	depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth);
-
 	/*
-	 * Reduce max depth by 50%, and re-calculate normal/bg based on that
+	 * For QD=1 devices, this is a special case. It's important for those
+	 * to have one request ready when one completes, so force a depth of
+	 * 2 for those devices. On the backend, it'll be a depth of 1 anyway,
+	 * since the device can't have more than that in flight.
 	 */
-	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
-	rwb->wb_normal = (rwb->wb_max + 1) / 2;
-	rwb->wb_background = (rwb->wb_max + 3) / 4;
+	if (rwb->queue_depth == 1) {
+		rwb->wb_max = rwb->wb_normal = 2;
+		rwb->wb_background = 1;
+	} else {
+		depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth);
+
+		/*
+		 * Reduce max depth by 50%, and re-calculate normal/bg based on
+		 * that.
+		 */
+		rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
+		rwb->wb_normal = (rwb->wb_max + 1) / 2;
+		rwb->wb_background = (rwb->wb_max + 3) / 4;
+	}
 }
 
 static bool inline stat_sample_valid(struct blk_rq_stat *stat)