Message ID | 20160427205915.GC25397@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
? 2016/4/28 4:59, Jens Axboe ??: > On Wed, Apr 27 2016, Jens Axboe wrote: >> On Wed, Apr 27 2016, Jens Axboe wrote: >>> On 04/27/2016 12:01 PM, Jan Kara wrote: >>>> Hi, >>>> >>>> On Tue 26-04-16 09:55:23, Jens Axboe wrote: >>>>> Since the dawn of time, our background buffered writeback has sucked. >>>>> When we do background buffered writeback, it should have little impact >>>>> on foreground activity. That's the definition of background activity... >>>>> But for as long as I can remember, heavy buffered writers have not >>>>> behaved like that. For instance, if I do something like this: >>>>> >>>>> $ dd if=/dev/zero of=foo bs=1M count=10k >>>>> >>>>> on my laptop, and then try and start chrome, it basically won't start >>>>> before the buffered writeback is done. Or, for server oriented >>>>> workloads, where installation of a big RPM (or similar) adversely >>>>> impacts database reads or sync writes. When that happens, I get people >>>>> yelling at me. >>>>> >>>>> I have posted plenty of results previously, I'll keep it shorter >>>>> this time. Here's a run on my laptop, using read-to-pipe-async for >>>>> reading a 5g file, and rewriting it. You can find this test program >>>>> in the fio git repo. >>>> >>>> I have tested your patchset on my test system. Generally I have observed >>>> noticeable drop in average throughput for heavy background writes without >>>> any other disk activity and also somewhat increased variance in the >>>> runtimes. It is most visible on this simple testcases: >>>> >>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 >>>> >>>> and >>>> >>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync >>>> >>>> The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly >>>> created before each dd run on a dedicated disk. >>>> >>>> Without your patches I get pretty stable dd runtimes for both cases: >>>> >>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 >>>> Runtimes: 87.9611 87.3279 87.2554 >>>> >>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync >>>> Runtimes: 93.3502 93.2086 93.541 >>>> >>>> With your patches the numbers look like: >>>> >>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 >>>> Runtimes: 108.183, 97.184, 99.9587 >>>> >>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync >>>> Runtimes: 104.9, 102.775, 102.892 >>>> >>>> I have checked whether the variance is due to some interaction with CFQ >>>> which is used for the disk. When I switched the disk to deadline, I still >>>> get some variance although, the throughput is still ~10% lower: >>>> >>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 >>>> Runtimes: 100.417 100.643 100.866 >>>> >>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync >>>> Runtimes: 104.208 106.341 105.483 >>>> >>>> The disk is rotational SATA drive with writeback cache, queue depth of the >>>> disk reported in /sys/block/sdb/device/queue_depth is 1. >>>> >>>> So I think we still need some tweaking on the low end of the storage >>>> spectrum so that we don't lose 10% of throughput for simple cases like >>>> this. >>> >>> Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if >>> you are seeing smaller requests, and that is why it both varies and >>> you get lower throughput? I'll try and setup a test here similar to >>> yours. >> >> Jan, care to try the below patch? I can't fully reproduce your issue on >> a SCSI disk limited to QD=1, but I have a feeling this might help. It's >> a bit of a hack, but the general idea is to allow one more request to >> build up for QD=1 devices. That eliminates wait time between one request >> finishing, and the next being submitted. > > That accidentally added a potentially stall, this one is both cleaner > and should have that fixed. > > diff --git a/lib/wbt.c b/lib/wbt.c > index 650da911f24f..322f5e04e994 100644 > --- a/lib/wbt.c > +++ b/lib/wbt.c > @@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb) > else > limit = rwb->wb_normal; Hi Jens, This statement 'limit = rwb->wb_normal' is executed twice, maybe once is enough. It is not a big deal anyway :) Another question about this if branch: if (rwb->wc && !atomic_read(&rwb->bdi->wb.dirty_sleeping)) limit = 0; I can't follow the logic of this if branch. why set limit equal to 0 when the device supports write back caches and there are tasks being limited in balance_dirty_pages(). Could you pelase give more info about this ? Thanks! > > + inflight = atomic_dec_return(&rwb->inflight); > + > /* > - * Don't wake anyone up if we are above the normal limit. If > - * throttling got disabled (limit == 0) with waiters, ensure > - * that we wake them up. > + * wbt got disabled with IO in flight. Wake up any potential > + * waiters, we don't have to do more than that. > */ > - inflight = atomic_dec_return(&rwb->inflight); > - if (limit && inflight >= limit) { > - if (!rwb->wb_max) > - wake_up_all(&rwb->wait); > + if (!rwb_enabled(rwb)) { > + wake_up_all(&rwb->wait); > return; > } Maybe it is better that executing this if branch earlier. So we can wake up potential waiters in time when wbt got disabled. > > + /* > + * Don't wake anyone up if we are above the normal limit. > + */ > + if (inflight && inflight >= limit) > + return; > + > if (waitqueue_active(&rwb->wait)) { > int diff = limit - inflight; > > @@ -150,14 +155,26 @@ static void calc_wb_limits(struct rq_wb *rwb) > return; > } > > - depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth); > - > /* > - * Reduce max depth by 50%, and re-calculate normal/bg based on that > + * For QD=1 devices, this is a special case. It's important for those > + * to have one request ready when one completes, so force a depth of > + * 2 for those devices. On the backend, it'll be a depth of 1 anyway, > + * since the device can't have more than that in flight. > */ > - rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); > - rwb->wb_normal = (rwb->wb_max + 1) / 2; > - rwb->wb_background = (rwb->wb_max + 3) / 4; > + if (rwb->queue_depth == 1) { > + rwb->wb_max = rwb->wb_normal = 2; > + rwb->wb_background = 1; > + } else { > + depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth); > + > + /* > + * Reduce max depth by 50%, and re-calculate normal/bg based on > + * that. > + */ > + rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); > + rwb->wb_normal = (rwb->wb_max + 1) / 2; > + rwb->wb_background = (rwb->wb_max + 3) / 4; > + } > } > > static bool inline stat_sample_valid(struct blk_rq_stat *stat) >
On Wed 27-04-16 14:59:15, Jens Axboe wrote: > On Wed, Apr 27 2016, Jens Axboe wrote: > > On Wed, Apr 27 2016, Jens Axboe wrote: > > > On 04/27/2016 12:01 PM, Jan Kara wrote: > > > >Hi, > > > > > > > >On Tue 26-04-16 09:55:23, Jens Axboe wrote: > > > >>Since the dawn of time, our background buffered writeback has sucked. > > > >>When we do background buffered writeback, it should have little impact > > > >>on foreground activity. That's the definition of background activity... > > > >>But for as long as I can remember, heavy buffered writers have not > > > >>behaved like that. For instance, if I do something like this: > > > >> > > > >>$ dd if=/dev/zero of=foo bs=1M count=10k > > > >> > > > >>on my laptop, and then try and start chrome, it basically won't start > > > >>before the buffered writeback is done. Or, for server oriented > > > >>workloads, where installation of a big RPM (or similar) adversely > > > >>impacts database reads or sync writes. When that happens, I get people > > > >>yelling at me. > > > >> > > > >>I have posted plenty of results previously, I'll keep it shorter > > > >>this time. Here's a run on my laptop, using read-to-pipe-async for > > > >>reading a 5g file, and rewriting it. You can find this test program > > > >>in the fio git repo. > > > > > > > >I have tested your patchset on my test system. Generally I have observed > > > >noticeable drop in average throughput for heavy background writes without > > > >any other disk activity and also somewhat increased variance in the > > > >runtimes. It is most visible on this simple testcases: > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 > > > > > > > >and > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync > > > > > > > >The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly > > > >created before each dd run on a dedicated disk. > > > > > > > >Without your patches I get pretty stable dd runtimes for both cases: > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 > > > >Runtimes: 87.9611 87.3279 87.2554 > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync > > > >Runtimes: 93.3502 93.2086 93.541 > > > > > > > >With your patches the numbers look like: > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 > > > >Runtimes: 108.183, 97.184, 99.9587 > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync > > > >Runtimes: 104.9, 102.775, 102.892 > > > > > > > >I have checked whether the variance is due to some interaction with CFQ > > > >which is used for the disk. When I switched the disk to deadline, I still > > > >get some variance although, the throughput is still ~10% lower: > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 > > > >Runtimes: 100.417 100.643 100.866 > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync > > > >Runtimes: 104.208 106.341 105.483 > > > > > > > >The disk is rotational SATA drive with writeback cache, queue depth of the > > > >disk reported in /sys/block/sdb/device/queue_depth is 1. > > > > > > > >So I think we still need some tweaking on the low end of the storage > > > >spectrum so that we don't lose 10% of throughput for simple cases like > > > >this. > > > > > > Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if > > > you are seeing smaller requests, and that is why it both varies and > > > you get lower throughput? I'll try and setup a test here similar to > > > yours. > > > > Jan, care to try the below patch? I can't fully reproduce your issue on > > a SCSI disk limited to QD=1, but I have a feeling this might help. It's > > a bit of a hack, but the general idea is to allow one more request to > > build up for QD=1 devices. That eliminates wait time between one request > > finishing, and the next being submitted. > > That accidentally added a potentially stall, this one is both cleaner > and should have that fixed. > .. > - rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); > - rwb->wb_normal = (rwb->wb_max + 1) / 2; > - rwb->wb_background = (rwb->wb_max + 3) / 4; > + if (rwb->queue_depth == 1) { > + rwb->wb_max = rwb->wb_normal = 2; > + rwb->wb_background = 1; This breaks the detection of too big scale_step in scale_up() where we key of wb_max == 1 value. However even with that fixed no luck :(: dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync Runtime: 105.126 107.125 105.641 So about the same as before. I'll try to debug this later today... Honza
On 04/27/2016 10:06 PM, xiakaixu wrote: >> diff --git a/lib/wbt.c b/lib/wbt.c >> index 650da911f24f..322f5e04e994 100644 >> --- a/lib/wbt.c >> +++ b/lib/wbt.c >> @@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb) >> else >> limit = rwb->wb_normal; > Hi Jens, > > This statement 'limit = rwb->wb_normal' is executed twice, maybe once is > enough. It is not a big deal anyway :) I'll clean that up, thanks for noticing. No functional difference. > Another question about this if branch: > > if (rwb->wc && !atomic_read(&rwb->bdi->wb.dirty_sleeping)) > limit = 0; > > I can't follow the logic of this if branch. why set limit equal to 0 > when the device supports write back caches and there are tasks being > limited in balance_dirty_pages(). Could you pelase give more info > about this ? Thanks! Sure. So for write back caching, we have to try a bit harder to ensure that the device doesn't build up long internal queues with a lot of dirty data in the cache. So for the case where we have write back caching AND we don't have anyone waiting for the IO, allow the queue depth to drain to zero before building it back up again. Does that make sense? >> >> + inflight = atomic_dec_return(&rwb->inflight); >> + >> /* >> - * Don't wake anyone up if we are above the normal limit. If >> - * throttling got disabled (limit == 0) with waiters, ensure >> - * that we wake them up. >> + * wbt got disabled with IO in flight. Wake up any potential >> + * waiters, we don't have to do more than that. >> */ >> - inflight = atomic_dec_return(&rwb->inflight); >> - if (limit && inflight >= limit) { >> - if (!rwb->wb_max) >> - wake_up_all(&rwb->wait); >> + if (!rwb_enabled(rwb)) { >> + wake_up_all(&rwb->wait); >> return; >> } > > Maybe it is better that executing this if branch earlier. So we can wake up > potential waiters in time when wbt got disabled. The !rwb_enabled() case will only happen if someone disabled wbt while we had tracked IO in flight. We have to it below the atomic_dec_return(), so we could reorder that to be at the front. Ideally we just want it out-of-line instead, as it's the unexpected slower path.
On 04/28/2016 05:54 AM, Jan Kara wrote: > On Wed 27-04-16 14:59:15, Jens Axboe wrote: >> On Wed, Apr 27 2016, Jens Axboe wrote: >>> On Wed, Apr 27 2016, Jens Axboe wrote: >>>> On 04/27/2016 12:01 PM, Jan Kara wrote: >>>>> Hi, >>>>> >>>>> On Tue 26-04-16 09:55:23, Jens Axboe wrote: >>>>>> Since the dawn of time, our background buffered writeback has sucked. >>>>>> When we do background buffered writeback, it should have little impact >>>>>> on foreground activity. That's the definition of background activity... >>>>>> But for as long as I can remember, heavy buffered writers have not >>>>>> behaved like that. For instance, if I do something like this: >>>>>> >>>>>> $ dd if=/dev/zero of=foo bs=1M count=10k >>>>>> >>>>>> on my laptop, and then try and start chrome, it basically won't start >>>>>> before the buffered writeback is done. Or, for server oriented >>>>>> workloads, where installation of a big RPM (or similar) adversely >>>>>> impacts database reads or sync writes. When that happens, I get people >>>>>> yelling at me. >>>>>> >>>>>> I have posted plenty of results previously, I'll keep it shorter >>>>>> this time. Here's a run on my laptop, using read-to-pipe-async for >>>>>> reading a 5g file, and rewriting it. You can find this test program >>>>>> in the fio git repo. >>>>> >>>>> I have tested your patchset on my test system. Generally I have observed >>>>> noticeable drop in average throughput for heavy background writes without >>>>> any other disk activity and also somewhat increased variance in the >>>>> runtimes. It is most visible on this simple testcases: >>>>> >>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 >>>>> >>>>> and >>>>> >>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync >>>>> >>>>> The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly >>>>> created before each dd run on a dedicated disk. >>>>> >>>>> Without your patches I get pretty stable dd runtimes for both cases: >>>>> >>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 >>>>> Runtimes: 87.9611 87.3279 87.2554 >>>>> >>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync >>>>> Runtimes: 93.3502 93.2086 93.541 >>>>> >>>>> With your patches the numbers look like: >>>>> >>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 >>>>> Runtimes: 108.183, 97.184, 99.9587 >>>>> >>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync >>>>> Runtimes: 104.9, 102.775, 102.892 >>>>> >>>>> I have checked whether the variance is due to some interaction with CFQ >>>>> which is used for the disk. When I switched the disk to deadline, I still >>>>> get some variance although, the throughput is still ~10% lower: >>>>> >>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 >>>>> Runtimes: 100.417 100.643 100.866 >>>>> >>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync >>>>> Runtimes: 104.208 106.341 105.483 >>>>> >>>>> The disk is rotational SATA drive with writeback cache, queue depth of the >>>>> disk reported in /sys/block/sdb/device/queue_depth is 1. >>>>> >>>>> So I think we still need some tweaking on the low end of the storage >>>>> spectrum so that we don't lose 10% of throughput for simple cases like >>>>> this. >>>> >>>> Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if >>>> you are seeing smaller requests, and that is why it both varies and >>>> you get lower throughput? I'll try and setup a test here similar to >>>> yours. >>> >>> Jan, care to try the below patch? I can't fully reproduce your issue on >>> a SCSI disk limited to QD=1, but I have a feeling this might help. It's >>> a bit of a hack, but the general idea is to allow one more request to >>> build up for QD=1 devices. That eliminates wait time between one request >>> finishing, and the next being submitted. >> >> That accidentally added a potentially stall, this one is both cleaner >> and should have that fixed. >> > .. >> - rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); >> - rwb->wb_normal = (rwb->wb_max + 1) / 2; >> - rwb->wb_background = (rwb->wb_max + 3) / 4; >> + if (rwb->queue_depth == 1) { >> + rwb->wb_max = rwb->wb_normal = 2; >> + rwb->wb_background = 1; > > This breaks the detection of too big scale_step in scale_up() where we key > of wb_max == 1 value. However even with that fixed no luck :(: Yeah, I need to look at that. For QD=1, I think the only sensible values for max/normal/bg is 2/2/1 and 1/1/1 if we step down. > dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync > Runtime: 105.126 107.125 105.641 > > So about the same as before. I'll try to debug this later today... Thanks, I'm very interested in what you find!
On Thu 28-04-16 12:46:41, Jens Axboe wrote: > >>- rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); > >>- rwb->wb_normal = (rwb->wb_max + 1) / 2; > >>- rwb->wb_background = (rwb->wb_max + 3) / 4; > >>+ if (rwb->queue_depth == 1) { > >>+ rwb->wb_max = rwb->wb_normal = 2; > >>+ rwb->wb_background = 1; > > > >This breaks the detection of too big scale_step in scale_up() where we key > >of wb_max == 1 value. However even with that fixed no luck :(: > > Yeah, I need to look at that. For QD=1, I think the only sensible values for > max/normal/bg is 2/2/1 and 1/1/1 if we step down. > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync > >Runtime: 105.126 107.125 105.641 > > > >So about the same as before. I'll try to debug this later today... > > Thanks, I'm very interested in what you find! OK, so the reason was relatively standard in the end. I was using ext3 (or more exactly ext4 without delayed allocation) for the test. The throttling of background writes gave more priority to writes from the journalling thread which happen with WRITE_SYNC and thus are not throttled. Thus the journalling thread ended up having to do more data writeback to be able to commit a transaction (due to requirements of data=ordered mode) and it is less efficient at that than the normal flusher thread. So this is an example where throttling background writeback effectively just pushes more work into another context which does it less efficiently and indirectly makes everyone wait for it. ext3 has been always sensitive to issues like this. ext4 is using delayed allocation and thus only data writes into holes end up being part of a transaction -> simple dd test case doesn't hit that path. And indeed when I repeat the same test with ext4, the numbers with and without your patch are exactly the same. The question remains how common a pattern where throttling of background writeback delays also something else is. I'll schedule a couple of benchmarks to measure impact of your patches for a wider range of workloads (but sadly pretty limited set of hw). If ext3 is the only one seeing issues, I would be willing to accept that ext3 takes the hit since it is doing something rather stupid (but inherent in its journal design) and we have a way to deal with this either by enabling delayed allocation or by turning off the writeback throttling... Honza
On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote: > On Thu 28-04-16 12:46:41, Jens Axboe wrote: > > >>- rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); > > >>- rwb->wb_normal = (rwb->wb_max + 1) / 2; > > >>- rwb->wb_background = (rwb->wb_max + 3) / 4; > > >>+ if (rwb->queue_depth == 1) { > > >>+ rwb->wb_max = rwb->wb_normal = 2; > > >>+ rwb->wb_background = 1; > > > > > >This breaks the detection of too big scale_step in scale_up() where we key > > >of wb_max == 1 value. However even with that fixed no luck :(: > > > > Yeah, I need to look at that. For QD=1, I think the only sensible values for > > max/normal/bg is 2/2/1 and 1/1/1 if we step down. > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync > > >Runtime: 105.126 107.125 105.641 > > > > > >So about the same as before. I'll try to debug this later today... > > > > Thanks, I'm very interested in what you find! > > OK, so the reason was relatively standard in the end. I was using ext3 (or > more exactly ext4 without delayed allocation) for the test. The throttling > of background writes gave more priority to writes from the journalling > thread which happen with WRITE_SYNC and thus are not throttled. Thus the > journalling thread ended up having to do more data writeback to be able to > commit a transaction (due to requirements of data=ordered mode) and it is > less efficient at that than the normal flusher thread. > > So this is an example where throttling background writeback effectively > just pushes more work into another context which does it less efficiently > and indirectly makes everyone wait for it. ext3 has been always sensitive to > issues like this. ext4 is using delayed allocation and thus only data > writes into holes end up being part of a transaction -> simple dd test case > doesn't hit that path. And indeed when I repeat the same test with ext4, > the numbers with and without your patch are exactly the same. > > The question remains how common a pattern where throttling of background > writeback delays also something else is. I'll schedule a couple of > benchmarks to measure impact of your patches for a wider range of workloads > (but sadly pretty limited set of hw). If ext3 is the only one seeing > issues, I would be willing to accept that ext3 takes the hit since it is > doing something rather stupid (but inherent in its journal design) and we > have a way to deal with this either by enabling delayed allocation or by > turning off the writeback throttling... At least in the case of io that we know is going to be data=ordered, we can bump the prio of those pages? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 03-05-16 08:40:11, Chris Mason wrote: > On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote: > > On Thu 28-04-16 12:46:41, Jens Axboe wrote: > > > >>- rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); > > > >>- rwb->wb_normal = (rwb->wb_max + 1) / 2; > > > >>- rwb->wb_background = (rwb->wb_max + 3) / 4; > > > >>+ if (rwb->queue_depth == 1) { > > > >>+ rwb->wb_max = rwb->wb_normal = 2; > > > >>+ rwb->wb_background = 1; > > > > > > > >This breaks the detection of too big scale_step in scale_up() where we key > > > >of wb_max == 1 value. However even with that fixed no luck :(: > > > > > > Yeah, I need to look at that. For QD=1, I think the only sensible values for > > > max/normal/bg is 2/2/1 and 1/1/1 if we step down. > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync > > > >Runtime: 105.126 107.125 105.641 > > > > > > > >So about the same as before. I'll try to debug this later today... > > > > > > Thanks, I'm very interested in what you find! > > > > OK, so the reason was relatively standard in the end. I was using ext3 (or > > more exactly ext4 without delayed allocation) for the test. The throttling > > of background writes gave more priority to writes from the journalling > > thread which happen with WRITE_SYNC and thus are not throttled. Thus the > > journalling thread ended up having to do more data writeback to be able to > > commit a transaction (due to requirements of data=ordered mode) and it is > > less efficient at that than the normal flusher thread. > > > > So this is an example where throttling background writeback effectively > > just pushes more work into another context which does it less efficiently > > and indirectly makes everyone wait for it. ext3 has been always sensitive to > > issues like this. ext4 is using delayed allocation and thus only data > > writes into holes end up being part of a transaction -> simple dd test case > > doesn't hit that path. And indeed when I repeat the same test with ext4, > > the numbers with and without your patch are exactly the same. > > > > The question remains how common a pattern where throttling of background > > writeback delays also something else is. I'll schedule a couple of > > benchmarks to measure impact of your patches for a wider range of workloads > > (but sadly pretty limited set of hw). If ext3 is the only one seeing > > issues, I would be willing to accept that ext3 takes the hit since it is > > doing something rather stupid (but inherent in its journal design) and we > > have a way to deal with this either by enabling delayed allocation or by > > turning off the writeback throttling... > > At least in the case of io that we know is going to be data=ordered, we > can bump the prio of those pages? But how would flusher thread, which is submitting IO, know that? We would have to somehow mark inodes that are part of the running transaction and flusher thread could give more priority to such writeback - e.g. by using WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that, it could be doable. Honza
On Tue, May 03, 2016 at 03:06:09PM +0200, Jan Kara wrote: > On Tue 03-05-16 08:40:11, Chris Mason wrote: > > On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote: > > > On Thu 28-04-16 12:46:41, Jens Axboe wrote: > > > > >>- rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); > > > > >>- rwb->wb_normal = (rwb->wb_max + 1) / 2; > > > > >>- rwb->wb_background = (rwb->wb_max + 3) / 4; > > > > >>+ if (rwb->queue_depth == 1) { > > > > >>+ rwb->wb_max = rwb->wb_normal = 2; > > > > >>+ rwb->wb_background = 1; > > > > > > > > > >This breaks the detection of too big scale_step in scale_up() where we key > > > > >of wb_max == 1 value. However even with that fixed no luck :(: > > > > > > > > Yeah, I need to look at that. For QD=1, I think the only sensible values for > > > > max/normal/bg is 2/2/1 and 1/1/1 if we step down. > > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync > > > > >Runtime: 105.126 107.125 105.641 > > > > > > > > > >So about the same as before. I'll try to debug this later today... > > > > > > > > Thanks, I'm very interested in what you find! > > > > > > OK, so the reason was relatively standard in the end. I was using ext3 (or > > > more exactly ext4 without delayed allocation) for the test. The throttling > > > of background writes gave more priority to writes from the journalling > > > thread which happen with WRITE_SYNC and thus are not throttled. Thus the > > > journalling thread ended up having to do more data writeback to be able to > > > commit a transaction (due to requirements of data=ordered mode) and it is > > > less efficient at that than the normal flusher thread. > > > > > > So this is an example where throttling background writeback effectively > > > just pushes more work into another context which does it less efficiently > > > and indirectly makes everyone wait for it. ext3 has been always sensitive to > > > issues like this. ext4 is using delayed allocation and thus only data > > > writes into holes end up being part of a transaction -> simple dd test case > > > doesn't hit that path. And indeed when I repeat the same test with ext4, > > > the numbers with and without your patch are exactly the same. > > > > > > The question remains how common a pattern where throttling of background > > > writeback delays also something else is. I'll schedule a couple of > > > benchmarks to measure impact of your patches for a wider range of workloads > > > (but sadly pretty limited set of hw). If ext3 is the only one seeing > > > issues, I would be willing to accept that ext3 takes the hit since it is > > > doing something rather stupid (but inherent in its journal design) and we > > > have a way to deal with this either by enabling delayed allocation or by > > > turning off the writeback throttling... > > > > At least in the case of io that we know is going to be data=ordered, we > > can bump the prio of those pages? > > But how would flusher thread, which is submitting IO, know that? We would > have to somehow mark inodes that are part of the running transaction and > flusher thread could give more priority to such writeback - e.g. by using > WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that, > it could be doable. This would be specific to the data=ordered code in the FS. If there's some way to test for an inode or a page's status in the data=ordered list, the FS writepages call could flag the IO as higher prio? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 03-05-16 09:42:40, Chris Mason wrote: > On Tue, May 03, 2016 at 03:06:09PM +0200, Jan Kara wrote: > > On Tue 03-05-16 08:40:11, Chris Mason wrote: > > > On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote: > > > > On Thu 28-04-16 12:46:41, Jens Axboe wrote: > > > > > >>- rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); > > > > > >>- rwb->wb_normal = (rwb->wb_max + 1) / 2; > > > > > >>- rwb->wb_background = (rwb->wb_max + 3) / 4; > > > > > >>+ if (rwb->queue_depth == 1) { > > > > > >>+ rwb->wb_max = rwb->wb_normal = 2; > > > > > >>+ rwb->wb_background = 1; > > > > > > > > > > > >This breaks the detection of too big scale_step in scale_up() where we key > > > > > >of wb_max == 1 value. However even with that fixed no luck :(: > > > > > > > > > > Yeah, I need to look at that. For QD=1, I think the only sensible values for > > > > > max/normal/bg is 2/2/1 and 1/1/1 if we step down. > > > > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync > > > > > >Runtime: 105.126 107.125 105.641 > > > > > > > > > > > >So about the same as before. I'll try to debug this later today... > > > > > > > > > > Thanks, I'm very interested in what you find! > > > > > > > > OK, so the reason was relatively standard in the end. I was using ext3 (or > > > > more exactly ext4 without delayed allocation) for the test. The throttling > > > > of background writes gave more priority to writes from the journalling > > > > thread which happen with WRITE_SYNC and thus are not throttled. Thus the > > > > journalling thread ended up having to do more data writeback to be able to > > > > commit a transaction (due to requirements of data=ordered mode) and it is > > > > less efficient at that than the normal flusher thread. > > > > > > > > So this is an example where throttling background writeback effectively > > > > just pushes more work into another context which does it less efficiently > > > > and indirectly makes everyone wait for it. ext3 has been always sensitive to > > > > issues like this. ext4 is using delayed allocation and thus only data > > > > writes into holes end up being part of a transaction -> simple dd test case > > > > doesn't hit that path. And indeed when I repeat the same test with ext4, > > > > the numbers with and without your patch are exactly the same. > > > > > > > > The question remains how common a pattern where throttling of background > > > > writeback delays also something else is. I'll schedule a couple of > > > > benchmarks to measure impact of your patches for a wider range of workloads > > > > (but sadly pretty limited set of hw). If ext3 is the only one seeing > > > > issues, I would be willing to accept that ext3 takes the hit since it is > > > > doing something rather stupid (but inherent in its journal design) and we > > > > have a way to deal with this either by enabling delayed allocation or by > > > > turning off the writeback throttling... > > > > > > At least in the case of io that we know is going to be data=ordered, we > > > can bump the prio of those pages? > > > > But how would flusher thread, which is submitting IO, know that? We would > > have to somehow mark inodes that are part of the running transaction and > > flusher thread could give more priority to such writeback - e.g. by using > > WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that, > > it could be doable. > > This would be specific to the data=ordered code in the FS. If there's > some way to test for an inode or a page's status in the data=ordered > list, the FS writepages call could flag the IO as higher prio? Oh, right, we could do that. I can experiment with that later. Honza
On Tue 03-05-16 14:17:19, Jan Kara wrote: > The question remains how common a pattern where throttling of background > writeback delays also something else is. I'll schedule a couple of > benchmarks to measure impact of your patches for a wider range of workloads > (but sadly pretty limited set of hw). If ext3 is the only one seeing > issues, I would be willing to accept that ext3 takes the hit since it is > doing something rather stupid (but inherent in its journal design) and we > have a way to deal with this either by enabling delayed allocation or by > turning off the writeback throttling... So I've run some benchmarks on a machine with 6 GB of RAM and SSD with queue depth 32. The filesystem on the disk was XFS this time. I've found couple of regressions. A clear one is with dbench (version 4). The average throughput numbers look like: Baseline WBT Hmean mb/sec-1 30.26 ( 0.00%) 18.67 (-38.28%) Hmean mb/sec-2 40.71 ( 0.00%) 31.25 (-23.23%) Hmean mb/sec-4 52.67 ( 0.00%) 46.83 (-11.09%) Hmean mb/sec-8 69.51 ( 0.00%) 64.35 ( -7.42%) Hmean mb/sec-16 91.07 ( 0.00%) 86.46 ( -5.07%) Hmean mb/sec-32 115.10 ( 0.00%) 110.29 ( -4.18%) Hmean mb/sec-64 145.14 ( 0.00%) 134.97 ( -7.00%) Hmean mb/sec-512 93.99 ( 0.00%) 133.85 ( 42.41%) There were also some losses in a filebench webproxy workload (I can give you exact details of the settings if you want to reproduce it). Also, and this really puzzles me, I've seen higher read latencies in some cases (I've verified they are not just noise by rerunning the test for kernel with writeback throttling patches). For example with the following fio job file: [global] direct=0 ioengine=sync runtime=300 time_based invalidate=1 blocksize=4096 size=10g # Just random value, we are running time based workload log_avg_msec=10 group_reporting=1 [writer] nrfiles=1 filesize=1g fdatasync=256 readwrite=randwrite numjobs=4 [reader] # Simulate random reading from different files, switching to different file # after 16 ios. This somewhat simulates application startup. new_group filesize=100m nrfiles=20 file_service_type=random:16 readwrite=randread I get the following results: Throughput Baseline WBT Hmean kb/sec-writer-write 591.60 ( 0.00%) 507.00 (-14.30%) Hmean kb/sec-reader-read 211.81 ( 0.00%) 137.53 (-35.07%) So both read and write throughput have suffered. And latencies don't offset for the loss either: FIO read latency Min latency-read 1383.00 ( 0.00%) 1519.00 ( -9.83%) 1st-qrtle latency-read 3485.00 ( 0.00%) 5235.00 (-50.22%) 2nd-qrtle latency-read 4708.00 ( 0.00%) 15028.00 (-219.20%) 3rd-qrtle latency-read 10286.00 ( 0.00%) 57622.00 (-460.20%) Max-90% latency-read 195834.00 ( 0.00%) 167149.00 ( 14.65%) Max-93% latency-read 273145.00 ( 0.00%) 200319.00 ( 26.66%) Max-95% latency-read 335434.00 ( 0.00%) 220695.00 ( 34.21%) Max-99% latency-read 537017.00 ( 0.00%) 347174.00 ( 35.35%) Max latency-read 991101.00 ( 0.00%) 485835.00 ( 50.98%) Mean latency-read 51282.79 ( 0.00%) 49953.95 ( 2.59%) So we have reduced the extra high read latencies which is nice but on average there is no change. And another fio jobfile which doesn't look great: [global] direct=0 ioengine=sync runtime=300 blocksize=4096 invalidate=1 time_based ramp_time=5 # Let the flusher thread start before taking measurements log_avg_msec=10 group_reporting=1 [writer] nrfiles=1 filesize=$((MEMTOTAL_BYTES*2)) readwrite=randwrite [reader] # Simulate random reading from different files, switching to different file # after 16 ios. This somewhat simulates application startup. new_group filesize=100m nrfiles=20 file_service_type=random:16 readwrite=randread The throughput numbers look like: Hmean kb/sec-writer-write 24707.22 ( 0.00%) 19912.23 (-19.41%) Hmean kb/sec-reader-read 886.65 ( 0.00%) 905.71 ( 2.15%) So we've got significant hit in writes not really offset by a big increase in reads. Read latency numbers look like (I show the WBT numbers for two runs just so that one can see how variable the latency numbers are because I was puzzled by very high max latency for WBT kernels - quartiles seem rather stable higher percentiles and min/max are rather variable): Baseline WBT WBT Min latency-read 1230.00 ( 0.00%) 1560.00 (-26.83%) 1100.00 ( 10.57%) 1st-qrtle latency-read 3357.00 ( 0.00%) 3351.00 ( 0.18%) 3351.00 ( 0.18%) 2nd-qrtle latency-read 4074.00 ( 0.00%) 4056.00 ( 0.44%) 4022.00 ( 1.28%) 3rd-qrtle latency-read 5198.00 ( 0.00%) 5145.00 ( 1.02%) 5095.00 ( 1.98%) Max-90% latency-read 6594.00 ( 0.00%) 6370.00 ( 3.40%) 6130.00 ( 7.04%) Max-93% latency-read 11251.00 ( 0.00%) 9410.00 ( 16.36%) 6654.00 ( 40.86%) Max-95% latency-read 14769.00 ( 0.00%) 13231.00 ( 10.41%) 10306.00 ( 30.22%) Max-99% latency-read 27826.00 ( 0.00%) 28728.00 ( -3.24%) 25077.00 ( 9.88%) Max latency-read 80202.00 ( 0.00%) 186491.00 (-132.53%) 141346.00 (-76.24%) Mean latency-read 5356.12 ( 0.00%) 5229.00 ( 2.37%) 4927.23 ( 8.01%) I have run also other tests but they have mostly shown no significant difference. Honza
On 05/11/2016 10:36 AM, Jan Kara wrote: > On Tue 03-05-16 14:17:19, Jan Kara wrote: >> The question remains how common a pattern where throttling of background >> writeback delays also something else is. I'll schedule a couple of >> benchmarks to measure impact of your patches for a wider range of workloads >> (but sadly pretty limited set of hw). If ext3 is the only one seeing >> issues, I would be willing to accept that ext3 takes the hit since it is >> doing something rather stupid (but inherent in its journal design) and we >> have a way to deal with this either by enabling delayed allocation or by >> turning off the writeback throttling... > > So I've run some benchmarks on a machine with 6 GB of RAM and SSD with > queue depth 32. The filesystem on the disk was XFS this time. I've found > couple of regressions. A clear one is with dbench (version 4). The average > throughput numbers look like: > > Baseline WBT > Hmean mb/sec-1 30.26 ( 0.00%) 18.67 (-38.28%) > Hmean mb/sec-2 40.71 ( 0.00%) 31.25 (-23.23%) > Hmean mb/sec-4 52.67 ( 0.00%) 46.83 (-11.09%) > Hmean mb/sec-8 69.51 ( 0.00%) 64.35 ( -7.42%) > Hmean mb/sec-16 91.07 ( 0.00%) 86.46 ( -5.07%) > Hmean mb/sec-32 115.10 ( 0.00%) 110.29 ( -4.18%) > Hmean mb/sec-64 145.14 ( 0.00%) 134.97 ( -7.00%) > Hmean mb/sec-512 93.99 ( 0.00%) 133.85 ( 42.41%) > > There were also some losses in a filebench webproxy workload (I can give > you exact details of the settings if you want to reproduce it). > > Also, and this really puzzles me, I've seen higher read latencies in some > cases (I've verified they are not just noise by rerunning the test for > kernel with writeback throttling patches). For example with the following > fio job file: > > [global] > direct=0 > ioengine=sync > runtime=300 > time_based > invalidate=1 > blocksize=4096 > size=10g # Just random value, we are running time based workload > log_avg_msec=10 > group_reporting=1 > > [writer] > nrfiles=1 > filesize=1g > fdatasync=256 > readwrite=randwrite > numjobs=4 > > [reader] > # Simulate random reading from different files, switching to different file > # after 16 ios. This somewhat simulates application startup. > new_group > filesize=100m > nrfiles=20 > file_service_type=random:16 > readwrite=randread > > I get the following results: > > Throughput Baseline WBT > Hmean kb/sec-writer-write 591.60 ( 0.00%) 507.00 (-14.30%) > Hmean kb/sec-reader-read 211.81 ( 0.00%) 137.53 (-35.07%) > > So both read and write throughput have suffered. And latencies don't offset > for the loss either: > > FIO read latency > Min latency-read 1383.00 ( 0.00%) 1519.00 ( -9.83%) > 1st-qrtle latency-read 3485.00 ( 0.00%) 5235.00 (-50.22%) > 2nd-qrtle latency-read 4708.00 ( 0.00%) 15028.00 (-219.20%) > 3rd-qrtle latency-read 10286.00 ( 0.00%) 57622.00 (-460.20%) > Max-90% latency-read 195834.00 ( 0.00%) 167149.00 ( 14.65%) > Max-93% latency-read 273145.00 ( 0.00%) 200319.00 ( 26.66%) > Max-95% latency-read 335434.00 ( 0.00%) 220695.00 ( 34.21%) > Max-99% latency-read 537017.00 ( 0.00%) 347174.00 ( 35.35%) > Max latency-read 991101.00 ( 0.00%) 485835.00 ( 50.98%) > Mean latency-read 51282.79 ( 0.00%) 49953.95 ( 2.59%) > > So we have reduced the extra high read latencies which is nice but on > average there is no change. > > And another fio jobfile which doesn't look great: > > [global] > direct=0 > ioengine=sync > runtime=300 > blocksize=4096 > invalidate=1 > time_based > ramp_time=5 # Let the flusher thread start before taking measurements > log_avg_msec=10 > group_reporting=1 > > [writer] > nrfiles=1 > filesize=$((MEMTOTAL_BYTES*2)) > readwrite=randwrite > > [reader] > # Simulate random reading from different files, switching to different file > # after 16 ios. This somewhat simulates application startup. > new_group > filesize=100m > nrfiles=20 > file_service_type=random:16 > readwrite=randread > > The throughput numbers look like: > Hmean kb/sec-writer-write 24707.22 ( 0.00%) 19912.23 (-19.41%) > Hmean kb/sec-reader-read 886.65 ( 0.00%) 905.71 ( 2.15%) > > So we've got significant hit in writes not really offset by a big increase > in reads. Read latency numbers look like (I show the WBT numbers for two runs > just so that one can see how variable the latency numbers are because I was > puzzled by very high max latency for WBT kernels - quartiles seem rather > stable higher percentiles and min/max are rather variable): > > Baseline WBT WBT > Min latency-read 1230.00 ( 0.00%) 1560.00 (-26.83%) 1100.00 ( 10.57%) > 1st-qrtle latency-read 3357.00 ( 0.00%) 3351.00 ( 0.18%) 3351.00 ( 0.18%) > 2nd-qrtle latency-read 4074.00 ( 0.00%) 4056.00 ( 0.44%) 4022.00 ( 1.28%) > 3rd-qrtle latency-read 5198.00 ( 0.00%) 5145.00 ( 1.02%) 5095.00 ( 1.98%) > Max-90% latency-read 6594.00 ( 0.00%) 6370.00 ( 3.40%) 6130.00 ( 7.04%) > Max-93% latency-read 11251.00 ( 0.00%) 9410.00 ( 16.36%) 6654.00 ( 40.86%) > Max-95% latency-read 14769.00 ( 0.00%) 13231.00 ( 10.41%) 10306.00 ( 30.22%) > Max-99% latency-read 27826.00 ( 0.00%) 28728.00 ( -3.24%) 25077.00 ( 9.88%) > Max latency-read 80202.00 ( 0.00%) 186491.00 (-132.53%) 141346.00 (-76.24%) > Mean latency-read 5356.12 ( 0.00%) 5229.00 ( 2.37%) 4927.23 ( 8.01%) > > I have run also other tests but they have mostly shown no significant > difference. Thanks Jan, this is great and super useful! I'm revamping certain parts of it to deal with write back caching better, and I'll take a look at the regressions that you reported. What kind of SSD is this? I'm assuming it's SATA (QD=32), and then it would probably be a safe assumption that it's flagging itself as having a volatile write back cache, would that be a correct assumption? Are you using scsi-mq, or do you have an IO scheduler attached to it?
On Fri 13-05-16 12:29:10, Jens Axboe wrote: > Thanks Jan, this is great and super useful! I'm revamping certain parts of > it to deal with write back caching better, and I'll take a look at the > regressions that you reported. > > What kind of SSD is this? I'm assuming it's SATA (QD=32), and then it would > probably be a safe assumption that it's flagging itself as having a volatile > write back cache, would that be a correct assumption? Yes, it is SATA with writeback cache. > Are you using scsi-mq, or do you have an IO scheduler attached to it? The disk was using IO scheduler, however at this point I'm not 100% sure which scheduler (deadline or cfq) was the default one for the distro that was installed. The machine is currently testing something else so I cannot reinstall it and check. Maybe I can rerun some tests later in the week when the machine gets freed with scsi-mq or deadline IO scheduler so that we have 100% certain config. Honza
diff --git a/lib/wbt.c b/lib/wbt.c index 650da911f24f..322f5e04e994 100644 --- a/lib/wbt.c +++ b/lib/wbt.c @@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb) else limit = rwb->wb_normal; + inflight = atomic_dec_return(&rwb->inflight); + /* - * Don't wake anyone up if we are above the normal limit. If - * throttling got disabled (limit == 0) with waiters, ensure - * that we wake them up. + * wbt got disabled with IO in flight. Wake up any potential + * waiters, we don't have to do more than that. */ - inflight = atomic_dec_return(&rwb->inflight); - if (limit && inflight >= limit) { - if (!rwb->wb_max) - wake_up_all(&rwb->wait); + if (!rwb_enabled(rwb)) { + wake_up_all(&rwb->wait); return; } + /* + * Don't wake anyone up if we are above the normal limit. + */ + if (inflight && inflight >= limit) + return; + if (waitqueue_active(&rwb->wait)) { int diff = limit - inflight; @@ -150,14 +155,26 @@ static void calc_wb_limits(struct rq_wb *rwb) return; } - depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth); - /* - * Reduce max depth by 50%, and re-calculate normal/bg based on that + * For QD=1 devices, this is a special case. It's important for those + * to have one request ready when one completes, so force a depth of + * 2 for those devices. On the backend, it'll be a depth of 1 anyway, + * since the device can't have more than that in flight. */ - rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); - rwb->wb_normal = (rwb->wb_max + 1) / 2; - rwb->wb_background = (rwb->wb_max + 3) / 4; + if (rwb->queue_depth == 1) { + rwb->wb_max = rwb->wb_normal = 2; + rwb->wb_background = 1; + } else { + depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth); + + /* + * Reduce max depth by 50%, and re-calculate normal/bg based on + * that. + */ + rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); + rwb->wb_normal = (rwb->wb_max + 1) / 2; + rwb->wb_background = (rwb->wb_max + 3) / 4; + } } static bool inline stat_sample_valid(struct blk_rq_stat *stat)