Message ID | 20210705162328.28366-3-jack@suse.cz (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | writeback: Fix bandwidth estimates | expand |
On Mon, 5 Jul 2021 18:23:17 +0200 Jan Kara wrote: > >Michael Stapelberg has reported that for workload with short big spikes >of writes (GCC linker seem to trigger this frequently) the write >throughput is heavily underestimated and tends to steadily sink until it >reaches zero. This has rather bad impact on writeback throttling >(causing stalls). The problem is that writeback throughput estimate gets >updated at most once per 200 ms. One update happens early after we >submit pages for writeback (at that point writeout of only small >fraction of pages is completed and thus observed throughput is tiny). >Next update happens only during the next write spike (updates happen >only from inode writeback and dirty throttling code) and if that is >more than 1s after previous spike, we decide system was idle and just >ignore whatever was written until this moment. > >Fix the problem by making sure writeback throughput estimate is also >updated shortly after writeback completes to get reasonable estimate of >throughput for spiky workloads. > >Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+li>nux@google.com >Reported-by: Michael Stapelberg <stapelberg+linux@google.com> >Signed-off-by: Jan Kara <jack@suse.cz> >--- > include/linux/backing-dev-defs.h | 1 + > include/linux/writeback.h | 1 + > mm/backing-dev.c | 10 ++++++++++ > mm/page-writeback.c | 32 ++++++++++++++++---------------- > 4 files changed, 28 insertions(+), 16 deletions(-) > >diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev>-defs.h >index 148d889f2f7f..57395f7bb192 100644 >--- a/include/linux/backing-dev-defs.h >+++ b/include/linux/backing-dev-defs.h >@@ -143,6 +143,7 @@ struct bdi_writeback { > spinlock_t work_lock; /* protects work_list & dwork scheduling */ > struct list_head work_list; > struct delayed_work dwork; /* work item used for writeback */ >+ struct delayed_work bw_dwork; /* work item used for bandwidth estimate >*/ > > unsigned long dirty_sleep; /* last wait */ > >diff --git a/include/linux/writeback.h b/include/linux/writeback.h >index 47cd732e012e..a45e09ed0711 100644 >--- a/include/linux/writeback.h >+++ b/include/linux/writeback.h >@@ -379,6 +379,7 @@ int dirty_writeback_centisecs_handler(struct ctl_tabl>e *table, int write, > void global_dirty_limits(unsigned long *pbackground, unsigned long *pdir>ty); > unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thr>esh); > >+void wb_update_bandwidth(struct bdi_writeback *wb); > void balance_dirty_pages_ratelimited(struct address_space *mapping); > bool wb_over_bg_thresh(struct bdi_writeback *wb); > >diff --git a/mm/backing-dev.c b/mm/backing-dev.c >index 342394ef1e02..9baa59d68110 100644 >--- a/mm/backing-dev.c >+++ b/mm/backing-dev.c >@@ -271,6 +271,14 @@ void wb_wakeup_delayed(struct bdi_writeback *wb) > spin_unlock_bh(&wb->work_lock); > } > >+static void wb_update_bandwidth_workfn(struct work_struct *work) >+{ >+ struct bdi_writeback *wb = container_of(to_delayed_work(work), >+ struct bdi_writeback, bw_dwork); >+ >+ wb_update_bandwidth(wb); >+} >+ > /* > * Initial write bandwidth: 100 MB/s > */ >@@ -303,6 +311,7 @@ static int wb_init(struct bdi_writeback *wb, struct b>acking_dev_info *bdi, > spin_lock_init(&wb->work_lock); > INIT_LIST_HEAD(&wb->work_list); > INIT_DELAYED_WORK(&wb->dwork, wb_workfn); >+ INIT_DELAYED_WORK(&wb->bw_dwork, wb_update_bandwidth_workfn); > wb->dirty_sleep = jiffies; > > err = fprop_local_init_percpu(&wb->completions, gfp); >@@ -351,6 +360,7 @@ static void wb_shutdown(struct bdi_writeback *wb) > mod_delayed_work(bdi_wq, &wb->dwork, 0); > flush_delayed_work(&wb->dwork); > WARN_ON(!list_empty(&wb->work_list)); >+ flush_delayed_work(&wb->bw_dwork); > } > > static void wb_exit(struct bdi_writeback *wb) >diff --git a/mm/page-writeback.c b/mm/page-writeback.c >index 1fecf8ebadb0..6a99ddca95c0 100644 >--- a/mm/page-writeback.c >+++ b/mm/page-writeback.c >@@ -1346,14 +1346,7 @@ static void __wb_update_bandwidth(struct dirty_thr>ottle_control *gdtc, > unsigned long dirtied; > unsigned long written; > >- lockdep_assert_held(&wb->list_lock); >- >- /* >- * rate-limit, only update once every 200ms. >- */ >- if (elapsed < BANDWIDTH_INTERVAL) >- return; Please leave it as it is if you are not dumping the 200ms rule. >- >+ spin_lock(&wb->list_lock); > dirtied = percpu_counter_read(&wb->stat[WB_DIRTIED]); > written = percpu_counter_read(&wb->stat[WB_WRITTEN]); > >@@ -1375,15 +1368,14 @@ static void __wb_update_bandwidth(struct dirty_th>rottle_control *gdtc, > wb->dirtied_stamp = dirtied; > wb->written_stamp = written; > wb->bw_time_stamp = now; >+ spin_unlock(&wb->list_lock); > } > >-static void wb_update_bandwidth(struct bdi_writeback *wb) >+void wb_update_bandwidth(struct bdi_writeback *wb) > { > struct dirty_throttle_control gdtc = { GDTC_INIT(wb) }; > >- spin_lock(&wb->list_lock); > __wb_update_bandwidth(&gdtc, NULL, false); >- spin_unlock(&wb->list_lock); > } > > /* Interval after which we consider wb idle and don't estimate bandwidth> */ >@@ -1728,11 +1720,8 @@ static void balance_dirty_pages(struct bdi_writeba>ck *wb, > wb->dirty_exceeded = 1; > > if (time_is_before_jiffies(wb->bw_time_stamp + >- BANDWIDTH_INTERVAL)) { >- spin_lock(&wb->list_lock); >+ BANDWIDTH_INTERVAL)) > __wb_update_bandwidth(gdtc, mdtc, true); >- spin_unlock(&wb->list_lock); >- } > > /* throttle according to the chosen dtc */ > dirty_ratelimit = wb->dirty_ratelimit; >@@ -2371,7 +2360,13 @@ int do_writepages(struct address_space *mapping, s>truct writeback_control *wbc) > cond_resched(); > congestion_wait(BLK_RW_ASYNC, HZ/50); > } >- wb_update_bandwidth(wb); >+ /* >+ * Usually few pages are written by now from those we've just submitted >+ * but if there's constant writeback being submitted, this makes sure >+ * writeback bandwidth is updated once in a while. >+ */ >+ if (time_is_before_jiffies(wb->bw_time_stamp + BANDWIDTH_INTERVAL)) >+ wb_update_bandwidth(wb); > return ret; > } > >@@ -2742,6 +2737,11 @@ static void wb_inode_writeback_start(struct bdi_wr>iteback *wb) > static void wb_inode_writeback_end(struct bdi_writeback *wb) > { > atomic_dec(&wb->writeback_inodes); >+ /* >+ * Make sure estimate of writeback throughput gets >+ * updated after writeback completed. >+ */ >+ queue_delayed_work(bdi_wq, &wb->bw_dwork, BANDWIDTH_INTERVAL); > } This is a bogus estimate - it does not break the 200ms rule but walks around it without specifying why 300ms is not good.
On Wed 07-07-21 15:40:17, Hillf Danton wrote: > On Mon, 5 Jul 2021 18:23:17 +0200 Jan Kara wrote: > > > >Michael Stapelberg has reported that for workload with short big spikes > >of writes (GCC linker seem to trigger this frequently) the write > >throughput is heavily underestimated and tends to steadily sink until it > >reaches zero. This has rather bad impact on writeback throttling > >(causing stalls). The problem is that writeback throughput estimate gets > >updated at most once per 200 ms. One update happens early after we > >submit pages for writeback (at that point writeout of only small > >fraction of pages is completed and thus observed throughput is tiny). > >Next update happens only during the next write spike (updates happen > >only from inode writeback and dirty throttling code) and if that is > >more than 1s after previous spike, we decide system was idle and just > >ignore whatever was written until this moment. > > > >Fix the problem by making sure writeback throughput estimate is also > >updated shortly after writeback completes to get reasonable estimate of > >throughput for spiky workloads. > > > >Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+li>nux@google.com > >Reported-by: Michael Stapelberg <stapelberg+linux@google.com> > >Signed-off-by: Jan Kara <jack@suse.cz> ... > >diff --git a/mm/page-writeback.c b/mm/page-writeback.c > >index 1fecf8ebadb0..6a99ddca95c0 100644 > >--- a/mm/page-writeback.c > >+++ b/mm/page-writeback.c > >@@ -1346,14 +1346,7 @@ static void __wb_update_bandwidth(struct dirty_thr>ottle_control *gdtc, > > unsigned long dirtied; > > unsigned long written; > > > >- lockdep_assert_held(&wb->list_lock); > >- > >- /* > >- * rate-limit, only update once every 200ms. > >- */ > >- if (elapsed < BANDWIDTH_INTERVAL) > >- return; > > Please leave it as it is if you are not dumping the 200ms rule. Well, that could break the delayed updated scheduled after the end of writeback and for no good reason. The problematic ordering is like: end writeback on inode1 queue_delayed_work() - queues delayed work after BANDWIDTH_INTERVAL __wb_update_bandwidth() called e.g. from balance_dirty_pages() wb->bw_time_stamp = now; end writeback on inode2 queue_delayed_work() - does nothing since work is already queued delayed work calls __wb_update_bandwidth() - nothing is done since elapsed < BANDWIDTH_INTERVAL and we may thus miss reflecting writeback of inode2 in our estimates. > >@@ -2742,6 +2737,11 @@ static void wb_inode_writeback_start(struct bdi_wr>iteback *wb) > > static void wb_inode_writeback_end(struct bdi_writeback *wb) > > { > > atomic_dec(&wb->writeback_inodes); > >+ /* > >+ * Make sure estimate of writeback throughput gets > >+ * updated after writeback completed. > >+ */ > >+ queue_delayed_work(bdi_wq, &wb->bw_dwork, BANDWIDTH_INTERVAL); > > } > > This is a bogus estimate - it does not break the 200ms rule but walks > around it without specifying why 300ms is not good. Well, you're right that BANDWIDTH_INTERVAL is somewhat arbitrary here. We do want some batching of bandwidth updates after writeback completes for the cases where lots of inodes end their writeback in a quick succession. I've picked BANDWIDTH_INTERVAL here as that's the batching of other bandwidth updates as well so it kind of makes sense. I'll add a comment why BANDWIDTH_INTERVAL is picked here. Honza
On Wed, 7 Jul 2021 11:51:38 +0200 Jan Kara wrote: >On Wed 07-07-21 15:40:17, Hillf Danton wrote: >> On Mon, 5 Jul 2021 18:23:17 +0200 Jan Kara wrote: >> > >> >Michael Stapelberg has reported that for workload with short big spikes >> >of writes (GCC linker seem to trigger this frequently) the write >> >throughput is heavily underestimated and tends to steadily sink until it >> >reaches zero. This has rather bad impact on writeback throttling >> >(causing stalls). The problem is that writeback throughput estimate gets >> >updated at most once per 200 ms. One update happens early after we >> >submit pages for writeback (at that point writeout of only small >> >fraction of pages is completed and thus observed throughput is tiny). >> >Next update happens only during the next write spike (updates happen >> >only from inode writeback and dirty throttling code) and if that is >> >more than 1s after previous spike, we decide system was idle and just >> >ignore whatever was written until this moment. >> > >> >Fix the problem by making sure writeback throughput estimate is also >> >updated shortly after writeback completes to get reasonable estimate of >> >throughput for spiky workloads. >> > >> >Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+li>nux@google.com >> >Reported-by: Michael Stapelberg <stapelberg+linux@google.com> >> >Signed-off-by: Jan Kara <jack@suse.cz> >... >> >diff --git a/mm/page-writeback.c b/mm/page-writeback.c >> >index 1fecf8ebadb0..6a99ddca95c0 100644 >> >--- a/mm/page-writeback.c >> >+++ b/mm/page-writeback.c >> >@@ -1346,14 +1346,7 @@ static void __wb_update_bandwidth(struct dirty_thr>ottle_control *gdtc, >> > unsigned long dirtied; >> > unsigned long written; >> > >> >- lockdep_assert_held(&wb->list_lock); >> >- >> >- /* >> >- * rate-limit, only update once every 200ms. >> >- */ >> >- if (elapsed < BANDWIDTH_INTERVAL) >> >- return; >> >> Please leave it as it is if you are not dumping the 200ms rule. > >Well, that could break the delayed updated scheduled after the end of >writeback and for no good reason. The problematic ordering is like: After another look at 2/5, you are cutting the rule, which is worth a seperate patch. > >end writeback on inode1 > queue_delayed_work() - queues delayed work after BANDWIDTH_INTERVAL > >__wb_update_bandwidth() called e.g. from balance_dirty_pages() > wb->bw_time_stamp = now; > >end writeback on inode2 > queue_delayed_work() - does nothing since work is already queued > >delayed work calls __wb_update_bandwidth() - nothing is done since elapsed >< BANDWIDTH_INTERVAL and we may thus miss reflecting writeback of inode2 in >our estimates. Your example says the estimate based on inode2 is torpedoed by a random update, and you are looking to make that estimate meaningful at the cost of breaking the rule - how differet is it to the current one if the estimate is derived from 20ms-elapsed interval at inode2? Is it likely to see another palpablely different result at inode3 from 50ms-elapsed interval? > >> >@@ -2742,6 +2737,11 @@ static void wb_inode_writeback_start(struct bdi_wr>iteback *wb) >> > static void wb_inode_writeback_end(struct bdi_writeback *wb) >> > { >> > atomic_dec(&wb->writeback_inodes); >> >+ /* >> >+ * Make sure estimate of writeback throughput gets >> >+ * updated after writeback completed. >> >+ */ >> >+ queue_delayed_work(bdi_wq, &wb->bw_dwork, BANDWIDTH_INTERVAL); >> > } >> >> This is a bogus estimate - it does not break the 200ms rule but walks >> around it without specifying why 300ms is not good. > >Well, you're right that BANDWIDTH_INTERVAL is somewhat arbitrary here. We >do want some batching of bandwidth updates after writeback completes for >the cases where lots of inodes end their writeback in a quick succession. >I've picked BANDWIDTH_INTERVAL here as that's the batching of other >bandwidth updates as well so it kind of makes sense. I'll add a comment why >BANDWIDTH_INTERVAL is picked here. > > Honza >-- >Jan Kara <jack@suse.com> >SUSE Labs, CR > >
On Thu 08-07-21 20:17:51, Hillf Danton wrote: > On Wed, 7 Jul 2021 11:51:38 +0200 Jan Kara wrote: > >On Wed 07-07-21 15:40:17, Hillf Danton wrote: > >> On Mon, 5 Jul 2021 18:23:17 +0200 Jan Kara wrote: > >> > > >> >Michael Stapelberg has reported that for workload with short big spikes > >> >of writes (GCC linker seem to trigger this frequently) the write > >> >throughput is heavily underestimated and tends to steadily sink until it > >> >reaches zero. This has rather bad impact on writeback throttling > >> >(causing stalls). The problem is that writeback throughput estimate gets > >> >updated at most once per 200 ms. One update happens early after we > >> >submit pages for writeback (at that point writeout of only small > >> >fraction of pages is completed and thus observed throughput is tiny). > >> >Next update happens only during the next write spike (updates happen > >> >only from inode writeback and dirty throttling code) and if that is > >> >more than 1s after previous spike, we decide system was idle and just > >> >ignore whatever was written until this moment. > >> > > >> >Fix the problem by making sure writeback throughput estimate is also > >> >updated shortly after writeback completes to get reasonable estimate of > >> >throughput for spiky workloads. > >> > > >> >Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+li>nux@google.com > >> >Reported-by: Michael Stapelberg <stapelberg+linux@google.com> > >> >Signed-off-by: Jan Kara <jack@suse.cz> > >... > >> >diff --git a/mm/page-writeback.c b/mm/page-writeback.c > >> >index 1fecf8ebadb0..6a99ddca95c0 100644 > >> >--- a/mm/page-writeback.c > >> >+++ b/mm/page-writeback.c > >> >@@ -1346,14 +1346,7 @@ static void __wb_update_bandwidth(struct dirty_thr>ottle_control *gdtc, > >> > unsigned long dirtied; > >> > unsigned long written; > >> > > >> >- lockdep_assert_held(&wb->list_lock); > >> >- > >> >- /* > >> >- * rate-limit, only update once every 200ms. > >> >- */ > >> >- if (elapsed < BANDWIDTH_INTERVAL) > >> >- return; > >> > >> Please leave it as it is if you are not dumping the 200ms rule. > > > >Well, that could break the delayed updated scheduled after the end of > >writeback and for no good reason. The problematic ordering is like: > > After another look at 2/5, you are cutting the rule, which is worth a > seperate patch. The only update that can break the 200ms rule are the updates added in this patch. I don't think separating the removal of 200ms check for that one case really brings much clarity. It would rather bring "what if questions" to this patch... > >end writeback on inode1 > > queue_delayed_work() - queues delayed work after BANDWIDTH_INTERVAL > > > >__wb_update_bandwidth() called e.g. from balance_dirty_pages() > > wb->bw_time_stamp = now; > > > >end writeback on inode2 > > queue_delayed_work() - does nothing since work is already queued > > > >delayed work calls __wb_update_bandwidth() - nothing is done since elapsed > >< BANDWIDTH_INTERVAL and we may thus miss reflecting writeback of inode2 in > >our estimates. > > Your example says the estimate based on inode2 is torpedoed by a random > update, and you are looking to make that estimate meaningful at the cost > of breaking the rule - how differet is it to the current one if the > estimate is derived from 20ms-elapsed interval at inode2? Is it likely to > see another palpablely different result at inode3 from 50ms-elapsed interval? I'm not sure I understand your question correctly but updates after shorter than 200ms interval should not disturb the estimates much. wb_update_write_bandwidth() effectively uses formula: bandwidth = (written + bandwidth * (period - elapsed)) / period where 'period' is 3 seconds. So we compute average bandwidth over last 3 seconds where amount written in 'elapsed' interval is 'written' pages. If 'elapsed' is small, the influence of current sample on reducing estimated bandwidth is going to be small as well. Honza
On Thu, 8 Jul 2021 18:43:01 +0200 Jan Kara wrote: >On Thu 08-07-21 20:17:51, Hillf Danton wrote: >> On Wed, 7 Jul 2021 11:51:38 +0200 Jan Kara wrote: >> >On Wed 07-07-21 15:40:17, Hillf Danton wrote: >> >> On Mon, 5 Jul 2021 18:23:17 +0200 Jan Kara wrote: >> >> > >> >> >Michael Stapelberg has reported that for workload with short big spikes >> >> >of writes (GCC linker seem to trigger this frequently) the write >> >> >throughput is heavily underestimated and tends to steadily sink until it >> >> >reaches zero. This has rather bad impact on writeback throttling >> >> >(causing stalls). The problem is that writeback throughput estimate gets >> >> >updated at most once per 200 ms. One update happens early after we >> >> >submit pages for writeback (at that point writeout of only small >> >> >fraction of pages is completed and thus observed throughput is tiny). >> >> >Next update happens only during the next write spike (updates happen >> >> >only from inode writeback and dirty throttling code) and if that is >> >> >more than 1s after previous spike, we decide system was idle and just >> >> >ignore whatever was written until this moment. >> >> > >> >> >Fix the problem by making sure writeback throughput estimate is also >> >> >updated shortly after writeback completes to get reasonable estimate of >> >> >throughput for spiky workloads. >> >> > >> >> >Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+li>nux@google.com >> >> >Reported-by: Michael Stapelberg <stapelberg+linux@google.com> >> >> >Signed-off-by: Jan Kara <jack@suse.cz> >> >... >> >> >diff --git a/mm/page-writeback.c b/mm/page-writeback.c >> >> >index 1fecf8ebadb0..6a99ddca95c0 100644 >> >> >--- a/mm/page-writeback.c >> >> >+++ b/mm/page-writeback.c >> >> >@@ -1346,14 +1346,7 @@ static void __wb_update_bandwidth(struct dirty_thr>ottle_control *gdtc, >> >> > unsigned long dirtied; >> >> > unsigned long written; >> >> > >> >> >- lockdep_assert_held(&wb->list_lock); >> >> >- >> >> >- /* >> >> >- * rate-limit, only update once every 200ms. >> >> >- */ >> >> >- if (elapsed < BANDWIDTH_INTERVAL) >> >> >- return; >> >> >> >> Please leave it as it is if you are not dumping the 200ms rule. >> > >> >Well, that could break the delayed updated scheduled after the end of >> >writeback and for no good reason. The problematic ordering is like: >> >> After another look at 2/5, you are cutting the rule, which is worth a >> seperate patch. > >The only update that can break the 200ms rule are the updates added in this >patch. I don't think separating the removal of 200ms check for that one >case really brings much clarity. It would rather bring "what if questions" >to this patch... > >> >end writeback on inode1 >> > queue_delayed_work() - queues delayed work after BANDWIDTH_INTERVAL >> > >> >__wb_update_bandwidth() called e.g. from balance_dirty_pages() >> > wb->bw_time_stamp = now; >> > >> >end writeback on inode2 >> > queue_delayed_work() - does nothing since work is already queued >> > >> >delayed work calls __wb_update_bandwidth() - nothing is done since elapsed >> >< BANDWIDTH_INTERVAL and we may thus miss reflecting writeback of inode2 in >> >our estimates. >> >> Your example says the estimate based on inode2 is torpedoed by a random >> update, and you are looking to make that estimate meaningful at the cost >> of breaking the rule - how differet is it to the current one if the >> estimate is derived from 20ms-elapsed interval at inode2? Is it likely to >> see another palpablely different result at inode3 from 50ms-elapsed interval? > >I'm not sure I understand your question correctly but updates after shorter >than 200ms interval should not disturb the estimates much. >wb_update_write_bandwidth() effectively uses formula: > > bandwidth = (written + bandwidth * (period - elapsed)) / period > >where 'period' is 3 seconds. So we compute average bandwidth over last 3 >seconds where amount written in 'elapsed' interval is 'written' pages. If >'elapsed' is small, the influence of current sample on reducing estimated >bandwidth is going to be small as well. Correct. Without the 3s period that in combination with filters there helps prevent any sample either in 20ms or 200ms interval from overturning the estimated bandwidth, we will find what is estimated in an hour is a mile from what disk vendors claim. And in a day interval I am inclined to ignore the difference between the estimated BW and the bare metal BW, in assumption that no disk will be found idle while dirty pages go above the background threshold. Hillf
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 148d889f2f7f..57395f7bb192 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -143,6 +143,7 @@ struct bdi_writeback { spinlock_t work_lock; /* protects work_list & dwork scheduling */ struct list_head work_list; struct delayed_work dwork; /* work item used for writeback */ + struct delayed_work bw_dwork; /* work item used for bandwidth estimate */ unsigned long dirty_sleep; /* last wait */ diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 47cd732e012e..a45e09ed0711 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -379,6 +379,7 @@ int dirty_writeback_centisecs_handler(struct ctl_table *table, int write, void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty); unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh); +void wb_update_bandwidth(struct bdi_writeback *wb); void balance_dirty_pages_ratelimited(struct address_space *mapping); bool wb_over_bg_thresh(struct bdi_writeback *wb); diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 342394ef1e02..9baa59d68110 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -271,6 +271,14 @@ void wb_wakeup_delayed(struct bdi_writeback *wb) spin_unlock_bh(&wb->work_lock); } +static void wb_update_bandwidth_workfn(struct work_struct *work) +{ + struct bdi_writeback *wb = container_of(to_delayed_work(work), + struct bdi_writeback, bw_dwork); + + wb_update_bandwidth(wb); +} + /* * Initial write bandwidth: 100 MB/s */ @@ -303,6 +311,7 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi, spin_lock_init(&wb->work_lock); INIT_LIST_HEAD(&wb->work_list); INIT_DELAYED_WORK(&wb->dwork, wb_workfn); + INIT_DELAYED_WORK(&wb->bw_dwork, wb_update_bandwidth_workfn); wb->dirty_sleep = jiffies; err = fprop_local_init_percpu(&wb->completions, gfp); @@ -351,6 +360,7 @@ static void wb_shutdown(struct bdi_writeback *wb) mod_delayed_work(bdi_wq, &wb->dwork, 0); flush_delayed_work(&wb->dwork); WARN_ON(!list_empty(&wb->work_list)); + flush_delayed_work(&wb->bw_dwork); } static void wb_exit(struct bdi_writeback *wb) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 1fecf8ebadb0..6a99ddca95c0 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1346,14 +1346,7 @@ static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc, unsigned long dirtied; unsigned long written; - lockdep_assert_held(&wb->list_lock); - - /* - * rate-limit, only update once every 200ms. - */ - if (elapsed < BANDWIDTH_INTERVAL) - return; - + spin_lock(&wb->list_lock); dirtied = percpu_counter_read(&wb->stat[WB_DIRTIED]); written = percpu_counter_read(&wb->stat[WB_WRITTEN]); @@ -1375,15 +1368,14 @@ static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc, wb->dirtied_stamp = dirtied; wb->written_stamp = written; wb->bw_time_stamp = now; + spin_unlock(&wb->list_lock); } -static void wb_update_bandwidth(struct bdi_writeback *wb) +void wb_update_bandwidth(struct bdi_writeback *wb) { struct dirty_throttle_control gdtc = { GDTC_INIT(wb) }; - spin_lock(&wb->list_lock); __wb_update_bandwidth(&gdtc, NULL, false); - spin_unlock(&wb->list_lock); } /* Interval after which we consider wb idle and don't estimate bandwidth */ @@ -1728,11 +1720,8 @@ static void balance_dirty_pages(struct bdi_writeback *wb, wb->dirty_exceeded = 1; if (time_is_before_jiffies(wb->bw_time_stamp + - BANDWIDTH_INTERVAL)) { - spin_lock(&wb->list_lock); + BANDWIDTH_INTERVAL)) __wb_update_bandwidth(gdtc, mdtc, true); - spin_unlock(&wb->list_lock); - } /* throttle according to the chosen dtc */ dirty_ratelimit = wb->dirty_ratelimit; @@ -2371,7 +2360,13 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc) cond_resched(); congestion_wait(BLK_RW_ASYNC, HZ/50); } - wb_update_bandwidth(wb); + /* + * Usually few pages are written by now from those we've just submitted + * but if there's constant writeback being submitted, this makes sure + * writeback bandwidth is updated once in a while. + */ + if (time_is_before_jiffies(wb->bw_time_stamp + BANDWIDTH_INTERVAL)) + wb_update_bandwidth(wb); return ret; } @@ -2742,6 +2737,11 @@ static void wb_inode_writeback_start(struct bdi_writeback *wb) static void wb_inode_writeback_end(struct bdi_writeback *wb) { atomic_dec(&wb->writeback_inodes); + /* + * Make sure estimate of writeback throughput gets + * updated after writeback completed. + */ + queue_delayed_work(bdi_wq, &wb->bw_dwork, BANDWIDTH_INTERVAL); } int test_clear_page_writeback(struct page *page)
Michael Stapelberg has reported that for workload with short big spikes of writes (GCC linker seem to trigger this frequently) the write throughput is heavily underestimated and tends to steadily sink until it reaches zero. This has rather bad impact on writeback throttling (causing stalls). The problem is that writeback throughput estimate gets updated at most once per 200 ms. One update happens early after we submit pages for writeback (at that point writeout of only small fraction of pages is completed and thus observed throughput is tiny). Next update happens only during the next write spike (updates happen only from inode writeback and dirty throttling code) and if that is more than 1s after previous spike, we decide system was idle and just ignore whatever was written until this moment. Fix the problem by making sure writeback throughput estimate is also updated shortly after writeback completes to get reasonable estimate of throughput for spiky workloads. Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+linux@google.com Reported-by: Michael Stapelberg <stapelberg+linux@google.com> Signed-off-by: Jan Kara <jack@suse.cz> --- include/linux/backing-dev-defs.h | 1 + include/linux/writeback.h | 1 + mm/backing-dev.c | 10 ++++++++++ mm/page-writeback.c | 32 ++++++++++++++++---------------- 4 files changed, 28 insertions(+), 16 deletions(-)