From patchwork Tue Nov 12 03:42:27 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hillf Danton X-Patchwork-Id: 11238513 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CA69B13B1 for ; Tue, 12 Nov 2019 03:42:44 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8FCFB21872 for ; Tue, 12 Nov 2019 03:42:44 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8FCFB21872 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id BCE936B0006; Mon, 11 Nov 2019 22:42:43 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id BA4AC6B0007; Mon, 11 Nov 2019 22:42:43 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ABAF36B0008; Mon, 11 Nov 2019 22:42:43 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0163.hostedemail.com [216.40.44.163]) by kanga.kvack.org (Postfix) with ESMTP id 961D76B0006 for ; Mon, 11 Nov 2019 22:42:43 -0500 (EST) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id 22C79824999B for ; Tue, 12 Nov 2019 03:42:43 +0000 (UTC) X-FDA: 76146228606.19.fuel36_177d837a01c4a X-Spam-Summary: 2,0,0,c70cd5e6f13c4e4b,d41d8cd98f00b204,hdanton@sina.com,::akpm@linux-foundation.org:rong.a.chen@intel.com:fengguang.wu@intel.com:tj@kernel.org:shakeelb@google.com:minchan@kernel.org:mgorman@suse.de:linux-kernel@vger.kernel.org:hdanton@sina.com,RULES_HIT:2:41:355:379:800:960:966:973:988:989:1260:1311:1314:1345:1437:1515:1535:1605:1730:1747:1777:1792:2194:2196:2198:2199:2200:2201:2393:2553:2559:2562:2693:2890:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4041:4042:4050:4119:4250:4321:4362:4385:5007:6117:6119:6120:6261:7875:7901:7903:7974:8784:9036:9592:10004:11026:11334:11473:11537:11658:11914:12043:12296:12297:12438:12555:12986:13138:13146:13161:13229:13230:13231:13869:13894:14096:21080:21220:21324:21325:21450:21451:21627:21740:30001:30005:30012:30041:30045:30054:30056:30070:30090,0,RBL:202.108.3.166:@sina.com:.lbl8.mailshell.net-62.50.2.100 64.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL :0,DNSBL X-HE-Tag: fuel36_177d837a01c4a X-Filterd-Recvd-Size: 8116 Received: from mail3-166.sinamail.sina.com.cn (mail3-166.sinamail.sina.com.cn [202.108.3.166]) by imf10.hostedemail.com (Postfix) with SMTP for ; Tue, 12 Nov 2019 03:42:41 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([114.244.162.243]) by sina.com with ESMTP id 5DCA2A2C000148A7; Tue, 12 Nov 2019 11:42:37 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 86790754919441 From: Hillf Danton To: linux-mm Cc: Andrew Morton , Rong Chen , Fengguang Wu , Tejun Heo , Shakeel Butt , Minchan Kim , Mel Gorman , linux-kernel , Hillf Danton Subject: [RFC v3] writeback: add elastic bdi in cgwb bdp Date: Tue, 12 Nov 2019 11:42:27 +0800 Message-Id: <20191112034227.3112-1-hdanton@sina.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The elastic bdi (ebdi) which is the mirror bdi of spinning disk, SSD and USB key on market is introduced to balancing dirty pages (bdp). The risk arises that system runs out of free memory, when dirty pages are produced too many too soon, so bdp is needed in field. Ebdi facilitates bdp in elastic time intervals e.g. from a jiffy to one HZ, depending on the time it would take to increase dirty pages by the amount which is defined by the variable ratelimit_pages. During cgroup writeback (cgwb) bdp, ebdi helps observe the changes both in cgwb's dirty pages (dirty speed) and in written-out pages (laundry speed) in elastic time intervals, until a balance is established between the two parties i.e. the two speeds statistically equal. The above mechanism of elastic equilibrium effectively prevents dirty page hogs, as no chance is left for dirty pages to pile up, thus cuts the risk that system free memory falls to unsafe level. Thanks to Rong Chen for testing. V3 is based on next-20191108. Changes since v2 - add code document and comments - adapt balance_dirty_pages_ratelimited() to ebdi Changes since v1 - drop CGWB_BDP_WITH_EBDI Changes since v0 - add CGWB_BDP_WITH_EBDI in mm/Kconfig - drop wakeup in wbc_detach_inode() - add wakeup in wb_workfn() Signed-off-by: Hillf Danton --- -- --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -170,6 +170,8 @@ struct bdi_writeback { struct list_head bdi_node; /* anchored at bdi->wb_list */ + struct wait_queue_head bdp_waitq; /* used for bdp, balancing dirty pages */ + #ifdef CONFIG_CGROUP_WRITEBACK struct percpu_ref refcnt; /* used only for !root wb's */ struct fprop_local_percpu memcg_completions; --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -324,6 +324,8 @@ static int wb_init(struct bdi_writeback goto out_destroy_stat; } + init_waitqueue_head(&wb->bdp_waitq); + return 0; out_destroy_stat: --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -811,6 +811,8 @@ static long wb_split_bdi_pages(struct bd if (nr_pages == LONG_MAX) return LONG_MAX; + return nr_pages; + /* * This may be called on clean wb's and proportional distribution * may not make sense, just use the original @nr_pages in those @@ -1604,6 +1606,7 @@ static long writeback_chunk_size(struct pages = min(pages, work->nr_pages); pages = round_down(pages + MIN_WRITEBACK_PAGES, MIN_WRITEBACK_PAGES); + pages = work->nr_pages; } return pages; @@ -2092,6 +2095,9 @@ void wb_workfn(struct work_struct *work) wb_wakeup_delayed(wb); current->flags &= ~PF_SWAPWRITE; + + if (waitqueue_active(&wb->bdp_waitq)) + wake_up_all(&wb->bdp_waitq); } /* --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1830,6 +1830,67 @@ pause: wb_start_background_writeback(wb); } +/** + * cgwb_bdp_should_throttle() tell if a wb should be throttled + * @wb bdi_writeback to throttle + * + * To avoid the risk of exhausting the system free memory, we check + * and try much to prevent too many dirty pages from being produced + * too soon. + * + * For cgroup writeback, it is essencially to keep an equilibrium + * between its dirty speed and laundry speed i.e. dirty pages are + * written out as fast as they are produced in an ideal state. + */ +static bool cgwb_bdp_should_throttle(struct bdi_writeback *wb) +{ + struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB }; + + if (fatal_signal_pending(current)) + return false; + + gdtc.avail = global_dirtyable_memory(); + + domain_dirty_limits(&gdtc); + + gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) + + global_node_page_state(NR_UNSTABLE_NFS) + + global_node_page_state(NR_WRITEBACK); + + if (gdtc.dirty < gdtc.bg_thresh) + return false; + + if (!writeback_in_progress(wb)) + wb_start_background_writeback(wb); + + if (gdtc.dirty < gdtc.thresh) + return false; + + /* + * throttle wb if there is the risk that wb's dirty speed is + * running away from its laundry speed, better with statistic + * error taken into account. + */ + return wb_stat(wb, WB_DIRTIED) > + wb_stat(wb, WB_WRITTEN) + wb_stat_error(); +} + +/** + * cgwb_bdp() cgroup writeback tries to balance dirty pages + * @wb bdi_writeback in question + * + * if no balance exists at the moment @wb will be throttled till + * it is established. + */ +static inline void cgwb_bdp(struct bdi_writeback *wb) +{ + wait_event_interruptible_timeout(wb->bdp_waitq, + !cgwb_bdp_should_throttle(wb), HZ); +} + +/* for detecting dirty page hogs */ +static DEFINE_PER_CPU(int, bdp_in_flight); + static DEFINE_PER_CPU(int, bdp_ratelimits); /* @@ -1866,8 +1927,8 @@ void balance_dirty_pages_ratelimited(str struct inode *inode = mapping->host; struct backing_dev_info *bdi = inode_to_bdi(inode); struct bdi_writeback *wb = NULL; - int ratelimit; - int *p; + bool try_bdp; + int *dirty, *leak, *flights; if (!bdi_cap_account_dirty(bdi)) return; @@ -1877,10 +1938,6 @@ void balance_dirty_pages_ratelimited(str if (!wb) wb = &bdi->wb; - ratelimit = current->nr_dirtied_pause; - if (wb->dirty_exceeded) - ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10)); - preempt_disable(); /* * This prevents one CPU to accumulate too many dirtied pages without @@ -1888,29 +1945,38 @@ void balance_dirty_pages_ratelimited(str * 1000+ tasks, all of them start dirtying pages at exactly the same * time, hence all honoured too large initial task->nr_dirtied_pause. */ - p = this_cpu_ptr(&bdp_ratelimits); - if (unlikely(current->nr_dirtied >= ratelimit)) - *p = 0; - else if (unlikely(*p >= ratelimit_pages)) { - *p = 0; - ratelimit = 0; - } + dirty = this_cpu_ptr(&bdp_ratelimits); + /* * Pick up the dirtied pages by the exited tasks. This avoids lots of * short-lived tasks (eg. gcc invocations in a kernel build) escaping * the dirty throttling and livelock other long-run dirtiers. */ - p = this_cpu_ptr(&dirty_throttle_leaks); - if (*p > 0 && current->nr_dirtied < ratelimit) { - unsigned long nr_pages_dirtied; - nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied); - *p -= nr_pages_dirtied; - current->nr_dirtied += nr_pages_dirtied; + leak = this_cpu_ptr(&dirty_throttle_leaks); + + if (*dirty + *leak < ratelimit_pages) { + /* + * nothing to do as it would take some more time to + * eat out ratelimit_pages + */ + try_bdp = false; + } else { + try_bdp = true; + + /* + * bdp in flight helps detect dirty page hogs soon + */ + flights = this_cpu_ptr(&bdp_in_flight); + + if ((*flights)++ & 1) { + *dirty = *dirty + *leak - ratelimit_pages; + *leak = 0; + } } preempt_enable(); - if (unlikely(current->nr_dirtied >= ratelimit)) - balance_dirty_pages(wb, current->nr_dirtied); + if (try_bdp) + cgwb_bdp(wb); wb_put(wb); }