From patchwork Mon Aug 26 16:06:52 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 11114995 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8E57D1800 for ; Mon, 26 Aug 2019 16:07:09 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4B96C21848 for ; Mon, 26 Aug 2019 16:07:09 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="sNwCMXNq" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4B96C21848 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id E70726B05B1; Mon, 26 Aug 2019 12:07:07 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id E4E176B05B2; Mon, 26 Aug 2019 12:07:07 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C9BD86B05B3; Mon, 26 Aug 2019 12:07:07 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0108.hostedemail.com [216.40.44.108]) by kanga.kvack.org (Postfix) with ESMTP id A980B6B05B1 for ; Mon, 26 Aug 2019 12:07:07 -0400 (EDT) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 54B74180AD7C3 for ; Mon, 26 Aug 2019 16:07:07 +0000 (UTC) X-FDA: 75865058094.13.watch65_5b392bf0f3616 X-Spam-Summary: 2,0,0,9701e21b82696a01,d41d8cd98f00b204,htejun@gmail.com,:axboe@kernel.dk:jack@suse.cz:hannes@cmpxchg.org:mhocko@kernel.org:vdavydov.dev@gmail.com:cgroups@vger.kernel.org::linux-block@vger.kernel.org:linux-kernel@vger.kernel.org:kernel-team@fb.com:guro@fb.com:akpm@linux-foundation.org:tj@kernel.org,RULES_HIT:1:2:41:69:355:379:541:800:960:966:973:988:989:1260:1345:1359:1437:1605:1730:1747:1777:1792:2196:2198:2199:2200:2393:2553:2559:2562:2897:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:4050:4321:4385:4605:5007:6117:6119:6261:6653:7875:7903:9036:9592:10004:11026:11232:11473:11658:11914:12043:12048:12294:12296:12297:12438:12517:12519:12555:12679:12683:12895:12986:13138:13231:14110:14394:21080:21222:21324:21444:21451:21627:21740:30054:30070:30090,0,RBL:209.85.222.194:@gmail.com:.lbl8.mailshell.net-62.18.0.100 66.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFti me:25,LU X-HE-Tag: watch65_5b392bf0f3616 X-Filterd-Recvd-Size: 10292 Received: from mail-qk1-f194.google.com (mail-qk1-f194.google.com [209.85.222.194]) by imf06.hostedemail.com (Postfix) with ESMTP for ; Mon, 26 Aug 2019 16:07:06 +0000 (UTC) Received: by mail-qk1-f194.google.com with SMTP id 125so14452014qkl.6 for ; Mon, 26 Aug 2019 09:07:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references; bh=VLHZ6ZNZgqY5/GiQqVLdmuLaStLAgiw+ZiX1K52HrY4=; b=sNwCMXNqcvEyA4KghlQTGzgEiPrT39GgcA2FeeuElZKbfhHirdQcRItctuijuXXqOq kytMlb6po4B16uaQagwe48+I9ETxfswFZRqdvHwf/nJPr0VxltfNOfbOm3qUiHNDy0o+ Vb/zF0BsMWZ/8iEhiNJIH8d8bYf5FlD8PT2vThdCDoX76yUthkG/kml9sRZe4ubIcTIF K5gPNKO3yvicvLQJ/RMICafnmg94MSCTsyHfDXelWtzsBERuEqohmd+/N5d+f0sIiBec 3BI8y4q4wS3W3ITTG7UpWqM7DxxS0a4Hmk1tuYbee9Hh4moHaIRq/pqif4gBMk6aRXHo E6fA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:cc:subject:date:message-id :in-reply-to:references; bh=VLHZ6ZNZgqY5/GiQqVLdmuLaStLAgiw+ZiX1K52HrY4=; b=INcqVyQ80MX9U6GURTgYtEik9Rq9hT2vyqgfz+/kzf4bVx07ib80gWixxaO7vuCVwa /Hdf38YoaKj5Ihz1AbGWMB3AGLwpWBspXHVnhiSFAmgQUEcNb9mBUkMFqW5u7lgI16gE mulmUuCS66Yn8PUxc2qiITqICDqVTtsBFpg2cKTA8udwDUVG5t+VvqS/7yop9wka+S2Z XCJvAk/ThT6DIfbii//ZmSQg4OD0z6miriJhqPi93e2YokztTwnFFimY2Uxw9BxNQhKf cmdGEI/dFawiuldWbWqb26+LLSW8G5kVKPAgqHXPKjskbPW8ZMx3HHuwmzz8qECK2y46 9TZA== X-Gm-Message-State: APjAAAUcOxoas29C4y/F1Il2+jsQPyRCmSRRyzl5lKVakG8JGu3vKZbe 90GDXBsyyP2Z3mv3ID/Y+pc= X-Google-Smtp-Source: APXvYqwlluVR2FGCz1GaYgZtKsTrXxdhegOASK8N9AFXHtVRxexC+pNnumgVk+Ouc0c5UQRW6sJBeQ== X-Received: by 2002:a37:454:: with SMTP id 81mr16595509qke.153.1566835626129; Mon, 26 Aug 2019 09:07:06 -0700 (PDT) Received: from localhost ([2620:10d:c091:500::d081]) by smtp.gmail.com with ESMTPSA id 20sm6237089qkg.59.2019.08.26.09.07.05 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 26 Aug 2019 09:07:05 -0700 (PDT) From: Tejun Heo To: axboe@kernel.dk, jack@suse.cz, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, guro@fb.com, akpm@linux-foundation.org, Tejun Heo Subject: [PATCH 1/5] writeback: Generalize and expose wb_completion Date: Mon, 26 Aug 2019 09:06:52 -0700 Message-Id: <20190826160656.870307-2-tj@kernel.org> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190826160656.870307-1-tj@kernel.org> References: <20190826160656.870307-1-tj@kernel.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: wb_completion is used to track writeback completions. We want to use it from memcg side for foreign inode flushes. This patch updates it to remember the target waitq instead of assuming bdi->wb_waitq and expose it outside of fs-writeback.c. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara --- fs/fs-writeback.c | 47 ++++++++++---------------------- include/linux/backing-dev-defs.h | 20 ++++++++++++++ include/linux/backing-dev.h | 2 ++ 3 files changed, 36 insertions(+), 33 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index fddd8abd839a..9442f1fd6460 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -36,10 +36,6 @@ */ #define MIN_WRITEBACK_PAGES (4096UL >> (PAGE_SHIFT - 10)) -struct wb_completion { - atomic_t cnt; -}; - /* * Passed into wb_writeback(), essentially a subset of writeback_control */ @@ -60,19 +56,6 @@ struct wb_writeback_work { struct wb_completion *done; /* set if the caller waits */ }; -/* - * If one wants to wait for one or more wb_writeback_works, each work's - * ->done should be set to a wb_completion defined using the following - * macro. Once all work items are issued with wb_queue_work(), the caller - * can wait for the completion of all using wb_wait_for_completion(). Work - * items which are waited upon aren't freed automatically on completion. - */ -#define DEFINE_WB_COMPLETION_ONSTACK(cmpl) \ - struct wb_completion cmpl = { \ - .cnt = ATOMIC_INIT(1), \ - } - - /* * If an inode is constantly having its pages dirtied, but then the * updates stop dirtytime_expire_interval seconds in the past, it's @@ -182,7 +165,7 @@ static void finish_writeback_work(struct bdi_writeback *wb, if (work->auto_free) kfree(work); if (done && atomic_dec_and_test(&done->cnt)) - wake_up_all(&wb->bdi->wb_waitq); + wake_up_all(done->waitq); } static void wb_queue_work(struct bdi_writeback *wb, @@ -206,20 +189,18 @@ static void wb_queue_work(struct bdi_writeback *wb, /** * wb_wait_for_completion - wait for completion of bdi_writeback_works - * @bdi: bdi work items were issued to * @done: target wb_completion * * Wait for one or more work items issued to @bdi with their ->done field - * set to @done, which should have been defined with - * DEFINE_WB_COMPLETION_ONSTACK(). This function returns after all such - * work items are completed. Work items which are waited upon aren't freed + * set to @done, which should have been initialized with + * DEFINE_WB_COMPLETION(). This function returns after all such work items + * are completed. Work items which are waited upon aren't freed * automatically on completion. */ -static void wb_wait_for_completion(struct backing_dev_info *bdi, - struct wb_completion *done) +void wb_wait_for_completion(struct wb_completion *done) { atomic_dec(&done->cnt); /* put down the initial count */ - wait_event(bdi->wb_waitq, !atomic_read(&done->cnt)); + wait_event(*done->waitq, !atomic_read(&done->cnt)); } #ifdef CONFIG_CGROUP_WRITEBACK @@ -854,7 +835,7 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, restart: rcu_read_lock(); list_for_each_entry_continue_rcu(wb, &bdi->wb_list, bdi_node) { - DEFINE_WB_COMPLETION_ONSTACK(fallback_work_done); + DEFINE_WB_COMPLETION(fallback_work_done, bdi); struct wb_writeback_work fallback_work; struct wb_writeback_work *work; long nr_pages; @@ -901,7 +882,7 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, last_wb = wb; rcu_read_unlock(); - wb_wait_for_completion(bdi, &fallback_work_done); + wb_wait_for_completion(&fallback_work_done); goto restart; } rcu_read_unlock(); @@ -2373,7 +2354,8 @@ static void wait_sb_inodes(struct super_block *sb) static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, enum wb_reason reason, bool skip_if_busy) { - DEFINE_WB_COMPLETION_ONSTACK(done); + struct backing_dev_info *bdi = sb->s_bdi; + DEFINE_WB_COMPLETION(done, bdi); struct wb_writeback_work work = { .sb = sb, .sync_mode = WB_SYNC_NONE, @@ -2382,14 +2364,13 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, .nr_pages = nr, .reason = reason, }; - struct backing_dev_info *bdi = sb->s_bdi; if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info) return; WARN_ON(!rwsem_is_locked(&sb->s_umount)); bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy); - wb_wait_for_completion(bdi, &done); + wb_wait_for_completion(&done); } /** @@ -2451,7 +2432,8 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb); */ void sync_inodes_sb(struct super_block *sb) { - DEFINE_WB_COMPLETION_ONSTACK(done); + struct backing_dev_info *bdi = sb->s_bdi; + DEFINE_WB_COMPLETION(done, bdi); struct wb_writeback_work work = { .sb = sb, .sync_mode = WB_SYNC_ALL, @@ -2461,7 +2443,6 @@ void sync_inodes_sb(struct super_block *sb) .reason = WB_REASON_SYNC, .for_sync = 1, }; - struct backing_dev_info *bdi = sb->s_bdi; /* * Can't skip on !bdi_has_dirty() because we should wait for !dirty @@ -2475,7 +2456,7 @@ void sync_inodes_sb(struct super_block *sb) /* protect against inode wb switch, see inode_switch_wbs_work_fn() */ bdi_down_write_wb_switch_rwsem(bdi); bdi_split_work_to_wbs(bdi, &work, false); - wb_wait_for_completion(bdi, &done); + wb_wait_for_completion(&done); bdi_up_write_wb_switch_rwsem(bdi); wait_sb_inodes(sb); diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 6a1a8a314d85..8fb740178d5d 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -67,6 +67,26 @@ enum wb_reason { WB_REASON_MAX, }; +struct wb_completion { + atomic_t cnt; + wait_queue_head_t *waitq; +}; + +#define __WB_COMPLETION_INIT(_waitq) \ + (struct wb_completion){ .cnt = ATOMIC_INIT(1), .waitq = (_waitq) } + +/* + * If one wants to wait for one or more wb_writeback_works, each work's + * ->done should be set to a wb_completion defined using the following + * macro. Once all work items are issued with wb_queue_work(), the caller + * can wait for the completion of all using wb_wait_for_completion(). Work + * items which are waited upon aren't freed automatically on completion. + */ +#define WB_COMPLETION_INIT(bdi) __WB_COMPLETION_INIT(&(bdi)->wb_waitq) + +#define DEFINE_WB_COMPLETION(cmpl, bdi) \ + struct wb_completion cmpl = WB_COMPLETION_INIT(bdi) + /* * For cgroup writeback, multiple wb's may map to the same blkcg. Those * wb's can operate mostly independently but should share the congested diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 35b31d176f74..02650b1253a2 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -44,6 +44,8 @@ void wb_start_background_writeback(struct bdi_writeback *wb); void wb_workfn(struct work_struct *work); void wb_wakeup_delayed(struct bdi_writeback *wb); +void wb_wait_for_completion(struct wb_completion *done); + extern spinlock_t bdi_lock; extern struct list_head bdi_list; From patchwork Mon Aug 26 16:06:53 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 11114997 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6085A1800 for ; Mon, 26 Aug 2019 16:07:12 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2003B20828 for ; Mon, 26 Aug 2019 16:07:12 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="SEjLWoYq" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2003B20828 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 178396B05B2; Mon, 26 Aug 2019 12:07:10 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 153DC6B05B4; Mon, 26 Aug 2019 12:07:10 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EBD246B05B5; Mon, 26 Aug 2019 12:07:09 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0057.hostedemail.com [216.40.44.57]) by kanga.kvack.org (Postfix) with ESMTP id CB83A6B05B2 for ; Mon, 26 Aug 2019 12:07:09 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id 7C6625003 for ; Mon, 26 Aug 2019 16:07:09 +0000 (UTC) X-FDA: 75865058178.14.knife18_5b8c4f0aeae5d X-Spam-Summary: 2,0,0,a34fcc2de4d73c74,d41d8cd98f00b204,htejun@gmail.com,:axboe@kernel.dk:jack@suse.cz:hannes@cmpxchg.org:mhocko@kernel.org:vdavydov.dev@gmail.com:cgroups@vger.kernel.org::linux-block@vger.kernel.org:linux-kernel@vger.kernel.org:kernel-team@fb.com:guro@fb.com:akpm@linux-foundation.org:tj@kernel.org,RULES_HIT:41:355:379:541:800:960:966:968:973:988:989:1260:1345:1359:1437:1535:1544:1605:1711:1730:1747:1777:1792:2196:2199:2393:2553:2559:2562:2693:3138:3139:3140:3141:3142:3865:3867:3868:3870:3871:3872:4118:4250:4385:4605:5007:6261:6653:7875:7903:8603:10004:11026:11473:11658:11914:12043:12048:12291:12296:12297:12438:12517:12519:12555:12683:12895:13161:13229:13972:14096:14181:14394:14721:21080:21324:21444:21451:21627:30045:30054:30080:30090,0,RBL:209.85.160.196:@gmail.com:.lbl8.mailshell.net-62.50.0.100 66.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: knife18_5b8c4f0aeae5d X-Filterd-Recvd-Size: 7617 Received: from mail-qt1-f196.google.com (mail-qt1-f196.google.com [209.85.160.196]) by imf25.hostedemail.com (Postfix) with ESMTP for ; Mon, 26 Aug 2019 16:07:08 +0000 (UTC) Received: by mail-qt1-f196.google.com with SMTP id g4so8533374qtq.7 for ; Mon, 26 Aug 2019 09:07:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references; bh=3O1Xq487diyfXoEMJTs1w0KreVNXXWibqp67Mw7HjBc=; b=SEjLWoYq0fP4psa2uBCCNXgApcuu8+EjDA3JSX8B878dJyhaBo4676Upvh2awIl1by hWUw2nMIRbQ1IepCNcoN4rtxFcmV3OkbYrB/HUNC7W42vZhjFKqslqo5N3geJADazT89 Isqm1ZYiEe8QZuRhdh1ET/XGlhWk9J7KMO1YGG1t1ghRvhMJsLk+xmyE9v1w3D2LDiym IXtA59EQycyV6DncujF5KbNoD9r9BWPpOqPvy/gANx21ZEGvUODp6cDqXEZENy+s9LJU ja/E0xCD8Mk5t3kLAMXsrtNXnfgzHlhuPrnw8C99C/D4DoCJOjrcRw9QKiy5r/thHfhw 3SSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:cc:subject:date:message-id :in-reply-to:references; bh=3O1Xq487diyfXoEMJTs1w0KreVNXXWibqp67Mw7HjBc=; b=JA3YkYnzthuSdVq+FAylQYFK1JejD9srUs6GktFubOk2v0zRWxO6VNH6NZWbTuhEte uHOIf4ARnjgLu1V0syzxaxV5xXm5KckuQaWPR1waeMSlSJ6KBcXU4GuYkIxJB9GQoQhY sDqAqOPLQHoX9DpOUfQnVaB6Txp4Nwcn7ZhfcGVvxMYQh4Ao9GHXwWONj9xixbjz0C79 lGB6riYx5nrET8DJO8OFvQwQh96YG3QG6lOx2v7OxbOa3SeODpWwWlPZPBJVQyUm2rAV 2ptBFBoLhUe/5Pkixn9kwtOpztijdPOHmkydgcznFcNWkxa0RSkl+/wB3IKOfZ72ptKz HS8w== X-Gm-Message-State: APjAAAXBlBauCQObNm5/NA79KVzSQ36XyHiufZPDTUConHV+rk4lr9r6 glIuotnEom0gJ9vCFlmhqJk= X-Google-Smtp-Source: APXvYqz6TmAInj6KM5R4Rc8P00hmyrxHa998IOH+1NeZV/9zlWFa2x+f6QG1y0uMpFPaYhkDibImdg== X-Received: by 2002:ac8:4504:: with SMTP id q4mr18060803qtn.286.1566835628343; Mon, 26 Aug 2019 09:07:08 -0700 (PDT) Received: from localhost ([2620:10d:c091:500::d081]) by smtp.gmail.com with ESMTPSA id r15sm6633892qtp.94.2019.08.26.09.07.07 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 26 Aug 2019 09:07:07 -0700 (PDT) From: Tejun Heo To: axboe@kernel.dk, jack@suse.cz, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, guro@fb.com, akpm@linux-foundation.org, Tejun Heo Subject: [PATCH 2/5] bdi: Add bdi->id Date: Mon, 26 Aug 2019 09:06:53 -0700 Message-Id: <20190826160656.870307-3-tj@kernel.org> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190826160656.870307-1-tj@kernel.org> References: <20190826160656.870307-1-tj@kernel.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: There currently is no way to universally identify and lookup a bdi without holding a reference and pointer to it. This patch adds an non-recycling bdi->id and implements bdi_get_by_id() which looks up bdis by their ids. This will be used by memcg foreign inode flushing. I left bdi_list alone for simplicity and because while rb_tree does support rcu assignment it doesn't seem to guarantee lossless walk when walk is racing aginst tree rebalance operations. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara --- include/linux/backing-dev-defs.h | 2 + include/linux/backing-dev.h | 1 + mm/backing-dev.c | 65 +++++++++++++++++++++++++++++++- 3 files changed, 66 insertions(+), 2 deletions(-) diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 8fb740178d5d..1075f2552cfc 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -185,6 +185,8 @@ struct bdi_writeback { }; struct backing_dev_info { + u64 id; + struct rb_node rb_node; /* keyed by ->id */ struct list_head bdi_list; unsigned long ra_pages; /* max readahead in PAGE_SIZE units */ unsigned long io_pages; /* max allowed IO size */ diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 02650b1253a2..84cdcfbc763f 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -24,6 +24,7 @@ static inline struct backing_dev_info *bdi_get(struct backing_dev_info *bdi) return bdi; } +struct backing_dev_info *bdi_get_by_id(u64 id); void bdi_put(struct backing_dev_info *bdi); __printf(2, 3) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index e8e89158adec..612aa7c5ddbd 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0-only #include +#include #include #include #include @@ -22,10 +23,12 @@ EXPORT_SYMBOL_GPL(noop_backing_dev_info); static struct class *bdi_class; /* - * bdi_lock protects updates to bdi_list. bdi_list has RCU reader side - * locking. + * bdi_lock protects bdi_tree and updates to bdi_list. bdi_list has RCU + * reader side locking. */ DEFINE_SPINLOCK(bdi_lock); +static u64 bdi_id_cursor; +static struct rb_root bdi_tree = RB_ROOT; LIST_HEAD(bdi_list); /* bdi_wq serves all asynchronous writeback tasks */ @@ -859,9 +862,58 @@ struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int node_id) } EXPORT_SYMBOL(bdi_alloc_node); +static struct rb_node **bdi_lookup_rb_node(u64 id, struct rb_node **parentp) +{ + struct rb_node **p = &bdi_tree.rb_node; + struct rb_node *parent = NULL; + struct backing_dev_info *bdi; + + lockdep_assert_held(&bdi_lock); + + while (*p) { + parent = *p; + bdi = rb_entry(parent, struct backing_dev_info, rb_node); + + if (bdi->id > id) + p = &(*p)->rb_left; + else if (bdi->id < id) + p = &(*p)->rb_right; + else + break; + } + + if (parentp) + *parentp = parent; + return p; +} + +/** + * bdi_get_by_id - lookup and get bdi from its id + * @id: bdi id to lookup + * + * Find bdi matching @id and get it. Returns NULL if the matching bdi + * doesn't exist or is already unregistered. + */ +struct backing_dev_info *bdi_get_by_id(u64 id) +{ + struct backing_dev_info *bdi = NULL; + struct rb_node **p; + + spin_lock_bh(&bdi_lock); + p = bdi_lookup_rb_node(id, NULL); + if (*p) { + bdi = rb_entry(*p, struct backing_dev_info, rb_node); + bdi_get(bdi); + } + spin_unlock_bh(&bdi_lock); + + return bdi; +} + int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args) { struct device *dev; + struct rb_node *parent, **p; if (bdi->dev) /* The driver needs to use separate queues per device */ return 0; @@ -877,7 +929,15 @@ int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args) set_bit(WB_registered, &bdi->wb.state); spin_lock_bh(&bdi_lock); + + bdi->id = ++bdi_id_cursor; + + p = bdi_lookup_rb_node(bdi->id, &parent); + rb_link_node(&bdi->rb_node, parent, p); + rb_insert_color(&bdi->rb_node, &bdi_tree); + list_add_tail_rcu(&bdi->bdi_list, &bdi_list); + spin_unlock_bh(&bdi_lock); trace_writeback_bdi_register(bdi); @@ -918,6 +978,7 @@ EXPORT_SYMBOL(bdi_register_owner); static void bdi_remove_from_list(struct backing_dev_info *bdi) { spin_lock_bh(&bdi_lock); + rb_erase(&bdi->rb_node, &bdi_tree); list_del_rcu(&bdi->bdi_list); spin_unlock_bh(&bdi_lock); From patchwork Mon Aug 26 16:06:54 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 11114999 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 218D81395 for ; Mon, 26 Aug 2019 16:07:15 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D782D20828 for ; Mon, 26 Aug 2019 16:07:14 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="gX5n0n0I" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D782D20828 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 806C16B05B4; Mon, 26 Aug 2019 12:07:12 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 7E0CE6B05B5; Mon, 26 Aug 2019 12:07:12 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 65BE96B05B6; Mon, 26 Aug 2019 12:07:12 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0148.hostedemail.com [216.40.44.148]) by kanga.kvack.org (Postfix) with ESMTP id 2F7B16B05B4 for ; Mon, 26 Aug 2019 12:07:12 -0400 (EDT) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id CBEB7181AC9B4 for ; Mon, 26 Aug 2019 16:07:11 +0000 (UTC) X-FDA: 75865058262.27.wish24_5bdbeef973503 X-Spam-Summary: 2,0,0,643c8a8abe102ecc,d41d8cd98f00b204,htejun@gmail.com,:axboe@kernel.dk:jack@suse.cz:hannes@cmpxchg.org:mhocko@kernel.org:vdavydov.dev@gmail.com:cgroups@vger.kernel.org::linux-block@vger.kernel.org:linux-kernel@vger.kernel.org:kernel-team@fb.com:guro@fb.com:akpm@linux-foundation.org:tj@kernel.org,RULES_HIT:41:69:355:379:541:800:960:973:988:989:1260:1345:1359:1437:1535:1543:1711:1730:1747:1777:1792:2393:2553:2559:2562:2689:2693:2897:3138:3139:3140:3141:3142:3355:3865:3866:3867:3868:3871:3872:4117:4321:4384:4605:5007:6119:6261:6653:6755:7903:9592:10004:11026:11658:11914:12043:12048:12219:12291:12296:12297:12438:12517:12519:12555:12683:12895:14096:14110:14181:14394:14721:21080:21324:21444:21451:21627:30054:30074:30080:30090,0,RBL:209.85.160.195:@gmail.com:.lbl8.mailshell.net-62.50.0.100 66.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:25,LUA_SUMMARY:none X-HE-Tag: wish24_5bdbeef973503 X-Filterd-Recvd-Size: 6918 Received: from mail-qt1-f195.google.com (mail-qt1-f195.google.com [209.85.160.195]) by imf29.hostedemail.com (Postfix) with ESMTP for ; Mon, 26 Aug 2019 16:07:11 +0000 (UTC) Received: by mail-qt1-f195.google.com with SMTP id 44so18344554qtg.11 for ; Mon, 26 Aug 2019 09:07:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references; bh=sbi9AtPpKs90xX0XCzYWXMGLwoRRPZP34bqUM0NUXLs=; b=gX5n0n0Iya2r+6HfxSBgduwjqnM9PRtJlmoEm5ghngBsve/UtQYFYtPDwNviTIYOMH yLHoMeNJ+DWQ0bdNx4cvSVgcN681dV36+OkH/g1jKJgRguOi7VsP0NlyTZAz7cNUZOE6 zqnuyTGB/QvBuFExLTGnei+AosOZ6sQ0y8LkGtkVWGqoLaIqdMokGxmf43BnVYTMWOsm TfgrfDI+kUfbZopDzIk1PUoIxNGEh7GWOOExZRBb6f7AdZPpR69VNoZkY4jlUDeGLJQo pcOBl6W9Exf/K3YN+IZmyKLnwa/O+zqQugMqiuDEyPBkAwPlHoZuAITEWOR87u2+0DCW ok0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:cc:subject:date:message-id :in-reply-to:references; bh=sbi9AtPpKs90xX0XCzYWXMGLwoRRPZP34bqUM0NUXLs=; b=YJX9WhQNTmaDel8gI4OlZIFzU9UV9EYUJlo1aTcvI/NSpPVg4XeuJXCxMHcOkZ5Bun ln5T/VfBqL3jlYJMia6WlgWG2JiWqFDYZTrcf/HevM4m5H9PrgxNVPwC0T5pjyVXESoI 9cM0ENvdJ3DtvqjCgFBRrgnrlkZRFA8rMXvlO2u8FywmeohDACB6GyWQOPTdfBoqxyRG XWvHW5qgF4/zcMK1YI5N1cG/U41HgKPzrQdBysTRkAr5CAMS25dcsp40e1hT0BqyaSsW +/jp8wqtGnCmgt3Nt9uX3xWhRB/QTv5M6Ujty4YUjQMIaveMr10YcDH7+tYeOGgxLWVq Nl4Q== X-Gm-Message-State: APjAAAVX5nF1I9of83WUH5/xqhGWNcOCQgqB0YN/sw5me1ntDbwnJANe 6aDZwURHM2oOvC4cEwGUb59loAfj X-Google-Smtp-Source: APXvYqwfRmoFj279L2wsJx3QYrn8JRDZ4xVVrNIXLn3zjcDpszmBL7Yzn1lB3z9lIiUyrSDzWNyx1Q== X-Received: by 2002:ac8:4789:: with SMTP id k9mr18184801qtq.41.1566835630468; Mon, 26 Aug 2019 09:07:10 -0700 (PDT) Received: from localhost ([2620:10d:c091:500::d081]) by smtp.gmail.com with ESMTPSA id 131sm6410446qkn.7.2019.08.26.09.07.09 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 26 Aug 2019 09:07:09 -0700 (PDT) From: Tejun Heo To: axboe@kernel.dk, jack@suse.cz, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, guro@fb.com, akpm@linux-foundation.org, Tejun Heo Subject: [PATCH 3/5] writeback: Separate out wb_get_lookup() from wb_get_create() Date: Mon, 26 Aug 2019 09:06:54 -0700 Message-Id: <20190826160656.870307-4-tj@kernel.org> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190826160656.870307-1-tj@kernel.org> References: <20190826160656.870307-1-tj@kernel.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Separate out wb_get_lookup() which doesn't try to create one if there isn't already one from wb_get_create(). This will be used by later patches. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara --- include/linux/backing-dev.h | 2 ++ mm/backing-dev.c | 55 +++++++++++++++++++++++++------------ 2 files changed, 39 insertions(+), 18 deletions(-) diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 84cdcfbc763f..97967ce06de3 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -230,6 +230,8 @@ static inline int bdi_sched_wait(void *word) struct bdi_writeback_congested * wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp); void wb_congested_put(struct bdi_writeback_congested *congested); +struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi, + struct cgroup_subsys_state *memcg_css); struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, struct cgroup_subsys_state *memcg_css, gfp_t gfp); diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 612aa7c5ddbd..d9daa3e422d0 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -618,13 +618,12 @@ static int cgwb_create(struct backing_dev_info *bdi, } /** - * wb_get_create - get wb for a given memcg, create if necessary + * wb_get_lookup - get wb for a given memcg * @bdi: target bdi * @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref) - * @gfp: allocation mask to use * - * Try to get the wb for @memcg_css on @bdi. If it doesn't exist, try to - * create one. The returned wb has its refcount incremented. + * Try to get the wb for @memcg_css on @bdi. The returned wb has its + * refcount incremented. * * This function uses css_get() on @memcg_css and thus expects its refcnt * to be positive on invocation. IOW, rcu_read_lock() protection on @@ -641,6 +640,39 @@ static int cgwb_create(struct backing_dev_info *bdi, * each lookup. On mismatch, the existing wb is discarded and a new one is * created. */ +struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi, + struct cgroup_subsys_state *memcg_css) +{ + struct bdi_writeback *wb; + + if (!memcg_css->parent) + return &bdi->wb; + + rcu_read_lock(); + wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); + if (wb) { + struct cgroup_subsys_state *blkcg_css; + + /* see whether the blkcg association has changed */ + blkcg_css = cgroup_get_e_css(memcg_css->cgroup, &io_cgrp_subsys); + if (unlikely(wb->blkcg_css != blkcg_css || !wb_tryget(wb))) + wb = NULL; + css_put(blkcg_css); + } + rcu_read_unlock(); + + return wb; +} + +/** + * wb_get_create - get wb for a given memcg, create if necessary + * @bdi: target bdi + * @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref) + * @gfp: allocation mask to use + * + * Try to get the wb for @memcg_css on @bdi. If it doesn't exist, try to + * create one. See wb_get_lookup() for more details. + */ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, struct cgroup_subsys_state *memcg_css, gfp_t gfp) @@ -653,20 +685,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, return &bdi->wb; do { - rcu_read_lock(); - wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); - if (wb) { - struct cgroup_subsys_state *blkcg_css; - - /* see whether the blkcg association has changed */ - blkcg_css = cgroup_get_e_css(memcg_css->cgroup, - &io_cgrp_subsys); - if (unlikely(wb->blkcg_css != blkcg_css || - !wb_tryget(wb))) - wb = NULL; - css_put(blkcg_css); - } - rcu_read_unlock(); + wb = wb_get_lookup(bdi, memcg_css); } while (!wb && !cgwb_create(bdi, memcg_css, gfp)); return wb; From patchwork Mon Aug 26 16:06:55 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 11115001 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 15CDE1395 for ; Mon, 26 Aug 2019 16:07:18 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C71A32173E for ; Mon, 26 Aug 2019 16:07:17 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="f7PWgdQa" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C71A32173E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 87DBC6B05B6; Mon, 26 Aug 2019 12:07:14 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 82EF06B05B7; Mon, 26 Aug 2019 12:07:14 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 71FA66B05B9; Mon, 26 Aug 2019 12:07:14 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0220.hostedemail.com [216.40.44.220]) by kanga.kvack.org (Postfix) with ESMTP id 3ABB26B05B6 for ; Mon, 26 Aug 2019 12:07:14 -0400 (EDT) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id DA146824CA21 for ; Mon, 26 Aug 2019 16:07:13 +0000 (UTC) X-FDA: 75865058346.17.front21_5c28929e7f843 X-Spam-Summary: 2,0,0,7406792b7c5291ce,d41d8cd98f00b204,htejun@gmail.com,:axboe@kernel.dk:jack@suse.cz:hannes@cmpxchg.org:mhocko@kernel.org:vdavydov.dev@gmail.com:cgroups@vger.kernel.org::linux-block@vger.kernel.org:linux-kernel@vger.kernel.org:kernel-team@fb.com:guro@fb.com:akpm@linux-foundation.org:tj@kernel.org,RULES_HIT:41:355:379:541:800:960:966:973:988:989:1260:1345:1359:1437:1535:1543:1605:1711:1730:1747:1777:1792:2194:2196:2198:2199:2200:2201:2393:2553:2559:2562:2693:3138:3139:3140:3141:3142:3865:3866:3867:3868:3871:3872:4117:4385:4605:5007:6119:6261:6653:7875:7903:9108:10004:10226:11026:11473:11658:11914:12043:12048:12114:12291:12296:12297:12438:12517:12519:12555:12663:12683:12895:14096:14181:14394:14721:21080:21324:21444:21450:21451:21627:21740:21789:30034:30054:30080:30090,0,RBL:209.85.160.196:@gmail.com:.lbl8.mailshell.net-62.50.0.100 66.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules: 0:0:0,LF X-HE-Tag: front21_5c28929e7f843 X-Filterd-Recvd-Size: 6775 Received: from mail-qt1-f196.google.com (mail-qt1-f196.google.com [209.85.160.196]) by imf46.hostedemail.com (Postfix) with ESMTP for ; Mon, 26 Aug 2019 16:07:13 +0000 (UTC) Received: by mail-qt1-f196.google.com with SMTP id b11so18352793qtp.10 for ; Mon, 26 Aug 2019 09:07:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references; bh=E6MTfC4z3G0quoIF3DBw3b87rQV2WqMBkOF2p8Csnt4=; b=f7PWgdQaNS08y3O69h2TR6wz4T9l5sgzCskuEu75Rl4PPIoqtP1eKAL8MIWvGgBGtD b9nHdxjt/X7Yhwz4de8X2WBOiT4ZepZsWVsoCYdmY2fMXZfNCQehm+t0G3Ku1Ww+4huv PQ6z8migaaxzUVn9akM/4OyyWFicteYTGNjr2o9iivqYNfvwcZBbtk9hiqePLSyjKXxt 0RoWxaEApyGfC0uS3sh3e33KjgJQNMv299LqP+PaH6YtTg+QfOuubHX1AmqbNWUaATVV gLgNdO4IRMKOeIBRyMaFmxRomthURawpH6qAf1+ZHXC5o5v4K5s/VBLoljkuFuy5JFa0 OhGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:cc:subject:date:message-id :in-reply-to:references; bh=E6MTfC4z3G0quoIF3DBw3b87rQV2WqMBkOF2p8Csnt4=; b=r+Ljfh2gvaSgny+qb7CnHsmnZE2Et7pVTv8vryxVDZlWBqLw/3bbuUWhjLpBj4PXIr SrNQy21UyLK/u9O30Os1+iGN/L3cPBX5ZTWOYZdst6P8LLknS7pJasuXlQPJjA5BzQaX UJJ2WK5bFFD1TAMfLsWM2bP0ELerqE9jxgzEJfFeG+2H5ZaNVNUqAsh0kEKrXjxXubZ8 6UnnxkmS9AdKGurTHhkUSvwHXi27nGAZZPsaFaag3ZuOn8QWq44Biqrtr4480uuLw2tY NOpOEUnW/d6vS+rjsh0VM7Qb9n1BQDqs2gmG9Xux91l5Pw8ZAoRqJIbbzauqraB/D1uS z28w== X-Gm-Message-State: APjAAAWPBZ7NugE9saE4wecKQQPss45YtrLlkAC21HZ2x1S/euKBMTV0 Qu1J6Ii8+Tq8rfoF89YPNKc= X-Google-Smtp-Source: APXvYqyW96+ohk8RXKdmfLCDzQI/qpRQjuN30ZN6OCty725+J7I96Alv2wMPQnmck7YKVxCLnF1bgg== X-Received: by 2002:ad4:4752:: with SMTP id c18mr16324758qvx.69.1566835632574; Mon, 26 Aug 2019 09:07:12 -0700 (PDT) Received: from localhost ([2620:10d:c091:500::d081]) by smtp.gmail.com with ESMTPSA id t5sm6637934qkt.93.2019.08.26.09.07.11 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 26 Aug 2019 09:07:11 -0700 (PDT) From: Tejun Heo To: axboe@kernel.dk, jack@suse.cz, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, guro@fb.com, akpm@linux-foundation.org, Tejun Heo Subject: [PATCH 4/5] writeback, memcg: Implement cgroup_writeback_by_id() Date: Mon, 26 Aug 2019 09:06:55 -0700 Message-Id: <20190826160656.870307-5-tj@kernel.org> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190826160656.870307-1-tj@kernel.org> References: <20190826160656.870307-1-tj@kernel.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Implement cgroup_writeback_by_id() which initiates cgroup writeback from bdi and memcg IDs. This will be used by memcg foreign inode flushing. v2: Use wb_get_lookup() instead of wb_get_create() to avoid creating spurious wbs. v3: Interpret 0 @nr as 1.25 * nr_dirty to implement best-effort flushing while avoding possible livelocks. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara --- fs/fs-writeback.c | 83 +++++++++++++++++++++++++++++++++++++++ include/linux/writeback.h | 2 + 2 files changed, 85 insertions(+) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 9442f1fd6460..658dc16c9e6d 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -891,6 +891,89 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, wb_put(last_wb); } +/** + * cgroup_writeback_by_id - initiate cgroup writeback from bdi and memcg IDs + * @bdi_id: target bdi id + * @memcg_id: target memcg css id + * @nr_pages: number of pages to write, 0 for best-effort dirty flushing + * @reason: reason why some writeback work initiated + * @done: target wb_completion + * + * Initiate flush of the bdi_writeback identified by @bdi_id and @memcg_id + * with the specified parameters. + */ +int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr, + enum wb_reason reason, struct wb_completion *done) +{ + struct backing_dev_info *bdi; + struct cgroup_subsys_state *memcg_css; + struct bdi_writeback *wb; + struct wb_writeback_work *work; + int ret; + + /* lookup bdi and memcg */ + bdi = bdi_get_by_id(bdi_id); + if (!bdi) + return -ENOENT; + + rcu_read_lock(); + memcg_css = css_from_id(memcg_id, &memory_cgrp_subsys); + if (memcg_css && !css_tryget(memcg_css)) + memcg_css = NULL; + rcu_read_unlock(); + if (!memcg_css) { + ret = -ENOENT; + goto out_bdi_put; + } + + /* + * And find the associated wb. If the wb isn't there already + * there's nothing to flush, don't create one. + */ + wb = wb_get_lookup(bdi, memcg_css); + if (!wb) { + ret = -ENOENT; + goto out_css_put; + } + + /* + * If @nr is zero, the caller is attempting to write out most of + * the currently dirty pages. Let's take the current dirty page + * count and inflate it by 25% which should be large enough to + * flush out most dirty pages while avoiding getting livelocked by + * concurrent dirtiers. + */ + if (!nr) { + unsigned long filepages, headroom, dirty, writeback; + + mem_cgroup_wb_stats(wb, &filepages, &headroom, &dirty, + &writeback); + nr = dirty * 10 / 8; + } + + /* issue the writeback work */ + work = kzalloc(sizeof(*work), GFP_NOWAIT | __GFP_NOWARN); + if (work) { + work->nr_pages = nr; + work->sync_mode = WB_SYNC_NONE; + work->range_cyclic = 1; + work->reason = reason; + work->done = done; + work->auto_free = 1; + wb_queue_work(wb, work); + ret = 0; + } else { + ret = -ENOMEM; + } + + wb_put(wb); +out_css_put: + css_put(memcg_css); +out_bdi_put: + bdi_put(bdi); + return ret; +} + /** * cgroup_writeback_umount - flush inode wb switches for umount * diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 8945aac31392..a19d845dd7eb 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -217,6 +217,8 @@ void wbc_attach_and_unlock_inode(struct writeback_control *wbc, void wbc_detach_inode(struct writeback_control *wbc); void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page, size_t bytes); +int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pages, + enum wb_reason reason, struct wb_completion *done); void cgroup_writeback_umount(void); /** From patchwork Mon Aug 26 16:06:56 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 11115003 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7475F1395 for ; Mon, 26 Aug 2019 16:07:21 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 26E9920828 for ; Mon, 26 Aug 2019 16:07:21 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="PsT/XOrm" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 26E9920828 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B62266B05B7; Mon, 26 Aug 2019 12:07:16 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id B126B6B05B9; Mon, 26 Aug 2019 12:07:16 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 96A486B05BA; Mon, 26 Aug 2019 12:07:16 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0131.hostedemail.com [216.40.44.131]) by kanga.kvack.org (Postfix) with ESMTP id 66B5A6B05B7 for ; Mon, 26 Aug 2019 12:07:16 -0400 (EDT) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id 0D14F181AC9AE for ; Mon, 26 Aug 2019 16:07:16 +0000 (UTC) X-FDA: 75865058472.28.back46_5c7796c4a262d X-Spam-Summary: 2,0,0,de3c1f307e4d40a9,d41d8cd98f00b204,htejun@gmail.com,:axboe@kernel.dk:jack@suse.cz:hannes@cmpxchg.org:mhocko@kernel.org:vdavydov.dev@gmail.com:cgroups@vger.kernel.org::linux-block@vger.kernel.org:linux-kernel@vger.kernel.org:kernel-team@fb.com:guro@fb.com:akpm@linux-foundation.org:tj@kernel.org,RULES_HIT:4:41:355:379:541:800:960:966:973:988:989:1260:1345:1359:1437:1605:1730:1747:1777:1792:2194:2196:2199:2200:2393:2553:2559:2562:2640:2693:2895:2903:2907:3138:3139:3140:3141:3142:3653:3865:3866:3867:3868:3870:3871:3872:3874:4250:4321:4385:4605:5007:6117:6119:6261:6653:7809:7875:7903:7904:7974:8603:8784:9010:9108:10004:11026:11233:11473:11657:11658:11914:12043:12048:12219:12291:12295:12296:12297:12438:12517:12519:12555:12683:12895:13161:13178:13229:14096:14394:14877:21080:21324:21325:21444:21450:21451:21627:21740:30029:30034:30054:30059:30070:30074:30080:30090,0,RBL:209.85.160.195:@gmail.com:.lbl8.mailshell.net-62.50.0.100 66.100.201.100,CacheIP:none,Bayesian: 0.5,0.5, X-HE-Tag: back46_5c7796c4a262d X-Filterd-Recvd-Size: 17347 Received: from mail-qt1-f195.google.com (mail-qt1-f195.google.com [209.85.160.195]) by imf47.hostedemail.com (Postfix) with ESMTP for ; Mon, 26 Aug 2019 16:07:15 +0000 (UTC) Received: by mail-qt1-f195.google.com with SMTP id q4so18411480qtp.1 for ; Mon, 26 Aug 2019 09:07:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references; bh=P2T/Id7H98NPxI/zlZfODmmmsoETa1Oiptdb28tCBNM=; b=PsT/XOrmLiw3ws7kz0Xbpm3bYDmBCvqDttzevHIuMFhUJsX6d0YWfdEkaL3siqA3S2 Ra12wfcA98qt1G2FzFG5zr/ydgY9ffvjaI9tKcGou1jAw7i8QCm9NaGsvlNr6UDW5aQQ VviuhQYJS0F8i7RImys70rLnpxOWODTfAMZYGteRaqpqw0LdBEO5ZukPhjsB9zsJ8tgB n0B9a0B3az/fA82eEu+YoWxOmZtSNJAAH1ApKoQWxfyEU4bLJtMFfM+4gACb5ie/scl2 HogBUZ6jUu9HFrToSedWqcuIS6sKueBE0s2bKhTV8MeU9oSM3JUsXjkuz+uu2WPY+qUA ZFXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:cc:subject:date:message-id :in-reply-to:references; bh=P2T/Id7H98NPxI/zlZfODmmmsoETa1Oiptdb28tCBNM=; b=FYRGtchzBWtiSkelGctGmfR5+mlsdkahwlxMHKDW8dQ5DBGgolpq8S1VgiCWMG0WD9 /ugLbEgdIFyNGN5Fezlp+AcQzb4xRoIcaMkHzB6kIX6smU6I0KgnLeo0WLkRatneAyn0 BXTS9ezvckmXzNb33C4QxJpUZhKEwF0/jgH7yTR9+sczGsMlhv6kWXcbEhFs4EqRsSd6 j1E3/x4tyQOk/beuQ8Fs4ijcPTBAsYqMxm5bR0bA0+inPTNzurDhjPkEl4X5Xp/vgQmr VVORqIVzgZz0gUtPmLTXAVIUH4KOWVDSoHvS4bJ9oUCaFpjll42xiCmZI5E7zll5WrzK Czdw== X-Gm-Message-State: APjAAAVHZern8tj3blqzGv+xjhck3u87obOFKqJKN+eda8oU9f7XB5sD JOiI+k6ZUWrxC16loqwbfBg= X-Google-Smtp-Source: APXvYqyYVShATPBrqxyY5iobSwQI2qxjIz5V2Uz3BuQGZY7a6ryn+bvgcqDoY3PprObD0QetQV7JkA== X-Received: by 2002:ac8:5491:: with SMTP id h17mr17958515qtq.227.1566835634583; Mon, 26 Aug 2019 09:07:14 -0700 (PDT) Received: from localhost ([2620:10d:c091:500::d081]) by smtp.gmail.com with ESMTPSA id k12sm11558qkj.4.2019.08.26.09.07.13 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 26 Aug 2019 09:07:14 -0700 (PDT) From: Tejun Heo To: axboe@kernel.dk, jack@suse.cz, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, guro@fb.com, akpm@linux-foundation.org, Tejun Heo Subject: [PATCH 5/5] writeback, memcg: Implement foreign dirty flushing Date: Mon, 26 Aug 2019 09:06:56 -0700 Message-Id: <20190826160656.870307-6-tj@kernel.org> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190826160656.870307-1-tj@kernel.org> References: <20190826160656.870307-1-tj@kernel.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: There's an inherent mismatch between memcg and writeback. The former trackes ownership per-page while the latter per-inode. This was a deliberate design decision because honoring per-page ownership in the writeback path is complicated, may lead to higher CPU and IO overheads and deemed unnecessary given that write-sharing an inode across different cgroups isn't a common use-case. Combined with inode majority-writer ownership switching, this works well enough in most cases but there are some pathological cases. For example, let's say there are two cgroups A and B which keep writing to different but confined parts of the same inode. B owns the inode and A's memory is limited far below B's. A's dirty ratio can rise enough to trigger balance_dirty_pages() sleeps but B's can be low enough to avoid triggering background writeback. A will be slowed down without a way to make writeback of the dirty pages happen. This patch implements foreign dirty recording and foreign mechanism so that when a memcg encounters a condition as above it can trigger flushes on bdi_writebacks which can clean its pages. Please see the comment on top of mem_cgroup_track_foreign_dirty_slowpath() for details. A reproducer follows. write-range.c:: #include #include #include #include #include static const char *usage = "write-range FILE START SIZE\n"; int main(int argc, char **argv) { int fd; unsigned long start, size, end, pos; char *endp; char buf[4096]; if (argc < 4) { fprintf(stderr, usage); return 1; } fd = open(argv[1], O_WRONLY); if (fd < 0) { perror("open"); return 1; } start = strtoul(argv[2], &endp, 0); if (*endp != '\0') { fprintf(stderr, usage); return 1; } size = strtoul(argv[3], &endp, 0); if (*endp != '\0') { fprintf(stderr, usage); return 1; } end = start + size; while (1) { for (pos = start; pos < end; ) { long bread, bwritten = 0; if (lseek(fd, pos, SEEK_SET) < 0) { perror("lseek"); return 1; } bread = read(0, buf, sizeof(buf) < end - pos ? sizeof(buf) : end - pos); if (bread < 0) { perror("read"); return 1; } if (bread == 0) return 0; while (bwritten < bread) { long this; this = write(fd, buf + bwritten, bread - bwritten); if (this < 0) { perror("write"); return 1; } bwritten += this; pos += bwritten; } } } } repro.sh:: #!/bin/bash set -e set -x sysctl -w vm.dirty_expire_centisecs=300000 sysctl -w vm.dirty_writeback_centisecs=300000 sysctl -w vm.dirtytime_expire_seconds=300000 echo 3 > /proc/sys/vm/drop_caches TEST=/sys/fs/cgroup/test A=$TEST/A B=$TEST/B mkdir -p $A $B echo "+memory +io" > $TEST/cgroup.subtree_control echo $((1<<30)) > $A/memory.high echo $((32<<30)) > $B/memory.high rm -f testfile touch testfile fallocate -l 4G testfile echo "Starting B" (echo $BASHPID > $B/cgroup.procs pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<<30)) $((2<<30))) & echo "Waiting 10s to ensure B claims the testfile inode" sleep 5 sync sleep 5 sync echo "Starting A" (echo $BASHPID > $A/cgroup.procs pv < /dev/urandom | ./write-range testfile 0 $((2<<30))) v2: Added comments explaining why the specific intervals are being used. v3: Use 0 @nr when calling cgroup_writeback_by_id() to use best-effort flushing while avoding possible livelocks. v4: Use get_jiffies_64() and time_before/after64() instead of raw jiffies_64 and arthimetic comparisons as suggested by Jan. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara --- include/linux/backing-dev-defs.h | 1 + include/linux/memcontrol.h | 39 +++++++++ mm/memcontrol.c | 134 +++++++++++++++++++++++++++++++ mm/page-writeback.c | 4 + 4 files changed, 178 insertions(+) diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 1075f2552cfc..4fc87dee005a 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -63,6 +63,7 @@ enum wb_reason { * so it has a mismatch name. */ WB_REASON_FORKER_THREAD, + WB_REASON_FOREIGN_FLUSH, WB_REASON_MAX, }; diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 2cd4359cb38c..ad8f1a397ae4 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -183,6 +183,23 @@ struct memcg_padding { #define MEMCG_PADDING(name) #endif +/* + * Remember four most recent foreign writebacks with dirty pages in this + * cgroup. Inode sharing is expected to be uncommon and, even if we miss + * one in a given round, we're likely to catch it later if it keeps + * foreign-dirtying, so a fairly low count should be enough. + * + * See mem_cgroup_track_foreign_dirty_slowpath() for details. + */ +#define MEMCG_CGWB_FRN_CNT 4 + +struct memcg_cgwb_frn { + u64 bdi_id; /* bdi->id of the foreign inode */ + int memcg_id; /* memcg->css.id of foreign inode */ + u64 at; /* jiffies_64 at the time of dirtying */ + struct wb_completion done; /* tracks in-flight foreign writebacks */ +}; + /* * The memory controller data structure. The memory controller controls both * page cache and RSS per cgroup. We would eventually like to provide @@ -307,6 +324,7 @@ struct mem_cgroup { #ifdef CONFIG_CGROUP_WRITEBACK struct list_head cgwb_list; struct wb_domain cgwb_domain; + struct memcg_cgwb_frn cgwb_frn[MEMCG_CGWB_FRN_CNT]; #endif /* List of events which userspace want to receive */ @@ -1237,6 +1255,18 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, unsigned long *pheadroom, unsigned long *pdirty, unsigned long *pwriteback); +void mem_cgroup_track_foreign_dirty_slowpath(struct page *page, + struct bdi_writeback *wb); + +static inline void mem_cgroup_track_foreign_dirty(struct page *page, + struct bdi_writeback *wb) +{ + if (unlikely(&page->mem_cgroup->css != wb->memcg_css)) + mem_cgroup_track_foreign_dirty_slowpath(page, wb); +} + +void mem_cgroup_flush_foreign(struct bdi_writeback *wb); + #else /* CONFIG_CGROUP_WRITEBACK */ static inline struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) @@ -1252,6 +1282,15 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb, { } +static inline void mem_cgroup_track_foreign_dirty(struct page *page, + struct bdi_writeback *wb) +{ +} + +static inline void mem_cgroup_flush_foreign(struct bdi_writeback *wb) +{ +} + #endif /* CONFIG_CGROUP_WRITEBACK */ struct sock; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 26e2999af608..eb626a290d93 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -87,6 +87,10 @@ int do_swap_account __read_mostly; #define do_swap_account 0 #endif +#ifdef CONFIG_CGROUP_WRITEBACK +static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq); +#endif + /* Whether legacy memory+swap accounting is active */ static bool do_memsw_account(void) { @@ -4238,6 +4242,127 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, } } +/* + * Foreign dirty flushing + * + * There's an inherent mismatch between memcg and writeback. The former + * trackes ownership per-page while the latter per-inode. This was a + * deliberate design decision because honoring per-page ownership in the + * writeback path is complicated, may lead to higher CPU and IO overheads + * and deemed unnecessary given that write-sharing an inode across + * different cgroups isn't a common use-case. + * + * Combined with inode majority-writer ownership switching, this works well + * enough in most cases but there are some pathological cases. For + * example, let's say there are two cgroups A and B which keep writing to + * different but confined parts of the same inode. B owns the inode and + * A's memory is limited far below B's. A's dirty ratio can rise enough to + * trigger balance_dirty_pages() sleeps but B's can be low enough to avoid + * triggering background writeback. A will be slowed down without a way to + * make writeback of the dirty pages happen. + * + * Conditions like the above can lead to a cgroup getting repatedly and + * severely throttled after making some progress after each + * dirty_expire_interval while the underyling IO device is almost + * completely idle. + * + * Solving this problem completely requires matching the ownership tracking + * granularities between memcg and writeback in either direction. However, + * the more egregious behaviors can be avoided by simply remembering the + * most recent foreign dirtying events and initiating remote flushes on + * them when local writeback isn't enough to keep the memory clean enough. + * + * The following two functions implement such mechanism. When a foreign + * page - a page whose memcg and writeback ownerships don't match - is + * dirtied, mem_cgroup_track_foreign_dirty() records the inode owning + * bdi_writeback on the page owning memcg. When balance_dirty_pages() + * decides that the memcg needs to sleep due to high dirty ratio, it calls + * mem_cgroup_flush_foreign() which queues writeback on the recorded + * foreign bdi_writebacks which haven't expired. Both the numbers of + * recorded bdi_writebacks and concurrent in-flight foreign writebacks are + * limited to MEMCG_CGWB_FRN_CNT. + * + * The mechanism only remembers IDs and doesn't hold any object references. + * As being wrong occasionally doesn't matter, updates and accesses to the + * records are lockless and racy. + */ +void mem_cgroup_track_foreign_dirty_slowpath(struct page *page, + struct bdi_writeback *wb) +{ + struct mem_cgroup *memcg = page->mem_cgroup; + struct memcg_cgwb_frn *frn; + u64 now = get_jiffies_64(); + u64 oldest_at = now; + int oldest = -1; + int i; + + /* + * Pick the slot to use. If there is already a slot for @wb, keep + * using it. If not replace the oldest one which isn't being + * written out. + */ + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { + frn = &memcg->cgwb_frn[i]; + if (frn->bdi_id == wb->bdi->id && + frn->memcg_id == wb->memcg_css->id) + break; + if (time_before64(frn->at, oldest_at) && + atomic_read(&frn->done.cnt) == 1) { + oldest = i; + oldest_at = frn->at; + } + } + + if (i < MEMCG_CGWB_FRN_CNT) { + /* + * Re-using an existing one. Update timestamp lazily to + * avoid making the cacheline hot. We want them to be + * reasonably up-to-date and significantly shorter than + * dirty_expire_interval as that's what expires the record. + * Use the shorter of 1s and dirty_expire_interval / 8. + */ + unsigned long update_intv = + min_t(unsigned long, HZ, + msecs_to_jiffies(dirty_expire_interval * 10) / 8); + + if (time_before64(frn->at, now - update_intv)) + frn->at = now; + } else if (oldest >= 0) { + /* replace the oldest free one */ + frn = &memcg->cgwb_frn[oldest]; + frn->bdi_id = wb->bdi->id; + frn->memcg_id = wb->memcg_css->id; + frn->at = now; + } +} + +/* issue foreign writeback flushes for recorded foreign dirtying events */ +void mem_cgroup_flush_foreign(struct bdi_writeback *wb) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); + unsigned long intv = msecs_to_jiffies(dirty_expire_interval * 10); + u64 now = jiffies_64; + int i; + + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { + struct memcg_cgwb_frn *frn = &memcg->cgwb_frn[i]; + + /* + * If the record is older than dirty_expire_interval, + * writeback on it has already started. No need to kick it + * off again. Also, don't start a new one if there's + * already one in flight. + */ + if (time_after64(frn->at, now - intv) && + atomic_read(&frn->done.cnt) == 1) { + frn->at = 0; + cgroup_writeback_by_id(frn->bdi_id, frn->memcg_id, 0, + WB_REASON_FOREIGN_FLUSH, + &frn->done); + } + } +} + #else /* CONFIG_CGROUP_WRITEBACK */ static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp) @@ -4760,6 +4885,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) struct mem_cgroup *memcg; unsigned int size; int node; + int __maybe_unused i; size = sizeof(struct mem_cgroup); size += nr_node_ids * sizeof(struct mem_cgroup_per_node *); @@ -4803,6 +4929,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void) #endif #ifdef CONFIG_CGROUP_WRITEBACK INIT_LIST_HEAD(&memcg->cgwb_list); + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) + memcg->cgwb_frn[i].done = + __WB_COMPLETION_INIT(&memcg_cgwb_frn_waitq); #endif idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); return memcg; @@ -4932,7 +5061,12 @@ static void mem_cgroup_css_released(struct cgroup_subsys_state *css) static void mem_cgroup_css_free(struct cgroup_subsys_state *css) { struct mem_cgroup *memcg = mem_cgroup_from_css(css); + int __maybe_unused i; +#ifdef CONFIG_CGROUP_WRITEBACK + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) + wb_wait_for_completion(&memcg->cgwb_frn[i].done); +#endif if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket) static_branch_dec(&memcg_sockets_enabled_key); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 1804f64ff43c..50055d2e4ea8 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1667,6 +1667,8 @@ static void balance_dirty_pages(struct bdi_writeback *wb, if (unlikely(!writeback_in_progress(wb))) wb_start_background_writeback(wb); + mem_cgroup_flush_foreign(wb); + /* * Calculate global domain's pos_ratio and select the * global dtc by default. @@ -2427,6 +2429,8 @@ void account_page_dirtied(struct page *page, struct address_space *mapping) task_io_account_write(PAGE_SIZE); current->nr_dirtied++; this_cpu_inc(bdp_ratelimits); + + mem_cgroup_track_foreign_dirty(page, wb); } }