From patchwork Wed Jan 8 16:03:57 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yafang Shao X-Patchwork-Id: 11324007 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9A809109A for ; Wed, 8 Jan 2020 16:06:14 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 5A5EF2073A for ; Wed, 8 Jan 2020 16:06:14 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="UFK59NqJ" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5A5EF2073A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8B6AF8E0005; Wed, 8 Jan 2020 11:06:13 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 867DA8E0001; Wed, 8 Jan 2020 11:06:13 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 77DB68E0005; Wed, 8 Jan 2020 11:06:13 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0144.hostedemail.com [216.40.44.144]) by kanga.kvack.org (Postfix) with ESMTP id 5F6D78E0001 for ; Wed, 8 Jan 2020 11:06:13 -0500 (EST) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id 0C84B8249980 for ; Wed, 8 Jan 2020 16:06:13 +0000 (UTC) X-FDA: 76354943826.06.C8E11ED Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin06.hostedemail.com (Postfix) with ESMTP id 5ABBA1005C4AC for ; Wed, 8 Jan 2020 16:04:29 +0000 (UTC) X-Spam-Summary: 2,0,0,e91e5983059305ff,d41d8cd98f00b204,laoar.shao@gmail.com,:dchinner@redhat.com:hannes@cmpxchg.org:mhocko@kernel.org:vdavydov.dev@gmail.com:guro@fb.com:akpm@linux-foundation.org:viro@zeniv.linux.org.uk::linux-fsdevel@vger.kernel.org:laoar.shao@gmail.com,RULES_HIT:2:41:355:379:541:800:960:965:966:973:988:989:1260:1345:1359:1437:1535:1605:1606:1730:1747:1777:1792:1801:2196:2198:2199:2200:2393:2553:2559:2562:2693:2897:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4119:4250:4385:4390:4395:4470:4605:5007:6261:6653:7514:7903:9121:9413:10004:11026:11473:11658:11914:12043:12048:12291:12296:12297:12438:12517:12519:12555:12683:12895:13161:13229:14096:14394:14687:21080:21444:21450:21451:21627:21666:21796:21990:30036:30054:30090,0,RBL:209.85.215.194:@gmail.com:.lbl8.mailshell.net-62.18.0.100 66.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:non e X-HE-Tag: fog48_44343ac839731 X-Filterd-Recvd-Size: 8907 Received: from mail-pg1-f194.google.com (mail-pg1-f194.google.com [209.85.215.194]) by imf39.hostedemail.com (Postfix) with ESMTP for ; Wed, 8 Jan 2020 16:04:28 +0000 (UTC) Received: by mail-pg1-f194.google.com with SMTP id b137so1795019pga.6 for ; Wed, 08 Jan 2020 08:04:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=Mf24uDxaxrXyVR4iCc3eAN/jtu0xKe1ZH34Ft0DWACg=; b=UFK59NqJPTz8BoeQ77BfQ+g5LONu/dr6l0d0g52YSUrbiG6xRjko26oVDtEVsHPsD0 sLsL5G/iP4iOGBCIeED9AwBiw9ntPiCBdbdcF9mDcRgt0W8fUJ+1MGHknNRkRcyGnGOa zQhZWUf4Wk7/i0RParuvUd+yMRMpHZniPBCS4iRGlfp0RACYKB3W+dJ3BR1ITnuZncL6 YE8WEeeW4P2NX933EJbqhtiqYtUCxXCP6d7ZUEiCDx5TAJcdZaHvlrfQ6/pbtfrPdNJ4 FVY0UhBCq0sf6biYqyWxXcbdST4mQyfOA7ssO8xVkdkPujOo8D5/SQEkPG2woll6nrJL YGhw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=Mf24uDxaxrXyVR4iCc3eAN/jtu0xKe1ZH34Ft0DWACg=; b=ErrRGodZcj2Et46RlHJDi/XK2SnIg2vPcbDyO3YNxpAmkYr6yHdOCCczjGbFupyw9U j2WzN8pW0whjn9uiWKDPdYfrOC3z9rc6sh3RoXnGtS3sQ9/nZkI47q38872zNbcgCK6Z 7rGPXSlSA5kRUMn+sieV6y14yX1G2lAfk3A3X4RXPYXyllMPfUf5lY+elPwPa01P+a9j vHbxys2PoTCneTu1INJhq/o+BpHVM/1HgYEfVZrwF/MeVx8raZkNunXXoHQ2CNssn/sZ mzify8hl2PjzAVSLlgPTIqWyKYnRk6Dy/g1FecFjvcq91tsSgcYGNVA81xh0rCajywBP 2YuA== X-Gm-Message-State: APjAAAXsYm2R8k7jWsgShNiD4uX8A/DI/I1MMZ82KI3vpmGPLkV2xFH+ suh32CAnIb1F10uXmHiiBvE= X-Google-Smtp-Source: APXvYqzpzaMo2o2yXGVPMqVWd7zOMF6/eFu/cnrHQ2glwGA4Vc7bHzW2fKgoLF43tcgijlEb5spNiA== X-Received: by 2002:a62:cdcb:: with SMTP id o194mr5659462pfg.117.1578499467505; Wed, 08 Jan 2020 08:04:27 -0800 (PST) Received: from dev.localdomain ([203.100.54.194]) by smtp.gmail.com with ESMTPSA id d22sm4079894pfo.187.2020.01.08.08.04.24 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 08 Jan 2020 08:04:26 -0800 (PST) From: Yafang Shao To: dchinner@redhat.com, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, guro@fb.com, akpm@linux-foundation.org, viro@zeniv.linux.org.uk Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Yafang Shao Subject: [PATCH v3 3/3] memcg, inode: protect page cache from freeing inode Date: Wed, 8 Jan 2020 11:03:57 -0500 Message-Id: <1578499437-1664-4-git-send-email-laoar.shao@gmail.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1578499437-1664-1-git-send-email-laoar.shao@gmail.com> References: <1578499437-1664-1-git-send-email-laoar.shao@gmail.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On my server there're some running MEMCGs protected by memory.{min, low}, but I found the usage of these MEMCGs abruptly became very small, which were far less than the protect limit. It confused me and finally I found that was because of inode stealing. Once an inode is freed, all its belonging page caches will be dropped as well, no matter how may page caches it has. So if we intend to protect the page caches in a memcg, we must protect their host (the inode) first. Otherwise the memcg protection can be easily bypassed with freeing inode, especially if there're big files in this memcg. Supposes we have a memcg, and the stat of this memcg is, memory.current = 1024M memory.min = 512M And in this memcg there's a inode with 800M page caches. Once this memcg is scanned by kswapd or other regular reclaimers, kswapd <<<< It can be either of the regular reclaimers. shrink_node_memcgs switch (mem_cgroup_protected()) <<<< Not protected case MEMCG_PROT_NONE: <<<< Will scan this memcg beak; shrink_lruvec() <<<< Reclaim the page caches shrink_slab() <<<< It may free this inode and drop all its page caches(800M). So we must protect the inode first if we want to protect page caches. The inherent mismatch between memcg and inode is a trouble. One inode can be shared by different MEMCGs, but it is a very rare case. If an inode is shared, its belonging page caches may be charged to different MEMCGs. Currently there's no perfect solution to fix this kind of issue, but the inode majority-writer ownership switching can help it more or less. Cc: Dave Chinner Signed-off-by: Yafang Shao --- fs/inode.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 75 insertions(+), 3 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 2b0f511..80dddbc 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -54,6 +54,12 @@ * inode_hash_lock */ +struct inode_isolate_control { + struct list_head *freeable; + struct mem_cgroup *memcg; /* derived from shrink_control */ + bool memcg_low_reclaim; /* derived from scan_control */ +}; + static unsigned int i_hash_mask __read_mostly; static unsigned int i_hash_shift __read_mostly; static struct hlist_head *inode_hashtable __read_mostly; @@ -713,6 +719,61 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) return busy; } +#ifdef CONFIG_MEMCG_KMEM +/* + * Once an inode is freed, all its belonging page caches will be dropped as + * well, even if there're lots of page caches. So if we intend to protect + * page caches in a memcg, we must protect their host(the inode) first. + * Otherwise the memcg protection can be easily bypassed with freeing inode, + * especially if there're big files in this memcg. + * Note that it may happen that the page caches are already charged to the + * memcg, but the inode hasn't been added to this memcg yet. In this case, + * this inode is not protected. + * The inherent mismatch between memcg and inode is a trouble. One inode + * can be shared by different MEMCGs, but it is a very rare case. If + * an inode is shared, its belonging page caches may be charged to + * different MEMCGs. Currently there's no perfect solution to fix this + * kind of issue, but the inode majority-writer ownership switching can + * help it more or less. + */ +static bool memcg_can_reclaim_inode(struct inode *inode, + struct inode_isolate_control *iic) +{ + unsigned long cgroup_size; + unsigned long protection; + struct mem_cgroup *memcg; + bool reclaimable = true; + + if (!inode->i_data.nrpages) + goto out; + + /* Excludes freeing inode via drop_caches */ + if (!current->reclaim_state) + goto out; + + memcg = iic->memcg; + if (!memcg || memcg == root_mem_cgroup) + goto out; + + protection = mem_cgroup_protection(memcg, iic->memcg_low_reclaim); + if (!protection) + goto out; + + cgroup_size = mem_cgroup_size(memcg); + if (inode->i_data.nrpages + protection >= cgroup_size) + reclaimable = false; + +out: + return reclaimable; +} +#else /* CONFIG_MEMCG_KMEM */ +static bool memcg_can_reclaim_inode(struct inode *inode, + struct inode_isolate_control *iic) +{ + return true; +} +#endif /* CONFIG_MEMCG_KMEM */ + /* * Isolate the inode from the LRU in preparation for freeing it. * @@ -731,8 +792,9 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) static enum lru_status inode_lru_isolate(struct list_head *item, struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) { - struct list_head *freeable = arg; - struct inode *inode = container_of(item, struct inode, i_lru); + struct inode_isolate_control *iic = arg; + struct list_head *freeable = iic->freeable; + struct inode *inode = container_of(item, struct inode, i_lru); /* * we are inverting the lru lock/inode->i_lock here, so use a trylock. @@ -741,6 +803,11 @@ static enum lru_status inode_lru_isolate(struct list_head *item, if (!spin_trylock(&inode->i_lock)) return LRU_SKIP; + if (!memcg_can_reclaim_inode(inode, iic)) { + spin_unlock(&inode->i_lock); + return LRU_ROTATE; + } + /* * Referenced or dirty inodes are still in use. Give them another pass * through the LRU as we canot reclaim them now. @@ -798,9 +865,14 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc) { LIST_HEAD(freeable); long freed; + struct inode_isolate_control iic = { + .freeable = &freeable, + .memcg = sc->memcg, + .memcg_low_reclaim = sc->memcg_low_reclaim, + }; freed = list_lru_shrink_walk(&sb->s_inode_lru, sc, - inode_lru_isolate, &freeable); + inode_lru_isolate, &iic); dispose_list(&freeable); return freed; }