From patchwork Thu Aug 17 16:47:33 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nhat Pham X-Patchwork-Id: 13356820 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5DF6C3DA54 for ; Thu, 17 Aug 2023 16:47:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 51043280040; Thu, 17 Aug 2023 12:47:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4C06A940009; Thu, 17 Aug 2023 12:47:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3894C280040; Thu, 17 Aug 2023 12:47:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 26B51940009 for ; Thu, 17 Aug 2023 12:47:38 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id ECDA4A067B for ; Thu, 17 Aug 2023 16:47:37 +0000 (UTC) X-FDA: 81134177754.24.37A55FF Received: from mail-pl1-f175.google.com (mail-pl1-f175.google.com [209.85.214.175]) by imf19.hostedemail.com (Postfix) with ESMTP id 2C9F81A0025 for ; Thu, 17 Aug 2023 16:47:35 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=p3vmZb8q; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf19.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692290856; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=CWIzmdgMapcbGgpVke3UkRQJiQcMwAi+LRj3MGvxzRI=; b=paRfgkaDzU5Ai3WT/qARLT2cse3swnynzhuXXRpuCDb6ssZNTxCt9PppwqPsPJxn6UolZJ 00RqrGySbIiqbhqkEf6CT9SJcBagiWaoF1AN+aMg1JRQFGJC1XZY1ZHh1O06obCsD2x6DV tCVa5i5CLTaPL9fVToC3ITD9jIh/lls= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=p3vmZb8q; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf19.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692290856; a=rsa-sha256; cv=none; b=dJonWQTKDBX53EhydrIv+hW/XVV7SDxXbUXm/d9hUOx+Rq63dmVmYJnJBy0aZ7d5Qek8ZV 7fXRP8HTbVArA8FxPmr4YSJoQ/ipbeXJ82AComCnBG97PEqtO3jiJ14IN6tsPC58OiWjdK v251DMQoXpevXGG2s8jueuipgGwmmtA= Received: by mail-pl1-f175.google.com with SMTP id d9443c01a7336-1bee82fab5aso21795995ad.3 for ; Thu, 17 Aug 2023 09:47:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1692290855; x=1692895655; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=CWIzmdgMapcbGgpVke3UkRQJiQcMwAi+LRj3MGvxzRI=; b=p3vmZb8qvObOMImgF79TKmULYdKEau/42SZHphKqWKX7e2eF75LmxMlxsh3WM29HUj sDIjNbIGGKP33Vf1lFbN/2YWhwhy+iINnp3PyI6bgtdNbTFm5VtN9jkcMR8z+6TsP1eQ wNlliBEJ8VmDhXrwHn4jOZz7zxTt/MZsZmf3X+z25nkQ5rGaCnsK67kn0vb8PhgYBXMY KEqjnWUbw2XEq78ECeVNRGjLamlpSOemofu2K4us9oSeKiE8wZDvo9NoCZAe4wrMq0RP J07Fx0DWAJUlwZb2ZnsO5ZiXnq8ocSGLNIGGE4j344Q8r0JU6y4ZE3/c0MXUbsKAxEKJ u/ng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692290855; x=1692895655; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=CWIzmdgMapcbGgpVke3UkRQJiQcMwAi+LRj3MGvxzRI=; b=Ulx/imKQ2PZVerP+EeTOcUxmw03OLG/EqUthqUZfRRNCzL6inzuXaasvc5pgr2OCSA 9nWRiasJLfbRMH4xFkd/oG9958bGbff0gLzWXnQ2ITDq1lFEYp3eF6Bsv8yBBIjO/90b stGT9ODtKAfzWIRF5XuICsml6T/EpeeYxe2821Yo6SkIyrmMDES5ey1nJNzHOaj1hWLO MmYEMoh/eKEA0dd1hJQTVmzyEwWDEhJWRXiJeVDK5ijRGilWHgwl4M0Lb/vT2KykiEsD NW5/5u8QvgXhdNsav0N7p2yI82Fdr+GjP5irc4hFvR+uunJGqbUm6V1++5AgZJEZP+Xa gr7g== X-Gm-Message-State: AOJu0YzBNmCrz2dplb9+H3mpg8enSRLc7LC7h/4UDQQlyrIDciTo+qwx hkwUxkeTpgh35P0Rt/rR4RQ= X-Google-Smtp-Source: AGHT+IG6bkb3XvTBvAh+Bhb9+K3nr45uh51KT7CzdevKx0aTklO6qH+QhrtGTchamMSjyrNJVQq+cA== X-Received: by 2002:a17:903:25cf:b0:1b8:8dbd:e1a0 with SMTP id jc15-20020a17090325cf00b001b88dbde1a0mr5356648plb.13.1692290854805; Thu, 17 Aug 2023 09:47:34 -0700 (PDT) Received: from localhost (fwdproxy-prn-012.fbsv.net. [2a03:2880:ff:c::face:b00c]) by smtp.gmail.com with ESMTPSA id f4-20020a170902e98400b001bdc8a5e96csm7722346plb.169.2023.08.17.09.47.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 17 Aug 2023 09:47:34 -0700 (PDT) From: Nhat Pham To: akpm@linux-foundation.org Cc: hannes@cmpxchg.org, kernel-team@meta.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: [PATCH] workingset: ensure memcg is valid for recency check Date: Thu, 17 Aug 2023 09:47:33 -0700 Message-Id: <20230817164733.2475092-1-nphamcs@gmail.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 X-Rspam-User: X-Stat-Signature: fsd5953oowpneup5hybbmamdkbw47r3r X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 2C9F81A0025 X-HE-Tag: 1692290855-289381 X-HE-Meta: U2FsdGVkX1/oatOLQ2WQHlH9uLxHpOZcdGvMH5FZnTkiVbYosLOcR4Fr9RP6/R35KQwLI5+GQwPKl68WOqqmS1XZ5ZYnAobVCIStQ92XsvEPVVDbPBf6uRd9yROxF0gPQt1DbitapjWyjOS2WEw7gx8WIWrUJ++G9MuDivA1MXYKU7RFcCWP+SXQfGCkX9f+Pg3rHSfvjuUcMIaBC5gNEqOYDk4S8nWMx8C3ck3kiMK1fwbpQnIxTbxqR9nMCTEmg3MAwRL18npcklYMvlprkcDBrnDW1dzwk8vmu66TE6YRU9KKPkZeZkhauD7/DBggSl/8nYQEXlmTSChhIBUSGX4JXnqF5VWOwyZNErLaQuDn5iKi43un0DeVlD8HaoxEAkM2a7Je/eFiEI5K9Knb4aysXhNyNtzC6G3jfptz1aOcEk/0eFPj1mNpGPfk/081K6qx1VsjGtyafbqO9LoN3HF/73VZ1sWwQTvD6HpTsveFp+aYhjyil6I3OwcvdkJe54Me88Au1Okd68t+kI14eLMj7FgQb3wqF7czcnCeeLFedr/cR7cvtNAnEFaC8PaOuUutcg8snLgLyQPV2GFAtOhDPr0Am0QOyJ0e0+WdQw6kv9S4noiv6iS3KIHg+qDP0edtRh3SOeAreZ+pJrFALLDXftXPethgsIDVPI9KBK2Cp1QBHIGzMPpef/b735piRJMiKZ7d+Ks8oj5N9CmXPzbfiKsTFLLv9HDBDHh0De0lhL+6ZHjUOAw+D1MuztooMWOenZPi4KobCg3kwaERnzRWLQFjAFTesbDb88CPkn53DGXDRykwifCvwC/WCS0CNCiLivafPk4UUC1VSkZkqSJb14catC9msKrb0DwJYXFM3LI4DEKVs3YllICTnpw8UTd6TzG7BEgad866iw0i+E+31+10t6VEj5g1l2CxwVKX852r+kS5P62JX1tjrAuLVZwYQ23ftp982H5bjk8 XXMsXJ+V 0xaJKNE9ZDAwjq8KvYPDQZ5YjDt7ldlAt337/4EwwZsTtPxZGi/vaROU8ctB/rYgMnolU9us85cp909ergDZg0OHs4JYfxtLF4uz4SV3n3YI6IWCIdLSfU8bFlYiWWpsleW75TqnCbVZdwAGlsvLUFCTP2GtU4gWl8sbLVe3lTaD5JnZT1OikB6SdQVBenRPUD15rv7ranLIO6781HSDSrRmbbmpvSjtqlE1qlLzAiYTkMZci0SbR2aYKi5+IF8mFqM2UbkoOHnYXLyEEtMVfllESapmotcXr0bvfPvmt2c0NtaORSknBCjAEJ0HrrSg0+1L0lu8TWIRAGwEOuFbRMm6phpXvSzDAuinLB7U2tv8g4pAkMHCH03vNjXckec3mZvtzADhMnyCKeEAg2pwOVD+eIVGRYv7vbSS7EiAfzCkn9MOsYJJIIjXFluHTBI4XL8Ag9JWC0RzyY5zexOu1grz7aAoTNuZePDx9K9fkLO+ZQJg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In eviction recency check, we are currently not holding a local reference to the memcg that the refaulted folio belonged to when it was evicted. This could cause serious memcg lifetime issues, for e.g in the memcg hierarchy traversal done in mem_cgroup_get_nr_swap_pages(). This has occurred in production: [ 155757.793456] BUG: kernel NULL pointer dereference, address: 00000000000000c0 [ 155757.807568] #PF: supervisor read access in kernel mode [ 155757.818024] #PF: error_code(0x0000) - not-present page [ 155757.828482] PGD 401f77067 P4D 401f77067 PUD 401f76067 PMD 0 [ 155757.839985] Oops: 0000 [#1] SMP [ 155757.846444] CPU: 7 PID: 1380944 Comm: ThriftSrv-pri3- Kdump: loaded Tainted: G S 6.4.3-0_fbk1_rc0_594_g8d0cbcaa67ba #1 [ 155757.870808] Hardware name: Wiwynn Twin Lakes MP/Twin Lakes Passive MP, BIOS YMM16 05/24/2021 [ 155757.887870] RIP: 0010:mem_cgroup_get_nr_swap_pages+0x3d/0xb0 [ 155757.899377] Code: 29 19 4a 02 48 39 f9 74 63 48 8b 97 c0 00 00 00 48 8b b7 58 02 00 00 48 2b b7 c0 01 00 00 48 39 f0 48 0f 4d c6 48 39 d1 74 42 <48> 8b b2 c0 00 00 00 48 8b ba 58 02 00 00 48 2b ba c0 01 00 00 48 [ 155757.937125] RSP: 0018:ffffc9002ecdfbc8 EFLAGS: 00010286 [ 155757.947755] RAX: 00000000003a3b1c RBX: 000007ffffffffff RCX: ffff888280183000 [ 155757.962202] RDX: 0000000000000000 RSI: 0007ffffffffffff RDI: ffff888bbc2d1000 [ 155757.976648] RBP: 0000000000000001 R08: 000000000000000b R09: ffff888ad9cedba0 [ 155757.991094] R10: ffffea0039c07900 R11: 0000000000000010 R12: ffff888b23a7b000 [ 155758.005540] R13: 0000000000000000 R14: ffff888bbc2d1000 R15: 000007ffffc71354 [ 155758.019991] FS: 00007f6234c68640(0000) GS:ffff88903f9c0000(0000) knlGS:0000000000000000 [ 155758.036356] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 155758.048023] CR2: 00000000000000c0 CR3: 0000000a83eb8004 CR4: 00000000007706e0 [ 155758.062473] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 155758.076924] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 155758.091376] PKRU: 55555554 [ 155758.096957] Call Trace: [ 155758.102016] [ 155758.106502] ? __die+0x78/0xc0 [ 155758.112793] ? page_fault_oops+0x286/0x380 [ 155758.121175] ? exc_page_fault+0x5d/0x110 [ 155758.129209] ? asm_exc_page_fault+0x22/0x30 [ 155758.137763] ? mem_cgroup_get_nr_swap_pages+0x3d/0xb0 [ 155758.148060] workingset_test_recent+0xda/0x1b0 [ 155758.157133] workingset_refault+0xca/0x1e0 [ 155758.165508] filemap_add_folio+0x4d/0x70 [ 155758.173538] page_cache_ra_unbounded+0xed/0x190 [ 155758.182919] page_cache_sync_ra+0xd6/0x1e0 [ 155758.191738] filemap_read+0x68d/0xdf0 [ 155758.199495] ? mlx5e_napi_poll+0x123/0x940 [ 155758.207981] ? __napi_schedule+0x55/0x90 [ 155758.216095] __x64_sys_pread64+0x1d6/0x2c0 [ 155758.224601] do_syscall_64+0x3d/0x80 [ 155758.232058] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 155758.242473] RIP: 0033:0x7f62c29153b5 [ 155758.249938] Code: e8 48 89 75 f0 89 7d f8 48 89 4d e0 e8 b4 e6 f7 ff 41 89 c0 4c 8b 55 e0 48 8b 55 e8 48 8b 75 f0 8b 7d f8 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 45 f8 e8 e7 e6 f7 ff 48 8b [ 155758.288005] RSP: 002b:00007f6234c5ffd0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011 [ 155758.303474] RAX: ffffffffffffffda RBX: 00007f628c4e70c0 RCX: 00007f62c29153b5 [ 155758.318075] RDX: 000000000003c041 RSI: 00007f61d2986000 RDI: 0000000000000076 [ 155758.332678] RBP: 00007f6234c5fff0 R08: 0000000000000000 R09: 0000000064d5230c [ 155758.347452] R10: 000000000027d450 R11: 0000000000000293 R12: 000000000003c041 [ 155758.362044] R13: 00007f61d2986000 R14: 00007f629e11b060 R15: 000000000027d450 [ 155758.376661] This patch fixes the issue by getting a local reference inside unpack_shadow(). Fixes: f78dfc7b77d5 ("workingset: fix confusion around eviction vs refault container") Signed-off-by: Nhat Pham Cc: stable@vger.kernel.org --- mm/workingset.c | 57 ++++++++++++++++++++++++++++++------------------- 1 file changed, 35 insertions(+), 22 deletions(-) diff --git a/mm/workingset.c b/mm/workingset.c index da58a26d0d4d..c20b26bb6cb1 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -206,10 +206,11 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction, return xa_mk_value(eviction); } -static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, - unsigned long *evictionp, bool *workingsetp) +static void unpack_shadow(void *shadow, struct mem_cgroup **memcgp, + pg_data_t **pgdat, unsigned long *evictionp, bool *workingsetp) { unsigned long entry = xa_to_value(shadow); + struct mem_cgroup *memcg; int memcgid, nid; bool workingset; @@ -220,7 +221,24 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1); entry >>= MEM_CGROUP_ID_SHIFT; - *memcgidp = memcgid; + /* + * Look up the memcg associated with the stored ID. It might + * have been deleted since the folio's eviction. + * + * Note that in rare events the ID could have been recycled + * for a new cgroup that refaults a shared folio. This is + * impossible to tell from the available data. However, this + * should be a rare and limited disturbance, and activations + * are always speculative anyway. Ultimately, it's the aging + * algorithm's job to shake out the minimum access frequency + * for the active cache. + */ + memcg = mem_cgroup_from_id(memcgid); + if (memcg && css_tryget(&memcg->css)) + *memcgp = memcg; + else + *memcgp = NULL; + *pgdat = NODE_DATA(nid); *evictionp = entry; *workingsetp = workingset; @@ -262,15 +280,16 @@ static void *lru_gen_eviction(struct folio *folio) static bool lru_gen_test_recent(void *shadow, bool file, struct lruvec **lruvec, unsigned long *token, bool *workingset) { - int memcg_id; unsigned long min_seq; struct mem_cgroup *memcg; struct pglist_data *pgdat; - unpack_shadow(shadow, &memcg_id, &pgdat, token, workingset); + unpack_shadow(shadow, &memcg, &pgdat, token, workingset); + if (!mem_cgroup_disabled() && !memcg) + return false; - memcg = mem_cgroup_from_id(memcg_id); *lruvec = mem_cgroup_lruvec(memcg, pgdat); + mem_cgroup_put(memcg); min_seq = READ_ONCE((*lruvec)->lrugen.min_seq[file]); return (*token >> LRU_REFS_WIDTH) == (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH)); @@ -421,36 +440,29 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset) unsigned long refault_distance; unsigned long workingset_size; unsigned long refault; - int memcgid; struct pglist_data *pgdat; unsigned long eviction; if (lru_gen_enabled()) return lru_gen_test_recent(shadow, file, &eviction_lruvec, &eviction, workingset); - unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset); - eviction <<= bucket_order; - + unpack_shadow(shadow, &eviction_memcg, &pgdat, &eviction, workingset); /* - * Look up the memcg associated with the stored ID. It might - * have been deleted since the folio's eviction. + * When memcg is enabled, we only get !memcg here if the + * eviction group has been deleted. In that case, ignore + * the refault. * - * Note that in rare events the ID could have been recycled - * for a new cgroup that refaults a shared folio. This is - * impossible to tell from the available data. However, this - * should be a rare and limited disturbance, and activations - * are always speculative anyway. Ultimately, it's the aging - * algorithm's job to shake out the minimum access frequency - * for the active cache. + * When memcg is disabled, we always get NULL since there + * is no root_mem_cgroup for !CONFIG_MEMCG. Continue; the + * mem_cgroup_lruvec() will get us the global lruvec. * - * XXX: On !CONFIG_MEMCG, this will always return NULL; it - * would be better if the root_mem_cgroup existed in all + * XXX: It would be better if the root_mem_cgroup existed in all * configurations instead. */ - eviction_memcg = mem_cgroup_from_id(memcgid); if (!mem_cgroup_disabled() && !eviction_memcg) return false; + eviction <<= bucket_order; eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); refault = atomic_long_read(&eviction_lruvec->nonresident_age); @@ -493,6 +505,7 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset) } } + mem_cgroup_put(eviction_memcg); return refault_distance <= workingset_size; }