From patchwork Tue Dec 1 17:44:49 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yang Shi X-Patchwork-Id: 11943419 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.5 required=3.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4C170C64E7B for ; Tue, 1 Dec 2020 17:44:56 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A7FEF20870 for ; Tue, 1 Dec 2020 17:44:55 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="OTSGIxRd" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A7FEF20870 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A82D18D0001; Tue, 1 Dec 2020 12:44:54 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 974E26B0068; Tue, 1 Dec 2020 12:44:54 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 83CC18D0001; Tue, 1 Dec 2020 12:44:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0149.hostedemail.com [216.40.44.149]) by kanga.kvack.org (Postfix) with ESMTP id 698386B005D for ; Tue, 1 Dec 2020 12:44:54 -0500 (EST) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 26B0A1DEB for ; Tue, 1 Dec 2020 17:44:54 +0000 (UTC) X-FDA: 77545438908.28.cable65_2a13324273ac Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin28.hostedemail.com (Postfix) with ESMTP id 03D4E6D67 for ; Tue, 1 Dec 2020 17:44:53 +0000 (UTC) X-HE-Tag: cable65_2a13324273ac X-Filterd-Recvd-Size: 8756 Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) by imf19.hostedemail.com (Postfix) with ESMTP for ; Tue, 1 Dec 2020 17:44:53 +0000 (UTC) Received: by mail-pl1-f178.google.com with SMTP id bj5so1556914plb.4 for ; Tue, 01 Dec 2020 09:44:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=rFsYFDq/Q3GPLL9fHJin91E11XSOnALUjdSBz3nDiTA=; b=OTSGIxRd7dTA/pzOwvu61LDGszj5Y/x28kE4X9ua7nZJLWHxRjdC4qhrvhGwzPigrF 7w1cRGCEerHd/XfeY8NhH+WtoOkY+OFlV8MJMTKq+Iqak77coNF85BoqZ/t9GpQbQUb3 Vk2FKbnlXsRNb4s10eH0mN4LNo5OjKp4mCrQEn03zlgoD1ss7EPN0YP7LGAs9W1io82v lE62vmBmq1Y+HKMibEzCxdiAqglk/vJ1zlKaQRDks9CStLmgpJw0cU+xFwyq3ashvD75 SBpVp93c8xYtsxE9y7nGHy3bZU330lZtnHXaHyTdsgO+raTa6yWoJnA8s+4sx4ffUhUc iDyA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=rFsYFDq/Q3GPLL9fHJin91E11XSOnALUjdSBz3nDiTA=; b=FGIfSanD1NBPYxw+ehtqgpCvAgruAJP7Mac8y9iXbSNf+EMK6iE2VKS1XSi2ec1nd4 684iGHLgpKKcoSlDkgxDuQwRb5jR4ehKgydujXZvJpZ3xPs2eGTu5uypcL1LPsySwk90 mHVOPgmcSN+aI068nAEE57dG1L3WzLw4K3oKf4ahqxkqd+W65fARYi6nQ2PPT7JY9wXv 6BmRlXhY3Rkz3RqRcHJDvwMlULwJNuw3Ewugj/Q1TvcJKEtYhX6I8PyxKvWQYWBeWju9 5Cfjbs4VqRPxspOZM9bfwzJrcSQXUtDR1vdPsP+C25yQ1Zlrh4Yw02qQgZHtMa6lqQfM vDsg== X-Gm-Message-State: AOAM532FJy/n9I29BwH441OjYCe4SsU0+zvyo7fpF737auxzWoy6l+WC KD+Xli70ue30fvh7ZxfVvog= X-Google-Smtp-Source: ABdhPJxmK2BHuXRNN5yxjcSqY1P5lL9gpXhjrIzx+Vi5k0D3ei1oy1FgKl9Wed5mX419vdnx4aBW7A== X-Received: by 2002:a17:90b:ec2:: with SMTP id gz2mr3897592pjb.143.1606844692537; Tue, 01 Dec 2020 09:44:52 -0800 (PST) Received: from localhost.localdomain (c-73-93-239-127.hsd1.ca.comcast.net. [73.93.239.127]) by smtp.gmail.com with ESMTPSA id p15sm360909pjg.21.2020.12.01.09.44.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 01 Dec 2020 09:44:51 -0800 (PST) From: Yang Shi To: guro@fb.com, ktkhai@virtuozzo.com, vdavydov.dev@gmail.com, shakeelb@google.com, akpm@linux-foundation.org Cc: shy828301@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: [v2 PATCH] mm: list_lru: set shrinker map bit when child nr_items is not zero Date: Tue, 1 Dec 2020 09:44:49 -0800 Message-Id: <20201201174449.51435-1-shy828301@gmail.com> X-Mailer: git-send-email 2.26.2 MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When investigating a slab cache bloat problem, significant amount of negative dentry cache was seen, but confusingly they neither got shrunk by reclaimer (the host has very tight memory) nor be shrunk by dropping cache. The vmcore shows there are over 14M negative dentry objects on lru, but tracing result shows they were even not scanned at all. The further investigation shows the memcg's vfs shrinker_map bit is not set. So the reclaimer or dropping cache just skip calling vfs shrinker. So we have to reboot the hosts to get the memory back. I didn't manage to come up with a reproducer in test environment, and the problem can't be reproduced after rebooting. But it seems there is race between shrinker map bit clear and reparenting by code inspection. The hypothesis is elaborated as below. The memcg hierarchy on our production environment looks like: root / \ system user The main workloads are running under user slice's children, and it creates and removes memcg frequently. So reparenting happens very often under user slice, but no task is under user slice directly. So with the frequent reparenting and tight memory pressure, the below hypothetical race condition may happen: CPU A CPU B CPU C reparent dst->nr_items == 0 shrinker: total_objects == 0 add src->nr_items to dst set_bit retrun SHRINK_EMPTY clear_bit child memcg offline replace child's kmemcg_id to parent's (in memcg_offline_kmem()) list_lru_del() see parent's kmemcg_id dec dst->nr_items reparent again dst->nr_items may go negative due to concurrent list_lru_del() on CPU C The second run of shrinker: read nr_items without any synchronization, so it may see intermediate negative nr_items then total_objects may return 0 conincidently keep the bit cleared dst->nr_items != 0 skip set_bit add scr->nr_item to dst After this point dst->nr_item may never go zero, so reparenting will not set shrinker_map bit anymore. And since there is no task under user slice directly, so no new object will be added to its lru to set the shrinker map bit either. That bit is kept cleared forever. How does list_lru_del() race with reparenting? It is because reparenting replaces childen's kmemcg_id to parent's without protecting from nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually deleting items from child's lru, but dec'ing parent's nr_items, so the parent's nr_items may go negative as commit 2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and free kmemcg_id on css offline") says. Can we move kmemcg_id replacement after reparenting? No, because the race with list_lru_del() may result in negative src->nr_items, but it will never be fixed. So the shrinker may never return SHRINK_EMPTY then keep the shrinker map bit set always. The shrinker will be always called for nonsense. Can we synchronize list_lru_del() and reparenting? Yes, it could be done. But it seems we need introduce a new lock or use nlru->lock. But it sounds complicated to move kmemcg_id replacement code under nlru->lock. And list_lru_del() may be called quite often to exacerbate some hot path, i.e. dentry kill. Since it is impossible that dst->nr_items goes negative and src->nr_items goes zero at the same time, so it seems we could set the shrinker map bit iff src->nr_items != 0. We could synchronize list_lru_count_one() and reparenting with nlru->lock, but it seems checking src->nr_items in reparenting is the simplest and avoids lock contention. Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance") Suggested-by: Roman Gushchin Cc: Vladimir Davydov Cc: Kirill Tkhai Cc: Shakeel Butt Cc: v4.19+ Signed-off-by: Yang Shi Reviewed-by: Roman Gushchin . --- v2: * Incorporated Roman's suggestion * Incorporated Kirill's suggestion * Changed the subject of patch to get align with the new fix * Added fixes tag mm/list_lru.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/mm/list_lru.c b/mm/list_lru.c index 5aa6e44bc2ae..fe230081690b 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -534,7 +534,6 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid, struct list_lru_node *nlru = &lru->node[nid]; int dst_idx = dst_memcg->kmemcg_id; struct list_lru_one *src, *dst; - bool set; /* * Since list_lru_{add,del} may be called under an IRQ-safe lock, @@ -546,11 +545,12 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid, dst = list_lru_from_memcg_idx(nlru, dst_idx); list_splice_init(&src->list, &dst->list); - set = (!dst->nr_items && src->nr_items); - dst->nr_items += src->nr_items; - if (set) + + if (src->nr_items) { + dst->nr_items += src->nr_items; memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru)); - src->nr_items = 0; + src->nr_items = 0; + } spin_unlock_irq(&nlru->lock); }