From patchwork Mon Mar 13 11:28:13 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 13172282 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5D8E5C6FD19 for ; Mon, 13 Mar 2023 11:29:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CCD8D6B0074; Mon, 13 Mar 2023 07:29:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C56C16B0075; Mon, 13 Mar 2023 07:29:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AD08D6B0078; Mon, 13 Mar 2023 07:29:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 9C5706B0074 for ; Mon, 13 Mar 2023 07:29:53 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 772CD160433 for ; Mon, 13 Mar 2023 11:29:53 +0000 (UTC) X-FDA: 80563655466.05.3F08148 Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) by imf19.hostedemail.com (Postfix) with ESMTP id A06B31A0014 for ; Mon, 13 Mar 2023 11:29:51 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=LKdBuqoP; spf=pass (imf19.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1678706991; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=o+PIjzMrqqus+vCKR4kW46vOjsCmmZLpdvQvuCO0Xe0=; b=MozoCzXEZnyB7T1e1cBzMhI2ARjeKA3pAbCh2jr8DeM5VzA8g9MUGSgSsHocfW1HU2pw65 WQiBOKvc1uWckcfyVBRQF0tInqiTRFTYg8j4v6w+fRUnf2w0RnA2+rfqMTcaCbZXb/sM8y b2e1rXVhvsAltin+QqOcrwCf6A8+LrY= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=LKdBuqoP; spf=pass (imf19.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1678706991; a=rsa-sha256; cv=none; b=xU9/vv1K73fR9Zt1sY7mdkyBlRIMq6WcvnDbh4VUZaSPmynW7dP+2u7HrZGoZyYsrIRD5D ckwVHIYuP5PR+kRCeydlp754F7+qm8IuX2EPYXYn7W27XtPaZbJmE/gSy8CDGdH4t4WsbN h7NtJ781pNFx9OhjdLccQbzb7bNDYl8= Received: by mail-pl1-f179.google.com with SMTP id p6so12609085plf.0 for ; Mon, 13 Mar 2023 04:29:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1678706990; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=o+PIjzMrqqus+vCKR4kW46vOjsCmmZLpdvQvuCO0Xe0=; b=LKdBuqoPZ3Eoe3IWRve5czAmJNAuSXHJEZyt6qMgQ2QWE0FfdegRHxqdr14XQLU0rx PsAdOS3EITkjvH5yWgAVHjIGJIgtokVqDvCFC26F4koRE2Dg75XvyXjyLjnvaxEhaWs0 dhQ9CFD461CllT/ivbb0KEKQLcI9xDxnAgIIgqPeywEBe7ZlcK3Kpq2x4YNZJlXiWmJW wFSsnsQNtCw4pUPGf8nkmUQNQWO9giNg3PiXQrgZK9dUZR0DabpxIf6V5RpwknTkHH+s 11as/+u7W1eLQqbscYrAfCE12YP+G4kZxz1LGruS5fzUaeaeMeP5o6+X9fPn1O+TsKdH kbvQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678706990; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=o+PIjzMrqqus+vCKR4kW46vOjsCmmZLpdvQvuCO0Xe0=; b=z6ndaRwvYThYuHaEQw1pSEh9T3T6zRICTg7SGB0u9blBxQm2JwDbNVSKM9IfobNcPM Tp8nyULjwtGznjFThMspF3kfD3pSVjrnngjO3HxzJzc75YE0gh6PaXoBrrtNFHSiIuyN uRiKZdL3wTWTPRVQapS3tbYbDZsvgt9g3oYzFCGOj6dNBNk/x4vqZpnG4i6y7A8KAYHe VJ8bdPhvkNgVpfxXi7z2Niw4jHt9SyEz+BzvHC/iUXIYfoA7ApsYLAv2l7nZyFGGiUXk bnf9JXNE5+0vxkuI7zyB+vpdf7rIbwzK4SUemJR5nW4C/vojufmnL0jjF56LM66noBVM sUjw== X-Gm-Message-State: AO0yUKW6GzWlnpS5dGJjmZnuoEnkWfMNBSSYKlIYlW3HrNuwiBqfwJ3r a0CQPwniZ4/hKgAeQ4l+4VFEyw== X-Google-Smtp-Source: AK7set9cSLNe/LRJSehPejbnqQcEtWGFrHHPC3Q/DqoKx4Ii0aTTtc55w7QWCxr0TLSi4GrfoJETMQ== X-Received: by 2002:a05:6a20:841c:b0:c7:af88:4199 with SMTP id c28-20020a056a20841c00b000c7af884199mr17859458pzd.6.1678706990517; Mon, 13 Mar 2023 04:29:50 -0700 (PDT) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.229]) by smtp.gmail.com with ESMTPSA id n2-20020a654882000000b0050300a7c8c2sm4390827pgs.89.2023.03.13.04.29.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Mar 2023 04:29:50 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tkhai@ya.ru, vbabka@suse.cz, christian.koenig@amd.com, hannes@cmpxchg.org, shakeelb@google.com, mhocko@kernel.org, roman.gushchin@linux.dev, muchun.song@linux.dev, david@redhat.com, shy828301@gmail.com Cc: sultan@kerneltoast.com, dave@stgolabs.net, penguin-kernel@I-love.SAKURA.ne.jp, paulmck@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Qi Zheng Subject: [PATCH v5 2/8] mm: vmscan: make global slab shrink lockless Date: Mon, 13 Mar 2023 19:28:13 +0800 Message-Id: <20230313112819.38938-3-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20230313112819.38938-1-zhengqi.arch@bytedance.com> References: <20230313112819.38938-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: A06B31A0014 X-Stat-Signature: 85ez69oa86izef9dzoyw3o9tdijfwu9j X-Rspam-User: X-HE-Tag: 1678706991-565083 X-HE-Meta: U2FsdGVkX19QWQXfj6noh/QiePAcWrFGCwINBGhtL8h86cqPtGjRlwb6QgpPF5pDGrgdZo8m/D/HJIgh/5/OZATAK7LQLWdcH0aF6SdzsSw6Ojaf/IZaleXhsidE+Y/FeSkcFQQXrV3XBGRlk2RVVdWigEdSC+2nVnOK4KuDlv1pfzUMR5EF+6UKr1BEHGBZcMVwmlXI56CUb7KSjUke+5Zy3eaRNCq0xf7Lo8rYU10jmcfWbiIE2LXnqkG5hOnfnixPiX8Lq/6Qy66/ZKoViy8UgKfbKX+oXVMUZGI+wuFyRGs9kYFKWADhyLl2NjrERLOyZ346EkdPznx8NIrUU0WdtQ1N4GpVFSK2ry3RB1F67wzp/v5pRAERPDiSuRbrR9gIu2SLWBz69gSDEN7EkSB4rZiZc4QXzx15cPMvWlqCgGzeCcpcAYxZhC62jFwprnbqAGdDxb1w5ALeZOTue84TiCBwtO7pJePLXeSPJQtm/uCCs1xEhTKp0oj4BmVtdR2UNmbGtrmTa+1WgwCY0fBOh8Hc/oGP5v/HbnXHblOva3W/uf3JFkvQiPnwj3j3O9mD3tv56/BfnQLZNsSe14DZ9hwusHtfIf2ogeSf0gac+bLjUm01KrAwpTi7RzJfZqr9e7KPZEUocn1cU90ZCfPezJlPsgYtwbWjUjCzucjdRu4ouDGcSWjIXNg5S7Jn7k9pbVfPG3sK0JCoiQnldgmnyUJZbcOkO+3HDfyo8EiZAMDLbaF4XVxDJ+f2cl3VfB8Vjb8nioQGPQ5bo35PzM/xosZSNQwailEGsWUqs0sWxdyRTEAywVNzp2J5v9QW9/7aYNONApmp0mqAc1Xv7oR2vkzd46+rfoEyXP0tQSOZ+xffQPnVvL4stdeOBNFC3Uaq+kGlJlY7By7UPmt1QmTjHl7mfAEjG4kwYrA9vO9ql+htYa44khOQ5WFFKIdrdfm+COYD1l8XPyLRgFG UfLESami e/4AAqs4eKWAwGnl/5bj1mp6mPFpQd6E1OqV/84jVztuQRM0KkUEx84j7UrNSRdCac7YffOEBI8kFjqG6xaVIVg0ntycZqAu4acXmyk+fRMVDYOjES9QQhC5n3OsgegL2EOBI8jCfOQxBmjRkWxCZX+Dof/DWM1eg7fWDU2V6ADN3aG8dv2TvMY/52vP2/S+/LeTykll+h+WxpY8nIfYjxBoG55Yrs9NyE25lCiRWc+CUicFOU1VNZ/l2pYNOsN6pphg7gpeRn9aV+a6DILeGHlXtrjMyI/NyOG9tKJxrFFuLwQnWTvoWnLMgZjlJJbAEoU+Hl5Fr3tG//gvwRBd04o6UP9GXjIYaJAZ53N01rGFIi1eEKJ52NTIcdRi4Sosj/Cae4ROjTn5boJeMJpNUkh1pEaWfoyHXajASs3NEZ9Lf4xYxWsVdvhf9b7wHsEE5eKFq7t3xpDAttumjRYE85yVD8GoI3zy3HazvejjnFN2+jl46vHSwrXt0zBaPdBGXjkrfyNVjAgKYvk/qWWkw4KDH40BwaFgPxgt1E2XRKBh4tCiUH5esIQZTRCdRBwiQRT8Y1AogsjoS70TvzVGuJADameMDUkWipXrrfgfxoHEV9DNhklRDzccTOKoJqgaeszFDINitC/O5G1cg/IrpHpVxW/TlJD2rWQ6pF7UhyQMVIn9mUbAkCe/H6OzjNM9HkX6J1rwualOLdP8ELuRUGPL+8g== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The shrinker_rwsem is a global read-write lock in shrinkers subsystem, which protects most operations such as slab shrink, registration and unregistration of shrinkers, etc. This can easily cause problems in the following cases. 1) When the memory pressure is high and there are many filesystems mounted or unmounted at the same time, slab shrink will be affected (down_read_trylock() failed). Such as the real workload mentioned by Kirill Tkhai: ``` One of the real workloads from my experience is start of an overcommitted node containing many starting containers after node crash (or many resuming containers after reboot for kernel update). In these cases memory pressure is huge, and the node goes round in long reclaim. ``` 2) If a shrinker is blocked (such as the case mentioned in [1]) and a writer comes in (such as mount a fs), then this writer will be blocked and cause all subsequent shrinker-related operations to be blocked. Even if there is no competitor when shrinking slab, there may still be a problem. If we have a long shrinker list and we do not reclaim enough memory with each shrinker, then the down_read_trylock() may be called with high frequency. Because of the poor multicore scalability of atomic operations, this can lead to a significant drop in IPC (instructions per cycle). So many times in history ([2],[3],[4],[5]), some people wanted to replace shrinker_rwsem trylock with SRCU in the slab shrink, but all these patches were abandoned because SRCU was not unconditionally enabled. But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"), the SRCU is unconditionally enabled. So it's time to use SRCU to protect readers who previously held shrinker_rwsem. This commit uses SRCU to make global slab shrink lockless, the memcg slab shrink is handled in the subsequent patch. [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/ [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/ [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/ [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/ [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/ Signed-off-by: Qi Zheng Acked-by: Vlastimil Babka Acked-by: Kirill Tkhai --- mm/vmscan.c | 28 ++++++++++++---------------- 1 file changed, 12 insertions(+), 16 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 9a2a6301052c..db2ed6e08f67 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -57,6 +57,7 @@ #include #include #include +#include #include #include @@ -202,6 +203,7 @@ static void set_task_reclaim_state(struct task_struct *task, LIST_HEAD(shrinker_list); DECLARE_RWSEM(shrinker_rwsem); +DEFINE_SRCU(shrinker_srcu); #ifdef CONFIG_MEMCG static int shrinker_nr_max; @@ -700,7 +702,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker) void register_shrinker_prepared(struct shrinker *shrinker) { down_write(&shrinker_rwsem); - list_add_tail(&shrinker->list, &shrinker_list); + list_add_tail_rcu(&shrinker->list, &shrinker_list); shrinker->flags |= SHRINKER_REGISTERED; shrinker_debugfs_add(shrinker); up_write(&shrinker_rwsem); @@ -754,13 +756,15 @@ void unregister_shrinker(struct shrinker *shrinker) return; down_write(&shrinker_rwsem); - list_del(&shrinker->list); + list_del_rcu(&shrinker->list); shrinker->flags &= ~SHRINKER_REGISTERED; if (shrinker->flags & SHRINKER_MEMCG_AWARE) unregister_memcg_shrinker(shrinker); debugfs_entry = shrinker_debugfs_remove(shrinker); up_write(&shrinker_rwsem); + synchronize_srcu(&shrinker_srcu); + debugfs_remove_recursive(debugfs_entry); kfree(shrinker->nr_deferred); @@ -780,6 +784,7 @@ void synchronize_shrinkers(void) { down_write(&shrinker_rwsem); up_write(&shrinker_rwsem); + synchronize_srcu(&shrinker_srcu); } EXPORT_SYMBOL(synchronize_shrinkers); @@ -990,6 +995,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, { unsigned long ret, freed = 0; struct shrinker *shrinker; + int srcu_idx; /* * The root memcg might be allocated even though memcg is disabled @@ -1001,10 +1007,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) return shrink_slab_memcg(gfp_mask, nid, memcg, priority); - if (!down_read_trylock(&shrinker_rwsem)) - goto out; + srcu_idx = srcu_read_lock(&shrinker_srcu); - list_for_each_entry(shrinker, &shrinker_list, list) { + list_for_each_entry_srcu(shrinker, &shrinker_list, list, + srcu_read_lock_held(&shrinker_srcu)) { struct shrink_control sc = { .gfp_mask = gfp_mask, .nid = nid, @@ -1015,19 +1021,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, if (ret == SHRINK_EMPTY) ret = 0; freed += ret; - /* - * Bail out if someone want to register a new shrinker to - * prevent the registration from being stalled for long periods - * by parallel ongoing shrinking. - */ - if (rwsem_is_contended(&shrinker_rwsem)) { - freed = freed ? : 1; - break; - } } - up_read(&shrinker_rwsem); -out: + srcu_read_unlock(&shrinker_srcu, srcu_idx); cond_resched(); return freed; }