From patchwork Mon Aug 7 11:09:33 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 13343698 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6EC0AC04A6A for ; Mon, 7 Aug 2023 11:19:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0FF606B0075; Mon, 7 Aug 2023 07:19:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 088E28D0005; Mon, 7 Aug 2023 07:19:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E44638D0003; Mon, 7 Aug 2023 07:19:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id CE4B46B0075 for ; Mon, 7 Aug 2023 07:19:37 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id ACF1514091C for ; Mon, 7 Aug 2023 11:19:37 +0000 (UTC) X-FDA: 81097063194.23.436BFDE Received: from mail-io1-f46.google.com (mail-io1-f46.google.com [209.85.166.46]) by imf30.hostedemail.com (Postfix) with ESMTP id 9DFC380016 for ; Mon, 7 Aug 2023 11:19:35 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=SsVeEy7g; spf=pass (imf30.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.166.46 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1691407175; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VvZzKGLWBPnU/YifQKDl8Wb9pt+zZKHS88YiwcrJMO0=; b=uVXk6YYKHQpGXzFgrJDfBlzBgGLDKP9o5NxDiajs+7m6kCulAw54yQJvtps5Yvu7yA01QX lKz4YKms0VL+MH2qYwuJEIYea7CsEYawTUR3TdAtAFjU9hSmcWSvLAqB2O8knB73Tscr0S PPuzQTxK2ZkKqEabpfWMjisiqGrJei0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1691407175; a=rsa-sha256; cv=none; b=3/B8Fa3EWSBceQoXqVoYKTmyMC4oprIdnQaJWoOiIkNCXBUaHRBX4N1fe7aMuuN0HstB5d fhEcfh3Omv4Di62MP8SffM549JI4/Y0NjGU+CMdvLCrd65Zqpj5UduRWVsHT9WRbobqpJw hHdAPWEORA8rjsOJz7haQhByYOEt4eY= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=SsVeEy7g; spf=pass (imf30.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.166.46 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com Received: by mail-io1-f46.google.com with SMTP id ca18e2360f4ac-790dcf48546so32862639f.0 for ; Mon, 07 Aug 2023 04:19:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1691407175; x=1692011975; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=VvZzKGLWBPnU/YifQKDl8Wb9pt+zZKHS88YiwcrJMO0=; b=SsVeEy7gwtNzgRLX3Qao2jYE1354yjZBdHrMaMz2XqDOvdbs41vLXJnddUX7FwP0Qx hSxH4iCCuXgVTr2NPnCMWLevPuwCt7LGzr4+TdjZlq4Gl2oVZgjQ80i5jSaavOwgn6QQ +nWVM5qqld5fZH2+ndWf7n2/Ma+aJb4WkOxejaU0XvqKX5vqDu25nIk8n1U32mnTDZgh PkjRGyPGbDKhTCsbJSFCDCkrBYmJRSx+6Z+gu0duks49czJ2bXnjSlvfCZzOJmwtfJCD wbrCVDMtyTbzTIuZroq7DUtUcyce8ZMbc3QUuctGqLseQtbL1sL9WjN8CX6fRRrU5uhk 0bSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691407175; x=1692011975; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VvZzKGLWBPnU/YifQKDl8Wb9pt+zZKHS88YiwcrJMO0=; b=BASRSZSwpgMQoWemmT5QBHrdrrbSJ2s7BYK53iEDQNfcf3JPiARcoyTcYh7hDb4jLy mG2dzt9bgY+gQTJUuZyq9gLfDBwyF4XlDTTpcNzp/0XOEfzdPPxcem81TPNhIASRf2h1 96N4dBEKlNKzfZwfWcgiHb8Gka5iPlmLVuniMPDOBxtVpSJFzJpM/v88XQRG83SK0ofH wbgySV67zgUoTIzU+GA1oDqIW10ba5QAaisZamarAEtcDAEzvyTppit0nWCLPbXYjpIg QdwGLcPbSYfbFgtZMSjAb9K6K0zei4RaslAT7K3DQSK2MB4/2xvan9KbLrM/9V6jP+s9 cgow== X-Gm-Message-State: ABy/qLZesG4avp53Ecif5f7fOTSeaaZJ9eEsFdh2Ff5kwcBsPNKSdwWr 4BtlsSqgOT9WP+kjsIhnYuC9yw== X-Google-Smtp-Source: APBJJlG/XxZB6sKzQ/QKN1+d/Y+m2Z7kNpKSUjYzz7Oa2LvDnuz3c5eyzNfvhPPV8k8oKjwIpKOiLg== X-Received: by 2002:a6b:c30f:0:b0:783:6e76:6bc7 with SMTP id t15-20020a6bc30f000000b007836e766bc7mr26415608iof.2.1691407174771; Mon, 07 Aug 2023 04:19:34 -0700 (PDT) Received: from C02DW0BEMD6R.bytedance.net ([203.208.167.146]) by smtp.gmail.com with ESMTPSA id y13-20020a17090aca8d00b0025be7b69d73sm5861191pjt.12.2023.08.07.04.19.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 07 Aug 2023 04:19:34 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, david@fromorbit.com, tkhai@ya.ru, vbabka@suse.cz, roman.gushchin@linux.dev, djwong@kernel.org, brauner@kernel.org, paulmck@kernel.org, tytso@mit.edu, steven.price@arm.com, cel@kernel.org, senozhatsky@chromium.org, yujie.liu@intel.com, gregkh@linuxfoundation.org, muchun.song@linux.dev, simon.horman@corigine.com, dlemoal@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org, kvm@vger.kernel.org, xen-devel@lists.xenproject.org, linux-erofs@lists.ozlabs.org, linux-f2fs-devel@lists.sourceforge.net, cluster-devel@redhat.com, linux-nfs@vger.kernel.org, linux-mtd@lists.infradead.org, rcu@vger.kernel.org, netdev@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-arm-msm@vger.kernel.org, dm-devel@redhat.com, linux-raid@vger.kernel.org, linux-bcache@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org, linux-btrfs@vger.kernel.org, Qi Zheng Subject: [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless Date: Mon, 7 Aug 2023 19:09:33 +0800 Message-Id: <20230807110936.21819-46-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20230807110936.21819-1-zhengqi.arch@bytedance.com> References: <20230807110936.21819-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 9DFC380016 X-Rspam-User: X-Stat-Signature: 8tu63w3s71ocszdo6ntxj1qpjayb4b4r X-Rspamd-Server: rspam03 X-HE-Tag: 1691407175-677049 X-HE-Meta: U2FsdGVkX1/vETp8HTG30YDO74MOqZ7RRgFdDHegUKS7aL7KKtHjenuRvcxk7Y9lZhz34VzPAk8By1rFTtk2dYR1ujjZmx05J1UFkB8swNo+LlipGWjc/ae4wEyh7MRolocfb77vG15i3oWQfHiZwNP0dIgkFGTqb0xkaEAbQxrResk2Ta5ooLdr6SnTB7yg9Y2XvJ7hHKHMjpBWDhuDF92mFmypig8zQ7XD0PLluEGllKmMxdVpWWg8w/8otXxe/siucc8wmnp4Mpke2Sbuj2GGDkOdKn2Nd2gin+KRymHrz186eP4JKhH0rgkabu8Tn3qXmp1Wli164xMp/1szXlghnwfU7R+o1At23rg07vWTRV/bIv4kPg1j+ZIlVLcFujHm23zhmB+g2gxEPSq+y1pq654EUsymh+iD2tzRuDZBCPVQ8B1CIjYK8Cui54Ex1bVk0WYdsp/rkY9Bo16Wr0j22kFbvTrLFRp2zH/LfMPXT5No2YvSUnnXTiBmB1xTqIxYaIaDy5rWtCD5+NfPWk+ib3h7X3AguYgeFykF6X0e3BzIwoHdWoxHFTtfctZp+Mw2bJKsISPpuIIl+w+hsFn86DR8AXZPC7P0z0S+SnBSIiT53CCPQZng6CL98rw32PugdOnsRfZAJE9pW8BK++hs/GSxlSh+XwoOmSW2VmyYRsHzuAIvAPRAPPCV+qSqgTqjAS20/84XYsnUc+2ZTNU2jNBec08ohfHFWt5t7Q8f22+8cBOKeGBZ79+OEq5gofQ9c7GDUfgSQgt2TkOUV2fI6my1uUf5jazfi4xFCe791v5ybPkbdkDMzwvxbTMI+VfXEltgsIHiVDi5cbXEgiSHVA3qPFV9IxtayZzoHtk1iX/8ol2cqdgVWzOPVEDrXBmfghnvY/LH+2eIOEbHDzwKR0TeQ6mawuuuMLE7tsHzcMVd4Po2/9rIO19qyzmVfPgvI0+HqxyRUoG08cL +MPArS00 olmZFuIgd5RbHyS4n8rIo6GQBI0qBY0W3V1OmSfIFS4rmoNIzlNZJI2GVW6p4uAUnVVdolWhZ2hkmUlrx71ZZrX3VEAcfgPHI3WsjbtzW383jEXTFGK0f0GLxg6Ise1DBhGMwMxIJ3Z8KX9xRQzWa9tNiug5A+4S5ETJIa7Y1pP094dc7DN4nTKNAasUfe6awgVhxnARGRcsX8GMhI5SjgsH9NRLcCQRVDClHtGwlQckUG29B0e+u+jtbR4/MtPuqv0GqA93OHzc7ab3McomuqyVooh4u/LbifEyQhRCwq8ZfqmAFlM7yPcB3Eo4XasPh2QNZCGFuHl69lA/ACbPtdd1aLMpaZoZY5cKzYGKTOUKAH6aTwZ9L5nmkwqP1r6/sQsxltCdSoPMulbQMMzM1wkvvnuENNBqWw0pNO2iVvShxFiZw5+v5ifQzFrkEqRbM5NxWe+0vIuLgPHiUcXJJizaP1Iafew3f3Y5xzHQwTUrb5jHDy7z/9iuH/AJFBUYTSzafnR6FCtRVY50z1QQihXtB65WchZZTmXcu7/yzRUaixv8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The shrinker_rwsem is a global read-write lock in shrinkers subsystem, which protects most operations such as slab shrink, registration and unregistration of shrinkers, etc. This can easily cause problems in the following cases. 1) When the memory pressure is high and there are many filesystems mounted or unmounted at the same time, slab shrink will be affected (down_read_trylock() failed). Such as the real workload mentioned by Kirill Tkhai: ``` One of the real workloads from my experience is start of an overcommitted node containing many starting containers after node crash (or many resuming containers after reboot for kernel update). In these cases memory pressure is huge, and the node goes round in long reclaim. ``` 2) If a shrinker is blocked (such as the case mentioned in [1]) and a writer comes in (such as mount a fs), then this writer will be blocked and cause all subsequent shrinker-related operations to be blocked. Even if there is no competitor when shrinking slab, there may still be a problem. The down_read_trylock() may become a perf hotspot with frequent calls to shrink_slab(). Because of the poor multicore scalability of atomic operations, this can lead to a significant drop in IPC (instructions per cycle). We used to implement the lockless slab shrink with SRCU [2], but then kernel test robot reported -88.8% regression in stress-ng.ramfs.ops_per_sec test case [3], so we reverted it [4]. This commit uses the refcount+RCU method [5] proposed by Dave Chinner to re-implement the lockless global slab shrink. The memcg slab shrink is handled in the subsequent patch. For now, all shrinker instances are converted to dynamically allocated and will be freed by call_rcu(). So we can use rcu_read_{lock,unlock}() to ensure that the shrinker instance is valid. And the shrinker instance will not be run again after unregistration. So the structure that records the pointer of shrinker instance can be safely freed without waiting for the RCU read-side critical section. In this way, while we implement the lockless slab shrink, we don't need to be blocked in unregister_shrinker(). The following are the test results: stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 & 1) Before applying this patchset: setting to a 60 second run per stressor dispatching hogs: 9 ramfs stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s (secs) (secs) (secs) (real time) (usr+sys time) ramfs 735238 60.00 12.37 363.70 12253.05 1955.08 for a 60.01s run time: 1440.27s available CPU time 12.36s user time ( 0.86%) 363.70s system time ( 25.25%) 376.06s total time ( 26.11%) load average: 10.79 4.47 1.69 passed: 9: ramfs (9) failed: 0 skipped: 0 successful run completed in 60.01s (1 min, 0.01 secs) 2) After applying this patchset: setting to a 60 second run per stressor dispatching hogs: 9 ramfs stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s (secs) (secs) (secs) (real time) (usr+sys time) ramfs 746698 60.00 12.45 376.16 12444.02 1921.47 for a 60.01s run time: 1440.28s available CPU time 12.44s user time ( 0.86%) 376.16s system time ( 26.12%) 388.60s total time ( 26.98%) load average: 9.01 3.85 1.49 passed: 9: ramfs (9) failed: 0 skipped: 0 successful run completed in 60.01s (1 min, 0.01 secs) We can see that the ops/s has hardly changed. [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/ [2]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/ [3]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/ [4]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/ [5]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/ Signed-off-by: Qi Zheng --- include/linux/shrinker.h | 17 ++++++++++ mm/shrinker.c | 70 +++++++++++++++++++++++++++++----------- 2 files changed, 68 insertions(+), 19 deletions(-) diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index eb342994675a..f06225f18531 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -4,6 +4,8 @@ #include #include +#include +#include #define SHRINKER_UNIT_BITS BITS_PER_LONG @@ -87,6 +89,10 @@ struct shrinker { int seeks; /* seeks to recreate an obj */ unsigned flags; + refcount_t refcount; + struct completion done; + struct rcu_head rcu; + void *private_data; /* These are for internal use */ @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, const char *fmt, ...); void shrinker_register(struct shrinker *shrinker); void shrinker_free(struct shrinker *shrinker); +static inline bool shrinker_try_get(struct shrinker *shrinker) +{ + return refcount_inc_not_zero(&shrinker->refcount); +} + +static inline void shrinker_put(struct shrinker *shrinker) +{ + if (refcount_dec_and_test(&shrinker->refcount)) + complete(&shrinker->done); +} + #ifdef CONFIG_SHRINKER_DEBUG extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...); diff --git a/mm/shrinker.c b/mm/shrinker.c index 1911c06b8af5..d318f5621862 100644 --- a/mm/shrinker.c +++ b/mm/shrinker.c @@ -2,6 +2,7 @@ #include #include #include +#include #include #include "internal.h" @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) return shrink_slab_memcg(gfp_mask, nid, memcg, priority); - if (!down_read_trylock(&shrinker_rwsem)) - goto out; - - list_for_each_entry(shrinker, &shrinker_list, list) { + rcu_read_lock(); + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { struct shrink_control sc = { .gfp_mask = gfp_mask, .nid = nid, .memcg = memcg, }; + if (!shrinker_try_get(shrinker)) + continue; + + /* + * We can safely unlock the RCU lock here since we already + * hold the refcount of the shrinker. + */ + rcu_read_unlock(); + ret = do_shrink_slab(&sc, shrinker, priority); if (ret == SHRINK_EMPTY) ret = 0; freed += ret; + /* - * Bail out if someone want to register a new shrinker to - * prevent the registration from being stalled for long periods - * by parallel ongoing shrinking. + * This shrinker may be deleted from shrinker_list and freed + * after the shrinker_put() below, but this shrinker is still + * used for the next traversal. So it is necessary to hold the + * RCU lock first to prevent this shrinker from being freed, + * which also ensures that the next shrinker that is traversed + * will not be freed (even if it is deleted from shrinker_list + * at the same time). */ - if (rwsem_is_contended(&shrinker_rwsem)) { - freed = freed ? : 1; - break; - } + rcu_read_lock(); + shrinker_put(shrinker); } - up_read(&shrinker_rwsem); -out: + rcu_read_unlock(); cond_resched(); return freed; } @@ -671,13 +681,29 @@ void shrinker_register(struct shrinker *shrinker) } down_write(&shrinker_rwsem); - list_add_tail(&shrinker->list, &shrinker_list); + list_add_tail_rcu(&shrinker->list, &shrinker_list); shrinker->flags |= SHRINKER_REGISTERED; shrinker_debugfs_add(shrinker); up_write(&shrinker_rwsem); + + init_completion(&shrinker->done); + /* + * Now the shrinker is fully set up, take the first reference to it to + * indicate that lookup operations are now allowed to use it via + * shrinker_try_get(). + */ + refcount_set(&shrinker->refcount, 1); } EXPORT_SYMBOL_GPL(shrinker_register); +static void shrinker_free_rcu_cb(struct rcu_head *head) +{ + struct shrinker *shrinker = container_of(head, struct shrinker, rcu); + + kfree(shrinker->nr_deferred); + kfree(shrinker); +} + void shrinker_free(struct shrinker *shrinker) { struct dentry *debugfs_entry = NULL; @@ -686,9 +712,18 @@ void shrinker_free(struct shrinker *shrinker) if (!shrinker) return; + if (shrinker->flags & SHRINKER_REGISTERED) { + shrinker_put(shrinker); + wait_for_completion(&shrinker->done); + } + down_write(&shrinker_rwsem); if (shrinker->flags & SHRINKER_REGISTERED) { - list_del(&shrinker->list); + /* + * Lookups on the shrinker are over and will fail in the future, + * so we can now remove it from the lists and free it. + */ + list_del_rcu(&shrinker->list); debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id); shrinker->flags &= ~SHRINKER_REGISTERED; } else { @@ -702,9 +737,6 @@ void shrinker_free(struct shrinker *shrinker) if (debugfs_entry) shrinker_debugfs_remove(debugfs_entry, debugfs_id); - kfree(shrinker->nr_deferred); - shrinker->nr_deferred = NULL; - - kfree(shrinker); + call_rcu(&shrinker->rcu, shrinker_free_rcu_cb); } EXPORT_SYMBOL_GPL(shrinker_free);