From patchwork Mon Feb 20 09:16:32 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 13146177 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A689CC636CC for ; Mon, 20 Feb 2023 09:17:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C14346B0071; Mon, 20 Feb 2023 04:17:01 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BC4246B0072; Mon, 20 Feb 2023 04:17:01 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A8BE56B0073; Mon, 20 Feb 2023 04:17:01 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 948166B0071 for ; Mon, 20 Feb 2023 04:17:01 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 5F9A780EF2 for ; Mon, 20 Feb 2023 09:17:01 +0000 (UTC) X-FDA: 80487115842.14.9BA1DA7 Received: from mail-pj1-f54.google.com (mail-pj1-f54.google.com [209.85.216.54]) by imf16.hostedemail.com (Postfix) with ESMTP id 6DDA1180019 for ; Mon, 20 Feb 2023 09:16:58 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=wmcFkDGF; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf16.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.54 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1676884619; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=TwoNp9Qdt7qy17urXcRkNK9cIx6l5B/5X3wH3FW6ZtQ=; b=BbTdwbijvHYBjuK1iFVDzZeqUybuaI/aojZVyTjbYUaF9OfIBIV/KmZKTv6Mdmq06aB35M CiiwDJGTG0jJNs+Bj5PdAB4hHdWAKpPQPmJmYCLdNf74TekQDtso6xET2jpt7X8Hr5BzyT 6b2U1dqIbeUanPoM2xI+3mbG3ril8A0= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=wmcFkDGF; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf16.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.54 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1676884619; a=rsa-sha256; cv=none; b=uhR2fYlZXoe0yR82xgBtdrlMKLq8cMYONDDFvX0u4Dk13buBY4QLaB4YPS9S5fVGdj8kHX nc3+DH03JmxgON0MSsuUJZhjBpgCi0lPoAuRlF0IWOiUNdjoHUNSQEnVnoBwPfpDvF6FF5 XDhSMWRLbFpC11k5TUupRaUZsZTgTBg= Received: by mail-pj1-f54.google.com with SMTP id g14so645052pjb.2 for ; Mon, 20 Feb 2023 01:16:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; t=1676884617; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=TwoNp9Qdt7qy17urXcRkNK9cIx6l5B/5X3wH3FW6ZtQ=; b=wmcFkDGFYS4yikPecKl42M1M8hqVdmXSe5i52p3dDfH52hC82kfhxpKZOu6tCa6yiZ l5lM+d2spVjbZvRlQtAnLuj2JCzuEftSCsPag8ciu7orqflhAZpQR4gq2OHum1B2WHR5 y+/O2+KNO9MlDGSk4hIBeeAGEGWU3b7J1D20AwRGES0jdlpGASeSByd5cS1Icy4ar8sN OJVE/gn9cZlrHW5xJshERiwu71fwpuR4bye2yeRWZQXlixmrbhfjcorenf+/JepbM3/5 DsWol6C7YOAkPdXr35/i2iLaQibDvLg41NHcOkvqSKlEvng/9hpmkeGDUtDX2J95qHNx p9qA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1676884617; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=TwoNp9Qdt7qy17urXcRkNK9cIx6l5B/5X3wH3FW6ZtQ=; b=WnJJh6dNqcmwTYisr6f0U/iJOVowGTapiazZrs50Ox1bAV9ZVV9cFSHW+lW3U9bswy q3meltRd3JZb3b+NnY86kyi19Z2xdaoY0VOkwTY5i6VsOcOh1Lx/4hICVZguuaRyBGEs 9WPwon2H6mEYLwOKo86mTj6F9/LoW4qJDjmy0YC7J/AUPJyGd8I8mExZX3ZdsvFbYKen UVBjM5PtHEBkkD8qpWNfCneBhlUKpU4UmQTc9bNxwGjrwXDnkLMV/ztpmqa935qbB0eB CDtPiHJCuNrWmJ5q34J63gTypW4G1whilAD5HDggSRZOLdeNVhwzgaxhK1kXaA+EA1FD b+4g== X-Gm-Message-State: AO0yUKVUkyxuT0ix6q5lawVYpDYNSlR1M9YRGJJmpc8y73zh+pIm/9Lu 5v6LTR9I1+zUqWaeWkYjTNRdIQ== X-Google-Smtp-Source: AK7set/GLdfzqThLrMiTLs9Tzr6PxZnWUhb/DH5CJ36VtGadDPqafJUhcYMkn/oerW22+vMV1SqlAw== X-Received: by 2002:a17:90a:bf0d:b0:233:a836:15f4 with SMTP id c13-20020a17090abf0d00b00233a83615f4mr528392pjs.1.1676884616905; Mon, 20 Feb 2023 01:16:56 -0800 (PST) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.229]) by smtp.gmail.com with ESMTPSA id fs13-20020a17090af28d00b002339195a47bsm432382pjb.53.2023.02.20.01.16.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 20 Feb 2023 01:16:56 -0800 (PST) From: Qi Zheng To: akpm@linux-foundation.org, hannes@cmpxchg.org, shakeelb@google.com, mhocko@kernel.org, roman.gushchin@linux.dev, muchun.song@linux.dev, david@redhat.com, shy828301@gmail.com Cc: tkhai@ya.ru, sultan@kerneltoast.com, dave@stgolabs.net, penguin-kernel@I-love.SAKURA.ne.jp, paulmck@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Qi Zheng Subject: [PATCH 0/5] make slab shrink lockless Date: Mon, 20 Feb 2023 17:16:32 +0800 Message-Id: <20230220091637.64865-1-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) MIME-Version: 1.0 X-Rspamd-Queue-Id: 6DDA1180019 X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: thmt9s9nh9kk1suezqcsod7smsjuof6x X-HE-Tag: 1676884618-654612 X-HE-Meta: U2FsdGVkX18YQloQObZUCsEq8NrehAV0ueWTVFUmfwdjyk40rLb3/3cqkJddOEUKZ7heeYhFohDzUqCHtJJ7GpqZZsJtmNX9SJAh5d01WOJ1g4lljg9siJ34OijAGzxpZ/Dnp4MNRoTYQu3w9xzJQd4O4rZdQKWT747gwb+U+TP6bdBpJr/olCcRQJLYRooi1a5t4wEtGuWUTcccKF6F1t7qm3Xk7cyQsDstuA9vJXwPHeVhyC/+SW7NQ4MDhtk/7DZToyulPyyFqU/baRV+nH/UIpvQGzNiT8d8XrxBNcIX8N7G0CBF2evcbBndMJ2lYc6KQalWOS9BXehywhVw2npwl8FyOr/KJp3HUxss1+Kw3CS+o7LZeocXTIT4J0islIFhyqjdD3oy/nv+Wv6S9j5te0k7QIRIVq6CrOkOgAFYDhLdBeUxa3R383ZhyiFJnthBgUWi6y//hfBOHXSMpzjjwIZPNeb8CGtDXPikoKyx2TJCNKx5F6JYYVTAKLy6gbkaDrmHTw0ayRX/bUb51ugzmglvFdAkDAHshlHvza0W/uSqurNkSH7VBVXliGoRtQXCS48a5dkcOmV30I42KbJBV8rizzIR/ELnPcysJAXLhNpncNWgFGpoPT1bVkUfOS0642NHiGi8YRoj/P0hZgp1xpI/rmoECkF5jVTPeeVga8mdirxWVbay2C/xSn/2o4xE1Jzy4xBAcXrsvm52Z6wy9zCz0eJXqnRgsEj9rWgN0kMc2kl9ZaxtI90RFzvaz1nZQKZiDPyc80bW36UtpKH+LSwnKVaQ23Dh2OsLSlmoEtMtv7/zLSmCgK/tokZHKT7TcHfMVs10TUHA+9WqUqWjqMWBvCE9ZRVRN5jEpVD7Dbaf+0ih0qn8OYXASgwF9aHmbTT//08wrUskHX/7h8rTrFYkQpL0Mso0HFpzjbWTG0j18+qiuspup/TSEBZ5RyrheGoJvIHFidPSL6W ihxYyKVw NQrFhW6DrJH9E6R43qmn5pb8i/RYD3xF9y73uf+YpU+gqyS7ZGvLjO6c5aQYAWaPJgZZwYCAuiCcRBz2lNj+F0WNXQHtOIHfenam4Bq6bw4qqTmdDy1AZSqiYayzMQznYUFdPFj0S4yJa6+lGRLjws+ADY6X7pC3VBkOYKRPYDrE5QcNuV4YaKmozuRMUbVVvHkHJ/v84nrEYANi10Fdj4wdbNcA68N81/aJxZ2Fy72beOBCIMZ4mA2AUdzAUMGYnIh5zfCXB9qdGD57wJecVb2Sd35C772Y9tR96Pyq/rCNXahh1B0PvqU+fFlhUst3YYcZmBwA7+uvFb/Jyuuhmx3nuueEDt/fZ1QzSJITs7iq8zzt7klIh2e47QF4ojKzPJe6ZYxfQ7vvkxSES5lVQTesOYSPDkNz72h0PqpoSge6fe8QicdUmttKMtznEXwmm1afFJu6g5GkoPwE3hqdxRCct9QP1Ve1Zv3wlYGVpPlpO+fGyn47Qqbjxj9qvpIzP8UV5XFID2TnZvCR7cV20FrNigOHAxPMpntC7F9kk3Q6W7c8u0WlzGVLBAJ2ed1hVkJZZ4sY+MIY52h/gu+njh8F3HBGDJdgGLvP6njmHcWkgRNzkqNHkXon+masxc1V8pHsHoFYIq+2hB+ZnJzgOD6aUY6/pUFMMuVYE X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi all, This patch series aims to make slab shrink lockless. 1. Background ============= On our servers, we often find the following system cpu hotspots: 44.16% [kernel] [k] down_read_trylock 14.12% [kernel] [k] up_read 13.43% [kernel] [k] shrink_slab 5.25% [kernel] [k] count_shadow_nodes 3.42% [kernel] [k] idr_find Then we used bpftrace to capture its calltrace as follows: @[ down_read_trylock+5 shrink_slab+292 shrink_node+640 do_try_to_free_pages+211 try_to_free_mem_cgroup_pages+266 try_charge_memcg+386 charge_memcg+51 __mem_cgroup_charge+44 __handle_mm_fault+1416 handle_mm_fault+260 do_user_addr_fault+459 exc_page_fault+104 asm_exc_page_fault+38 clear_user_rep_good+18 read_zero+100 vfs_read+176 ksys_read+93 do_syscall_64+62 entry_SYSCALL_64_after_hwframe+114 ]: 1868979 It is easy to see that this is caused by the frequent failure to obtain the read lock of shrinker_rwsem when reclaiming slab memory. Currently, the shrinker_rwsem is a global lock. And the following cases may cause the above system cpu hotspots: a. the write lock of shrinker_rwsem was held for too long. For example, there are many memcgs in the system, which causes some paths to hold locks and traverse it for too long. (e.g. expand_shrinker_info()) b. the read lock of shrinker_rwsem was held for too long, and a writer came at this time. Then this writer will be forced to wait and block all subsequent readers. For example: - be scheduled when the read lock of shrinker_rwsem is held in do_shrink_slab() - some shrinker are blocked for too long. Like the case mentioned in the patchset[1]. [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/ And all the down_read_trylock() hotspots caused by the above cases can be solved by replacing the shrinker_rwsem trylocks with SRCU. 2. Survey ========= Before doing the code implementation, I found that there were many similar submissions in the community: a. Davidlohr Bueso submitted a patch in 2015. Subject: [PATCH -next v2] mm: srcu-ify shrinkers Link: https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/ Result: It was finally merged into the linux-next branch, but failed on arm allnoconfig (without CONFIG_SRCU) b. Tetsuo Handa submitted a patchset in 2017. Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock. Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/ Result: Finally chose to use the current simple way (break when rwsem_is_contended()). And Christoph Hellwig suggested to using SRCU, but SRCU was not unconditionally enabled at the time. c. Kirill Tkhai submitted a patchset in 2018. Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab() Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/ Result: At that time, SRCU was not unconditionally enabled, and there were some objections to enabling SRCU. Later, because Kirill's focus was moved to other things, this patchset was not continued to be updated. d. Sultan Alsawaf submitted a patch in 2021. Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection Link: https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/ Result: Rejected because SRCU was not unconditionally enabled. We can find that almost all these historical commits were abandoned because SRCU was not unconditionally enabled. But now SRCU has been unconditionally enable by Paul E. McKenney in 2023 [2], so it's time to replace shrinker_rwsem trylocks with SRCU. [2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/ 3. Reproduction and testing =========================== We can reproduce the down_read_trylock() hotspot through the following script: ``` #!/bin/bash DIR="/root/shrinker/memcg/mnt" do_create() { mkdir /sys/fs/cgroup/memory/test echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes for i in `seq 0 $1`; do mkdir /sys/fs/cgroup/memory/test/$i; echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; mkdir -p $DIR/$i; done } do_mount() { for i in `seq $1 $2`; do mount -t tmpfs $i $DIR/$i; done } do_touch() { for i in `seq $1 $2`; do echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 & done } do_create 2000 do_mount 0 2000 do_touch 0 1000 ``` Save the above script and execute it, we can get the following perf hotspots: 46.60% [kernel] [k] down_read_trylock 18.70% [kernel] [k] up_read 15.44% [kernel] [k] shrink_slab 4.37% [kernel] [k] _find_next_bit 2.75% [kernel] [k] xa_load 2.07% [kernel] [k] idr_find 1.73% [kernel] [k] do_shrink_slab 1.42% [kernel] [k] shrink_lruvec 0.74% [kernel] [k] shrink_node 0.60% [kernel] [k] list_lru_count_one After applying this patchset, the hotspot becomes as follows: 19.53% [kernel] [k] _find_next_bit 14.63% [kernel] [k] do_shrink_slab 14.58% [kernel] [k] shrink_slab 11.83% [kernel] [k] shrink_lruvec 9.33% [kernel] [k] __blk_flush_plug 6.67% [kernel] [k] mem_cgroup_iter 3.73% [kernel] [k] list_lru_count_one 2.43% [kernel] [k] shrink_node 1.96% [kernel] [k] super_cache_count 1.78% [kernel] [k] __rcu_read_unlock 1.38% [kernel] [k] __srcu_read_lock 1.30% [kernel] [k] xas_descend We can see that the slab reclaim is no longer blocked by shinker_rwsem trylock, which realizes the lockless slab reclaim. This series is based on next-20230217. Comments and suggestions are welcome. Thanks, Qi. Qi Zheng (5): mm: vmscan: make global slab shrink lockless mm: vmscan: make memcg slab shrink lockless mm: shrinkers: make count and scan in shrinker debugfs lockless mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() mm: shrinkers: convert shrinker_rwsem to mutex drivers/md/dm-cache-metadata.c | 2 +- drivers/md/dm-thin-metadata.c | 2 +- fs/super.c | 2 +- mm/shrinker_debug.c | 38 ++++------- mm/vmscan.c | 119 ++++++++++++++++----------------- 5 files changed, 76 insertions(+), 87 deletions(-)