From patchwork Thu Aug 24 03:43:02 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 13363506 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8D66AC3DA6F for ; Thu, 24 Aug 2023 03:50:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 31791680003; Wed, 23 Aug 2023 23:50:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2A1538E0011; Wed, 23 Aug 2023 23:50:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 141C6680003; Wed, 23 Aug 2023 23:50:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 0336F8E0011 for ; Wed, 23 Aug 2023 23:50:10 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id CE41E8017E for ; Thu, 24 Aug 2023 03:50:09 +0000 (UTC) X-FDA: 81157620138.10.C767164 Received: from mail-oi1-f173.google.com (mail-oi1-f173.google.com [209.85.167.173]) by imf29.hostedemail.com (Postfix) with ESMTP id 1294F120018 for ; Thu, 24 Aug 2023 03:50:07 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=Pf6vNHkc; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf29.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.167.173 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692849008; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=bj97+7kRuzjBy8tIew3M4TNtnbEno9Wuan+4JTlpSSQ=; b=PxShoz6it+IONq1sIbDxR4qCQ/AkLBCD4+oU5GZaduFOhtsw+sD1tJkYh+dJVZzeCgeWlq TYA/gnTlXeN6+Y2rWLPEnKS1vSA3CA06pAsZiWKbJRPMhusbwaGmy5YDFoiIu5gC/vxMtc OIOmP4laXFghpnXcudUgf3/Oc0I3VR8= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=Pf6vNHkc; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf29.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.167.173 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692849008; a=rsa-sha256; cv=none; b=WPXXEsJOZTGqge7ZWPX1UHyf0bdpNCUKn5SjrYmRyZmoGj0ps0n+spRIhb5GYBq+x8rLjD dW7fUfywiE/HNcDBj3CZeaGOe4WNlxwwUH35GEIaaxzwiph5y9d8gEBlHAFMaxHrB+K7bA RJHkTzHSOfO1hytJ6pQi7GwWUHT1IkA= Received: by mail-oi1-f173.google.com with SMTP id 5614622812f47-3a8586813cfso655957b6e.0 for ; Wed, 23 Aug 2023 20:50:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1692849007; x=1693453807; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=bj97+7kRuzjBy8tIew3M4TNtnbEno9Wuan+4JTlpSSQ=; b=Pf6vNHkcB1vA+6pxbgxqpCgxpTQyk2TtDM/one3T/7C2oTeNlkoEFdQ0L+WsmL9Y+q BjBv6VhRRLZdvemmxw4SoGo8giABbtk+hJjbUvzD9xKbM+F06wn4gwW4u8zm7429LXth 3CAeLPVtrqyhJ+pNzZ27ujBXU6T0SdwYWU6AYQtLaBCIKMUwWo2tehQ59lXNTJxqwpjh QSCHxH4GcVwROqRsmcJ4PYqES36W4oSEATQLnGYrVdSRQTVs+xacMGyzpMR7Luxhok1d CDFJezReHQORkNLmpGGdLCVlNbqFm8Htz4v8L6Gp82964pbV+f9rHsaB4MwdgXLHjtpQ t6bQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692849007; x=1693453807; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=bj97+7kRuzjBy8tIew3M4TNtnbEno9Wuan+4JTlpSSQ=; b=RuY8/svpVR2juN/g+kP27usizD0DaM0tc7ar2hVb0fCviYDIHfWOXHihriE0FWYEcr yTXRFbYRUX+cWV45+46pP+3KsS6KvvVApCHcXQgvbUyx2knz35gx3xC5JfF+Nqrp4lrQ smgwrd+l1WZDJgf4CB4mzMTQR3N9mw2+Wg6fVfRZMpvxdF0+Vbo+f50dTgXswSIKeqHQ 6EABsdJQYpvoL5lzJE7BkZSBJaEW827MjL134N20XqMW9mkXL9FiwsEoOADuveTH19wu ZiKEi1zsl7i0zXy6qJ1ccN7JYGeubwbiQg6Xt3aP71ax/dSgt9HqIANEk6Lcrgi5kl2P caMw== X-Gm-Message-State: AOJu0YwR+zmOoYH9J5qRrNZUx1Dqz3vVssBX4HJBAwUeOVBoGckqc6Qv DoXMEGjBz8cRhYnhdITk2c9yeg== X-Google-Smtp-Source: AGHT+IFy9m7FV74PRUXhk8NrRebE6Bg/bLzhyLzquzwrzg4LwcJGIL+HBXsvf05OcawJkJTA6sAEMQ== X-Received: by 2002:a05:6808:30a7:b0:3a7:2eb4:ce04 with SMTP id bl39-20020a05680830a700b003a72eb4ce04mr17486035oib.5.1692849007231; Wed, 23 Aug 2023 20:50:07 -0700 (PDT) Received: from C02DW0BEMD6R.bytedance.net ([203.208.167.146]) by smtp.gmail.com with ESMTPSA id t6-20020a63b246000000b005579f12a238sm10533157pgo.86.2023.08.23.20.50.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 23 Aug 2023 20:50:06 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, david@fromorbit.com, tkhai@ya.ru, vbabka@suse.cz, roman.gushchin@linux.dev, djwong@kernel.org, brauner@kernel.org, paulmck@kernel.org, tytso@mit.edu, steven.price@arm.com, cel@kernel.org, senozhatsky@chromium.org, yujie.liu@intel.com, gregkh@linuxfoundation.org, muchun.song@linux.dev Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Qi Zheng Subject: [PATCH v5 43/45] mm: shrinker: make memcg slab shrink lockless Date: Thu, 24 Aug 2023 11:43:02 +0800 Message-Id: <20230824034304.37411-44-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20230824034304.37411-1-zhengqi.arch@bytedance.com> References: <20230824034304.37411-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 1294F120018 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: x1eme8w8khkei1oo9xcqzcuyjssp6q7j X-HE-Tag: 1692849007-373697 X-HE-Meta: U2FsdGVkX191RWmHD85VleCOUFpLPJD7aVKbC3Flo/l+GmiSsifValvW5OY+zWBg2hG9Un0D2d+DaG+sjv0qs4/DDKNHkCol/HWFPGvYyUC5eLmPGnJJDWD3kbe2vv2mUnVXvBof87juvZR+GVpYQmFgUz6Ymlm96F7AfPY0+f8Xu/9D/m0FcuivecifWfZCsnXFmSXDq632ZSXdqf2cyZugvhYnpgq4tTf0/rhofSLajv9hCMrztIbBJ8wB+t+Pgo/JmIX4JgmqUJyBtENNxNDjwhz8TK1M2yFNusf7H7U4kbRUyoRavV3uxFHb47U/azob/7/z25l5qCunCoS7A0pxUgCwWGkTEijdw+RY/rFdRVZia+JwyikQAJwLWBwJw+1MMOx+dTKs+BxTL6y7Oz2jqHJlcXl16BOss7DLDxwuOXBFV2VdvIoOrY2aQQAc9WDM7kIhmA5eGZbiA6YxmSBUqwn1tjzCk+Pwokupepqsqo2bD9ItgeFmr6HXKfnQ+Tmc1HUZn/biZlzJfukmJaI0y7L/UgeAP2XEVtd1JtrT6FBkVqMeikZGFihRMoH5Z2sfGsh4DaPbP24YPWrhfy32VebhF3XxoSPuGKjRs0En606LNIgki4gphJjiVumIa/oOeN/0vScAPeIxehw/QDMtSmMc1ZQ1r0yI81iGFJnFF0VLfVnRP+D5Jfk5S7ME4suJ+LKH2GW3Hg88m3K9o8/HgkcbaHoF3vtseWpLEkATgAH/tUzow+7xAnboUy2QcToUkiEp8YwnvZblZC53Ev/15KNI6MXOSCB/NHZaO29MntqjUfTomqmnn5sbhWXeffOfXMaG7cCYyyDGW4+saNyMioSsKbdO48C9MQFUdiqB94jKfKnS93VDivEywEzsIevwn8QBDVMJX6y8VV4v6Vd8IYHvHXeweqraVlzMMApwFxA12vESpWWy/WElE9gs3JMezlzWBgrCf3eRl/2 EWYgAwhv YEIGJes6MNRYPuh6TPxvrqzDLYIiwXcr0WRlSFkjfU1oDFfeTE2YlvgyxEdkm9z4ajZQnepcr2iZYBKj0xT/ojzh7Cl6o2x5brhNu1AsIde9AgxpH63W4vd60bRi1JSE0Kp1AENIIYD2mQmO0zWhGuwxYGCNYvCgOhXEaMN+S0OjzF4qHbCq5E7Oa1s9qTc9rw1MehHxaCuDxRCPMekJBpH+L2mtYgy/PwUPxa/C4BGk+sTAiNYucxW9U4NAR0DUKl4CUprvK5s0K8RaVOoHPr+dvDPx5MD+eZLT6G72jQimty11AtTvwJJ0szWkzj/mH6Hy66Ag9XY+KVNFPG+xYLMmbHBacIUe823I9IOuDxTittXPPlh7uP4xtCgMifUhs1e85H9xWllxy4O6yP3IaSlgaT2xZmLZxPPYm X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Like global slab shrink, this commit also uses refcount+RCU method to make memcg slab shrink lockless. Use the following script to do slab shrink stress test: ``` DIR="/root/shrinker/memcg/mnt" do_create() { mkdir -p /sys/fs/cgroup/memory/test echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes for i in `seq 0 $1`; do mkdir -p /sys/fs/cgroup/memory/test/$i; echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; mkdir -p $DIR/$i; done } do_mount() { for i in `seq $1 $2`; do mount -t tmpfs $i $DIR/$i; done } do_touch() { for i in `seq $1 $2`; do echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 & done } case "$1" in touch) do_touch $2 $3 ;; test) do_create 4000 do_mount 0 4000 do_touch 0 3000 ;; *) exit 1 ;; esac ``` Save the above script, then run test and touch commands. Then we can use the following perf command to view hotspots: perf top -U -F 999 1) Before applying this patchset: 40.44% [kernel] [k] down_read_trylock 17.59% [kernel] [k] up_read 13.64% [kernel] [k] pv_native_safe_halt 11.90% [kernel] [k] shrink_slab 8.21% [kernel] [k] idr_find 2.71% [kernel] [k] _find_next_bit 1.36% [kernel] [k] shrink_node 0.81% [kernel] [k] shrink_lruvec 0.80% [kernel] [k] __radix_tree_lookup 0.50% [kernel] [k] do_shrink_slab 0.21% [kernel] [k] list_lru_count_one 0.16% [kernel] [k] mem_cgroup_iter 2) After applying this patchset: 60.17% [kernel] [k] shrink_slab 20.42% [kernel] [k] pv_native_safe_halt 3.03% [kernel] [k] do_shrink_slab 2.73% [kernel] [k] shrink_node 2.27% [kernel] [k] shrink_lruvec 2.00% [kernel] [k] __rcu_read_unlock 1.92% [kernel] [k] mem_cgroup_iter 0.98% [kernel] [k] __rcu_read_lock 0.91% [kernel] [k] osq_lock 0.63% [kernel] [k] mem_cgroup_calculate_protection 0.55% [kernel] [k] shrinker_put 0.46% [kernel] [k] list_lru_count_one We can see that the first perf hotspot becomes shrink_slab, which is what we expect. Signed-off-by: Qi Zheng --- mm/shrinker.c | 85 +++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 66 insertions(+), 19 deletions(-) diff --git a/mm/shrinker.c b/mm/shrinker.c index 2b8c1f1bbf2d..a66e2a30cc16 100644 --- a/mm/shrinker.c +++ b/mm/shrinker.c @@ -218,7 +218,6 @@ static int shrinker_memcg_alloc(struct shrinker *shrinker) return -ENOSYS; down_write(&shrinker_rwsem); - /* This may call shrinker, so it must use down_read_trylock() */ id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL); if (id < 0) goto unlock; @@ -252,10 +251,15 @@ static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker, { struct shrinker_info *info; struct shrinker_info_unit *unit; + long nr_deferred; - info = shrinker_info_protected(memcg, nid); + rcu_read_lock(); + info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info); unit = info->unit[shrinker_id_to_index(shrinker->id)]; - return atomic_long_xchg(&unit->nr_deferred[shrinker_id_to_offset(shrinker->id)], 0); + nr_deferred = atomic_long_xchg(&unit->nr_deferred[shrinker_id_to_offset(shrinker->id)], 0); + rcu_read_unlock(); + + return nr_deferred; } static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker, @@ -263,10 +267,16 @@ static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker, { struct shrinker_info *info; struct shrinker_info_unit *unit; + long nr_deferred; - info = shrinker_info_protected(memcg, nid); + rcu_read_lock(); + info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info); unit = info->unit[shrinker_id_to_index(shrinker->id)]; - return atomic_long_add_return(nr, &unit->nr_deferred[shrinker_id_to_offset(shrinker->id)]); + nr_deferred = + atomic_long_add_return(nr, &unit->nr_deferred[shrinker_id_to_offset(shrinker->id)]); + rcu_read_unlock(); + + return nr_deferred; } void reparent_shrinker_deferred(struct mem_cgroup *memcg) @@ -463,18 +473,54 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, if (!mem_cgroup_online(memcg)) return 0; - if (!down_read_trylock(&shrinker_rwsem)) - return 0; - - info = shrinker_info_protected(memcg, nid); + /* + * lockless algorithm of memcg shrink. + * + * The shrinker_info may be freed asynchronously via RCU in the + * expand_one_shrinker_info(), so the rcu_read_lock() needs to be used + * to ensure the existence of the shrinker_info. + * + * The shrinker_info_unit is never freed unless its corresponding memcg + * is destroyed. Here we already hold the refcount of memcg, so the + * memcg will not be destroyed, and of course shrinker_info_unit will + * not be freed. + * + * So in the memcg shrink: + * step 1: use rcu_read_lock() to guarantee existence of the + * shrinker_info. + * step 2: after getting shrinker_info_unit we can safely release the + * RCU lock. + * step 3: traverse the bitmap and calculate shrinker_id + * step 4: use rcu_read_lock() to guarantee existence of the shrinker. + * step 5: use shrinker_id to find the shrinker, then use + * shrinker_try_get() to guarantee existence of the shrinker, + * then we can release the RCU lock to do do_shrink_slab() that + * may sleep. + * step 6: do shrinker_put() paired with step 5 to put the refcount, + * if the refcount reaches 0, then wake up the waiter in + * shrinker_free() by calling complete(). + * Note: here is different from the global shrink, we don't + * need to acquire the RCU lock to guarantee existence of + * the shrinker, because we don't need to use this + * shrinker to traverse the next shrinker in the bitmap. + * step 7: we have already exited the read-side of rcu critical section + * before calling do_shrink_slab(), the shrinker_info may be + * released in expand_one_shrinker_info(), so go back to step 1 + * to reacquire the shrinker_info. + */ +again: + rcu_read_lock(); + info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info); if (unlikely(!info)) goto unlock; - for (; index < shrinker_id_to_index(info->map_nr_max); index++) { + if (index < shrinker_id_to_index(info->map_nr_max)) { struct shrinker_info_unit *unit; unit = info->unit[index]; + rcu_read_unlock(); + for_each_set_bit(offset, unit->map, SHRINKER_UNIT_BITS) { struct shrink_control sc = { .gfp_mask = gfp_mask, @@ -484,12 +530,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, struct shrinker *shrinker; int shrinker_id = calc_shrinker_id(index, offset); + rcu_read_lock(); shrinker = idr_find(&shrinker_idr, shrinker_id); - if (unlikely(!shrinker || !(shrinker->flags & SHRINKER_REGISTERED))) { - if (!shrinker) - clear_bit(offset, unit->map); + if (unlikely(!shrinker || !shrinker_try_get(shrinker))) { + clear_bit(offset, unit->map); + rcu_read_unlock(); continue; } + rcu_read_unlock(); /* Call non-slab shrinkers even though kmem is disabled */ if (!memcg_kmem_online() && @@ -522,15 +570,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, set_shrinker_bit(memcg, nid, shrinker_id); } freed += ret; - - if (rwsem_is_contended(&shrinker_rwsem)) { - freed = freed ? : 1; - goto unlock; - } + shrinker_put(shrinker); } + + index++; + goto again; } unlock: - up_read(&shrinker_rwsem); + rcu_read_unlock(); return freed; } #else /* !CONFIG_MEMCG */