From patchwork Wed Jul 26 00:29:04 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yosry Ahmed X-Patchwork-Id: 13327333 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C5BD6C001DF for ; Wed, 26 Jul 2023 00:29:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0CD3E6B0074; Tue, 25 Jul 2023 20:29:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 007F28D0001; Tue, 25 Jul 2023 20:29:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D9C2A6B0078; Tue, 25 Jul 2023 20:29:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id CBCBE6B0074 for ; Tue, 25 Jul 2023 20:29:12 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 64377C0FC4 for ; Wed, 26 Jul 2023 00:29:12 +0000 (UTC) X-FDA: 81051878544.26.D6C4E9A Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com [209.85.215.202]) by imf04.hostedemail.com (Postfix) with ESMTP id 965D840003 for ; Wed, 26 Jul 2023 00:29:10 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=TpUttPVb; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of 31WjAZAoKCNQOEIHO07C436EE6B4.2ECB8DKN-CCAL02A.EH6@flex--yosryahmed.bounces.google.com designates 209.85.215.202 as permitted sender) smtp.mailfrom=31WjAZAoKCNQOEIHO07C436EE6B4.2ECB8DKN-CCAL02A.EH6@flex--yosryahmed.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690331350; a=rsa-sha256; cv=none; b=wxn/ktmHPMel3UKdfQokzRLL0R8ZLQtwCJ+1AcvV1VHip/1R5oM6PNbgwz1renaGBwExdL Qwzpw2BQjFzYVJR6pdn0XuE7+/8+abvgKRuz9cz08VPB7aO0hkW2/XY72TyaZtWI28kfZz yFEkADah8e9+esnwDz8hHn66VVpwJek= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=TpUttPVb; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of 31WjAZAoKCNQOEIHO07C436EE6B4.2ECB8DKN-CCAL02A.EH6@flex--yosryahmed.bounces.google.com designates 209.85.215.202 as permitted sender) smtp.mailfrom=31WjAZAoKCNQOEIHO07C436EE6B4.2ECB8DKN-CCAL02A.EH6@flex--yosryahmed.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690331350; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zFtdwb3APNn9BFhytoOm2zBYpeQvZ7p6I7QfkX2ngXo=; b=j1Yn/dYR4ESGxBSL3E5EAAfqirp4MEJgL+jhnLCRacFOncZsT3KQM7GrU+bZQG66WWdF+8 93n6397joU6af3OV4XiCEUpRK1yMPAH70nmDenakLiXkPT1ewDt2/CoWKrbnz5Gjer20df bh0d4PM/V8HePYv7QGPyzb9RPlk4BTc= Received: by mail-pg1-f202.google.com with SMTP id 41be03b00d2f7-563db371f05so430347a12.3 for ; Tue, 25 Jul 2023 17:29:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1690331349; x=1690936149; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=zFtdwb3APNn9BFhytoOm2zBYpeQvZ7p6I7QfkX2ngXo=; b=TpUttPVberDynUN55a/PV3Hkc/gbY69rT2pJfNpoY+GTNgKc76uD0Ws41+S1vu9upt HdcZAqRDdgy+BZ9/OcSzS1+eLUayT06dFBbUZP852pAzgiG/RO5253qaRqwlNWxkHQC2 xQ+XU2gtlb929X5zrhhn2VtkXVslXS8pI5o53GVX6UY4+Li5bFNZBKrTZYaJNoCDNn0/ g/4UG0YGNvdyeOmkEE0EIGoEC4Q9jcGFAVw7s6smMkW7Rj8JTPkePWnuNXl+Jnn6cb48 QG6zI20QyFyLYYpe17fxjLQvUDIC3IxITcU0s0dOufk40bH24hcMJtYbub0Lqg7bvnza WdPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690331349; x=1690936149; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=zFtdwb3APNn9BFhytoOm2zBYpeQvZ7p6I7QfkX2ngXo=; b=LeN8Ui9NrVsJ4L0BHB1C+tRxUBWDiueft9+Z6zVhUKmdicIA8xSDGocKXFCQi0KIym jV6dTq4SR4aXpBnORml5uj2ilJcAmkHcpD964wpdYqw7Dx0MExFDdSlS4o3oPXa2I0Fh p5xj4wY50k6qoypMHiZZMYRy+N+QilhOWmqWIj232e3dvSPAyBPZVeYzB7dxlfHbfm+7 wqpLpQNK2SF5p5xPuQlQsyBWwLK0eqppH3OISN0teEtmmznYHviij0EU3XqmEwCyAham 5fk/soN+yvn/mc3FPFuiOEBmoYZg8431xOV80/EYn7J1PnFXZ1Q+9YAyynivZPYxvlBd Xcbw== X-Gm-Message-State: ABy/qLZHLm9Xbb4j7qJ/zJEomeQHp4uoV3u+SuJ1rYKmTNXGeogBAtLa 6nYEX0844pgDxOJ0e6jtvvk7MDccMMGQ/dhP X-Google-Smtp-Source: APBJJlGpv/lmB9D4mhC4EimIMkJrIevPcEoC2bsCQucZc02Ly0ifQqwwRhMQwJ0e4/syS5zyNUySOmG7z59Wjqxc X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327]) (user=yosryahmed job=sendgmr) by 2002:a63:7f1c:0:b0:553:3ba2:f36 with SMTP id a28-20020a637f1c000000b005533ba20f36mr3143pgd.9.1690331349329; Tue, 25 Jul 2023 17:29:09 -0700 (PDT) Date: Wed, 26 Jul 2023 00:29:04 +0000 In-Reply-To: <20230726002904.655377-1-yosryahmed@google.com> Mime-Version: 1.0 References: <20230726002904.655377-1-yosryahmed@google.com> X-Mailer: git-send-email 2.41.0.487.g6d72f3e995-goog Message-ID: <20230726002904.655377-2-yosryahmed@google.com> Subject: [PATCH v2] mm: memcg: use rstat for non-hierarchical stats From: Yosry Ahmed To: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Yosry Ahmed X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 965D840003 X-Stat-Signature: 16qep7fxa4xbd3jq3bg9dbcwstfkxfsi X-HE-Tag: 1690331350-493917 X-HE-Meta: U2FsdGVkX18ji7hOssjVYJhaFRKQbgu6secAsLXDQZblA/AiVk20flF8/IZ1cFL6qnNiiOURKnnxm3tl+VUWZcoCb+Euavs90bVPFCt9pkp8vwUE2gKjxIi0WA7GsdBWjtNX1ORwX45QwKeLfGbuFrRoqFkRF9Vwqi9h25cCmsP4x+OCDXtu7fubV06jEKL7k+cWqMabL2/5AUBUt2JHLtRVU1zbp/F6bBI/nCy8yiEeOktmAq6q5hF2XSWWA1SrrrQ1QJP6Co0LLsKWm1yD7myLODrqohY96Ulcjyjvl941s/SWDz3VJHFuoJ48s6ZQ/4iueWGchVpY7vIcb3ixkKmcSVRB52qW64O1eeIH+N1m9FSkDMdBkaxGuCeJ+WzvpoYEcajMHf/uwTrI4+0l2ZlbZxtFQzrpKAmFiJWTyUUp3VuvsUtOUACA2+UzyDpNvyLyEEA3C5Pv8Y8R8ALbIdYV4alTETs+r1DBn+3GRsd3T4g3M8hp7SPUxOCBZY8G8O06uReaIkOfVSbkh+6Mk962A18oWpJE2EQPpuvK/KdDBMMf6C8vVuoRA/XhynfG17ywTGe99MqU8YJ58ip/PPwdUUZWttIzypyAGXpNoOePWUUM8v2DTzRqTGFOr+k3GuJRGAk5kHJnZs4WJ5zDiwGu0I67zMvuNtvwYQPRH52BQ5i/l5kZWWkCp8T+QSQp8KOc77alESyW0rWelWuzXkA9JSe2IR4PB3PbAyfZdLQVEzHuukiUlUp/+tCAAaINzA2FpnYZwkPg5/JBlc80MMiavuihAkHBUo0bUyh6IvqTUUe9hiLp6ST3S5g0/XARUdZXWwYGRHUr7unUmzLjGcDO4f9qIrVCJ0YnA50Hs4uV+wggJT4nAyQE9oQX9xV4VQMCfit3AP5x00JcC/kfvpWzwYPUYqppGJCw1o0Amfy4aH3CSOj45KDVcfmMUggtxNGlFL/xEcHzZrRQmHl 9pA7AWqI 0OTId8NxW9eT7hcmpOd3MK9YgyaYUt+qzt5yFUW1JDFvvI5q/MffeAnKwOoaoE5uQQhUL1cbyyR13ApIzhlpxz9uTfQsqaXAtl1u+9JaKodlhzza+RbOoop5hJ1/eSCAAMjUUmaG0zoQGwK1YuAcGPhUTrdfoJAPG6wjqBhX+/dsZe83Hn9xuOp7BYB/Cwvu9zmoEmrTmOb+/VuEVvJoPEWRi7a5lRvdNLw8354sYe/cmFoXNJQXV9G7w788c3iXQYEitmW3+UuwCibEXtO1anbsgaRb6bOHQoyfJRfP5HxobtRv+dXhZPjsPxAtQn3YfqX0YFA3yfr/R8A3ZBCH4nDfSNrfm6gioEszcZhn9MvKRdsuHslfmrgCbHNe+aQiXZCZbNYdE3IpMmS4gmbwLQ9/1aYHLvbF+Q1RQg5a8OL1U3t4cX8DIdoQoym4a7wTKlWTIAj7KTXCkCpjJNQrpYyKkmNZjHbmbRHIP+oNa/76P3TBYSTT4cYDEZfH6pWRCREO4gLxWorI0GSjHDyDMv3QjGHr6XJzHcFjaqF/juMR83etCTzPeeh3RNFtf9o4TPmgj7yGwXiSMymOE5gB2y4ALzP7/bqVRZD/zC2Rxc2gWU2MGFDsQPxDoOu3z3han4X+VOYEHswHpgATK5H6NJU2FMLut4yItPMR32FYc3oVZMHCPDAanBEgeEVowYnPBdn5eV82MflOB7tIGX68+Z5iEDrQH+sGQrBTnqRiJeqYJDHQujO2lSty+xymC/yGaXcbs X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Currently, memcg uses rstat to maintain hierarchical stats. Counters are maintained for hierarchical stats at each memcg. Rstat tracks which cgroups have updates on which cpus to keep those counters fresh on the read-side. For non-hierarchical stats, we do not maintain counters. Instead, the percpu counters for a given stat need to be summed to get the non-hierarchical stat value. The original implementation did the same. At some point before rstat, non-hierarchical counters were introduced by commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting"). However, those counters were updated on the performance critical write-side, which caused regressions, so they were later removed by commit 815744d75152 ("mm: memcontrol: don't batch updates of local VM stats and events"). See [1] for more detailed history. Kernel versions in between a983b5ebee57 & 815744d75152 (a year and a half) enjoyed cheap reads of non-hierarchical stats, specifically on cgroup v1. When moving to more recent kernels, a performance regression for reading non-hierarchical stats is observed. Now that we have rstat, we know exactly which percpu counters have updates for each stat. We can maintain non-hierarchical counters again, making reads much more efficient, without affecting the performance critical write-side. Hence, add non-hierarchical (i.e local) counters for the stats, and extend rstat flushing to keep those up-to-date. A caveat is that we now a stats flush before reading local/non-hierarchical stats through {memcg/lruvec}_page_state_local() or memcg_events_local(), where we previously only needed a flush to read hierarchical stats. Most contexts reading non-hierarchical stats are already doing a flush, add a flush to the only missing context in count_shadow_nodes(). With this patch, reading memory.stat from 1000 memcgs is 3x faster on a machine with 256 cpus on cgroup v1: # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null real 0m0.125s user 0m0.005s sys 0m0.120s After: real 0m0.032s user 0m0.005s sys 0m0.027s [1]https://lore.kernel.org/lkml/20230725201811.GA1231514@cmpxchg.org/ Signed-off-by: Yosry Ahmed Acked-by: Johannes Weiner Acked-by: Roman Gushchin --- v1 -> v2: - Rewrite the changelog based on the history context provided by Johannes (Thanks!). - Fix a subtle bug where updating a local counter would be missed if it was cancelled out by a pending update from child memcgs. --- include/linux/memcontrol.h | 7 ++-- mm/memcontrol.c | 67 +++++++++++++++++++++----------------- mm/workingset.c | 1 + 3 files changed, 43 insertions(+), 32 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5818af8eca5a..a9f2861a57a5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -112,6 +112,9 @@ struct lruvec_stats { /* Aggregated (CPU and subtree) state */ long state[NR_VM_NODE_STAT_ITEMS]; + /* Non-hierarchical (CPU aggregated) state */ + long state_local[NR_VM_NODE_STAT_ITEMS]; + /* Pending child counts during tree propagation */ long state_pending[NR_VM_NODE_STAT_ITEMS]; }; @@ -1020,14 +1023,12 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, { struct mem_cgroup_per_node *pn; long x = 0; - int cpu; if (mem_cgroup_disabled()) return node_page_state(lruvec_pgdat(lruvec), idx); pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); - for_each_possible_cpu(cpu) - x += per_cpu(pn->lruvec_stats_percpu->state[idx], cpu); + x = READ_ONCE(pn->lruvec_stats.state_local[idx]); #ifdef CONFIG_SMP if (x < 0) x = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e8ca4bdcb03c..50f8035e998a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -742,6 +742,10 @@ struct memcg_vmstats { long state[MEMCG_NR_STAT]; unsigned long events[NR_MEMCG_EVENTS]; + /* Non-hierarchical (CPU aggregated) page state & events */ + long state_local[MEMCG_NR_STAT]; + unsigned long events_local[NR_MEMCG_EVENTS]; + /* Pending child counts during tree propagation */ long state_pending[MEMCG_NR_STAT]; unsigned long events_pending[NR_MEMCG_EVENTS]; @@ -775,11 +779,8 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) /* idx can be of type enum memcg_stat_item or node_stat_item. */ static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) { - long x = 0; - int cpu; + long x = READ_ONCE(memcg->vmstats->state_local[idx]); - for_each_possible_cpu(cpu) - x += per_cpu(memcg->vmstats_percpu->state[idx], cpu); #ifdef CONFIG_SMP if (x < 0) x = 0; @@ -926,16 +927,12 @@ static unsigned long memcg_events(struct mem_cgroup *memcg, int event) static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) { - long x = 0; - int cpu; int index = memcg_events_index(event); if (index < 0) return 0; - for_each_possible_cpu(cpu) - x += per_cpu(memcg->vmstats_percpu->events[index], cpu); - return x; + return READ_ONCE(memcg->vmstats->events_local[index]); } static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, @@ -5526,7 +5523,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) struct mem_cgroup *memcg = mem_cgroup_from_css(css); struct mem_cgroup *parent = parent_mem_cgroup(memcg); struct memcg_vmstats_percpu *statc; - long delta, v; + long delta, delta_cpu, v; int i, nid; statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); @@ -5542,19 +5539,23 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) memcg->vmstats->state_pending[i] = 0; /* Add CPU changes on this level since the last flush */ + delta_cpu = 0; v = READ_ONCE(statc->state[i]); if (v != statc->state_prev[i]) { - delta += v - statc->state_prev[i]; + delta_cpu = v - statc->state_prev[i]; + delta += delta_cpu; statc->state_prev[i] = v; } - if (!delta) - continue; - /* Aggregate counts on this level and propagate upwards */ - memcg->vmstats->state[i] += delta; - if (parent) - parent->vmstats->state_pending[i] += delta; + if (delta_cpu) + memcg->vmstats->state_local[i] += delta_cpu; + + if (delta) { + memcg->vmstats->state[i] += delta; + if (parent) + parent->vmstats->state_pending[i] += delta; + } } for (i = 0; i < NR_MEMCG_EVENTS; i++) { @@ -5562,18 +5563,22 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) if (delta) memcg->vmstats->events_pending[i] = 0; + delta_cpu = 0; v = READ_ONCE(statc->events[i]); if (v != statc->events_prev[i]) { - delta += v - statc->events_prev[i]; + delta_cpu = v - statc->events_prev[i]; + delta += delta_cpu; statc->events_prev[i] = v; } - if (!delta) - continue; + if (delta_cpu) + memcg->vmstats->events_local[i] += delta_cpu; - memcg->vmstats->events[i] += delta; - if (parent) - parent->vmstats->events_pending[i] += delta; + if (delta) { + memcg->vmstats->events[i] += delta; + if (parent) + parent->vmstats->events_pending[i] += delta; + } } for_each_node_state(nid, N_MEMORY) { @@ -5591,18 +5596,22 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) if (delta) pn->lruvec_stats.state_pending[i] = 0; + delta_cpu = 0; v = READ_ONCE(lstatc->state[i]); if (v != lstatc->state_prev[i]) { - delta += v - lstatc->state_prev[i]; + delta_cpu = v - lstatc->state_prev[i]; + delta += delta_cpu; lstatc->state_prev[i] = v; } - if (!delta) - continue; + if (delta_cpu) + pn->lruvec_stats.state_local[i] += delta_cpu; - pn->lruvec_stats.state[i] += delta; - if (ppn) - ppn->lruvec_stats.state_pending[i] += delta; + if (delta) { + pn->lruvec_stats.state[i] += delta; + if (ppn) + ppn->lruvec_stats.state_pending[i] += delta; + } } } } diff --git a/mm/workingset.c b/mm/workingset.c index 4686ae363000..da58a26d0d4d 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -664,6 +664,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, struct lruvec *lruvec; int i; + mem_cgroup_flush_stats(); lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) pages += lruvec_page_state_local(lruvec,