From patchwork Wed Jul 26 00:29:03 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yosry Ahmed X-Patchwork-Id: 13327332 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D4BE8EB64DD for ; Wed, 26 Jul 2023 00:29:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1DD866B0071; Tue, 25 Jul 2023 20:29:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 18D5D8D0001; Tue, 25 Jul 2023 20:29:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 07D146B0075; Tue, 25 Jul 2023 20:29:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id EE16E6B0071 for ; Tue, 25 Jul 2023 20:29:10 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id B7E6441001 for ; Wed, 26 Jul 2023 00:29:10 +0000 (UTC) X-FDA: 81051878460.03.C8FD98D Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com [209.85.215.202]) by imf17.hostedemail.com (Postfix) with ESMTP id 05E7A40004 for ; Wed, 26 Jul 2023 00:29:08 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=7QpFC+Zf; spf=pass (imf17.hostedemail.com: domain of 302jAZAoKCNIMCGFMy5A214CC492.0CA96BIL-AA8Jy08.CF4@flex--yosryahmed.bounces.google.com designates 209.85.215.202 as permitted sender) smtp.mailfrom=302jAZAoKCNIMCGFMy5A214CC492.0CA96BIL-AA8Jy08.CF4@flex--yosryahmed.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690331349; a=rsa-sha256; cv=none; b=duJwv1Q8R1jTRsXTF9PYV/hU5ItBu8oFKcRxNweRIFdGt4ySqjniPm7AeWgXEQXimL3+e/ QGpcAwQR0J75YT5dd1yxur2Fc3beyc2BbwLSGiCWH6TKuXnna4/1mKs/aAtcPIlpk+XZ/1 QneI2WwxvfFut3JKko2fg/4TasoyLAM= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=7QpFC+Zf; spf=pass (imf17.hostedemail.com: domain of 302jAZAoKCNIMCGFMy5A214CC492.0CA96BIL-AA8Jy08.CF4@flex--yosryahmed.bounces.google.com designates 209.85.215.202 as permitted sender) smtp.mailfrom=302jAZAoKCNIMCGFMy5A214CC492.0CA96BIL-AA8Jy08.CF4@flex--yosryahmed.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690331349; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=6e7KuyR5qZ6xCXotERLteVBuiZtk31A4XosqK7/beqs=; b=fz2Nw5s4T6U9/X5sRxA5HkH0C7ye4T1FIDf3zMuqFSJa4cD8C/57GXRXDxpwX9U74rgPNA cOZOMrM3i7KE2KmQeEx/hzvkr03R7BeMdjmhb43HSlUvC4zgPON6awOv9DEEaG75jlfm4s r+JTuODgDFpyqPZVVT8bjuvGm5pGBpU= Received: by mail-pg1-f202.google.com with SMTP id 41be03b00d2f7-5635233876bso3190682a12.0 for ; Tue, 25 Jul 2023 17:29:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1690331347; x=1690936147; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=6e7KuyR5qZ6xCXotERLteVBuiZtk31A4XosqK7/beqs=; b=7QpFC+Zfznw7OuSoMTX/p95YGjjcDJJg9+gPDx/lb2H54TbG4P5Xa38VofAHZ96p5N WGUnLkUGGy6jflqxk1QGNzIxkgCyi0qcS503qbf3kmf2Osp+DhoFvPJ3zE3HC7GNXRoR qHtDp6mrJj2ficxwEFSYIvgtTB+VlIPeYxj3TjWiql22oMoo6JLSKtbgzsebTZPKkRCb 4hjak/PvPMZkFVbl7VHvtgtCnJMjHm/C4ncehq5A8A08ogRorMCC8hQWg5fFexWSJOk/ JLz4R0/nmXkILVg1tEt6MIWVNCc2H/fUTvrC0td8R3QbxtptmuESdyZXh949JYq4vTSY h4tw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690331347; x=1690936147; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=6e7KuyR5qZ6xCXotERLteVBuiZtk31A4XosqK7/beqs=; b=d7Yj1T/+o7ldLu+liSmpupWHZlQdjt+Npz4xI3UXry2WXT06YktTF0b3CPutxx9tc1 xZPrQqNUXEb5Fq6R0PW9h25k1UbVE1tg5IabtL/RynxS3dWSx+zrW4f5Heee5FOWdHia IORVArQ4T5JJp4AM3BGpaWjzK1K/ElPCxL8Ibn+1NaLhiVb9u0pZN2tPQkc9fgMlCB8r 43FGJsb/6e1qKDOujaQqVMxuLcamV4JAbK05hXf4OKv7XU8RFH0YcRJV0IzNU4KNzsnJ GeqfpR4y7pXXYYVMrk4w2bSoBc0Aj9p1OIinAvgM4ZCQhT6NZqfJ8BH+4CRln3JF+K6P qzBA== X-Gm-Message-State: ABy/qLYvtaLufqdZtz6Q1U8e8jVG87c/UjYeYSUZmpzUj3nwJNTFjuIq 3cFA3TUIXcjGOkYoNw5VsSQUOgwScvwqQgyL X-Google-Smtp-Source: APBJJlEahJM5c6A1lc+edA5YJwpBXCCsUtNGtY9GtXBfRzCxHEEgayz8OzQBOKHM5UTLc1KJK0cwjsVUdLiWrEwJ X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327]) (user=yosryahmed job=sendgmr) by 2002:a63:7411:0:b0:563:4869:f54d with SMTP id p17-20020a637411000000b005634869f54dmr2890pgc.11.1690331347454; Tue, 25 Jul 2023 17:29:07 -0700 (PDT) Date: Wed, 26 Jul 2023 00:29:03 +0000 Mime-Version: 1.0 X-Mailer: git-send-email 2.41.0.487.g6d72f3e995-goog Message-ID: <20230726002904.655377-1-yosryahmed@google.com> Subject: [PATCH] mm: memcg: use rstat for non-hierarchical stats From: Yosry Ahmed To: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Yosry Ahmed X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 05E7A40004 X-Stat-Signature: bhqunxmf5okiw1jxakn7yfcnsqden4bb X-Rspam-User: X-HE-Tag: 1690331348-632443 X-HE-Meta: U2FsdGVkX1/2Q/IP4c8DNgLkPg1qXx+6BfRB6CuWDKkUXxkpHrbYC3ekBUyTTLzl1wCpkgcVbq1j5fgMoCihYkmh8097UD+Et2R40lg8iLhEfPNhQ/0WA5gpkJYypm0Wfpwfql0SdMYAQ/kyFg2XgJ3KnpmJLnfFD3Ytwh54Q9mbib1BWSdo1sZPPeaVfpxRQqyREU2eis9IieXdW1IT6w8K/4SKdgrHIo+gYL4ftaR1msZ0IKaxjJMNnX6hFKHeFQcnt7sJvMK8OjKCdVZTmUn/rNXmgT8FQEgpVdLT6+IHqZl9TaXdy/DlCoPIMjz2g2dkdQvb4QHnNHCLT8FzeJ+IQqQ/+vm5OkHJvpc4SlEf2NBFbqYvDnaWWpRI0NhMf2+GSP/KNjpOlKApZnhi1kx+gALR87Fp8WS6hxVhUcVEQD9ila6rNw8kFU0aO/K0C03tW5sVyVuJXk1ADQ/CBw64x2839hhIyIbM926BRCGsBJ8kv9jJOjrDfwVc75EPHjrKa9VRpomfekhjuvcd2MveODE6srYSEWyWomPImOampBEQUlDXpRLoPfIQFZQl/UC9p6qAM716DrvDFkYi6Ls3xADvune4G/y2tdaeqaUSDArqWc9NBrBcnA14ow6ahawT4TuaETskDxSKDofni9zqM5mpqCCBUaB8R1le6PY3MDpXf62lVZVU/1aPjPfGmvtMkIDL7kD0zTQ9k7u6ZZbZ5n8cGG5+sPekb9ufeMoO52GfYmzZUL6Nz3Mn3tQfLf+or4wJpqt+9Ut1PjZC+b8b0Dgiu82PJnT9ZBv8UfbE+DGUZsytauR2j/Fn5uQODdIaRG4WytxULmkjWyLGEekp58gZXctSoNd+v0acpsgRVvCHtpcTSYRzxZPyT0Lba73W9vgaKZjQ+0GB19lI3gszO9Rzpt6LlBM/uyyn7xdrNaF3qAuv/VCuhugDEv4kRDUCC7/ix7oS1mzySSC hFeW16Hr uLLmenaugm0UO+yT+8qu0EptvODZX/E7Ak8wthAsFsjBEjBozOBE+CYap6MdfjjGI8+eo3/C9baYjth3MnfgGZMu3K7yx3KNHkF+tzYInirhYgRKMsr5Jx+W32VKj6ulXEMz+5j80Ddp0rZks/9MgwzL0IIFWhDTlC0SKqP+f7AQDU2ItlEjG7YAh9x+3l7ppcbxGwkYBfGEEhD2ivnDB5B038YbMW1TNFvytcVXPyhWxMfMldJUwE5fKYH8fPSQ0xahh5ASSkYiPsEB4Zb6Q2yomB5CBONzSe096Ab55xZ1mOpwE4js5CaDhNCMOyhi2PhRpB0D2mlWxx2guhqY8L7Fix1ZRrynXbzCvi4KJF1nLWHTXgb5NALcm6gAnlizWTu4ovVgxgkx8AKS3tHxXWYNVXKI3VQcTikLwagiC8u1zriW7CwghU9cZv9aU7vjXhns/lHHAJ30LyBdDSv24rnlpYHUVGILOaj/2u+pAfaU+Y+Ja2ix2Dr3Ot9EP4JLTLCc1S/SVuDOsLjL0M0//B+xYUOjoiXEwzJfeRWzx7i+JB5pFIu5/2UBHsBMlsF8H7mwTg3zA2Hbb1OEjCmRjE2CADEfdB3uJCmc+K69I0uEp4f+vkNQ7XELUOZasMsl45zs6OxkeaMo71J8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Currently, memcg uses rstat to maintain hierarchical stats. The rstat framework keeps track of which cgroups have updates on which cpus. For non-hierarchical stats, as memcg moved to rstat, they are no longer readily available as counters. Instead, the percpu counters for a given stat need to be summed to get the non-hierarchical stat value. This causes a performance regression when reading non-hierarchical stats on kernels where memcg moved to using rstat. This is especially visible when reading memory.stat on cgroup v1. There are also some code paths internal to the kernel that read such non-hierarchical stats. It is inefficient to iterate and sum counters in all cpus when the rstat framework knows exactly when a percpu counter has an update. Instead, maintain cpu-aggregated non-hierarchical counters for each stat. During an rstat flush, keep those updated as well. When reading non-hierarchical stats, we no longer need to iterate cpus, we just need to read the maintainer counters, similar to hierarchical stats. A caveat is that we now a stats flush before reading local/non-hierarchical stats through {memcg/lruvec}_page_state_local() or memcg_events_local(), where we previously only needed a flush to read hierarchical stats. Most contexts reading non-hierarchical stats are already doing a flush, add a flush to the only missing context in count_shadow_nodes(). With this patch, reading memory.stat from 1000 memcgs is 3x faster on a machine with 256 cpus on cgroup v1: # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null real 0m0.125s user 0m0.005s sys 0m0.120s After: real 0m0.032s user 0m0.005s sys 0m0.027s Signed-off-by: Yosry Ahmed --- include/linux/memcontrol.h | 7 ++++--- mm/memcontrol.c | 32 +++++++++++++++++++------------- mm/workingset.c | 1 + 3 files changed, 24 insertions(+), 16 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5818af8eca5a..a9f2861a57a5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -112,6 +112,9 @@ struct lruvec_stats { /* Aggregated (CPU and subtree) state */ long state[NR_VM_NODE_STAT_ITEMS]; + /* Non-hierarchical (CPU aggregated) state */ + long state_local[NR_VM_NODE_STAT_ITEMS]; + /* Pending child counts during tree propagation */ long state_pending[NR_VM_NODE_STAT_ITEMS]; }; @@ -1020,14 +1023,12 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, { struct mem_cgroup_per_node *pn; long x = 0; - int cpu; if (mem_cgroup_disabled()) return node_page_state(lruvec_pgdat(lruvec), idx); pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); - for_each_possible_cpu(cpu) - x += per_cpu(pn->lruvec_stats_percpu->state[idx], cpu); + x = READ_ONCE(pn->lruvec_stats.state_local[idx]); #ifdef CONFIG_SMP if (x < 0) x = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e8ca4bdcb03c..90a22637818e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -742,6 +742,10 @@ struct memcg_vmstats { long state[MEMCG_NR_STAT]; unsigned long events[NR_MEMCG_EVENTS]; + /* Non-hierarchical (CPU aggregated) page state & events */ + long state_local[MEMCG_NR_STAT]; + unsigned long events_local[NR_MEMCG_EVENTS]; + /* Pending child counts during tree propagation */ long state_pending[MEMCG_NR_STAT]; unsigned long events_pending[NR_MEMCG_EVENTS]; @@ -775,11 +779,8 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) /* idx can be of type enum memcg_stat_item or node_stat_item. */ static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) { - long x = 0; - int cpu; + long x = READ_ONCE(memcg->vmstats->state_local[idx]); - for_each_possible_cpu(cpu) - x += per_cpu(memcg->vmstats_percpu->state[idx], cpu); #ifdef CONFIG_SMP if (x < 0) x = 0; @@ -926,16 +927,12 @@ static unsigned long memcg_events(struct mem_cgroup *memcg, int event) static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) { - long x = 0; - int cpu; int index = memcg_events_index(event); if (index < 0) return 0; - for_each_possible_cpu(cpu) - x += per_cpu(memcg->vmstats_percpu->events[index], cpu); - return x; + return READ_ONCE(memcg->vmstats->events_local[index]); } static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, @@ -5526,7 +5523,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) struct mem_cgroup *memcg = mem_cgroup_from_css(css); struct mem_cgroup *parent = parent_mem_cgroup(memcg); struct memcg_vmstats_percpu *statc; - long delta, v; + long delta, delta_cpu, v; int i, nid; statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); @@ -5542,9 +5539,11 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) memcg->vmstats->state_pending[i] = 0; /* Add CPU changes on this level since the last flush */ + delta_cpu = 0; v = READ_ONCE(statc->state[i]); if (v != statc->state_prev[i]) { - delta += v - statc->state_prev[i]; + delta_cpu = v - statc->state_prev[i]; + delta += delta_cpu; statc->state_prev[i] = v; } @@ -5553,6 +5552,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) /* Aggregate counts on this level and propagate upwards */ memcg->vmstats->state[i] += delta; + memcg->vmstats->state_local[i] += delta_cpu; if (parent) parent->vmstats->state_pending[i] += delta; } @@ -5562,9 +5562,11 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) if (delta) memcg->vmstats->events_pending[i] = 0; + delta_cpu = 0; v = READ_ONCE(statc->events[i]); if (v != statc->events_prev[i]) { - delta += v - statc->events_prev[i]; + delta_cpu = v - statc->events_prev[i]; + delta += delta_cpu; statc->events_prev[i] = v; } @@ -5572,6 +5574,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) continue; memcg->vmstats->events[i] += delta; + memcg->vmstats->events_local[i] += delta_cpu; if (parent) parent->vmstats->events_pending[i] += delta; } @@ -5591,9 +5594,11 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) if (delta) pn->lruvec_stats.state_pending[i] = 0; + delta_cpu = 0; v = READ_ONCE(lstatc->state[i]); if (v != lstatc->state_prev[i]) { - delta += v - lstatc->state_prev[i]; + delta_cpu = v - lstatc->state_prev[i]; + delta += delta_cpu; lstatc->state_prev[i] = v; } @@ -5601,6 +5606,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) continue; pn->lruvec_stats.state[i] += delta; + pn->lruvec_stats.state_local[i] += delta_cpu; if (ppn) ppn->lruvec_stats.state_pending[i] += delta; } diff --git a/mm/workingset.c b/mm/workingset.c index 4686ae363000..da58a26d0d4d 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -664,6 +664,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, struct lruvec *lruvec; int i; + mem_cgroup_flush_stats(); lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) pages += lruvec_page_state_local(lruvec,