From patchwork Thu Aug 31 16:56:11 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yosry Ahmed X-Patchwork-Id: 13371729 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 61450C83F2E for ; Thu, 31 Aug 2023 16:56:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BA4598D001C; Thu, 31 Aug 2023 12:56:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B380F8D0001; Thu, 31 Aug 2023 12:56:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 933F98D001C; Thu, 31 Aug 2023 12:56:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 7CC8B8D0001 for ; Thu, 31 Aug 2023 12:56:25 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 5970E403A5 for ; Thu, 31 Aug 2023 16:56:25 +0000 (UTC) X-FDA: 81185003130.25.DB12B81 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf02.hostedemail.com (Postfix) with ESMTP id 6443280016 for ; Thu, 31 Aug 2023 16:56:23 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b="tz39G/2E"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf02.hostedemail.com: domain of 3NsbwZAoKCLMrhlkrTafXWZhhZeX.Vhfebgnq-ffdoTVd.hkZ@flex--yosryahmed.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3NsbwZAoKCLMrhlkrTafXWZhhZeX.Vhfebgnq-ffdoTVd.hkZ@flex--yosryahmed.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1693500983; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=lVjKd06WEqwLUbomPNwpng6u2PnrisAzbeEHsBpSLQA=; b=ogMBcUVktvkMYkGirP1NT/dUGRryPJUFn64NJtlpbHsS4Z7oIIen8NHyoCB6UmnXE8EkDN JJBRMaW2wCkeRbWdGYSSY2U4WA3KviXNuSM/KAIQ7rAj1yfL/TcyfOFVTr7CNdiquScC+O bMtuYCSkal4f5W6hlMDErYDEuskt1S0= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b="tz39G/2E"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf02.hostedemail.com: domain of 3NsbwZAoKCLMrhlkrTafXWZhhZeX.Vhfebgnq-ffdoTVd.hkZ@flex--yosryahmed.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3NsbwZAoKCLMrhlkrTafXWZhhZeX.Vhfebgnq-ffdoTVd.hkZ@flex--yosryahmed.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1693500983; a=rsa-sha256; cv=none; b=nE1hmWs/9+mQ8LpCjo5ijnga0moIKQQq9DX/+5mUR+Y0X75rQzJ7xU4aQeB//raFynZX7p B7tmti6Lrq4CBL31YaaAuF5IXiwN2cwsOty20rLy5K6CZirnAadZqkxpqGJGe6Svpw48O5 hULWGhwisIjZ0CkURFxDyofRwBxy+7I= Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-d748b6f1077so899129276.1 for ; Thu, 31 Aug 2023 09:56:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1693500982; x=1694105782; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=lVjKd06WEqwLUbomPNwpng6u2PnrisAzbeEHsBpSLQA=; b=tz39G/2EFfFqPC7xETaRyDIHdldvN3VUhOkdGeY2MQhASDJF+Ug+VvWv+VtzkCdC1q priEAgU/sNtnsufIrzgcbHFtPSwwmkPtCM0Wh3jYn29WagIRmNT1rOB3/3Mf950jKLz0 wBffDKUCvmST+A4jPW8bNP08JREYrCnOPIiuZyijoT09f8qhpAXVRSO368qISqK6Af43 /aVLTxtxE1CW5vMQiXI5WJDQg8SEJQNqV21qDfmziLkT2nAftTg7SwLqqjP6MtGlXwsn 5edcHjKTztsyfHnEE1LNq+78wgDaBFlgU34e4ZXZNqQCFnJrbyoUjee4B974pvODXcHl o1XA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693500982; x=1694105782; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=lVjKd06WEqwLUbomPNwpng6u2PnrisAzbeEHsBpSLQA=; b=IhLrP7XwLbqrd+PjQ08pe+gy+BiI2t9yMz8TV8REXfY2kO3R9VbAkb1WM99TnPdSwB sOY+kW5bVYPweBSCqFlJ6z5KlWnh1qZqAvj3TuMLKA2/aXt2zVIX2QMq4fxfEnASwMek WKR4Gix0w/MkGchW/hw3puUjiBHZ7fqtojvN5lge+0Oz0lIXtCJ9297gJ0cTzhrU5V88 /qE00CO/quo7ODwgFEqcBjZug7mnGtmOjJD+qpoicwloEt5sEPtC/72tirJHvqe0pjLM 7fP+Z/49y8a7SAUZrAGgCzCfMCJLSrlYte6Q3guT1vNYGf9zFmqtASmKDvHfew+WtGMK 0pvQ== X-Gm-Message-State: AOJu0YxYPKF0+6dyiJvr8asN0Wot5LuyvHcUONZ40YYoWXGu8vuCSaDh Ya5DcULf5qQljoJUPwU8p32ceb8czmvz92zm X-Google-Smtp-Source: AGHT+IEy8G064MlfVtEPW2U+DbL+KGFNApMore566FGW/RELtpuRVjTr6mp/8DBxtEe5U/4TTDMHu8xHlFoQIToU X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327]) (user=yosryahmed job=sendgmr) by 2002:a25:ba90:0:b0:d72:8661:ee29 with SMTP id s16-20020a25ba90000000b00d728661ee29mr6359ybg.2.1693500982588; Thu, 31 Aug 2023 09:56:22 -0700 (PDT) Date: Thu, 31 Aug 2023 16:56:11 +0000 In-Reply-To: <20230831165611.2610118-1-yosryahmed@google.com> Mime-Version: 1.0 References: <20230831165611.2610118-1-yosryahmed@google.com> X-Mailer: git-send-email 2.42.0.rc2.253.gd59a3bf2b4-goog Message-ID: <20230831165611.2610118-5-yosryahmed@google.com> Subject: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads From: Yosry Ahmed To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , " =?utf-8?q?Michal_Koutn=C3=BD?= " , Waiman Long , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Yosry Ahmed X-Rspam-User: X-Stat-Signature: xrft5wtnhhu7cjdp5jx4xmbshsamkrd6 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 6443280016 X-HE-Tag: 1693500983-472860 X-HE-Meta: U2FsdGVkX1+6xsU2Lfe312BPdcx7XUBDcIqr9qu6DhccJUAYhmEMVz9GbH7makMimKDsZz1UnOEKgyo+T0gdHP7Cr5Ig6uSJnYNEezHLPI4eYBg6rk8uanJb2GL+o6ZwEWdugNQUNnOsnOt/UHv7XV31oLoesE7Jf2jvx5qCVnmSwNDw+3q4iIMmwrtOMJpaL5eJElQ4biEhZGhRRY+hjpHIHkuY/WEVHzIPhAokV/oSQSMVP8pQkCKvxx1lv1W2DKJZdk2QXmxcS1c7YfLVhAJoLHcKgL2WEyYEwmQBNUe9/ibbEQE5rl4raGyYhZvYYfY20X1tbDFYSTDOZ4WW3m0+bwOwI8uzGl0GqDBGdXibmcmJHRTB0jyqLXvohWdtla+vn6id/P+mx7JsrLhRi/eskN1Z8wq8n+VAAu9PkjeSpQavqevswUYWoo8cBcBszzfgM2vPUynXxuSYl6p+FzyDpHnKYbdvJ9M/ZvXsJXzfpEg8Fdv/4ZgUKIKrexWSXNQ8HiCanGpQBBMckcMZDVCocY277BJX5wLmXs8dZGUZioyVM8VidH9pZwepjsy2R+zcYEAnKFeop5o7D0I8pckZY4521Ti3SGc7V5Bym1dhWRePV6ixhysxiNWE6DMk1D7V2CkPUE0OXNIkaBpUMIQSOAXrCXyWHWAAS8ZwKt5e6vtUM4aFdKf3S+jTosEUiApRByv7cYXPvm8PZGt2qNkFE2Y2OigX1ccR7cQkzer8qgX6mmJ0ls6DpcKolV0BZUzhdesX8dP319CR31rkfc1JogcVRI29Z6amfYBmpLHzWc/IrQx2iTxz0Hk6TmBPPekkBauTn6iAxAqN2Ninb4j9a4M033DasFHD6hJDQMAXIi7VGaNsZYsPKUGD01ijwnfIDiEDurRwuKSnyXy7AL/RbqXtVa54RnK/Z+mwmrstL5pJAOt2irC3poGC8XfF7SK0z/hiPeb6NIvb5kV trxKXSl3 b3lgS8X6mx0bJhxHTT9h9FlIKtAU8SDCPT5RLdJtj0FmK180LZXYtyveAWyECqOOrTlkWT0vPgtNJU32/pdYL/v5xZIkNxzLKmw02EYGWNwqqTTXsT6HK1KeIZF6YuFX3qV6TwuOgs0QcI+Zla9g1CwLT+aYQGgvRFnfZLEuvP9KKwVTSXfvZukdO0jiKuZEFKLUXPMW0uqp5xB8aOuvDniZa6kzUbVRhYVbgSPwxh8NIk+B3NOm4j9PV0Lel3OuEXPbdhmlbuDiqO6oPbHWsaTreu0lNxt/v61ULOppfTsqokEyjlyw3jztnoUuX1c9uhV1PjOchKDabVIKHHZ7olK2mUVho6Q7gqnZJL2sTGY8N1HeDOiGX4KH5lfIEz2057zdcjrZf2/Y2JDMwyMf3/HOVz2aAAkxy0vcafr+7rBE0YmYZ1Rq1iF1j57nuVmfQ1yNNe7Lim042Yqy77MuWKbINArL/pdyygPZaT8k8u8/UO/PfI8APlcOJmI5JWPcR00vGdBiNUQEmf++M5b22IhN8cWWtHaRBcOI3URmaSxcpdhgvaqTvIJ6viljhm6f9n7IalZ7iUpLvtH3R9md6ACvPgIKr0uWrRZitEWYCdjzu1k15tq3RCpGjQunXvDdTyBc7I/5HijuDkTXKnJ8luMmY9vkmU1tv2L29eR0giLBR2C9mN7zsnysIb6E2ZIb7fVbPzn04lR54/4/+5cZ3+WdioAYxYeBCOhmp X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Unified flushing allows for great concurrency for paths that attempt to flush the stats, at the expense of potential staleness and a single flusher paying the extra cost of flushing the full tree. This tradeoff makes sense for in-kernel flushers that may observe high concurrency (e.g. reclaim, refault). For userspace readers, stale stats may be unexpected and problematic, especially when such stats are used for critical paths such as userspace OOM handling. Additionally, a userspace reader will occasionally pay the cost of flushing the entire hierarchy, which also causes problems in some cases [1]. Opt userspace reads out of unified flushing. This makes the cost of reading the stats more predictable (proportional to the size of the subtree), as well as the freshness of the stats. Userspace readers are not expected to have similar concurrency to in-kernel flushers, serializing them among themselves and among in-kernel flushers should be okay. Nonetheless, for extra safety, introduce a mutex when flushing for userspace readers to make sure only a single userspace reader can compete with in-kernel flushers at a time. This takes away userspace ability to directly influence or hurt in-kernel lock contention. An alternative is to remove flushing from the stats reading path completely, and rely on the periodic flusher. This should be accompanied by making the periodic flushing period tunable, and providing an interface for userspace to force a flush, following a similar model to /proc/vmstat. However, such a change will be hard to reverse if the implementation needs to be changed because: - The cost of reading stats will be very cheap and we won't be able to take that back easily. - There are user-visible interfaces involved. Hence, let's go with the change that's most reversible first and revisit as needed. This was tested on a machine with 256 cpus by running a synthetic test script [2] that creates 50 top-level cgroups, each with 5 children (250 leaf cgroups). Each leaf cgroup has 10 processes running that allocate memory beyond the cgroup limit, invoking reclaim (which is an in-kernel unified flusher). Concurrently, one thread is spawned per-cgroup to read the stats every second (including root, top-level, and leaf cgroups -- so total 251 threads). No significant regressions were observed in the total run time, which means that userspace readers are not significantly affecting in-kernel flushers: Base (mm-unstable): real 0m22.500s user 0m9.399s sys 73m41.381s real 0m22.749s user 0m15.648s sys 73m13.113s real 0m22.466s user 0m10.000s sys 73m11.933s With this patch: real 0m23.092s user 0m10.110s sys 75m42.774s real 0m22.277s user 0m10.443s sys 72m7.182s real 0m24.127s user 0m12.617s sys 78m52.765s [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/ [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/ Signed-off-by: Yosry Ahmed Acked-by: Michal Hocko --- mm/memcontrol.c | 24 ++++++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 94d5a6751a9e..46a7abf71c73 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) static void flush_memcg_stats_dwork(struct work_struct *w); static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); static DEFINE_PER_CPU(unsigned int, stats_updates); +static DEFINE_MUTEX(stats_user_flush_mutex); static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0); static atomic_t stats_flush_threshold = ATOMIC_INIT(0); static u64 flush_next_time; @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memcg) cgroup_rstat_flush(memcg->css.cgroup); } +/* + * mem_cgroup_user_flush_stats - do a stats flush for a user read + * @memcg: memory cgroup to flush + * + * Flush the subtree of @memcg. A mutex is used for userspace readers to gate + * the global rstat spinlock. This protects in-kernel flushers from userspace + * readers hogging the lock. + */ +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg) +{ + mutex_lock(&stats_user_flush_mutex); + do_stats_flush(memcg); + mutex_unlock(&stats_user_flush_mutex); +} + /* * do_unified_stats_flush - do a unified flush of memory cgroup statistics * @@ -1608,7 +1624,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) * * Current memory state: */ - mem_cgroup_try_flush_stats(); + mem_cgroup_user_flush_stats(memcg); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; @@ -4050,7 +4066,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) int nid; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_try_flush_stats(); + mem_cgroup_user_flush_stats(memcg); for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { seq_printf(m, "%s=%lu", stat->name, @@ -4125,7 +4141,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); - mem_cgroup_try_flush_stats(); + mem_cgroup_user_flush_stats(memcg); for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; @@ -6642,7 +6658,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v) int i; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_try_flush_stats(); + mem_cgroup_user_flush_stats(memcg); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { int nid;