From patchwork Wed Dec 18 11:46:33 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kairui Song X-Patchwork-Id: 13913541 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 337E9E77187 for ; Wed, 18 Dec 2024 11:47:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 984C66B0093; Wed, 18 Dec 2024 06:47:09 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9314E6B0095; Wed, 18 Dec 2024 06:47:09 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7AB586B0096; Wed, 18 Dec 2024 06:47:09 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 5C0796B0093 for ; Wed, 18 Dec 2024 06:47:09 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id E3D0080D99 for ; Wed, 18 Dec 2024 11:47:08 +0000 (UTC) X-FDA: 82907902980.15.E5095CA Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) by imf22.hostedemail.com (Postfix) with ESMTP id B891FC000C for ; Wed, 18 Dec 2024 11:46:33 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Zyhwkjdf; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf22.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.173 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734522397; a=rsa-sha256; cv=none; b=iPeknjwp++zdCRYW1VI6mlmRuKrJFS71UHItkUJ6SU/7QkM/+H8KeHYasFUEaSJ/0W4u5v tO6L7jHIw/pgehJy06jfgdXall8ZnLBlqDfksm6UU6kWewEiE8aHFGoRsXSRCBIuktr1I4 tbFeM1WIfO4vMdVm1ND+rVnDF3/LRH4= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Zyhwkjdf; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf22.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.173 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734522397; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=e+ni2sm3wk+IEpHJsvCxGYROd6D/iUDHnxMXaHI/HCw=; b=5G9vo0H8Mr5uCpZw261sAFkKStQJJH8Q0Kc1DBsIfMq3whqOJO/sLWx3RFvCRywPYOt0Ah wBBCg8QOoyXOV5sy+v0gqDooy+EbhRi6M8UYPF1lcr95oNpfOWHwzQTIxqmdCcEjnn/kQ3 AcFbo3r5X7J1oMwLp5Ic6HdQNlnQkCQ= Received: by mail-pl1-f173.google.com with SMTP id d9443c01a7336-2165cb60719so51246555ad.0 for ; Wed, 18 Dec 2024 03:47:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1734522425; x=1735127225; darn=kvack.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=e+ni2sm3wk+IEpHJsvCxGYROd6D/iUDHnxMXaHI/HCw=; b=ZyhwkjdfO/FSDDPdhe59ef6anX70X+dTfstWgirQhAHgHP6lIb8yzxP9tjpxe1nPy7 TkOEXVIgWAMq4+wx7euxGS0oB84nZpKK/DVoS6sDzS7b18bRgZMLQyTZ9cXfOpD2ll4U WG7G4RjFhC3VXp4WwGWQEhNc3MfeZNL6kbbVwM8GhgpxUJznl24PzgUGHWP1elULT7Cf imlHFAvCKq8r8cjAfH0Zrik1VuP9MD1N10MXGWoTu/0mL8fOOtFDTlnCO07sguEluJsp 3OKMUjKBCB6IA34bFZTTw1InSyswURAU8p3ZyZDVSRQwfycN5Qe8eFqcGnuec7xBHOPL uSeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734522425; x=1735127225; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=e+ni2sm3wk+IEpHJsvCxGYROd6D/iUDHnxMXaHI/HCw=; b=uuxlfts88uLDcIKPSCYuLN0Wa0a1hC/oWeD6P5OwPzwGYft533zQ3z/D51toYR6/Xl AFqV7kKPrICQBMFLznQU7dFfqxD7GMuuROl6I2iOjGeBMNjl9Sr9ZVxweumynaCW+mEX Vm/Dd1AYazx1NF+pM0o9Yqqh4FF4Bwk+AkIR+g4xUQL2GAEiGA/hgcLCCyPaq1F3CJTg Ejpy8fSEqe4EbLf8YyfG0T3/Ute2xPNYvJ66oJ4tzOg4Q4dJPJUE8HZYnQlbZRgc/2P8 JRPsa4B8C0EaAOVPSvGJXCNSyezlCh5ZA4LC3UmR0vNiZLzbxlOGrBpMSfEzbIGHSBDE Rl7g== X-Gm-Message-State: AOJu0YznVMQMTLSud4SH2XL6BogUUaJH/CF64HW2Xzr1wdpEpINQanlt QB/r72mksufNnRrnbEELustJEV8NMvZYyV8aXrj8h1DznHQXsqHfOxvatA2r X-Gm-Gg: ASbGncs6f5LkPEKaEO081yYucxfbRuQGTdk8nv7b9wfhMiiNGrBitm2ZP/5SVcVuUxq qd358K+YygUoRdwn2psiYuV64B5k72zi45LfvZ7uCdiTNDvc/pTd+dYqI/hg0RE9RiYPoT7a3UT Kmtq2RGRxEHIbBz0iz+FMKRrTV1N6rfX9/03WPotF4eDoYLQ7oB4l5esWdBnTFX0w/vkQ1Ptvzg 5dGVejaCQL8hzhZ95IuGN9W+T1kRWCiS4e3cB/cAQlrppDzqfibyKEYc0DR7P4IE0AUNAA5q2Lg PUkOXrfvtr8= X-Google-Smtp-Source: AGHT+IFfwnmvDXUoevR3JMwJ83hwhWaRbCVWe3lKgTz6e75Uv7wy2N9Yq4LbPCY/3wUaTXBc06OKTQ== X-Received: by 2002:a17:903:183:b0:215:9894:5679 with SMTP id d9443c01a7336-218d6c4bc26mr35002555ad.0.1734522425286; Wed, 18 Dec 2024 03:47:05 -0800 (PST) Received: from KASONG-MC4.tencent.com ([43.132.141.21]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-218a1db7f50sm74337285ad.39.2024.12.18.03.47.01 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 18 Dec 2024 03:47:04 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Hugh Dickins , "Huang, Ying" , Yosry Ahmed , Roman Gushchin , Shakeel Butt , Johannes Weiner , Barry Song , Michal Hocko , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v3 4/4] mm/swap_cgroup: decouple swap cgroup recording and clearing Date: Wed, 18 Dec 2024 19:46:33 +0800 Message-ID: <20241218114633.85196-5-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20241218114633.85196-1-ryncsn@gmail.com> References: <20241218114633.85196-1-ryncsn@gmail.com> Reply-To: Kairui Song MIME-Version: 1.0 X-Stat-Signature: tuw3pj4mshj1ruckry4dx5qyrjhfqof7 X-Rspam-User: X-Rspamd-Queue-Id: B891FC000C X-Rspamd-Server: rspam08 X-HE-Tag: 1734522393-440080 X-HE-Meta: U2FsdGVkX18D3eqWlPFcKmvUlx6NbBw3PbfuVJYtaUkLAzUnH1MgONZBpuS2M/JvkCV56ijbgtUUNcbWMD8s3RJ22A+Y7SzUS/YN6hD3ZdE99j8IsBMOqB95PHXfHoB1hI7p5WoeNoujYIysyOmG+lfvFjr24Jx5pbKNqqd51qMqYJgAMmHjfsyKfScecwXervnwp0rRblMegemCYFygLsw/ZpAtk8NPCziWrmbWaxTl0dswVzZhHc6py304tG7A25SdC2DognwnGf3r9TdpUlIYu+vdk6LphQc6pmpEuI3GxGjzoCLrmOQOMQGXhjaBF/FSA/LSMi5oVyxFtZJL4heNC+b/DFq8fSFGb3AZ2Cz0fiVDWPuOo1tkQewThBS7PvdAX7NrVeius0dXSbWkcCtrT20Rc6bYgkqGEVDVhiC7OJHJ1Teb2LDWhnMgXNT8eFxe7MAYK2h7wLLcFZeVpIJLFterF5375FZ7Q1abZuD6xowUSn8gTsFmcrMOWYHcQO9+ZOwpHUFA85OF4EmgLJCw+BRx1tXf2wLRMJr4h7EOQc3NphWvUFx0IKFA+gLnHmMCVulQZh0pDCBc1W6NcCsa9BmDD03wCbBoH/B6q97mGtyAtslnErIPpn5EY/jTUabMCV7pZdDwmJzjKD06OJy6gMeKTWGWDzmxePHnKHg7khFsJLd+2//QyYY2Yg8mqi5HBM88rcycxDtW8spOjXqoDi2ib9dh73gEkBVRBYA1uK3GLtcHrLUcHajxua4ZTBReNsBYyqe9jgdfJMr6GLr+yqCj3LffJ3YbNbgq3dtbBwD/xgJOYSdWSi19jBwYARD+px/kYZcoTbvR0ZqQgdhsg3sAh8HRtFNnaEbQIMlzYriGY0Hru4Q2eIaEZQn8/mtdfTdbFB1O+dUgfdlrOUvrEJ50XCfEC6LJrEXJJ800H1+QIU0AqlnE7Y1Wa1Ak2EHKOuN52adXMRg9dCo Q7g2KKac QDm123JJyb6D7nowLjNsZDiQfa8FDAqeym9LTefjGq8NJGBsXOUGebMhdPxJ4UEy0s5yRHwJT+2e2cSQ2UlEhbirhzZonCoDWjJO4LH0R/DDJNMbZCpkARcEGYnUFHU+IM5a7t0LGXxTFSldferV8NjJdaziunSYR1Ko0lBIeO2rmWY1tguP/asY/ofAzfc8rKb+UZXXd0HWGnczoXxXojk5eFmRZVt2fWpZe7dd+5KHzQ29n2sLDfNxq+bcUXXZCL3c0kd9t74TioPVosuUvNM5VAT2c10AO7Zsm5Cqkc7cJSu/QIOXkJdJlXYeWqVgn3yhksOxyDijzOgyO15cxOZN/q6+UCRoeH0IfN5RyYClvWENsCPjxC+Rt9bQD7ruLJJpArHACGxtVdfV+XTcFBEgWSkkd+Dr0pR0fphqI83r3P63CBQVRdNa/o2l4XdnZA0OGK5H/98Oqk+oXFvahixHsVvYish7qRokdkivf10BB80pzsvuGVbsV9sk69gz55DEQPhKK6j6CJ2cOxCfiYPXqBR9mRu6M+1US0mNsEJlEuXxBgzI4MbOh2CWWzoo/CbRjFL5yv6bWFbMriaKgxHpvzVB1qWwWgTQ8 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000009, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song The current implementation of swap cgroup tracking is a bit complex and fragile: On charging path, swap_cgroup_record always records an actual memcg id, and it depends on the caller to make sure all entries passed in must belong to one single folio. As folios are always charged or uncharged as a whole, and always charged and uncharged in order, swap_cgroup doesn't need an extra lock. On uncharging path, swap_cgroup_record always sets the record to zero. These entries won't be charged again until uncharging is done. So there is no extra lock needed either. Worth noting that swap cgroup clearing may happen without folio involved, eg. exiting processes will zap its page table without swapin. The xchg/cmpxchg provides atomic operations and barriers to ensure no tearing or synchronization issue of these swap cgroup records. It works but quite error-prone. Things can be much clear and robust by decoupling recording and clearing into two helpers. Recording takes the actual folio being charged as argument, and clearing always set the record to zero, and refine the debug sanity checks to better reflect their usage Benchmark even showed a very slight improvement as it saved some extra arguments and lookups: make -j96 with defconfig on tmpfs in 1.5G memory cgroup using 4k folios: Before: sys 9617.23 (stdev 37.764062) After : sys 9541.54 (stdev 42.973976) make -j96 with defconfig on tmpfs in 2G memory cgroup using 64k folios: Before: sys 7358.98 (stdev 54.927593) After : sys 7337.82 (stdev 39.398956) Suggested-by: Chris Li Signed-off-by: Kairui Song --- include/linux/swap_cgroup.h | 12 ++++--- mm/memcontrol.c | 13 +++----- mm/swap_cgroup.c | 66 +++++++++++++++++++++++-------------- 3 files changed, 55 insertions(+), 36 deletions(-) diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h index d521ad1c4164..b5ec038069da 100644 --- a/include/linux/swap_cgroup.h +++ b/include/linux/swap_cgroup.h @@ -6,8 +6,8 @@ #if defined(CONFIG_MEMCG) && defined(CONFIG_SWAP) -extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id, - unsigned int nr_ents); +extern void swap_cgroup_record(struct folio *folio, swp_entry_t ent); +extern unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents); extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent); extern int swap_cgroup_swapon(int type, unsigned long max_pages); extern void swap_cgroup_swapoff(int type); @@ -15,8 +15,12 @@ extern void swap_cgroup_swapoff(int type); #else static inline -unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id, - unsigned int nr_ents) +void swap_cgroup_record(struct folio *folio, swp_entry_t ent) +{ +} + +static inline +unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents) { return 0; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 79900a486ed1..ca1ea84b4bce 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4973,7 +4973,6 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry) { struct mem_cgroup *memcg, *swap_memcg; unsigned int nr_entries; - unsigned short oldid; VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); VM_BUG_ON_FOLIO(folio_ref_count(folio), folio); @@ -5000,11 +4999,10 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry) /* Get references for the tail pages, too */ if (nr_entries > 1) mem_cgroup_id_get_many(swap_memcg, nr_entries - 1); - oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg), - nr_entries); - VM_BUG_ON_FOLIO(oldid, folio); mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries); + swap_cgroup_record(folio, entry); + folio_unqueue_deferred_split(folio); folio->memcg_data = 0; @@ -5035,7 +5033,6 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry) unsigned int nr_pages = folio_nr_pages(folio); struct page_counter *counter; struct mem_cgroup *memcg; - unsigned short oldid; if (do_memsw_account()) return 0; @@ -5064,10 +5061,10 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry) /* Get references for the tail pages, too */ if (nr_pages > 1) mem_cgroup_id_get_many(memcg, nr_pages - 1); - oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg), nr_pages); - VM_BUG_ON_FOLIO(oldid, folio); mod_memcg_state(memcg, MEMCG_SWAP, nr_pages); + swap_cgroup_record(folio, entry); + return 0; } @@ -5081,7 +5078,7 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages) struct mem_cgroup *memcg; unsigned short id; - id = swap_cgroup_record(entry, 0, nr_pages); + id = swap_cgroup_clear(entry, nr_pages); rcu_read_lock(); memcg = mem_cgroup_from_id(id); if (memcg) { diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c index cf0445cb35ed..be39078f255b 100644 --- a/mm/swap_cgroup.c +++ b/mm/swap_cgroup.c @@ -21,17 +21,6 @@ struct swap_cgroup_ctrl { static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES]; -/* - * SwapCgroup implements "lookup" and "exchange" operations. - * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge - * against SwapCache. At swap_free(), this is accessed directly from swap. - * - * This means, - * - we have no race in "exchange" when we're accessed via SwapCache because - * SwapCache(and its swp_entry) is under lock. - * - When called via swap_free(), there is no user of this entry and no race. - * Then, we don't need lock around "exchange". - */ static unsigned short __swap_cgroup_id_lookup(struct swap_cgroup *map, pgoff_t offset) { @@ -63,29 +52,58 @@ static unsigned short __swap_cgroup_id_xchg(struct swap_cgroup *map, } /** - * swap_cgroup_record - record mem_cgroup for a set of swap entries + * swap_cgroup_record - record mem_cgroup for a set of swap entries. + * These entries must belong to one single folio, and that folio + * must be being charged for swap space (swap out), and these + * entries must not have been charged + * + * @folio: the folio that the swap entry belongs to + * @ent: the first swap entry to be recorded + */ +void swap_cgroup_record(struct folio *folio, swp_entry_t ent) +{ + unsigned int nr_ents = folio_nr_pages(folio); + struct swap_cgroup *map; + pgoff_t offset, end; + unsigned short old; + + offset = swp_offset(ent); + end = offset + nr_ents; + map = swap_cgroup_ctrl[swp_type(ent)].map; + + do { + old = __swap_cgroup_id_xchg(map, offset, + mem_cgroup_id(folio_memcg(folio))); + VM_BUG_ON(old); + } while (++offset != end); +} + +/** + * swap_cgroup_clear - clear mem_cgroup for a set of swap entries. + * These entries must be being uncharged from swap. They either + * belongs to one single folio in the swap cache (swap in for + * cgroup v1), or no longer have any users (slot freeing). + * * @ent: the first swap entry to be recorded into - * @id: mem_cgroup to be recorded * @nr_ents: number of swap entries to be recorded * - * Returns old value at success, 0 at failure. - * (Of course, old value can be 0.) + * Returns the existing old value. */ -unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id, - unsigned int nr_ents) +unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents) { - struct swap_cgroup_ctrl *ctrl; pgoff_t offset = swp_offset(ent); pgoff_t end = offset + nr_ents; - unsigned short old, iter; struct swap_cgroup *map; + unsigned short old, iter = 0; - ctrl = &swap_cgroup_ctrl[swp_type(ent)]; - map = ctrl->map; + offset = swp_offset(ent); + end = offset + nr_ents; + map = swap_cgroup_ctrl[swp_type(ent)].map; - old = __swap_cgroup_id_lookup(map, offset); do { - iter = __swap_cgroup_id_xchg(map, offset, id); + old = __swap_cgroup_id_xchg(map, offset, 0); + if (!iter) + iter = old; VM_BUG_ON(iter != old); } while (++offset != end); @@ -119,7 +137,7 @@ int swap_cgroup_swapon(int type, unsigned long max_pages) BUILD_BUG_ON(sizeof(unsigned short) * ID_PER_SC != sizeof(struct swap_cgroup)); - map = vcalloc(DIV_ROUND_UP(max_pages, ID_PER_SC), + map = vzalloc(DIV_ROUND_UP(max_pages, ID_PER_SC) * sizeof(struct swap_cgroup)); if (!map) goto nomem;