From patchwork Sun Oct 27 00:14:44 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Barry Song <21cnbao@gmail.com> X-Patchwork-Id: 13852390 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A6D70D0C604 for ; Sun, 27 Oct 2024 00:15:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 761336B007B; Sat, 26 Oct 2024 20:15:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6E9AE6B0082; Sat, 26 Oct 2024 20:15:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 516076B0085; Sat, 26 Oct 2024 20:15:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 2ED716B007B for ; Sat, 26 Oct 2024 20:15:00 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 02F5F41951 for ; Sun, 27 Oct 2024 00:14:47 +0000 (UTC) X-FDA: 82717461246.26.89BA0CD Received: from mail-pg1-f172.google.com (mail-pg1-f172.google.com [209.85.215.172]) by imf28.hostedemail.com (Postfix) with ESMTP id EACFCC0010 for ; Sun, 27 Oct 2024 00:14:35 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=VMtJByKZ; spf=pass (imf28.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.215.172 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729988019; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=5lRAXHFRLFd0vpK1cnRAmUO0Dqv8gr3TyFM9dVBaf5c=; b=8HLPliJoTBf/Z8Dbj0XEHq1mCWS0xJGXqfBkEYu2WKNMqMMpsbSpu+PL/B74bGmvjzMCmG GLGmDK+mIB/9NBRk49hY/XOwpWPceDSQ21qWL7jRv1bLKclas0ndiOaKAaIdiDNN7D+akh nyMg2CxbJVsQE2Be0/Up5ygjptTuSo4= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=VMtJByKZ; spf=pass (imf28.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.215.172 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729988019; a=rsa-sha256; cv=none; b=mu1Of++Jvqo2zcABPGAx3t7VCAe5MbEHWoQrM5JuEYeJojM7Ju4ldZN1H55k3TSMUGvHQA R4rDOpDNEGFxOY4SuL654ENNhkf4Q9+WyN6k1oZuIp6szWu6Po6rxk1E6GxXr2wIcvORIT Mc+jBfRu/no4QoeLdPEUZI+zujGS0zU= Received: by mail-pg1-f172.google.com with SMTP id 41be03b00d2f7-656d8b346d2so2075315a12.2 for ; Sat, 26 Oct 2024 17:14:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729988096; x=1730592896; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=5lRAXHFRLFd0vpK1cnRAmUO0Dqv8gr3TyFM9dVBaf5c=; b=VMtJByKZSeQehadaXsw8pBDcB4ZAwuVtFuIqIGl1RsguAqPq+mtWYyFmEQN2es6rxZ qokIwluk3f2p2kBQTS8ngm98Gfs3mG4WvYcLIIZWn/jkjvjFCqDSzDnvblDlY233wu3T /wwbt8b3DTjepcCX2JuCRNVnqRAE3K4Cc0js8g5mKVf0tHN7ghXn0zZseNkONzqbfXv4 sD0JWRltczTTN8zy2I4C5+20sb3HPeC5RAPp5/qBKiRlL5e1wf/w+zav3vltOj7JYo5G lSaApz6zG0j2AT//gUl/fS+3cTUMTZecyaYnDnHAiO9vxNvavjdM7k4fF6AfgUGaOR0y vY3Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729988096; x=1730592896; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=5lRAXHFRLFd0vpK1cnRAmUO0Dqv8gr3TyFM9dVBaf5c=; b=Bp4P+drnvbbX/xRCXLnrSGIzSqzkd89CvjA7+jvjLtYCnUiUua2H2dxGT8fX1hkmAo NtILQgBRgxTbG84GtPBfdEmQ+B+YqCLsA5CHz6vH71/OZdRyfFT+CnsAa69H4yA12mTm Fi21lbOs4j1zS23mYYtl7exE9319mhotlZARGWAhX1lVrUfrj1D/hRQofyDde8t1MfmW TVkrGp2/pBGWlM2Qq2knYqfORyNEkIwCJiT4PCIHh1/B9hdyS7VmfZPa77pcn0Ev53w9 Ly+qcNHQY/7eXOa2gjXdHDoH/7FVJFWVN0Pfmc8VsA/uAGe61Rbt+lgy5xsbr9F442bR pbqQ== X-Forwarded-Encrypted: i=1; AJvYcCX9xPKX6EKdBUY8z+t4CNgd9b4/aja+q0tM8NBGF5+11luk8XtS4MBXe+0QtcuCWkMQxtfYRFEhPw==@kvack.org X-Gm-Message-State: AOJu0Yy960T1kMcs+HNMX2rkb2QaMVbHYvr6wwWe3KsQW8U0Xya2IBMQ UYrgbYprpc13fatQh3cWXvHn7P51DlF5Int2/LC1nAbW8+SYHp2R X-Google-Smtp-Source: AGHT+IFGZXIVzbT5sX9oh59YUYepkGPnrURlx/98KjLh4QkGzKlRlP/uf/8IQZEaXEIL+6pxa0xxaw== X-Received: by 2002:a05:6a21:4d8b:b0:1cf:3a52:6ad6 with SMTP id adf61e73a8af0-1d9a8401be8mr5589116637.24.1729988096115; Sat, 26 Oct 2024 17:14:56 -0700 (PDT) Received: from Barrys-MBP.hub ([2407:7000:8942:5500:dd3b:f946:8c2b:40fb]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-72057931abesm3274425b3a.73.2024.10.26.17.14.49 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Sat, 26 Oct 2024 17:14:55 -0700 (PDT) From: Barry Song <21cnbao@gmail.com> To: akpm@linux-foundation.org, linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Barry Song , Kanchana P Sridhar , Usama Arif , David Hildenbrand , Baolin Wang , Chris Li , Yosry Ahmed , "Huang, Ying" , Kairui Song , Ryan Roberts , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song Subject: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg Date: Sun, 27 Oct 2024 13:14:44 +1300 Message-Id: <20241027001444.3233-1-21cnbao@gmail.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) MIME-Version: 1.0 X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: EACFCC0010 X-Stat-Signature: iii87oquxtaxsd1w81k5d554mr5p63kd X-HE-Tag: 1729988075-653534 X-HE-Meta: U2FsdGVkX189wCMDKDw65JMm3e+2vqG5NWKYW9OvwTYAouwQ4fDUVC8hKANa8hUuy4celn/qj1a1m0W7lK0k/skEZTBRQ5sV/oMNsQFzg+LgH/vaC9N59AFFLVb0wAt93n1XcySyzoCNwBkMUOTRXXkG9P4j5iiYeUgmYLgZKlgGi6myEevnh6h7PPI7E+dKP/X5juHn47S+D8+k8+bTdHLUtsvbaUXru9WN1AYXTyHPj70IIlM3xaUdaZlhMMXeCW9wCL95DEtJC3nMCQ5gUNm8bWfbM4KKOVWwnwxAMpnpZo2Ti7TEsADIW1UpTbPt4MLiDw1E7tKkvMdoPts6DHgvIcH3OAMFGUEQwfxYqN859wJtfPwEJR2QjYne9zYQCZdFjF8ctMJwyfUfxaNg0q6+d1ubSVCSNFnFIwQaYAX3T4eJa0FUmVFYvDa6KpEPIgvfN3Feabu64aI0kIW3PM+uKptBEjySwZXHt3b6N0/8w9/KWMSTnWTEMbn6un9hzxu50hTeBOFr39XMjxW+kVmSyooOR2OYEAX//QMVasydBEj0oxau4IV3JnCEXvgUeFddAeQNLjTUcjo9RnFRtbFhMh8gmzcgEXlkJrLkpEbj2DRPe53B/4Ns86MYv7le8uMkLeElxEICKxbXmM/mnBXeV/uTfZfFluttItWxs1zcnb0RXydip3oiVg7LJY6OrW2MQePDUuOce1t/S5Jis4Ndk0LaI3mu/UqfjXZ8+8qbNt90y82l8VmXZU8ntMkSQPjjYUns8wH86RQKD3+O0laJj4uLYT7jHwThzipYmeBrS5cZ8oDeuwgQL084PFt6hkTLerskU/cVW0pRLo1xY8yqsmHyFuLU9KgjzbcjMdZOlafNam9v8z1CEsuCGRG+Tg9UlkOjku+6/aZBAiAfDouYOPaOgn1VlVdrKNe0du2/cJ0iiXQ7x7nuAJsdx8S28fFl9N1oPGOePH2qnNw KzYdmG72 tFXpHA4+bq4SFqYMS/oil5lcSKO91SAuV/yfxT8PV89AyEOezg7dFWdiVTkkdXFMTy9l9BZh13xe+EAEJ8n57H8qp6UaxKS/n6ZRoWdiBq0gd5F5TS4qaPHG1u2Vs523+UfN99YMb6mbDFgPspkxrgpM4lt+OsIoqih1yYBPI6DdSl17+bTzg5bIb1TR/sLRgkBOZ08e4JO67fnumGNnx3XKuFjXe4x4wNnSAqGdUT/eGaA5sjEqyh+SOsxQ7OwCa7G9VS7GnqqmUJPo/wZGfFcZqmz71XaeGCBJKv+s2TfTyWduSyFofJXUsKE8LjSJbhKSTjIdAHhO9VG0UpGt/g4RA5emoJ+GhbPfiWXe5058LjIrIer+dGKjSjUVt15m1Hkt/CnB0XlY8cxaSuuPyGoDnGNj3Wo76g8q7aiLK4OCz8pi9oVrM+rPuRdIb0sLwvpElHnfIHx9XiorUgvAZnrFC1LgSvXNVAl8kOHodm/cuYglRUotVb+vMrLPwrgI7TTUaOInXrBHOiTtJT+QunZdyd5QGAgFwjXpR1T88nzhCos2DTh4NILxkzaJVG3l+VYkx+Pp7xrLKtEoTCWNBBz5tm9yo7FIEA5PPtt05KFaQthEaWKfxuloONIabaiiHhbiY8v/dlklAWHlEr9S1ImVlVhJhxfd+lojP X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Barry Song In a memcg where mTHP is always utilized, even at full capacity, it might not be the best option. Consider a system that uses only small folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX of buffer space before it can initiate the next reclamation. However, large folios can quickly fill this space, rapidly bringing the memcg back to full capacity, even though some portions of the large folios may not be immediately needed and used by the process. Usama and Kanchana identified a regression when building the kernel in a memcg with memory.max set to a small value while enabling large folio swap-in support on zswap[1]. The issue arises from an edge case where the memory cgroup remains nearly full most of the time. Consequently, bringing in mTHP can quickly cause a memcg overflow, triggering a swap-out. The subsequent swap-in then recreates the overflow, resulting in a repetitive cycle. We need a mechanism to stop the cup from overflowing continuously. One potential solution is to slow the filling process when we identify that the cup is nearly full. Usama reported an improvement when we mitigate mTHP swap-in as the memcg approaches full capacity[2]: int mem_cgroup_swapin_charge_folio(...) { ... if (folio_test_large(folio) && mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) ret = -ENOMEM; else ret = charge_memcg(folio, memcg, gfp); ... } AMD 16K+32K THP=always metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix real 1m23.038s 1m23.050s 1m22.704s user 53m57.210s 53m53.437s 53m52.577s sys 7m24.592s 7m48.843s 7m22.519s zswpin 612070 999244 815934 zswpout 2226403 2347979 2054980 pgfault 20667366 20481728 20478690 pgmajfault 385887 269117 309702 AMD 16K+32K+64K THP=always metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix real 1m22.975s 1m23.266s 1m22.549s user 53m51.302s 53m51.069s 53m46.471s sys 7m40.168s 7m57.104s 7m25.012s zswpin 676492 1258573 1225703 zswpout 2449839 2714767 2899178 pgfault 17540746 17296555 17234663 pgmajfault 429629 307495 287859 I wonder if we can extend the mitigation to do_anonymous_page() as well. Without hardware like AMD and ARM with hardware TLB coalescing or CONT-PTE, I conducted a quick test on my Intel i9 workstation with 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel builds in a memcg with memory.max set to 1 GiB. $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null disable-mTHP-swapin mm-unstable with-this-patch Real: 6m54.595s 7m4.832s 6m45.811s User: 66m42.795s 66m59.984s 67m21.150s Sys: 12m7.092s 15m18.153s 12m52.644s pswpin: 4262327 11723248 5918690 pswpout: 14883774 19574347 14026942 64k-swpout: 624447 889384 480039 32k-swpout: 115473 242288 73874 16k-swpout: 158203 294672 109142 64k-swpin: 0 495869 159061 32k-swpin: 0 219977 56158 16k-swpin: 0 223501 81445 I need Usama's assistance to identify a suitable patch, as I lack access to hardware such as AMD machines and ARM servers with TLB optimization. [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/ [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/ Cc: Kanchana P Sridhar Cc: Usama Arif Cc: David Hildenbrand Cc: Baolin Wang Cc: Chris Li Cc: Yosry Ahmed Cc: "Huang, Ying" Cc: Kairui Song Cc: Ryan Roberts Cc: Johannes Weiner Cc: Michal Hocko Cc: Roman Gushchin Cc: Shakeel Butt Cc: Muchun Song Signed-off-by: Barry Song --- include/linux/memcontrol.h | 9 ++++++++ mm/memcontrol.c | 45 ++++++++++++++++++++++++++++++++++++++ mm/memory.c | 17 ++++++++++++++ 3 files changed, 71 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 524006313b0d..8bcc8f4af39f 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, long nr_pages); +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, + swp_entry_t *entry); + int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry); @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, return 0; } +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm, + swp_entry_t *entry) +{ + return 0; +} + static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 17af08367c68..f3d92b93ea6d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, return 0; } +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg) +{ + for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { + if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR) + return false; + } + + return true; +} + +/** + * mem_cgroup_swapin_precharge_large_folio: Precharge large folios. + * + * @mm: mm context of the victim + * @entry: swap entry for which the folio will be allocated + * + * If we are arriving the edge of an almost full memcg, return error so that + * swap-in and anon faults can quickly fall back to small folios to avoid swap + * thrashing. + * + * Returns 0 on success, an error code on failure. + */ +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, swp_entry_t *entry) +{ + struct mem_cgroup *memcg = NULL; + unsigned short id; + bool has_margin; + + if (mem_cgroup_disabled()) + return 0; + + rcu_read_lock(); + if (entry) { + id = lookup_swap_cgroup_id(*entry); + memcg = mem_cgroup_from_id(id); + } + if (!memcg || !css_tryget_online(&memcg->css)) + memcg = get_mem_cgroup_from_mm(mm); + has_margin = mem_cgroup_has_margin(memcg); + rcu_read_unlock(); + + css_put(&memcg->css); + return has_margin ? 0 : -ENOMEM; +} + /** * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin. * @folio: folio to charge. diff --git a/mm/memory.c b/mm/memory.c index 0f614523b9f4..96368ba0e8a6 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4168,6 +4168,16 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) pte_unmap_unlock(pte, ptl); + if (!orders) + goto fallback; + + /* + * Avoid swapping in large folios when memcg is nearly full, as it + * may quickly trigger additional swap-out and swap-in cycles. + */ + if (mem_cgroup_precharge_large_folio(vma->vm_mm, &entry)) + goto fallback; + /* Try allocating the highest of the remaining orders. */ gfp = vma_thp_gfp_mask(vma); while (orders) { @@ -4707,6 +4717,13 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) if (!orders) goto fallback; + /* + * When memcg is nearly full, large folios can rapidly fill + * the margin and trigger new reclamation + */ + if (mem_cgroup_precharge_large_folio(vma->vm_mm, NULL)) + goto fallback; + /* Try allocating the highest of the remaining orders. */ gfp = vma_thp_gfp_mask(vma); while (orders) {