From patchwork Wed Sep 25 17:52:41 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kairui Song X-Patchwork-Id: 13812380 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 66559CCF9EB for ; Wed, 25 Sep 2024 17:54:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BE6386B00BE; Wed, 25 Sep 2024 13:54:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B969D6B00BF; Wed, 25 Sep 2024 13:54:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A36DD6B00C0; Wed, 25 Sep 2024 13:54:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 847656B00BE for ; Wed, 25 Sep 2024 13:54:04 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 21CC280F6D for ; Wed, 25 Sep 2024 17:54:04 +0000 (UTC) X-FDA: 82604009208.10.F9F9370 Received: from mail-pj1-f41.google.com (mail-pj1-f41.google.com [209.85.216.41]) by imf20.hostedemail.com (Postfix) with ESMTP id 4DE5F1C0003 for ; Wed, 25 Sep 2024 17:54:00 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=MND8QTAj; spf=pass (imf20.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.216.41 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727286779; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=6EoHJ7ocju/j5y3k4cYBKQ/0gfv38s9h2AqOIXx1TaY=; b=eYECoBoe1guJk12EgM4wC9mAD2l5+asJW+x8cr9gO3JWV6UkJ1+ZrgmTsQgc/GqjUQIH3r WReg1oalaK4naJiGIuOJXwbujyXX9FiVtVTQzB+j+mVO8Bbp4PTewUbeKuJLhgcWFdUPSG /LxX3JGKGPC1UCT+IFPOaIKyoI+cV6s= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=MND8QTAj; spf=pass (imf20.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.216.41 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727286779; a=rsa-sha256; cv=none; b=OH0QwW9oS8qq0sGoxJwAFjnFK9Y01iZDHnj2ACn4W4v6tP1UwlMTJTHvuJK4r89jSGLF/u N7DYpEqJa8BR36/AhZUnoAtJkR9Ri1TvX01c7C9KJ8D3LuOoKbtxlz6tYvrKqDCs99W4Pt 6/Lv2TezKUsJNUaX8f5ogFkUK+bkO3M= Received: by mail-pj1-f41.google.com with SMTP id 98e67ed59e1d1-2dee9d86f4dso134486a91.3 for ; Wed, 25 Sep 2024 10:53:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1727286838; x=1727891638; darn=kvack.org; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=6EoHJ7ocju/j5y3k4cYBKQ/0gfv38s9h2AqOIXx1TaY=; b=MND8QTAjWPdWodYVcXXx2ud1sYpqE6zG32QoA0zpBfGxw1omI4/TrgJZkHYIIGdyEb PV3aNTZe6VDoQ63vHfZ06vwT1pg6afOxQfWQPvInDmxDMcdIOVJZfVzTEIqEy9yngsrl BQplwdIT/aiA/lbci1bJPi4ZejgXMXo5BPZ/tqBajxq5huxOfikiP3sF3ZVfrecSlIWi 7HVnzIiw2+m43cZOUUXGMa+/EN8YifapeoVvju6ib8CP/y+jREm7g54ZSZivKSHMUZxr /zyjZoB+TS3f+LawCkvwQcK4MciFcDPPaoYP5T6MizaEzyDksCAmOtadOmFTyz86QwlD KpUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727286838; x=1727891638; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=6EoHJ7ocju/j5y3k4cYBKQ/0gfv38s9h2AqOIXx1TaY=; b=gzfBoN2LDmCbrI579L8vqSwFW+xtjNVemLJC82liVK2x8KiKAV8JUeMGkMay/W3PPb Qy+rgO4E3U0yI/yOxBjcVHlBV8YkkQr7lXSUUaYdnlRJD8ouREw4JQH+px18bf30iFP/ 8Ej+9wjVJzqyEAA27fkALnOQxdWeZfngez/PWHnYzFbYO+yR1CRfF7n62eJ/vyHWhfPu 03H0b/FIroFnfZtx8COD6GCqmUvS7JlWKskOllajEGt9SfDtv+NI+ZmS+f4IZnZHbGVy QkIJiA23RGQm0f6jHxyr7tJIwaELyQO1Lj2+9hh1AyxbtXhGmDtuYj4uss5GSum+dVNU M7Bw== X-Gm-Message-State: AOJu0YwkdIN7/GqeMf4zDXKvzjDV/27BK65TuEbLxm8kX6WD5fnPH1wV 4TAi7bjj7hA/YgpzH+lycS++kKh6KCGceMCuUI9EJUb1B7lznHl1NgfDeDzH X-Google-Smtp-Source: AGHT+IHFI+9934Boc8xYNAsNyqEQN0RpyAFGjPw2HgITs7iRGHHX6FBWRyloZk6a7PXfSC+1aFbQuQ== X-Received: by 2002:a17:90b:3bcb:b0:2d3:d45b:9e31 with SMTP id 98e67ed59e1d1-2e06ae26b0amr4151563a91.2.1727286838435; Wed, 25 Sep 2024 10:53:58 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2e06e1feeecsm1806245a91.30.2024.09.25.10.53.55 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 25 Sep 2024 10:53:57 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , Kalesh Singh , "Huang, Ying" , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH] mm, swap: avoid over reclaim of full clusters Date: Thu, 26 Sep 2024 01:52:41 +0800 Message-ID: <20240925175241.46679-1-ryncsn@gmail.com> X-Mailer: git-send-email 2.46.1 Reply-To: Kairui Song MIME-Version: 1.0 X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: 4DE5F1C0003 X-Stat-Signature: 3ratidhj7948eb317o3hn13xuhsfx854 X-HE-Tag: 1727286840-63270 X-HE-Meta: U2FsdGVkX1/D9ToWb+NT0hc0iD8GUvGbwDIHayFCdLdRv4wXG99CKH4ikDSiLAUa1MQPwpZBrEv9BLJg8zEVycgTBVV7mSddrAs1YdUrqWjul5WcJHAQTJIYQWrIEywMI5SsZ+QaaM1TZc14YkhP+ZogVOK6prYXhu134PnpmXKpO/n3jTt3HsiL1M8cX5qogEE93xCDfiRDxbYhyXMnEhNLE+gISD3q9wuug6HE32GJeMAkb+zWQrA4ulEg0W40Om9aHpTs/XHrqgxZ/iwFYpLK8cMq1dmLrP2t+5ST0/xWDDc8zvY/O6AJ4dqmgRY1f/tPemCOxb3lfJPaFVp16eSFUblFteYF72gU0l4Yt0B2MGOb05NpifzXWherOBScGn8/UeeoT5o4M4xxatZi1czsD35Zex8ei7UqNQe/MZaHeBjfVxS8ZcdgXKRGpbkTbN4f24hI82IvRxCpvrQl0oYGv7dCNHQmE/oSLRZPdNE/L3IOsIjUpk376qtxWFrs5ZMDLpUpbuL8jbADhkmcmqymiWOeeFF+/Fgrsz8c+NdwTQKU7d/T92oeHPDDiJUSgT2Md2tiVEb7VlbVEESWM8M+TL2Vo62g+Yvr3FMn0bcEHpsYScEhpl91fhL0LA0Soz9yS8oVs3TNMEKMBU3+06QrreRFQWDNvLe2v6CTLZiZ2nAge4DDjKtZ7hCB5JRz3nR9Vz8A+dZ0AXIPsrFC1fT97m8Q1HM5NJZH9AZoPEmZpWnrjbzwr/MJdeFwTBN9sh3aVgvbGJ9NkpdyDLt/z5StJuxDM9WD+rFPt3Gn3WLx/4/TQpBhdv7m7myK/VDBmTNW2avPqhOgh51zmXEhCTM++r76VJ+qKV1w52y5LDbsvU8T6oI9V1t89cETEqvYHamcWeTiQc1WfFI7hXvOyhdHYZNPcRdLG7Nl6l/WZ5lR70oMovN2B75tsuPQmR4QvwhQmQIl0PWLSpR6TDB FyGwkxri FMg6a9tUU9TRH6MaOGf8hkk8zJ9RMw6/5sPjhmp2/4rrnpH88I99F/zmIftHkCp+WyVUwtSm/FVbPbTMNEE1EmxxVTFCUvm4Xg4YJlxnh1JfmSXTfBql9ZX36d4dZlr4yqAHDwvPknn0aT0zmWp0BSIEirru3NXFlyNAx0sIhiiGNzIdnLKi6RIsGDi6bUuc8KaAG5AuEsnmzY8jVLKiac8bSiqg0FbeX8dfGkC64So+nlpZj6HJnefJmP9/4I1dJpFoimI8z4QsAoUNunXf+5167cxVJxKlqhT0YRO6uWD6FttKzamnZ7FQHAZJ+iSNDoT3a6oktARJVI4DEIt7Vz+ywlpPON1+usQayLnbQ9rSAGmX+YLPIXGQt29yX1bfYSiHjtjVo5w7qC7lqBO9aGKhaii8IxghD4G5/g4NQf7W5DZ8uwj15QGzWt4As4J0SqzR6Engzoa4i9en/3k4C4qRr55oyqfNqxxM0jHnFodgCRns/Tm+w1WN0DqTO6ZzpMAt0Ep+y9oUbQDA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song When running low on usable slots, cluster allocator will try to reclaim the full clusters aggressively to reclaim HAS_CACHE slots. This guarantees that as long as there are any usable slots, HAS_CACHE or not, the swap device will be usable and workload won't go OOM. Before the cluster allocator, swap allocator fails easily if device is filled up with reclaimable HAS_CACHE slots. Which can be easily simulated with following simple program: #include #include #include #include #define SIZE 8192UL * 1024UL * 1024UL int main(int argc, char **argv) { long tmp; char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); memset(p, 0, SIZE); madvise(p, SIZE, MADV_PAGEOUT); for (unsigned long i = 0; i < SIZE; ++i) tmp += p[i]; getchar(); /* Pause */ return 0; } Setup a 8G non ramdisk swap, the first run of the program can swapout 8G ram successfully. But run same program again without existing previous one, it can't swapout all 8G memory as nearly half of the swap device is occupied by HAS_CACHE. There was a random scan in the old allocator may reclaim part of the HAS_CACHE by luck, but it's unreliable. The new allocator's added reclaim of full clusters, but it has a hidden pitfall. When multiple CPUs are seeing the device is low on usable slots, they ran into a thundering herd problem, causing performance drop and delayed OOM kill. This is observable on large machine with mass parallel workload. Test using 128G ZRAM on a 96 CPU system, first fill the swap with 124G place holder data, then run build linux kernel with make -j96 in a 1G memory Cgroup. The left 4G swap space is far not enough for the workload, but the OOM won't kick in until the workload hung for about ~5min. Full cluster reclaim is slower for large device, and every CPU will drop the si->lock for reclaim, allowed other CPUs to re-entry the reclaim path. As a result, all CPUs are busy doing the full reclaim. Besides, the current full reclaim trigger condition is too lenient (available slots < cluster size, which was suppose to ensure mTHP user won't fail due to HAS_CACHE), making things more risky. So, to ensure only one aggressively full cluster reclaim is issued when device is near full, offload the aggressively reclaim to a kworker instead. This still ensures in worst case the device won't be unusable because of HAS_CACHE slots. And to avoid allocation (especially higher order) suffer from HAS_CACHE filling up clusters and kworker not responsive enough, do one synchronous scan every time the free list is drained, and only scan one cluster. This keeps the full clusters rotated and reclaimed with a minimal latency, and provide a fair reclaim strategy for most workloads. Fixes: 2cacbdfdee65 ("mm: swap: add a adaptive full cluster cache reclaim") Signed-off-by: Kairui Song --- include/linux/swap.h | 1 + mm/swapfile.c | 49 +++++++++++++++++++++++++++----------------- 2 files changed, 31 insertions(+), 19 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index ca533b478c21..f3e0ac20c2e8 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -335,6 +335,7 @@ struct swap_info_struct { * list. */ struct work_struct discard_work; /* discard worker */ + struct work_struct reclaim_work; /* reclaim worker */ struct list_head discard_clusters; /* discard clusters list */ struct plist_node avail_lists[]; /* * entries in swap_avail_heads, one diff --git a/mm/swapfile.c b/mm/swapfile.c index 0cded32414a1..3d9ce12fa95e 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -730,15 +730,16 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne return offset; } -static void swap_reclaim_full_clusters(struct swap_info_struct *si) +/* Return true if reclaimed a whole cluster */ +static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force) { long to_scan = 1; unsigned long offset, end; struct swap_cluster_info *ci; unsigned char *map = si->swap_map; - int nr_reclaim, total_reclaimed = 0; + int nr_reclaim; - if (atomic_long_read(&nr_swap_pages) <= SWAPFILE_CLUSTER) + if (force) to_scan = si->inuse_pages / SWAPFILE_CLUSTER; while (!list_empty(&si->full_clusters)) { @@ -748,28 +749,36 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si) end = min(si->max, offset + SWAPFILE_CLUSTER); to_scan--; + spin_unlock(&si->lock); while (offset < end) { if (READ_ONCE(map[offset]) == SWAP_HAS_CACHE) { - spin_unlock(&si->lock); nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT); - spin_lock(&si->lock); - if (nr_reclaim > 0) { - offset += nr_reclaim; - total_reclaimed += nr_reclaim; - continue; - } else if (nr_reclaim < 0) { - offset += -nr_reclaim; + if (nr_reclaim) { + offset += abs(nr_reclaim); continue; } } offset++; } - if (to_scan <= 0 || total_reclaimed) + spin_lock(&si->lock); + + if (to_scan <= 0) break; } } +static void swap_reclaim_work(struct work_struct *work) +{ + struct swap_info_struct *si; + + si = container_of(work, struct swap_info_struct, reclaim_work); + + spin_lock(&si->lock); + swap_reclaim_full_clusters(si, true); + spin_unlock(&si->lock); +} + /* * Try to get swap entries with specified order from current cpu's swap entry * pool (a cluster). This might involve allocating a new cluster for current CPU @@ -799,6 +808,10 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o goto done; } + /* Try reclaim from full clusters if free clusters list is drained */ + if (vm_swap_full()) + swap_reclaim_full_clusters(si, false); + if (order < PMD_ORDER) { unsigned int frags = 0; @@ -880,13 +893,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o } done: - /* Try reclaim from full clusters if device is nearfull */ - if (vm_swap_full() && (!found || (si->pages - si->inuse_pages) < SWAPFILE_CLUSTER)) { - swap_reclaim_full_clusters(si); - if (!found && !order && si->pages != si->inuse_pages) - goto new_cluster; - } - cluster->next[order] = offset; return found; } @@ -921,6 +927,9 @@ static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset, si->lowest_bit = si->max; si->highest_bit = 0; del_from_avail_list(si); + + if (vm_swap_full()) + schedule_work(&si->reclaim_work); } } @@ -2815,6 +2824,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) wait_for_completion(&p->comp); flush_work(&p->discard_work); + flush_work(&p->reclaim_work); destroy_swap_extents(p); if (p->flags & SWP_CONTINUED) @@ -3375,6 +3385,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) return PTR_ERR(si); INIT_WORK(&si->discard_work, swap_discard_work); + INIT_WORK(&si->reclaim_work, swap_reclaim_work); name = getname(specialfile); if (IS_ERR(name)) {