From patchwork Wed Jan 8 02:16:49 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Baolin Wang X-Patchwork-Id: 13930000 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3FB8CE77197 for ; Wed, 8 Jan 2025 02:17:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 77A916B0082; Tue, 7 Jan 2025 21:17:04 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 729FB6B0083; Tue, 7 Jan 2025 21:17:04 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5F1736B0088; Tue, 7 Jan 2025 21:17:04 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 404476B0082 for ; Tue, 7 Jan 2025 21:17:04 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 9D01FB021D for ; Wed, 8 Jan 2025 02:17:03 +0000 (UTC) X-FDA: 82982671926.07.1AD47EC Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132]) by imf27.hostedemail.com (Postfix) with ESMTP id 2A5E04000A for ; Wed, 8 Jan 2025 02:16:59 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=GvLy8QIZ; spf=pass (imf27.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.132 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736302622; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=wUuzRVpwJXkcpzpsBXo1F6rLZRwwTYJ3qDWkskJWgY0=; b=bzvws4oTgUCd8+1aONx/dnr/2y++TzrrAHMaRXM9DDvUqZXIb9bSKJS98S64++KmTESX+P U5SvVxIfWTxS3bWhIZZJjUAwRbbIfhCjhcU3+MXo7pkQee54QnQv0lABT4nloC9leeaVFl 6sUn+jSBhtA/fL5XjgJfJsl2K3aP0uI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736302622; a=rsa-sha256; cv=none; b=A0Z8CMPEAhbmoCJ8jnyXAHCp3VumsW8zBncm77aML2MgBn4YB3Ax+1Dq6mo9GA5KpHv0SH QEIBmXo7hOK/wMkZh/k0kWf98qPePMQEWRU3Ec4GmLe2Wr02HeD3oiMPxjnwMStZ8k0QdR NEvC25aBxmITgSXaSnuToyUC5Ng8Kbk= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=GvLy8QIZ; spf=pass (imf27.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.132 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1736302617; h=From:To:Subject:Date:Message-Id:MIME-Version; bh=wUuzRVpwJXkcpzpsBXo1F6rLZRwwTYJ3qDWkskJWgY0=; b=GvLy8QIZd6v8JgzOJ4lf53f5dQDkuy6/jIOdNKLh/YK9bhIy2Po0UjLMcu2K9wlg4umtrIz7+wfya+S9xvDrQ0IL1MM1yBc1XNthtVzTHZ19qrLKGCaAgmBe8ehcLrTclKMMZw+SjiXUMXIrLRkkPmAedDSKj5/eWJLOSxmI00k= Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WNCACmM_1736302613 cluster:ay36) by smtp.aliyun-inc.com; Wed, 08 Jan 2025 10:16:54 +0800 From: Baolin Wang To: akpm@linux-foundation.org, hughd@google.com Cc: willy@infradead.org, david@redhat.com, wangkefeng.wang@huawei.com, kasong@tencent.com, ying.huang@linux.alibaba.com, 21cnbao@gmail.com, ryan.roberts@arm.com, baolin.wang@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v3] mm: shmem: skip swapcache for swapin of synchronous swap device Date: Wed, 8 Jan 2025 10:16:49 +0800 Message-Id: <3d9f3bd3bc6ec953054baff5134f66feeaae7c1e.1736301701.git.baolin.wang@linux.alibaba.com> X-Mailer: git-send-email 2.39.3 MIME-Version: 1.0 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 2A5E04000A X-Stat-Signature: qp8rzydjq1hzu6sw8mxbp7rcbkfzqs5g X-Rspam-User: X-HE-Tag: 1736302619-136591 X-HE-Meta: U2FsdGVkX1/lF5kdE0dXyh8VqcF/KLinwBfE0QTZ9GufpWG6dOCfCuxBcQlAcWy/XiwN/smyvHDvBX2eS+NCjKqH5GXvOKgFhuONWAfk42yqVlNtiwREH/yIXRRplvjqMPlIxnB+S/YSeWxnw/J1BjODafWH4oG+54PgghjhMST6N+36sLJH6LuWpwPjznCq+wQr18EWc9Q6zXuocoryq7Hv5S/1YWvNToB+UcJDXp+yvsrfupzIbMXo6AEPBuGdojKXV2Tgiubkc5hGRdJqtpcbmL9BBD81qthENijeUV263NbsDvdG+Aot1kYhQgG0gpfDXpevxZzYrR+NoxPwXx+CEFlNUAnHk+/d/zybr8AwSIhL6ayXXNfnN2CPfGnKKtvHxlEuF5fVa1FzaEu9OpKDGW/fXphw6Zu8LO2NbwSi66Nt8BUpgmVAAX7Ii1q6hQev8DiiLftlrcWe4ksEgrnqw4poLH5Ao+adB8exsQHufV1iiONPJXX5zMG3E1jgySAP/k6wMDOysYZULX42SVuoKMa+VoXHGYSl84o2/FQ2Xm8yx8Drq6x6SfbM6pcbfxxofoF7ym1DqAPx4IDgfBwM4NYuqMGhmy0iqDy1GagyNLhEkPyJZo2wF1tUuHpaRDNk0werrXBIsajpoWiO1s2Xi0o/AZwqHBoVHgd7ZzRFzHl/C+ndjLomNVaDCEdDi0P9FxRY7+8SiRjVSvUwdeUitGINg5EbPzx9QWlcB9CrY1MYoBrVHwd0wGWXsZMJZdy8GvTJipoI0pSbH5sDDKL5kc579DsbHWZxJiBjW7jpRHeQYqL1iD1SW4kC7O3IeINVvHBYTIOjE9GC9xyR4rj+X1ybDClNxJ6E7UEOD8HMnTkR+vhIN+EWLqNMRdVPd8XQbfWA0HFXVeF4pTkzklDfNo/HK+5za4f+Z7iXwXCUtVuK+JQ5L7oEpXO54VS9naeKAH6tBZeJkAX+1HI 3VdSVNGJ djC+SBP2qUR6yjWWXBkNIRbiMMRv3wOKUNzJq1xJyRR7jgEpzCs8eLye+eeSRlG15u0CphjxUSpQn8wfJ+EKXXvXnq9eVyBJKtmRhChNDLbVWJW0quzOhy7p8ffscAbLWbvEcdkchgKHyHHX3pJ/h4SUf6mcwEGmprmueE4yo7XsIiH5yHu0bOhCOZvGSTY5uKmmSLm1T2r5B890obzJdOZ2ltMX6MHYNRT+2 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: With fast swap devices (such as zram), swapin latency is crucial to applications. For shmem swapin, similar to anonymous memory swapin, we can skip the swapcache operation to improve swapin latency. Testing 1G shmem sequential swapin without THP enabled, I observed approximately a 6% performance improvement: (Note: I repeated 5 times and took the mean data for each test) w/o patch w/ patch changes 534.8ms 501ms +6.3% In addition, currently, we always split the large swap entry stored in the shmem mapping during shmem large folio swapin, which is not perfect, especially with a fast swap device. We should swap in the whole large folio instead of splitting the precious large folios to take advantage of the large folios and improve the swapin latency if the swap device is synchronous device, which is similar to anonymous memory mTHP swapin. Testing 1G shmem sequential swapin with 64K mTHP and 2M mTHP, I observed obvious performance improvement: mTHP=64K w/o patch w/ patch changes 550.4ms 169.6ms +69% mTHP=2M w/o patch w/ patch changes 542.8ms 126.8ms +77% Note that skipping swapcache requires attention to concurrent swapin scenarios. Fortunately the swapcache_prepare() and shmem_add_to_page_cache() can help identify concurrent swapin and large swap entry split scenarios, and return -EEXIST for retry. Signed-off-by: Baolin Wang --- Changes from v2: - Wrap at 80 columns for function definition, per Matthew. Changes from v1: - Skip calling delete_from_swap_cache() in shmem_set_folio_swapin_error() when skipping the swapcache. --- mm/shmem.c | 112 ++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 107 insertions(+), 5 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index 95b80c24f6f9..8c9e0e7408b7 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1967,6 +1967,67 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, return ERR_PTR(error); } +static struct folio *shmem_swap_alloc_folio(struct inode *inode, + struct vm_area_struct *vma, pgoff_t index, + swp_entry_t entry, int order, gfp_t gfp) +{ + struct shmem_inode_info *info = SHMEM_I(inode); + struct folio *new; + void *shadow; + int nr_pages; + + /* + * We have arrived here because our zones are constrained, so don't + * limit chance of success by further cpuset and node constraints. + */ + gfp &= ~GFP_CONSTRAINT_MASK; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (order > 0) { + gfp_t huge_gfp = vma_thp_gfp_mask(vma); + + gfp = limit_gfp_mask(huge_gfp, gfp); + } +#endif + + new = shmem_alloc_folio(gfp, order, info, index); + if (!new) + return ERR_PTR(-ENOMEM); + + nr_pages = folio_nr_pages(new); + if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL, + gfp, entry)) { + folio_put(new); + return ERR_PTR(-ENOMEM); + } + + /* + * Prevent parallel swapin from proceeding with the swap cache flag. + * + * Of course there is another possible concurrent scenario as well, + * that is to say, the swap cache flag of a large folio has already + * been set by swapcache_prepare(), while another thread may have + * already split the large swap entry stored in the shmem mapping. + * In this case, shmem_add_to_page_cache() will help identify the + * concurrent swapin and return -EEXIST. + */ + if (swapcache_prepare(entry, nr_pages)) { + folio_put(new); + return ERR_PTR(-EEXIST); + } + + __folio_set_locked(new); + __folio_set_swapbacked(new); + new->swap = entry; + + mem_cgroup_swapin_uncharge_swap(entry, nr_pages); + shadow = get_shadow_from_swap_cache(entry); + if (shadow) + workingset_refault(new, shadow); + folio_add_lru(new); + swap_read_folio(new, NULL); + return new; +} + /* * When a page is moved from swapcache to shmem filecache (either by the * usual swapin of shmem_get_folio_gfp(), or by the less common swapoff of @@ -2070,7 +2131,8 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp, } static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index, - struct folio *folio, swp_entry_t swap) + struct folio *folio, swp_entry_t swap, + bool skip_swapcache) { struct address_space *mapping = inode->i_mapping; swp_entry_t swapin_error; @@ -2086,7 +2148,8 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index, nr_pages = folio_nr_pages(folio); folio_wait_writeback(folio); - delete_from_swap_cache(folio); + if (!skip_swapcache) + delete_from_swap_cache(folio); /* * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks) @@ -2190,6 +2253,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, struct shmem_inode_info *info = SHMEM_I(inode); struct swap_info_struct *si; struct folio *folio = NULL; + bool skip_swapcache = false; swp_entry_t swap; int error, nr_pages; @@ -2211,6 +2275,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, /* Look it up and read it in.. */ folio = swap_cache_get_folio(swap, NULL, 0); if (!folio) { + int order = xa_get_order(&mapping->i_pages, index); + bool fallback_order0 = false; int split_order; /* Or update major stats only when swapin succeeds?? */ @@ -2220,6 +2286,33 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, count_memcg_event_mm(fault_mm, PGMAJFAULT); } + /* + * If uffd is active for the vma, we need per-page fault + * fidelity to maintain the uffd semantics, then fallback + * to swapin order-0 folio, as well as for zswap case. + */ + if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) || + !zswap_never_enabled())) + fallback_order0 = true; + + /* Skip swapcache for synchronous device. */ + if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) { + folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp); + if (!IS_ERR(folio)) { + skip_swapcache = true; + goto alloced; + } + + /* + * Fallback to swapin order-0 folio unless the swap entry + * already exists. + */ + error = PTR_ERR(folio); + folio = NULL; + if (error == -EEXIST) + goto failed; + } + /* * Now swap device can only swap in order 0 folio, then we * should split the large swap entry stored in the pagecache @@ -2250,9 +2343,10 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, } } +alloced: /* We have to do this with folio locked to prevent races */ folio_lock(folio); - if (!folio_test_swapcache(folio) || + if ((!skip_swapcache && !folio_test_swapcache(folio)) || folio->swap.val != swap.val || !shmem_confirm_swap(mapping, index, swap)) { error = -EEXIST; @@ -2288,7 +2382,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, if (sgp == SGP_WRITE) folio_mark_accessed(folio); - delete_from_swap_cache(folio); + if (skip_swapcache) { + folio->swap.val = 0; + swapcache_clear(si, swap, nr_pages); + } else { + delete_from_swap_cache(folio); + } folio_mark_dirty(folio); swap_free_nr(swap, nr_pages); put_swap_device(si); @@ -2299,8 +2398,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, if (!shmem_confirm_swap(mapping, index, swap)) error = -EEXIST; if (error == -EIO) - shmem_set_folio_swapin_error(inode, index, folio, swap); + shmem_set_folio_swapin_error(inode, index, folio, swap, + skip_swapcache); unlock: + if (skip_swapcache) + swapcache_clear(si, swap, folio_nr_pages(folio)); if (folio) { folio_unlock(folio); folio_put(folio);