From patchwork Mon Aug 19 02:16:17 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Sridhar, Kanchana P" X-Patchwork-Id: 13767722 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D62E0C52D7C for ; Mon, 19 Aug 2024 02:16:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 68E0F6B0083; Sun, 18 Aug 2024 22:16:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6167C6B0085; Sun, 18 Aug 2024 22:16:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 490836B0088; Sun, 18 Aug 2024 22:16:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 272916B0083 for ; Sun, 18 Aug 2024 22:16:26 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 8CB8040EA3 for ; Mon, 19 Aug 2024 02:16:25 +0000 (UTC) X-FDA: 82467380730.02.53B4686 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.9]) by imf24.hostedemail.com (Postfix) with ESMTP id 7007E180016 for ; Mon, 19 Aug 2024 02:16:23 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="aY/+pp3l"; spf=pass (imf24.hostedemail.com: domain of kanchana.p.sridhar@intel.com designates 198.175.65.9 as permitted sender) smtp.mailfrom=kanchana.p.sridhar@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724033744; a=rsa-sha256; cv=none; b=d1RIjI3mLcNr+tjgbniekbj84R8bn97aX6650QT5b70X7uTcy2rZakvotWqP/n3DiPgjF+ CaRq+2NrM8BMRaZ84V2srBfvQfq/uXvwlKkFWQGbBrvXSJj/mujlgAvK3K96vx1JyylFt6 j2Mnt80zyF1BBcZ3YLeglt2XM0cnAxo= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="aY/+pp3l"; spf=pass (imf24.hostedemail.com: domain of kanchana.p.sridhar@intel.com designates 198.175.65.9 as permitted sender) smtp.mailfrom=kanchana.p.sridhar@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724033744; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=k8otnyyGdrioKYuAdczfhVxS2zjwcH/MkphEXf45TBI=; b=PSgTjhUDhYRtkrgujM2WnnB6KK6n+G+pFicS8vuwz71YgZR23TI4a3mMKoMtx6/l76kHxY kWg8cRNT6Y0q25NW53AOlEEqIvBulhJnOc15n1rx6JGVqTrsvB6PZGW2CXVJXYiu4rHoLA jwC2jPuuXbsEocYhGyq5+xP0l45SEpA= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1724033784; x=1755569784; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=YYKgEK7DwekBv83kdHiuHbvSZBpldfHgcCbNkNaCKyE=; b=aY/+pp3lG3dD8kJW/I07Slglk5jdQCSTO3x4SE8XFdzi2KQqCtQaSchf scuqUeZk/HD/h3XhMunEdWVUtj53JqtFGNlng4cCp3ZVlC3FIFCf15Wwp poLDr1KSjJ3NhFPHUgTA0Qj93ohXv2bFNYmLuwuDEmouX6V+nyTmDD85U VVN7I7QORud56sS8Eq6afj6yx5ZmqBiitlo4kIfVn/oBbW4tNieXYCPxC IXOgsJsJ0PH+a5u9JavEn/nDxZBgEpBEOug84EQ02Gvi3fsFUFXu66nIj cFqxjpJORW/vMCVln2nIVBUCabvtvh4U/ncohqUhrJYl8Cb+rH3HAXX6L Q==; X-CSE-ConnectionGUID: BSwIT2CJQoKcxDpChZK7ag== X-CSE-MsgGUID: Nq/K2XIkSgeESNxOqdDnBw== X-IronPort-AV: E=McAfee;i="6700,10204,11168"; a="44782953" X-IronPort-AV: E=Sophos;i="6.10,158,1719903600"; d="scan'208";a="44782953" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by orvoesa101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Aug 2024 19:16:22 -0700 X-CSE-ConnectionGUID: tZ7WcHE7RLGHoiCIHZz/Lw== X-CSE-MsgGUID: zEj854CYToq1VnzNLIRwXw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.10,158,1719903600"; d="scan'208";a="64610267" Received: from jf5300-b11a338t.jf.intel.com ([10.242.51.6]) by fmviesa005.fm.intel.com with ESMTP; 18 Aug 2024 19:16:21 -0700 From: Kanchana P Sridhar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, yosryahmed@google.com, nphamcs@gmail.com, ryan.roberts@arm.com, ying.huang@intel.com, 21cnbao@gmail.com, akpm@linux-foundation.org Cc: nanhai.zou@intel.com, wajdi.k.feghali@intel.com, vinodh.gopal@intel.com, kanchana.p.sridhar@intel.com Subject: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios Date: Sun, 18 Aug 2024 19:16:17 -0700 Message-Id: <20240819021621.29125-1-kanchana.p.sridhar@intel.com> X-Mailer: git-send-email 2.27.0 MIME-Version: 1.0 X-Stat-Signature: ju1uiwp614cj8xn4heww3z9c4w4nq1a4 X-Rspamd-Queue-Id: 7007E180016 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1724033783-865105 X-HE-Meta: U2FsdGVkX1+++LBigA34vSnJsI2d48BnnQsJhBybNRXR0VQRS2pRdv6qMPemcKPNua9Q7Q8KVwsUtLr3p2A/AmMHgwAj2H+KgIVTRKaYkm3QKhCkHRcdTlelVggSXY1EXKLGwhnovAI9XqSAAblU3ikVX2GRUnzKy/3T6VuWHPwEx8AeQqmLrVCgHaiG3uaDawcOHEQ90xPUBuHWBV9D8bMlwKzINUuhbfrpp5xU6LKHGb956JGjUOfCcy39MzWtmFqKml5DSSlR7hjGb48lPEFNM8myTlJ6vqcqYKRl5YzmBIWvFIMwuJ4V0VTEODVuNuAF5x5DfOEQMlB5ylGzwjeX8JpGjrUuQun0FG557FHUQNq9zYqexfu7U8Cf3w8l3RWuhb0KzlpLESIBckzHXWkaIXh/Rp6dd6repUf3gBQCi2EDZfmZ7N2w5AVwXu9woT1dXKwAkQgqIJfK+8B7auSEpccq+pwCq63Fmu7zkdqh9zDfBjzgXbj9uhfF0FyDUsoICVQ3GzEJfbYDGQaqYX+Fo2AsLwHgqXQhYJDON2KZRGO6RlKSG8IQ0kehl19wDGoUBSMRv+Ci4D8tK2mm4x1bE/pNneucATue+WtGrd6Bw1yedP54ySbjf3XMqkap0lBQ+iQDzSlaFt53W4oLLKTuMbZgasD9NOD7l0M64rgYOSjZ7aP9G+FBQNJRsU2bxZ1BXXCE5zeL6VW96ENmUPENlE6r9f+j0LJ/YDPWdBBudWZc6xIJnMw+FiufGsHjZdvorJOOxLVRsuLYSntGQqISI6hBae5azpGjeqkJbMfgIDJjlv/UKgeuGDzmBAs4IsfCczli8wbPGwV/i5UA4mZ4tEeSoV6wd7pOIKPOGrMHOhw4lLKCpv5nXRlXrU6q1kUOLj5ngnToJbuCByF4KP/hTvgDvCqU8Y1hoZRiLvJArMaXsm37m+GVn7v37sIC8d8rYBPDnwWA2kuqHsj vE1sy41q YpmVS8P2K1O8Qe7607botwNKKvEkUvhtBXBTFZicMQRitlGuHI2jQdWB/kMu6g6Rl5Jk75+P2tL325KS4tz/v4m6/D0p2gmehBMwkZ6GkTRIcQFeEXa4r28vX/KyEbNdeMyoNmWkDli1ZCnGXMpzVY/dtiUXjSV+IVR6QnGrD6KGH9Mo9NONz+/EEcpFDT+eVVE9ZxaZnYujwRSaToWtNUxKggkw/axRdfgSZIb9oGuFzZ31ZVnJiNxsSPNxJKkuDu48D X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi All, This patch-series enables zswap_store() to accept and store mTHP folios. The most significant contribution in this series is from the earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been migrated to v6.11-rc3 in patch 2/4 of this series. [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u Additionally, there is an attempt to modularize some of the functionality in zswap_store(), to make it more amenable to supporting any-order mTHPs. For instance, the determination of whether a folio is same-filled is based on mapping an index into the folio to derive the page. Likewise, there is a function "zswap_store_entry" added to store a zswap_entry in the xarray. For accounting purposes, the patch-series adds per-order mTHP sysfs "zswpout" counters that get incremented upon successful zswap_store of an mTHP folio: /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout This patch-series is a precursor to ZSWAP compress batching of mTHP swap-out and decompress batching of swap-ins based on swapin_readahead(), using Intel IAA hardware acceleration, which we would like to submit in subsequent RFC patch-series, with performance improvement data. Thanks to Ying Huang for pre-posting review feedback and suggestions! Changes since v3: ================= 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444. Thanks to Barry for suggesting aligning with Ryan Roberts' latest changes to count_mthp_stat() so that it's always defined, even when THP is disabled. Barry, I have also made one other change in page_io.c where count_mthp_stat() is called by count_swpout_vm_event(). I would appreciate it if you can review this. Thanks! Hopefully this should resolve the kernel robot build errors. Changes since v2: ================= 1) Gathered usemem data using SSD as the backing swap device for zswap, as suggested by Ying Huang. Ying, I would appreciate it if you can review the latest data. Thanks! 2) Generated the base commit info in the patches to attempt to address the kernel test robot build errors. 3) No code changes to the individual patches themselves. Changes since RFC v1: ===================== 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion. Thanks Barry! 2) Addressed some of the code review comments that Nhat Pham provided in Ryan's initial RFC [1]: - Added a comment about the cgroup zswap limit checks occuring once per folio at the beginning of zswap_store(). Nhat, Ryan, please do let me know if the comments convey the summary from the RFC discussion. Thanks! - Posted data on running the cgroup suite's zswap kselftest. 3) Rebased to v6.11-rc3. 4) Gathered performance data with usemem and the rebased patch-series. Performance Testing: ==================== Testing of this patch-series was done with the v6.11-rc3 mainline, without and with this patch-series, on an Intel Sapphire Rapids server, dual-socket 56 cores per socket, 4 IAA devices per socket. The system has 503 GiB RAM, with a 4G SSD as the backing swap device for ZSWAP. Core frequency was fixed at 2500MHz. The vm-scalability "usemem" test was run in a cgroup whose memory.high was fixed. Following a similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting" series [2], 70 usemem processes were run, each allocating and writing 1G of memory: usemem --init-time -w -O -n 70 1g Since I was constrained to get the 70 usemem processes to generate swapout activity with the 4G SSD, I ended up using different cgroup memory.high fixed limits for the experiments with 64K mTHP and 2M THP: 64K mTHP experiments: cgroup memory fixed at 60G 2M THP experiments : cgroup memory fixed at 55G The vm/sysfs stats included after the performance data provide details on the swapout activity to SSD/ZSWAP. Other kernel configuration parameters: ZSWAP Compressor : LZ4, DEFLATE-IAA ZSWAP Allocator : ZSMALLOC SWAP page-cluster : 2 In the experiments where "deflate-iaa" is used as the ZSWAP compressor, IAA "compression verification" is enabled. Hence each IAA compression will be decompressed internally by the "iaa_crypto" driver, the crc-s returned by the hardware will be compared and errors reported in case of mismatches. Thus "deflate-iaa" helps ensure better data integrity as compared to the software compressors. Throughput reported by usemem and perf sys time for running the test are as follows, averaged across 3 runs: 64KB mTHP (cgroup memory.high set to 60G): ========================================== ------------------------------------------------------------------ | | | | | |Kernel | mTHP SWAP-OUT | Throughput | Improvement| | | | KB/s | | |--------------------|-------------------|------------|------------| |v6.11-rc3 mainline | SSD | 335,346 | Baseline | |zswap-mTHP-Store | ZSWAP lz4 | 271,558 | -19% | |zswap-mTHP-Store | ZSWAP deflate-iaa | 388,154 | 16% | |------------------------------------------------------------------| | | | | | |Kernel | mTHP SWAP-OUT | Sys time | Improvement| | | | sec | | |--------------------|-------------------|------------|------------| |v6.11-rc3 mainline | SSD | 91.37 | Baseline | |zswap-mTHP=Store | ZSWAP lz4 | 265.43 | -191% | |zswap-mTHP-Store | ZSWAP deflate-iaa | 235.60 | -158% | ------------------------------------------------------------------ ----------------------------------------------------------------------- | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | zswap-mTHP | | | mainline | Store | Store | | | | lz4 | deflate-iaa | |-----------------------------------------------------------------------| | pswpin | 0 | 0 | 0 | | pswpout | 174,432 | 0 | 0 | | zswpin | 703 | 534 | 721 | | zswpout | 1,501 | 1,491,654 | 1,398,805 | |-----------------------------------------------------------------------| | thp_swpout | 0 | 0 | 0 | | thp_swpout_fallback | 0 | 0 | 0 | | pgmajfault | 3,364 | 3,650 | 3,431 | |-----------------------------------------------------------------------| | hugepages-64kB/stats/zswpout | | 63,200 | 63,244 | |-----------------------------------------------------------------------| | hugepages-64kB/stats/swpout | 10,902 | 0 | 0 | ----------------------------------------------------------------------- 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 55G): ======================================================= ------------------------------------------------------------------ | | | | | |Kernel | mTHP SWAP-OUT | Throughput | Improvement| | | | KB/s | | |--------------------|-------------------|------------|------------| |v6.11-rc3 mainline | SSD | 190,827 | Baseline | |zswap-mTHP-Store | ZSWAP lz4 | 32,026 | -83% | |zswap-mTHP-Store | ZSWAP deflate-iaa | 203,772 | 7% | |------------------------------------------------------------------| | | | | | |Kernel | mTHP SWAP-OUT | Sys time | Improvement| | | | sec | | |--------------------|-------------------|------------|------------| |v6.11-rc3 mainline | SSD | 27.23 | Baseline | |zswap-mTHP-Store | ZSWAP lz4 | 156.52 | -475% | |zswap-mTHP-Store | ZSWAP deflate-iaa | 171.45 | -530% | ------------------------------------------------------------------ ------------------------------------------------------------------------- | VMSTATS, mTHP ZSWAP/SSD stats | v6.11-rc3 | zswap-mTHP | zswap-mTHP | | | mainline | Store | Store | | | | lz4 | deflate-iaa | |-------------------------------------------------------------------------| | pswpin | 0 | 0 | 0 | | pswpout | 797,184 | 0 | 0 | | zswpin | 690 | 649 | 669 | | zswpout | 1,465 | 1,596,382 | 1,540,766 | |-------------------------------------------------------------------------| | thp_swpout | 1,557 | 0 | 0 | | thp_swpout_fallback | 0 | 3,248 | 3,752 | | pgmajfault | 3,726 | 6,470 | 5,691 | |-------------------------------------------------------------------------| | hugepages-2048kB/stats/zswpout | | 2,416 | 2,261 | |-------------------------------------------------------------------------| | hugepages-2048kB/stats/swpout | 1,557 | 0 | 0 | ------------------------------------------------------------------------- In the "Before" scenario, when zswap does not store mTHP, only allocations count towards the cgroup memory limit, as against in the "After" scenario with the introduction of zswap_store mTHP, that causes both allocations and the zswap usage count towards the memory limit. As a result, we see higher swapout activity in the "After" data, and consequent sys time degradation. We do observe considerable throughput improvement in the "After" data when DEFLATE-IAA is the zswap compressor. This observation holds for 64K mTHP and 2MB THP experiments. This can be attributed to IAA's better compress/decompress latency and compression ratio as compared to software compressors. In my opinion, even though the test set up does not provide an accurate way for a direct before/after comparison (because of zswap usage being counted in cgroup, hence towards the memory.high), it still seems reasonable for zswap_store to support mTHP, so that further performance improvements can be implemented. One of the ideas that has shown promise in our experiments is to improve ZSWAP mTHP store performance using batching. With IAA compress/decompress batching used in ZSWAP, we are able to demonstrate significant performance improvements and memory savings with IAA in scalability experiments, as compared to software compressors. We hope to submit this work as subsequent RFCs. cgroup zswap kselftest with 4G SSD as zswap's backing device: ============================================================= mTHP 64K set to 'always' zswap compressor set to 'lz4' page-cluster = 3 "Before": ========= Test run with v6.11-rc3 and no code changes: zswap shrinker_enabled = Y: --------------------------- not ok 1 test_zswap_usage not ok 2 test_swapin_nozswap not ok 3 test_zswapin # Failed to reclaim all of the requested memory not ok 4 test_zswap_writeback_enabled # Failed to reclaim all of the requested memory not ok 5 test_zswap_writeback_disabled ok 6 # SKIP test_no_kmem_bypass not ok 7 test_no_invasive_cgroup_shrink "After": ======== Test run with this patch-series and v6.11-rc3: zswap shrinker_enabled = Y: --------------------------- ok 1 test_zswap_usage not ok 2 test_swapin_nozswap ok 3 test_zswapin ok 4 test_zswap_writeback_enabled ok 5 test_zswap_writeback_disabled ok 6 # SKIP test_no_kmem_bypass not ok 7 test_no_invasive_cgroup_shrink I haven't taken an in-depth look into the cgroup zswap tests, but it looks like the results with the patch-series are no worse than without, and in some cases better (this needs more analysis). I would greatly appreciate your code review comments and suggestions! Thanks, Kanchana [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/ Kanchana P Sridhar (4): mm: zswap: zswap_is_folio_same_filled() takes an index in the folio. mm: zswap: zswap_store() extended to handle mTHP folios. mm: Add MTHP_STAT_ZSWPOUT to sysfs per-order mthp stats. mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats. include/linux/huge_mm.h | 1 + mm/huge_memory.c | 3 + mm/page_io.c | 3 +- mm/zswap.c | 238 +++++++++++++++++++++++++++++----------- 4 files changed, 180 insertions(+), 65 deletions(-) base-commit: 8c0b4f7b65fd1ca7af01267f491e815a40d77444