mbox series

[RFC,v1,0/4] mm: ZSWAP swap-out of mTHP folios

Message ID 20240814062830.26833-1-kanchana.p.sridhar@intel.com (mailing list archive)
Headers show
Series mm: ZSWAP swap-out of mTHP folios | expand

Message

Kanchana P Sridhar Aug. 14, 2024, 6:28 a.m. UTC
This RFC patch-series enables zswap_store() to accept and store mTHP
folios. The most significant contribution in this series is from the 
earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
migrated to v6.10 in patch [3] of this series.

[1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
     https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

Additionally, there is an attempt to modularize some of the functionality
in zswap_store(), to make it more amenable to supporting any-order
mTHPs.

For instance, the determination of whether a folio is same-filled is
based on mapping an index into the folio to derive the page. Likewise,
there is a function "zswap_store_entry" added to store a zswap_entry in
the xarray.

For testing purposes, per-mTHP size vmstat zswap_store event counters are
added, and incremented upon successful zswap_store of an mTHP.

This patch-series is a precursor to ZSWAP compress batching of mTHP
swap-out and decompress batching of swap-ins based on swapin_readahead(),
using Intel IAA hardware acceleration, which we would like to submit in
subsequent RFC patch-series, with performance improvement data.

Performance Testing:
====================
Testing of this patch-series was done with the v6.10 mainline, without
and with this RFC, on an Intel Sapphire Rapids server, dual-socket 56
cores per socket, 4 IAA devices per socket.

The system has 503 GiB RAM, 176 GiB swap/ZSWAP with ZRAM as the backing
swap device. Core frequency was fixed at 2500MHz.

The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 40G. Following a similar methodology as in Ryan Roberts'
"Swap-out mTHP without splitting" series [2], 70 usemem processes were
run, each allocating and writing 1G of memory:

    usemem --init-time -w -O -n 70 1g

Other kernel configuration parameters:

    ZSWAP Compressor  : LZ4, DEFLATE-IAA
    ZSWAP Allocator   : ZSMALLOC
    ZRAM Compressor   : LZO-RLE
    SWAP page-cluster : 2

In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
IAA "compression verification" is enabled. Hence each IAA compression
will be decompressed internally by the "iaa_crypto" driver, the crc-s
returned by the hardware will be compared and errors reported in case of
mismatches. Thus "deflate-iaa" helps ensure better data integrity as
compared to the software compressors.

Throughput reported by usemem and perf sys time for running the test were
measured and averaged across 3 runs:

 64KB mTHP:
 ==========
  ----------------------------------------------------------
 |                |               |            |            |
 |Kernel          | mTHP SWAP-OUT | Throughput | Improvement|
 |                |               |       KB/s |            |
 |----------------|---------------|------------|------------|
 |v6.10 mainline  | ZRAM lzo-rle  |    111,180 |   Baseline |
 |zswap-mTHP-RFC  | ZSWAP lz4     |    115,996 |         4% |
 |zswap-mTHP-RFC  | ZSWAP         |            |            |
 |                | deflate-iaa   |    166,048 |        49% |
 |----------------------------------------------------------|
 |                |               |            |            |
 |Kernel          | mTHP SWAP-OUT |   Sys time | Improvement|
 |                |               |        sec |            |
 |----------------|---------------|------------|------------|
 |v6.10 mainline  | ZRAM lzo-rle  |   1,049.69 |   Baseline |
 |zswap-mTHP RFC  | ZSWAP lz4     |   1,178.20 |       -12% |
 |zswap-mTHP-RFC  | ZSWAP         |            |            |
 |                | deflate-iaa   |     626.12 |        40% |
  ----------------------------------------------------------

  -------------------------------------------------------
 | VMSTATS, mTHP ZSWAP stats, |      v6.10 |  zswap-mTHP |
 | mTHP ZRAM stats:           |   mainline |         RFC |
 |-------------------------------------------------------|
 | pswpin                     |         16 |           0 |
 | pswpout                    |  7,823,984 |           0 |
 | zswpin                     |        551 |         647 |
 | zswpout                    |      1,410 |  15,175,113 |
 |-------------------------------------------------------|
 | thp_swpout                 |          0 |           0 |
 | thp_swpout_fallback        |          0 |           0 |
 | pgmajfault                 |      2,189 |       2,241 |
 |-------------------------------------------------------|
 | zswpout_4kb_folio          |            |       1,497 |
 | mthp_zswpout_64kb          |            |     948,351 |
 |-------------------------------------------------------|
 | hugepages-64kB/stats/swpout|    488,999 |           0 |
  -------------------------------------------------------


 2MB PMD-THP/2048K mTHP:
 =======================
  ----------------------------------------------------------
 |                |               |            |            |
 |Kernel          | mTHP SWAP-OUT | Throughput | Improvement|
 |                |               |       KB/s |            |
 |----------------|---------------|------------|------------|
 |v6.10 mainline  | ZRAM lzo-rle  |    136,617 |   Baseline |
 |zswap-mTHP-RFC  | ZSWAP lz4     |    137,360 |         1% |
 |zswap-mTHP-RFC  | ZSWAP         |            |            |
 |                | deflate-iaa   |    179,097 |        31% |
 |----------------------------------------------------------|
 |                |               |            |            |
 |Kernel          | mTHP SWAP-OUT |   Sys time | Improvement|
 |                |               |        sec |            |
 |----------------|---------------|------------|------------|
 |v6.10 mainline  | ZRAM lzo-rle  |   1,044.40 |   Baseline |
 |zswap-mTHP RFC  | ZSWAP lz4     |   1,035.79 |         1% |
 |zswap-mTHP-RFC  | ZSWAP         |            |            |
 |                | deflate-iaa   |     571.31 |        45% |
  ----------------------------------------------------------

  ---------------------------------------------------------
 | VMSTATS, mTHP ZSWAP stats,   |      v6.10 |  zswap-mTHP |
 | mTHP ZRAM stats:             |   mainline |         RFC |
 |---------------------------------------------------------|
 | pswpin                       |          0 |           0 |
 | pswpout                      |  8,630,272 |           0 |
 | zswpin                       |        565 |       6,901 |
 | zswpout                      |      1,388 |  15,379,163 |
 |---------------------------------------------------------|
 | thp_swpout                   |     16,856 |           0 |
 | thp_swpout_fallback          |          0 |           0 |
 | pgmajfault                   |      2,184 |       8,532 |
 |---------------------------------------------------------|
 | zswpout_4kb_folio            |            |       5,851 |
 | mthp_zswpout_2048kb          |            |      30,026 |
 | zswpout_pmd_thp_folio        |            |      30,026 |
 |---------------------------------------------------------|
 | hugepages-2048kB/stats/swpout|     16,856 |           0 |
  ---------------------------------------------------------

As expected in the "Before" experiment, there are relatively fewer
swapouts, because ZRAM utilization is not accounted in the cgroup.

With the introduction of zswap_store mTHP, the "After" data reflects the
higher swapout activity, and consequent sys time degradation.

Our goal is to improve ZSWAP mTHP store performance using batching. With
Intel IAA compress/decompress batching used in ZSWAP (to be submitted as
additional RFC series), we are able to demonstrate significant
performance improvements with IAA as compared to software compressors.

For instance, with IAA-Canned compression [3] used with batching of
zswap_stores and of zswap_loads, the usemem experiment's average of 3
runs throughput improves to 170,461 KB/s (64KB mTHP) and 188,325 KB/s
(2MB THP).

[2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/
[3] https://patchwork.kernel.org/project/linux-crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/

Kanchana P Sridhar (4):
  mm: zswap: zswap_is_folio_same_filled() takes an index in the folio.
  mm: vmstat: Per mTHP-size zswap_store vmstat event counters.
  mm: zswap: zswap_store() extended to handle mTHP folios.
  mm: page_io: Count successful mTHP zswap stores in vmstat.

 include/linux/vm_event_item.h |  15 +++
 mm/page_io.c                  |  44 +++++++
 mm/vmstat.c                   |  15 +++
 mm/zswap.c                    | 223 ++++++++++++++++++++++++----------
 4 files changed, 233 insertions(+), 64 deletions(-)