[v4,0/4] mm: ZSWAP swap-out of mTHP folios

Message ID	20240819021621.29125-1-kanchana.p.sridhar@intel.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, yosryahmed@google.com, nphamcs@gmail.com, ryan.roberts@arm.com, ying.huang@intel.com, 21cnbao@gmail.com, akpm@linux-foundation.org Cc: nanhai.zou@intel.com, wajdi.k.feghali@intel.com, vinodh.gopal@intel.com, kanchana.p.sridhar@intel.com Subject: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios Date: Sun, 18 Aug 2024 19:16:17 -0700 Message-Id: <20240819021621.29125-1-kanchana.p.sridhar@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: ZSWAP swap-out of mTHP folios \| expand [v4,0/4] mm: ZSWAP swap-out of mTHP folios [v4,1/4] mm: zswap: zswap_is_folio_same_filled() takes an index in the folio. [v4,2/4] mm: zswap: zswap_store() extended to handle mTHP folios. [v4,3/4] mm: Add MTHP_STAT_ZSWPOUT to sysfs per-order mthp stats. [v4,4/4] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats.

Sridhar, Kanchana P Aug. 19, 2024, 2:16 a.m. UTC

Hi All,

This patch-series enables zswap_store() to accept and store mTHP
folios. The most significant contribution in this series is from the 
earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
migrated to v6.11-rc3 in patch 2/4 of this series.

[1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
     https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

Additionally, there is an attempt to modularize some of the functionality
in zswap_store(), to make it more amenable to supporting any-order
mTHPs.

For instance, the determination of whether a folio is same-filled is
based on mapping an index into the folio to derive the page. Likewise,
there is a function "zswap_store_entry" added to store a zswap_entry in
the xarray.

For accounting purposes, the patch-series adds per-order mTHP sysfs
"zswpout" counters that get incremented upon successful zswap_store of
an mTHP folio:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout

This patch-series is a precursor to ZSWAP compress batching of mTHP
swap-out and decompress batching of swap-ins based on swapin_readahead(),
using Intel IAA hardware acceleration, which we would like to submit in
subsequent RFC patch-series, with performance improvement data.

Thanks to Ying Huang for pre-posting review feedback and suggestions!

Changes since v3:
=================
1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
   Thanks to Barry for suggesting aligning with Ryan Roberts' latest
   changes to count_mthp_stat() so that it's always defined, even when THP
   is disabled. Barry, I have also made one other change in page_io.c
   where count_mthp_stat() is called by count_swpout_vm_event(). I would
   appreciate it if you can review this. Thanks!
   Hopefully this should resolve the kernel robot build errors.

Changes since v2:
=================
1) Gathered usemem data using SSD as the backing swap device for zswap,
   as suggested by Ying Huang. Ying, I would appreciate it if you can
   review the latest data. Thanks!
2) Generated the base commit info in the patches to attempt to address
   the kernel test robot build errors.
3) No code changes to the individual patches themselves.

Changes since RFC v1:
=====================

1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
   Thanks Barry!
2) Addressed some of the code review comments that Nhat Pham provided in
   Ryan's initial RFC [1]:
   - Added a comment about the cgroup zswap limit checks occuring once per
     folio at the beginning of zswap_store().
     Nhat, Ryan, please do let me know if the comments convey the summary
     from the RFC discussion. Thanks!
   - Posted data on running the cgroup suite's zswap kselftest.
3) Rebased to v6.11-rc3.
4) Gathered performance data with usemem and the rebased patch-series.

Performance Testing:
====================
Testing of this patch-series was done with the v6.11-rc3 mainline, without
and with this patch-series, on an Intel Sapphire Rapids server,
dual-socket 56 cores per socket, 4 IAA devices per socket.

The system has 503 GiB RAM, with a 4G SSD as the backing swap device for
ZSWAP. Core frequency was fixed at 2500MHz.

The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed. Following a similar methodology as in Ryan Roberts'
"Swap-out mTHP without splitting" series [2], 70 usemem processes were
run, each allocating and writing 1G of memory:

    usemem --init-time -w -O -n 70 1g

Since I was constrained to get the 70 usemem processes to generate
swapout activity with the 4G SSD, I ended up using different cgroup
memory.high fixed limits for the experiments with 64K mTHP and 2M THP:

64K mTHP experiments: cgroup memory fixed at 60G
2M THP experiments  : cgroup memory fixed at 55G

The vm/sysfs stats included after the performance data provide details
on the swapout activity to SSD/ZSWAP.

Other kernel configuration parameters:

    ZSWAP Compressor  : LZ4, DEFLATE-IAA
    ZSWAP Allocator   : ZSMALLOC
    SWAP page-cluster : 2

In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
IAA "compression verification" is enabled. Hence each IAA compression
will be decompressed internally by the "iaa_crypto" driver, the crc-s
returned by the hardware will be compared and errors reported in case of
mismatches. Thus "deflate-iaa" helps ensure better data integrity as
compared to the software compressors.

Throughput reported by usemem and perf sys time for running the test
are as follows, averaged across 3 runs:

 64KB mTHP (cgroup memory.high set to 60G):
 ==========================================
  ------------------------------------------------------------------
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
 |                    |                   |       KB/s |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | SSD               |    335,346 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |    271,558 |       -19% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |    388,154 |        16% |
 |------------------------------------------------------------------|
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
 |                    |                   |        sec |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | SSD               |      91.37 |   Baseline |
 |zswap-mTHP=Store    | ZSWAP lz4         |     265.43 |      -191% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |     235.60 |      -158% |
  ------------------------------------------------------------------

  -----------------------------------------------------------------------
 | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
 |                              |   mainline |       Store |       Store |
 |                              |            |         lz4 | deflate-iaa |
 |-----------------------------------------------------------------------|
 | pswpin                       |          0 |           0 |           0 |
 | pswpout                      |    174,432 |           0 |           0 |
 | zswpin                       |        703 |         534 |         721 |
 | zswpout                      |      1,501 |   1,491,654 |   1,398,805 |
 |-----------------------------------------------------------------------|
 | thp_swpout                   |          0 |           0 |           0 |
 | thp_swpout_fallback          |          0 |           0 |           0 |
 | pgmajfault                   |      3,364 |       3,650 |       3,431 |
 |-----------------------------------------------------------------------|
 | hugepages-64kB/stats/zswpout |            |      63,200 |      63,244 |
 |-----------------------------------------------------------------------|
 | hugepages-64kB/stats/swpout  |     10,902 |           0 |           0 |
  -----------------------------------------------------------------------


 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 55G):
 =======================================================
  ------------------------------------------------------------------
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
 |                    |                   |       KB/s |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | SSD               |    190,827 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |     32,026 |       -83% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |    203,772 |         7% |
 |------------------------------------------------------------------|
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
 |                    |                   |        sec |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | SSD               |      27.23 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |     156.52 |      -475% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |     171.45 |      -530% |
  ------------------------------------------------------------------

  ------------------------------------------------------------------------- 
 | VMSTATS, mTHP ZSWAP/SSD stats  |  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
 |                                |   mainline |       Store |       Store |
 |                                |            |         lz4 | deflate-iaa |
 |-------------------------------------------------------------------------|
 | pswpin                         |          0 |           0 |           0 |
 | pswpout                        |    797,184 |           0 |           0 |
 | zswpin                         |        690 |         649 |         669 |
 | zswpout                        |      1,465 |   1,596,382 |   1,540,766 |
 |-------------------------------------------------------------------------|
 | thp_swpout                     |      1,557 |           0 |           0 |
 | thp_swpout_fallback            |          0 |       3,248 |       3,752 |
 | pgmajfault                     |      3,726 |       6,470 |       5,691 |
 |-------------------------------------------------------------------------|
 | hugepages-2048kB/stats/zswpout |            |       2,416 |       2,261 |
 |-------------------------------------------------------------------------|
 | hugepages-2048kB/stats/swpout  |      1,557 |           0 |           0 |
  -------------------------------------------------------------------------

In the "Before" scenario, when zswap does not store mTHP, only allocations
count towards the cgroup memory limit, as against in the "After" scenario
with the introduction of zswap_store mTHP, that causes both allocations
and the zswap usage count towards the memory limit. As a result, we see
higher swapout activity in the "After" data, and consequent sys time
degradation.

We do observe considerable throughput improvement in the "After" data
when DEFLATE-IAA is the zswap compressor. This observation holds for
64K mTHP and 2MB THP experiments. This can be attributed to IAA's better
compress/decompress latency and compression ratio as compared to
software compressors.

In my opinion, even though the test set up does not provide an accurate
way for a direct before/after comparison (because of zswap usage being
counted in cgroup, hence towards the memory.high), it still seems
reasonable for zswap_store to support mTHP, so that further performance
improvements can be implemented.

One of the ideas that has shown promise in our experiments is to improve
ZSWAP mTHP store performance using batching. With IAA compress/decompress
batching used in ZSWAP, we are able to demonstrate significant
performance improvements and memory savings with IAA in scalability
experiments, as compared to software compressors. We hope to submit
this work as subsequent RFCs.

cgroup zswap kselftest with 4G SSD as zswap's backing device:
=============================================================
mTHP 64K set to 'always'
zswap compressor set to 'lz4'
page-cluster = 3

"Before":
=========
  Test run with v6.11-rc3 and no code changes:

  zswap shrinker_enabled = Y:
  ---------------------------
  not ok 1 test_zswap_usage
  not ok 2 test_swapin_nozswap
  not ok 3 test_zswapin
  # Failed to reclaim all of the requested memory
  not ok 4 test_zswap_writeback_enabled
  # Failed to reclaim all of the requested memory
  not ok 5 test_zswap_writeback_disabled
  ok 6 # SKIP test_no_kmem_bypass
  not ok 7 test_no_invasive_cgroup_shrink

"After":
========
  Test run with this patch-series and v6.11-rc3:

  zswap shrinker_enabled = Y:
  ---------------------------
  ok 1 test_zswap_usage
  not ok 2 test_swapin_nozswap
  ok 3 test_zswapin
  ok 4 test_zswap_writeback_enabled
  ok 5 test_zswap_writeback_disabled
  ok 6 # SKIP test_no_kmem_bypass
  not ok 7 test_no_invasive_cgroup_shrink

I haven't taken an in-depth look into the cgroup zswap tests, but it
looks like the results with the patch-series are no worse than without,
and in some cases better (this needs more analysis).

I would greatly appreciate your code review comments and suggestions!

Thanks,
Kanchana

[2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/



Kanchana P Sridhar (4):
  mm: zswap: zswap_is_folio_same_filled() takes an index in the folio.
  mm: zswap: zswap_store() extended to handle mTHP folios.
  mm: Add MTHP_STAT_ZSWPOUT to sysfs per-order mthp stats.
  mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats.

 include/linux/huge_mm.h |   1 +
 mm/huge_memory.c        |   3 +
 mm/page_io.c            |   3 +-
 mm/zswap.c              | 238 +++++++++++++++++++++++++++++-----------
 4 files changed, 180 insertions(+), 65 deletions(-)


base-commit: 8c0b4f7b65fd1ca7af01267f491e815a40d77444

Huang, Ying Aug. 19, 2024, 3:16 a.m. UTC | #1

Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:

[snip]

>
> Performance Testing:
> ====================
> Testing of this patch-series was done with the v6.11-rc3 mainline, without
> and with this patch-series, on an Intel Sapphire Rapids server,
> dual-socket 56 cores per socket, 4 IAA devices per socket.
>
> The system has 503 GiB RAM, with a 4G SSD as the backing swap device for
> ZSWAP. Core frequency was fixed at 2500MHz.
>
> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed. Following a similar methodology as in Ryan Roberts'
> "Swap-out mTHP without splitting" series [2], 70 usemem processes were
> run, each allocating and writing 1G of memory:
>
>     usemem --init-time -w -O -n 70 1g
>
> Since I was constrained to get the 70 usemem processes to generate
> swapout activity with the 4G SSD, I ended up using different cgroup
> memory.high fixed limits for the experiments with 64K mTHP and 2M THP:
>
> 64K mTHP experiments: cgroup memory fixed at 60G
> 2M THP experiments  : cgroup memory fixed at 55G
>
> The vm/sysfs stats included after the performance data provide details
> on the swapout activity to SSD/ZSWAP.
>
> Other kernel configuration parameters:
>
>     ZSWAP Compressor  : LZ4, DEFLATE-IAA
>     ZSWAP Allocator   : ZSMALLOC
>     SWAP page-cluster : 2
>
> In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> IAA "compression verification" is enabled. Hence each IAA compression
> will be decompressed internally by the "iaa_crypto" driver, the crc-s
> returned by the hardware will be compared and errors reported in case of
> mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> compared to the software compressors.
>
> Throughput reported by usemem and perf sys time for running the test
> are as follows, averaged across 3 runs:
>
>  64KB mTHP (cgroup memory.high set to 60G):
>  ==========================================
>   ------------------------------------------------------------------
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
>  |                    |                   |       KB/s |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | SSD               |    335,346 |   Baseline |
>  |zswap-mTHP-Store    | ZSWAP lz4         |    271,558 |       -19% |

zswap throughput is worse than ssd swap?  This doesn't look right.

>  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    388,154 |        16% |
>  |------------------------------------------------------------------|
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
>  |                    |                   |        sec |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | SSD               |      91.37 |   Baseline |
>  |zswap-mTHP=Store    | ZSWAP lz4         |     265.43 |      -191% |
>  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     235.60 |      -158% |
>   ------------------------------------------------------------------
>
>   -----------------------------------------------------------------------
>  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
>  |                              |   mainline |       Store |       Store |
>  |                              |            |         lz4 | deflate-iaa |
>  |-----------------------------------------------------------------------|
>  | pswpin                       |          0 |           0 |           0 |
>  | pswpout                      |    174,432 |           0 |           0 |
>  | zswpin                       |        703 |         534 |         721 |
>  | zswpout                      |      1,501 |   1,491,654 |   1,398,805 |

It appears that the number of swapped pages for zswap is much larger
than that of SSD swap.  Why?  I guess this is why zswap throughput is
worse.

>  |-----------------------------------------------------------------------|
>  | thp_swpout                   |          0 |           0 |           0 |
>  | thp_swpout_fallback          |          0 |           0 |           0 |
>  | pgmajfault                   |      3,364 |       3,650 |       3,431 |
>  |-----------------------------------------------------------------------|
>  | hugepages-64kB/stats/zswpout |            |      63,200 |      63,244 |
>  |-----------------------------------------------------------------------|
>  | hugepages-64kB/stats/swpout  |     10,902 |           0 |           0 |
>   -----------------------------------------------------------------------
>

[snip]

--
Best Regards,
Huang, Ying

Sridhar, Kanchana P Aug. 19, 2024, 5:12 a.m. UTC | #2

Hi Ying,

> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Sunday, August 18, 2024 8:17 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:
> 
> [snip]
> 
> >
> > Performance Testing:
> > ====================
> > Testing of this patch-series was done with the v6.11-rc3 mainline, without
> > and with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket.
> >
> > The system has 503 GiB RAM, with a 4G SSD as the backing swap device for
> > ZSWAP. Core frequency was fixed at 2500MHz.
> >
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed. Following a similar methodology as in Ryan Roberts'
> > "Swap-out mTHP without splitting" series [2], 70 usemem processes were
> > run, each allocating and writing 1G of memory:
> >
> >     usemem --init-time -w -O -n 70 1g
> >
> > Since I was constrained to get the 70 usemem processes to generate
> > swapout activity with the 4G SSD, I ended up using different cgroup
> > memory.high fixed limits for the experiments with 64K mTHP and 2M THP:
> >
> > 64K mTHP experiments: cgroup memory fixed at 60G
> > 2M THP experiments  : cgroup memory fixed at 55G
> >
> > The vm/sysfs stats included after the performance data provide details
> > on the swapout activity to SSD/ZSWAP.
> >
> > Other kernel configuration parameters:
> >
> >     ZSWAP Compressor  : LZ4, DEFLATE-IAA
> >     ZSWAP Allocator   : ZSMALLOC
> >     SWAP page-cluster : 2
> >
> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > IAA "compression verification" is enabled. Hence each IAA compression
> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > returned by the hardware will be compared and errors reported in case of
> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors.
> >
> > Throughput reported by usemem and perf sys time for running the test
> > are as follows, averaged across 3 runs:
> >
> >  64KB mTHP (cgroup memory.high set to 60G):
> >  ==========================================
> >   ------------------------------------------------------------------
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
> >  |                    |                   |       KB/s |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |    335,346 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |    271,558 |       -19% |
> 
> zswap throughput is worse than ssd swap?  This doesn't look right.

I realize it might look that way, however, this is not an apples-to-apples comparison,
as explained in the latter part of my analysis (after the 2M THP data tables).
The primary reason for this is because of running the test under a fixed
cgroup memory limit.

In the "Before" scenario, mTHP get swapped out to SSD. However, the disk swap
usage is not accounted towards checking if the cgroup's memory limit has been
exceeded. Hence there are relatively fewer swap-outs, resulting mainly from the
1G allocations from each of the 70 usemem processes working with a 60G memory
limit on the parent cgroup.

However, the picture changes in the "After" scenario. mTHPs will now get stored in
zswap, which is accounted for in the cgroup's memory.current and counts
towards the fixed memory limit in effect for the parent cgroup. As a result, when
mTHP get stored in zswap, the mTHP compressed data in the zswap zpool now
count towards the cgroup's active memory and memory limit. This is in addition
to the 1G allocations from each of the 70 processes.

As you can see, this creates more memory pressure on the cgroup, resulting in
more swap-outs. With lz4 as the zswap compressor, this results in lesser throughput
wrt "Before".

However, with IAA as the zswap compressor, the throughout with zswap mTHP is
better than "Before" because of better hardware compress latencies, which handle
the higher swap-out activity without compromising on throughput.

> 
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    388,154 |        16% |
> >  |------------------------------------------------------------------|
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
> >  |                    |                   |        sec |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |      91.37 |   Baseline |
> >  |zswap-mTHP=Store    | ZSWAP lz4         |     265.43 |      -191% |
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     235.60 |      -158% |
> >   ------------------------------------------------------------------
> >
> >   -----------------------------------------------------------------------
> >  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |  zswap-
> mTHP |
> >  |                              |   mainline |       Store |       Store |
> >  |                              |            |         lz4 | deflate-iaa |
> >  |-----------------------------------------------------------------------|
> >  | pswpin                       |          0 |           0 |           0 |
> >  | pswpout                      |    174,432 |           0 |           0 |
> >  | zswpin                       |        703 |         534 |         721 |
> >  | zswpout                      |      1,501 |   1,491,654 |   1,398,805 |
> 
> It appears that the number of swapped pages for zswap is much larger
> than that of SSD swap.  Why?  I guess this is why zswap throughput is
> worse.

Your observation is correct. I hope the above explanation helps as to the
reasoning behind this.

Thanks,
Kanchana

> 
> >  |-----------------------------------------------------------------------|
> >  | thp_swpout                   |          0 |           0 |           0 |
> >  | thp_swpout_fallback          |          0 |           0 |           0 |
> >  | pgmajfault                   |      3,364 |       3,650 |       3,431 |
> >  |-----------------------------------------------------------------------|
> >  | hugepages-64kB/stats/zswpout |            |      63,200 |      63,244 |
> >  |-----------------------------------------------------------------------|
> >  | hugepages-64kB/stats/swpout  |     10,902 |           0 |           0 |
> >   -----------------------------------------------------------------------
> >
> 
> [snip]
> 
> --
> Best Regards,
> Huang, Ying

Huang, Ying Aug. 19, 2024, 5:51 a.m. UTC | #3

"Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:

> Hi Ying,
>
>> -----Original Message-----
>> From: Huang, Ying <ying.huang@intel.com>
>> Sent: Sunday, August 18, 2024 8:17 PM
>> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
>> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
>> ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org;
>> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
>> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
>> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
>> 
>> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:
>> 
>> [snip]
>> 
>> >
>> > Performance Testing:
>> > ====================
>> > Testing of this patch-series was done with the v6.11-rc3 mainline, without
>> > and with this patch-series, on an Intel Sapphire Rapids server,
>> > dual-socket 56 cores per socket, 4 IAA devices per socket.
>> >
>> > The system has 503 GiB RAM, with a 4G SSD as the backing swap device for
>> > ZSWAP. Core frequency was fixed at 2500MHz.
>> >
>> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
>> > was fixed. Following a similar methodology as in Ryan Roberts'
>> > "Swap-out mTHP without splitting" series [2], 70 usemem processes were
>> > run, each allocating and writing 1G of memory:
>> >
>> >     usemem --init-time -w -O -n 70 1g
>> >
>> > Since I was constrained to get the 70 usemem processes to generate
>> > swapout activity with the 4G SSD, I ended up using different cgroup
>> > memory.high fixed limits for the experiments with 64K mTHP and 2M THP:
>> >
>> > 64K mTHP experiments: cgroup memory fixed at 60G
>> > 2M THP experiments  : cgroup memory fixed at 55G
>> >
>> > The vm/sysfs stats included after the performance data provide details
>> > on the swapout activity to SSD/ZSWAP.
>> >
>> > Other kernel configuration parameters:
>> >
>> >     ZSWAP Compressor  : LZ4, DEFLATE-IAA
>> >     ZSWAP Allocator   : ZSMALLOC
>> >     SWAP page-cluster : 2
>> >
>> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
>> > IAA "compression verification" is enabled. Hence each IAA compression
>> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
>> > returned by the hardware will be compared and errors reported in case of
>> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
>> > compared to the software compressors.
>> >
>> > Throughput reported by usemem and perf sys time for running the test
>> > are as follows, averaged across 3 runs:
>> >
>> >  64KB mTHP (cgroup memory.high set to 60G):
>> >  ==========================================
>> >   ------------------------------------------------------------------
>> >  |                    |                   |            |            |
>> >  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
>> >  |                    |                   |       KB/s |            |
>> >  |--------------------|-------------------|------------|------------|
>> >  |v6.11-rc3 mainline  | SSD               |    335,346 |   Baseline |
>> >  |zswap-mTHP-Store    | ZSWAP lz4         |    271,558 |       -19% |
>> 
>> zswap throughput is worse than ssd swap?  This doesn't look right.
>
> I realize it might look that way, however, this is not an apples-to-apples comparison,
> as explained in the latter part of my analysis (after the 2M THP data tables).
> The primary reason for this is because of running the test under a fixed
> cgroup memory limit.
>
> In the "Before" scenario, mTHP get swapped out to SSD. However, the disk swap
> usage is not accounted towards checking if the cgroup's memory limit has been
> exceeded. Hence there are relatively fewer swap-outs, resulting mainly from the
> 1G allocations from each of the 70 usemem processes working with a 60G memory
> limit on the parent cgroup.
>
> However, the picture changes in the "After" scenario. mTHPs will now get stored in
> zswap, which is accounted for in the cgroup's memory.current and counts
> towards the fixed memory limit in effect for the parent cgroup. As a result, when
> mTHP get stored in zswap, the mTHP compressed data in the zswap zpool now
> count towards the cgroup's active memory and memory limit. This is in addition
> to the 1G allocations from each of the 70 processes.
>
> As you can see, this creates more memory pressure on the cgroup, resulting in
> more swap-outs. With lz4 as the zswap compressor, this results in lesser throughput
> wrt "Before".
>
> However, with IAA as the zswap compressor, the throughout with zswap mTHP is
> better than "Before" because of better hardware compress latencies, which handle
> the higher swap-out activity without compromising on throughput.
>
>> 
>> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    388,154 |        16% |
>> >  |------------------------------------------------------------------|
>> >  |                    |                   |            |            |
>> >  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
>> >  |                    |                   |        sec |            |
>> >  |--------------------|-------------------|------------|------------|
>> >  |v6.11-rc3 mainline  | SSD               |      91.37 |   Baseline |
>> >  |zswap-mTHP=Store    | ZSWAP lz4         |     265.43 |      -191% |
>> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     235.60 |      -158% |
>> >   ------------------------------------------------------------------
>> >
>> >   -----------------------------------------------------------------------
>> >  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |  zswap-
>> mTHP |
>> >  |                              |   mainline |       Store |       Store |
>> >  |                              |            |         lz4 | deflate-iaa |
>> >  |-----------------------------------------------------------------------|
>> >  | pswpin                       |          0 |           0 |           0 |
>> >  | pswpout                      |    174,432 |           0 |           0 |
>> >  | zswpin                       |        703 |         534 |         721 |
>> >  | zswpout                      |      1,501 |   1,491,654 |   1,398,805 |
>> 
>> It appears that the number of swapped pages for zswap is much larger
>> than that of SSD swap.  Why?  I guess this is why zswap throughput is
>> worse.
>
> Your observation is correct. I hope the above explanation helps as to the
> reasoning behind this.

Before:
(174432 + 1501) * 4 / 1024 = 687.2 MB

After:
1491654 * 4.0 / 1024 = 5826.8 MB

From your previous words, 10GB memory should be swapped out.

Even if the average compression ratio is 0, the swap-out count of zswap
should be about 100% more than that of SSD.  However, the ratio here
appears unreasonable.

--
Best Regards,
Huang, Ying

> Thanks,
> Kanchana
>
>> 
>> >  |-----------------------------------------------------------------------|
>> >  | thp_swpout                   |          0 |           0 |           0 |
>> >  | thp_swpout_fallback          |          0 |           0 |           0 |
>> >  | pgmajfault                   |      3,364 |       3,650 |       3,431 |
>> >  |-----------------------------------------------------------------------|
>> >  | hugepages-64kB/stats/zswpout |            |      63,200 |      63,244 |
>> >  |-----------------------------------------------------------------------|
>> >  | hugepages-64kB/stats/swpout  |     10,902 |           0 |           0 |
>> >   -----------------------------------------------------------------------
>> >
>> 
>> [snip]
>> 
>> --
>> Best Regards,
>> Huang, Ying

Sridhar, Kanchana P Aug. 20, 2024, 3 a.m. UTC | #4

Hi Ying,

> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Sunday, August 18, 2024 10:52 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:
> 
> > Hi Ying,
> >
> >> -----Original Message-----
> >> From: Huang, Ying <ying.huang@intel.com>
> >> Sent: Sunday, August 18, 2024 8:17 PM
> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> >> ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-
> foundation.org;
> >> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> >> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> >> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> >>
> >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:
> >>
> >> [snip]
> >>
> >> >
> >> > Performance Testing:
> >> > ====================
> >> > Testing of this patch-series was done with the v6.11-rc3 mainline,
> without
> >> > and with this patch-series, on an Intel Sapphire Rapids server,
> >> > dual-socket 56 cores per socket, 4 IAA devices per socket.
> >> >
> >> > The system has 503 GiB RAM, with a 4G SSD as the backing swap device
> for
> >> > ZSWAP. Core frequency was fixed at 2500MHz.
> >> >
> >> > The vm-scalability "usemem" test was run in a cgroup whose
> memory.high
> >> > was fixed. Following a similar methodology as in Ryan Roberts'
> >> > "Swap-out mTHP without splitting" series [2], 70 usemem processes
> were
> >> > run, each allocating and writing 1G of memory:
> >> >
> >> >     usemem --init-time -w -O -n 70 1g
> >> >
> >> > Since I was constrained to get the 70 usemem processes to generate
> >> > swapout activity with the 4G SSD, I ended up using different cgroup
> >> > memory.high fixed limits for the experiments with 64K mTHP and 2M
> THP:
> >> >
> >> > 64K mTHP experiments: cgroup memory fixed at 60G
> >> > 2M THP experiments  : cgroup memory fixed at 55G
> >> >
> >> > The vm/sysfs stats included after the performance data provide details
> >> > on the swapout activity to SSD/ZSWAP.
> >> >
> >> > Other kernel configuration parameters:
> >> >
> >> >     ZSWAP Compressor  : LZ4, DEFLATE-IAA
> >> >     ZSWAP Allocator   : ZSMALLOC
> >> >     SWAP page-cluster : 2
> >> >
> >> > In the experiments where "deflate-iaa" is used as the ZSWAP
> compressor,
> >> > IAA "compression verification" is enabled. Hence each IAA compression
> >> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> >> > returned by the hardware will be compared and errors reported in case
> of
> >> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> >> > compared to the software compressors.
> >> >
> >> > Throughput reported by usemem and perf sys time for running the test
> >> > are as follows, averaged across 3 runs:
> >> >
> >> >  64KB mTHP (cgroup memory.high set to 60G):
> >> >  ==========================================
> >> >   ------------------------------------------------------------------
> >> >  |                    |                   |            |            |
> >> >  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
> >> >  |                    |                   |       KB/s |            |
> >> >  |--------------------|-------------------|------------|------------|
> >> >  |v6.11-rc3 mainline  | SSD               |    335,346 |   Baseline |
> >> >  |zswap-mTHP-Store    | ZSWAP lz4         |    271,558 |       -19% |
> >>
> >> zswap throughput is worse than ssd swap?  This doesn't look right.
> >
> > I realize it might look that way, however, this is not an apples-to-apples
> comparison,
> > as explained in the latter part of my analysis (after the 2M THP data tables).
> > The primary reason for this is because of running the test under a fixed
> > cgroup memory limit.
> >
> > In the "Before" scenario, mTHP get swapped out to SSD. However, the disk
> swap
> > usage is not accounted towards checking if the cgroup's memory limit has
> been
> > exceeded. Hence there are relatively fewer swap-outs, resulting mainly
> from the
> > 1G allocations from each of the 70 usemem processes working with a 60G
> memory
> > limit on the parent cgroup.
> >
> > However, the picture changes in the "After" scenario. mTHPs will now get
> stored in
> > zswap, which is accounted for in the cgroup's memory.current and counts
> > towards the fixed memory limit in effect for the parent cgroup. As a result,
> when
> > mTHP get stored in zswap, the mTHP compressed data in the zswap zpool
> now
> > count towards the cgroup's active memory and memory limit. This is in
> addition
> > to the 1G allocations from each of the 70 processes.
> >
> > As you can see, this creates more memory pressure on the cgroup, resulting
> in
> > more swap-outs. With lz4 as the zswap compressor, this results in lesser
> throughput
> > wrt "Before".
> >
> > However, with IAA as the zswap compressor, the throughout with zswap
> mTHP is
> > better than "Before" because of better hardware compress latencies, which
> handle
> > the higher swap-out activity without compromising on throughput.
> >
> >>
> >> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    388,154 |        16% |
> >> >  |------------------------------------------------------------------|
> >> >  |                    |                   |            |            |
> >> >  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
> >> >  |                    |                   |        sec |            |
> >> >  |--------------------|-------------------|------------|------------|
> >> >  |v6.11-rc3 mainline  | SSD               |      91.37 |   Baseline |
> >> >  |zswap-mTHP=Store    | ZSWAP lz4         |     265.43 |      -191% |
> >> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     235.60 |      -158% |
> >> >   ------------------------------------------------------------------
> >> >
> >> >   -----------------------------------------------------------------------
> >> >  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |
> zswap-
> >> mTHP |
> >> >  |                              |   mainline |       Store |       Store |
> >> >  |                              |            |         lz4 | deflate-iaa |
> >> >  |-----------------------------------------------------------------------|
> >> >  | pswpin                       |          0 |           0 |           0 |
> >> >  | pswpout                      |    174,432 |           0 |           0 |
> >> >  | zswpin                       |        703 |         534 |         721 |
> >> >  | zswpout                      |      1,501 |   1,491,654 |   1,398,805 |
> >>
> >> It appears that the number of swapped pages for zswap is much larger
> >> than that of SSD swap.  Why?  I guess this is why zswap throughput is
> >> worse.
> >
> > Your observation is correct. I hope the above explanation helps as to the
> > reasoning behind this.
> 
> Before:
> (174432 + 1501) * 4 / 1024 = 687.2 MB
> 
> After:
> 1491654 * 4.0 / 1024 = 5826.8 MB
> 
> From your previous words, 10GB memory should be swapped out.
> 
> Even if the average compression ratio is 0, the swap-out count of zswap
> should be about 100% more than that of SSD.  However, the ratio here
> appears unreasonable.

Excellent point! In order to understand this better myself, I ran usemem with
1 process that tries to allocate 58G:

cgroup memory.high = 60,000,000,000
usemem --init-time -w -O -n 1 58g

usemem -n 1 58g         Before          After
----------------------------------------------
pswpout                 586,352             0
zswpout                   1,005     1,042,963
----------------------------------------------
Total swapout           587,357     1,042,963
----------------------------------------------

In the case where the cgroup has only 1 process, your rationale above applies
(more or less). This shows the stats collected every 100 micro-seconds from the critical
section of the workload right before the memory limit is reached (Before and After):

===========================================================================
BEFORE zswap_store mTHP:
===========================================================================
 cgroup_memory        cgroup_memory      zswap_pool        zram_compr
                          w/o zswap     _total_size        _data_size
---------------------------------------------------------------------------
59,999,600,640       59,999,600,640               0                74
59,999,911,936       59,999,911,936               0        14,139,441
60,000,083,968       59,997,634,560       2,449,408        53,448,205
59,999,952,896       59,997,503,488       2,449,408        93,477,490
60,000,083,968       59,997,634,560       2,449,408       133,152,754
60,000,083,968       59,997,634,560       2,449,408       172,628,328
59,999,952,896       59,997,503,488       2,449,408       212,760,840
60,000,083,968       59,997,634,560       2,449,408       251,999,675
60,000,083,968       59,997,634,560       2,449,408       291,058,130
60,000,083,968       59,997,634,560       2,449,408       329,655,206
59,999,793,152       59,997,343,744       2,449,408       368,938,904
59,999,924,224       59,997,474,816       2,449,408       408,652,723
59,999,924,224       59,997,474,816       2,449,408       447,830,071
60,000,055,296       59,997,605,888       2,449,408       487,776,082
59,999,924,224       59,997,474,816       2,449,408       526,826,360
60,000,055,296       59,997,605,888       2,449,408       566,193,520
60,000,055,296       59,997,605,888       2,449,408       604,625,879
60,000,055,296       59,997,605,888       2,449,408       642,545,706
59,999,924,224       59,997,474,816       2,449,408       681,958,173
59,999,924,224       59,997,474,816       2,449,408       721,908,162
59,999,924,224       59,997,474,816       2,449,408       761,935,307
59,999,924,224       59,997,474,816       2,449,408       802,014,594
59,999,924,224       59,997,474,816       2,449,408       842,087,656
59,999,924,224       59,997,474,816       2,449,408       883,889,588
59,999,924,224       59,997,474,816       2,449,408       804,458,184
59,999,793,152       59,997,343,744       2,449,408        94,150,548
54,938,513,408       54,936,064,000       2,449,408           172,644
29,492,523,008       29,490,073,600       2,449,408           172,644
 3,465,621,504        3,463,172,096       2,449,408           131,457
---------------------------------------------------------------------------


===========================================================================
AFTER zswap_store mTHP:
===========================================================================
 cgroup_memory        cgroup_memory          zswap_pool
                          w/o zswap         _total_size
---------------------------------------------------------------------------
55,578,234,880       55,578,234,880                   0
56,104,095,744       56,104,095,744                   0
56,644,898,816       56,644,898,816                   0
57,184,653,312       57,184,653,312                   0
57,706,057,728       57,706,057,728                   0
58,226,937,856       58,226,937,856                   0
58,747,293,696       58,747,293,696                   0
59,275,776,000       59,275,776,000                   0
59,793,772,544       59,793,772,544                   0
60,000,141,312       60,000,141,312                   0
59,999,956,992       59,999,956,992                   0
60,000,169,984       60,000,169,984                   0
59,999,907,840       59,951,226,880          48,680,960
60,000,169,984       59,900,010,496         100,159,488
60,000,169,984       59,848,007,680         152,162,304
60,000,169,984       59,795,513,344         204,656,640
59,999,907,840       59,743,477,760         256,430,080
60,000,038,912       59,692,097,536         307,941,376
60,000,169,984       59,641,208,832         358,961,152
60,000,038,912       59,589,992,448         410,046,464
60,000,169,984       59,539,005,440         461,164,544
60,000,169,984       59,487,657,984         512,512,000
60,000,038,912       59,434,868,736         565,170,176
60,000,038,912       59,383,259,136         616,779,776
60,000,169,984       59,331,518,464         668,651,520
60,000,169,984       59,279,843,328         720,326,656
60,000,169,984       59,228,626,944         771,543,040
59,999,907,840       59,176,984,576         822,923,264
60,000,038,912       59,124,326,400         875,712,512
60,000,169,984       59,072,454,656         927,715,328
60,000,169,984       59,020,156,928         980,013,056
60,000,038,912       58,966,974,464       1,033,064,448
60,000,038,912       58,913,628,160       1,086,410,752
60,000,038,912       58,858,840,064       1,141,198,848
60,000,169,984       58,804,314,112       1,195,855,872
59,999,907,840       58,748,936,192       1,250,971,648
60,000,169,984       58,695,131,136       1,305,038,848
60,000,169,984       58,642,800,640       1,357,369,344
60,000,169,984       58,589,782,016       1,410,387,968
60,000,038,912       58,535,124,992       1,464,913,920
60,000,169,984       58,482,925,568       1,517,244,416
60,000,169,984       58,429,775,872       1,570,394,112
60,000,038,912       58,376,658,944       1,623,379,968
60,000,169,984       58,323,247,104       1,676,922,880
60,000,038,912       58,271,113,216       1,728,925,696
60,000,038,912       58,216,292,352       1,783,746,560
60,000,038,912       58,164,289,536       1,835,749,376
60,000,038,912       58,112,090,112       1,887,948,800
60,000,038,912       58,058,350,592       1,941,688,320
59,999,907,840       58,004,971,520       1,994,936,320
60,000,169,984       57,953,165,312       2,047,004,672
59,999,907,840       57,900,277,760       2,099,630,080
60,000,038,912       57,847,586,816       2,152,452,096
60,000,169,984       57,793,421,312       2,206,748,672
59,999,907,840       57,741,582,336       2,258,325,504
60,012,826,624       57,734,840,320       2,277,986,304
60,098,793,472       57,820,348,416       2,278,445,056
60,176,334,848       57,897,889,792       2,278,445,056
60,269,826,048       57,991,380,992       2,278,445,056
59,687,481,344       57,851,977,728       1,835,503,616
59,049,836,544       57,888,108,544       1,161,728,000
58,406,068,224       57,929,551,872         476,516,352
43,837,923,328       43,837,919,232               4,096
18,124,546,048       18,124,541,952               4,096
     2,846,720            2,842,624               4,096
---------------------------------------------------------------------------

I have also attached plots of the memory pressure reported by PSI. Both these sets
of data should give a sense of the added memory pressure on the cgroup because of
zswap mTHP stores. The data shows that the cgroup is over the limit much more
frequently in the "After" than in "Before". However, the rationale that you suggested
seems more reasonable and apparent in the 1 process case.

However, with 70 processes trying to allocate 1G, things get more complicated.
These are the functions that should provide more clarity:

[1] mm/memcontrol.c: mem_cgroup_handle_over_high().
[2] mm/memcontrol.c: try_charge_memcg().
[3] include/linux/resume_user_mode.h: resume_user_mode_work().

At a high level, when zswap mTHP compressed pool usage starts counting towards
cgroup.memory.current, there are two inter-related effects occurring that ultimately
cause more reclaim to happen:

1) When each process reclaims a folio and zswap_store() writes out each page in
     the folio, it charges the compressed size to the memcg
    "obj_cgroup_charge_zswap(objcg, entry->length);". This calls [2] and sets
    current->memcg_nr_pages_over_high if the limit is exceeded. The comments
    towards the end of [2] are relevant.
2) When each of the processes returns from a page-fault, it checks if the
    cgroup memory usage is over the limit in [3], and if so, it will trigger
    reclaim.

I confirmed that in the case of usemem, all calls to [1] occur from the code path in [3].
However, my takeaway from this is that the more reclaim that results in zswap_store(),
for e.g., from mTHP folios, there is higher likelihood of overage recorded per-process in
current->memcg_nr_pages_over_high, which could potentially be causing each
process to reclaim memory, even if it is possible that the swapout from a few of
the 70 processes could have brought the parent cgroup under the limit.

Please do let me know if you have any other questions. Appreciate your feedback
and comments.

Thanks,
Kanchana

> 
> --
> Best Regards,
> Huang, Ying
> 
> > Thanks,
> > Kanchana
> >
> >>
> >> >  |-----------------------------------------------------------------------|
> >> >  | thp_swpout                   |          0 |           0 |           0 |
> >> >  | thp_swpout_fallback          |          0 |           0 |           0 |
> >> >  | pgmajfault                   |      3,364 |       3,650 |       3,431 |
> >> >  |-----------------------------------------------------------------------|
> >> >  | hugepages-64kB/stats/zswpout |            |      63,200 |      63,244 |
> >> >  |-----------------------------------------------------------------------|
> >> >  | hugepages-64kB/stats/swpout  |     10,902 |           0 |           0 |
> >> >   -----------------------------------------------------------------------
> >> >
> >>
> >> [snip]
> >>
> >> --
> >> Best Regards,
> >> Huang, Ying

Nhat Pham Aug. 20, 2024, 9:13 p.m. UTC | #5

On Mon, Aug 19, 2024 at 11:01 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi Ying,
>
> I confirmed that in the case of usemem, all calls to [1] occur from the code path in [3].
> However, my takeaway from this is that the more reclaim that results in zswap_store(),
> for e.g., from mTHP folios, there is higher likelihood of overage recorded per-process in
> current->memcg_nr_pages_over_high, which could potentially be causing each
> process to reclaim memory, even if it is possible that the swapout from a few of
> the 70 processes could have brought the parent cgroup under the limit.

Yeah IIUC, the memory increase from zswap store happens
immediately/synchronously (swap_writepage() -> zswap_store() ->
obj_cgroup_charge_zswap()), before the memory saving kicks in. This is
a non-issue for swap - the memory saving doesn't happen right away,
but it also doesn't increase memory usage (well, as you pointed out,
obj_cgroup_charge_zswap() doesn't even happen).

And yes, this is compounded a) if you're in a high concurrency regime,
where all tasks in the same cgroup, under memory pressure, all go into
reclaim. and b) for larger folios, where we compress multiple pages
before the saving happens. I wonder how bad the effect is tho - could
you quantify the reclamation amount that happens per zswap store
somehow with tracing magic?

Also, I wonder if there is a "charge delta" mechanism, where we
directly uncharge by (page size - zswap object size), to avoid the
temporary double charging... Sort of like what folio migration is
doing now v.s what it used to do. Seems complicated - not even sure if
it's possible TBH.

>
> Please do let me know if you have any other questions. Appreciate your feedback
> and comments.
>
> Thanks,
> Kanchana

Sridhar, Kanchana P Aug. 20, 2024, 10:09 p.m. UTC | #6

Hi Nhat,

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Tuesday, August 20, 2024 2:14 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Huang, Ying <ying.huang@intel.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> On Mon, Aug 19, 2024 at 11:01 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi Ying,
> >
> > I confirmed that in the case of usemem, all calls to [1] occur from the code
> path in [3].
> > However, my takeaway from this is that the more reclaim that results in
> zswap_store(),
> > for e.g., from mTHP folios, there is higher likelihood of overage recorded
> per-process in
> > current->memcg_nr_pages_over_high, which could potentially be causing
> each
> > process to reclaim memory, even if it is possible that the swapout from a
> few of
> > the 70 processes could have brought the parent cgroup under the limit.
> 
> Yeah IIUC, the memory increase from zswap store happens
> immediately/synchronously (swap_writepage() -> zswap_store() ->
> obj_cgroup_charge_zswap()), before the memory saving kicks in. This is
> a non-issue for swap - the memory saving doesn't happen right away,
> but it also doesn't increase memory usage (well, as you pointed out,
> obj_cgroup_charge_zswap() doesn't even happen).
> 
> And yes, this is compounded a) if you're in a high concurrency regime,
> where all tasks in the same cgroup, under memory pressure, all go into
> reclaim. and b) for larger folios, where we compress multiple pages
> before the saving happens. I wonder how bad the effect is tho - could
> you quantify the reclamation amount that happens per zswap store
> somehow with tracing magic?

Thanks very much for the detailed comments and explanations!
Sure, I will gather data on the reclamation amount that happens per
zswap store and share.

> 
> Also, I wonder if there is a "charge delta" mechanism, where we
> directly uncharge by (page size - zswap object size), to avoid the
> temporary double charging... Sort of like what folio migration is
> doing now v.s what it used to do. Seems complicated - not even sure if
> it's possible TBH.

Yes, this is a very interesting idea. I will also look into the feasibility of
doing this in the shrink_folio_list()->swap_writepage()->zswap_store()
path.

Thanks again for the discussion, really appreciate it.

Thanks,
Kanchana

> 
> >
> > Please do let me know if you have any other questions. Appreciate your
> feedback
> > and comments.
> >
> > Thanks,
> > Kanchana

Nhat Pham Aug. 21, 2024, 2:42 p.m. UTC | #7

On Sun, Aug 18, 2024 at 10:16 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi All,
>
> This patch-series enables zswap_store() to accept and store mTHP
> folios. The most significant contribution in this series is from the
> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> migrated to v6.11-rc3 in patch 2/4 of this series.
>
> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
>      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Additionally, there is an attempt to modularize some of the functionality
> in zswap_store(), to make it more amenable to supporting any-order
> mTHPs.
>
> For instance, the determination of whether a folio is same-filled is
> based on mapping an index into the folio to derive the page. Likewise,
> there is a function "zswap_store_entry" added to store a zswap_entry in
> the xarray.
>
> For accounting purposes, the patch-series adds per-order mTHP sysfs
> "zswpout" counters that get incremented upon successful zswap_store of
> an mTHP folio:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
>
> This patch-series is a precursor to ZSWAP compress batching of mTHP
> swap-out and decompress batching of swap-ins based on swapin_readahead(),
> using Intel IAA hardware acceleration, which we would like to submit in
> subsequent RFC patch-series, with performance improvement data.
>
> Thanks to Ying Huang for pre-posting review feedback and suggestions!
>
> Changes since v3:
> =================
> 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
>    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
>    changes to count_mthp_stat() so that it's always defined, even when THP
>    is disabled. Barry, I have also made one other change in page_io.c
>    where count_mthp_stat() is called by count_swpout_vm_event(). I would
>    appreciate it if you can review this. Thanks!
>    Hopefully this should resolve the kernel robot build errors.
>
> Changes since v2:
> =================
> 1) Gathered usemem data using SSD as the backing swap device for zswap,
>    as suggested by Ying Huang. Ying, I would appreciate it if you can
>    review the latest data. Thanks!
> 2) Generated the base commit info in the patches to attempt to address
>    the kernel test robot build errors.
> 3) No code changes to the individual patches themselves.
>
> Changes since RFC v1:
> =====================
>
> 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
>    Thanks Barry!
> 2) Addressed some of the code review comments that Nhat Pham provided in
>    Ryan's initial RFC [1]:
>    - Added a comment about the cgroup zswap limit checks occuring once per
>      folio at the beginning of zswap_store().
>      Nhat, Ryan, please do let me know if the comments convey the summary
>      from the RFC discussion. Thanks!
>    - Posted data on running the cgroup suite's zswap kselftest.
> 3) Rebased to v6.11-rc3.
> 4) Gathered performance data with usemem and the rebased patch-series.
>
> Performance Testing:
> ====================
> Testing of this patch-series was done with the v6.11-rc3 mainline, without
> and with this patch-series, on an Intel Sapphire Rapids server,
> dual-socket 56 cores per socket, 4 IAA devices per socket.
>
> The system has 503 GiB RAM, with a 4G SSD as the backing swap device for
> ZSWAP. Core frequency was fixed at 2500MHz.
>
> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed. Following a similar methodology as in Ryan Roberts'
> "Swap-out mTHP without splitting" series [2], 70 usemem processes were
> run, each allocating and writing 1G of memory:
>
>     usemem --init-time -w -O -n 70 1g
>
> Since I was constrained to get the 70 usemem processes to generate
> swapout activity with the 4G SSD, I ended up using different cgroup
> memory.high fixed limits for the experiments with 64K mTHP and 2M THP:
>
> 64K mTHP experiments: cgroup memory fixed at 60G
> 2M THP experiments  : cgroup memory fixed at 55G
>
> The vm/sysfs stats included after the performance data provide details
> on the swapout activity to SSD/ZSWAP.
>
> Other kernel configuration parameters:
>
>     ZSWAP Compressor  : LZ4, DEFLATE-IAA
>     ZSWAP Allocator   : ZSMALLOC
>     SWAP page-cluster : 2
>
> In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> IAA "compression verification" is enabled. Hence each IAA compression
> will be decompressed internally by the "iaa_crypto" driver, the crc-s
> returned by the hardware will be compared and errors reported in case of
> mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> compared to the software compressors.
>
> Throughput reported by usemem and perf sys time for running the test
> are as follows, averaged across 3 runs:
>
>  64KB mTHP (cgroup memory.high set to 60G):
>  ==========================================
>   ------------------------------------------------------------------
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
>  |                    |                   |       KB/s |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | SSD               |    335,346 |   Baseline |
>  |zswap-mTHP-Store    | ZSWAP lz4         |    271,558 |       -19% |
>  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    388,154 |        16% |
>  |------------------------------------------------------------------|
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
>  |                    |                   |        sec |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | SSD               |      91.37 |   Baseline |
>  |zswap-mTHP=Store    | ZSWAP lz4         |     265.43 |      -191% |
>  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     235.60 |      -158% |
>   ------------------------------------------------------------------

Yeah no, this is not good. That throughput regression is concerning...

Is this tied to lz4 only, or do you observe similar trends in other
compressors that are not deflate-iaa?


>
>   -----------------------------------------------------------------------
>  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
>  |                              |   mainline |       Store |       Store |
>  |                              |            |         lz4 | deflate-iaa |
>  |-----------------------------------------------------------------------|
>  | pswpin                       |          0 |           0 |           0 |
>  | pswpout                      |    174,432 |           0 |           0 |
>  | zswpin                       |        703 |         534 |         721 |
>  | zswpout                      |      1,501 |   1,491,654 |   1,398,805 |
>  |-----------------------------------------------------------------------|
>  | thp_swpout                   |          0 |           0 |           0 |
>  | thp_swpout_fallback          |          0 |           0 |           0 |
>  | pgmajfault                   |      3,364 |       3,650 |       3,431 |
>  |-----------------------------------------------------------------------|
>  | hugepages-64kB/stats/zswpout |            |      63,200 |      63,244 |
>  |-----------------------------------------------------------------------|
>  | hugepages-64kB/stats/swpout  |     10,902 |           0 |           0 |
>   -----------------------------------------------------------------------
>

Yeah this is not good. Something fishy is going on, if we see this
ginormous jump from 175000 (z)swpout pages to almost 1.5 million
pages. That's a massive jump.

Either it's:

1.Your theory - zswap store keeps banging on the limit (which suggests
incompatibility between the way zswap currently behaves and our
reclaim logic)

2. The data here is ridiculously incompressible. We're needing to
zswpout roughly 8.5 times the number of pages, so the saving is 8.5
less => we only save 11.76% of memory for each page??? That's not
right...

3. There's an outright bug somewhere.

Very suspicious.

>
>  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 55G):
>  =======================================================
>   ------------------------------------------------------------------
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
>  |                    |                   |       KB/s |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | SSD               |    190,827 |   Baseline |
>  |zswap-mTHP-Store    | ZSWAP lz4         |     32,026 |       -83% |
>  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    203,772 |         7% |
>  |------------------------------------------------------------------|
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
>  |                    |                   |        sec |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | SSD               |      27.23 |   Baseline |
>  |zswap-mTHP-Store    | ZSWAP lz4         |     156.52 |      -475% |
>  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     171.45 |      -530% |
>   ------------------------------------------------------------------

I'm confused. This is a *regression* right? A massive one that is -
sys time is *more* than 5 times the old value?

>
>   -------------------------------------------------------------------------
>  | VMSTATS, mTHP ZSWAP/SSD stats  |  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
>  |                                |   mainline |       Store |       Store |
>  |                                |            |         lz4 | deflate-iaa |
>  |-------------------------------------------------------------------------|
>  | pswpin                         |          0 |           0 |           0 |
>  | pswpout                        |    797,184 |           0 |           0 |
>  | zswpin                         |        690 |         649 |         669 |
>  | zswpout                        |      1,465 |   1,596,382 |   1,540,766 |
>  |-------------------------------------------------------------------------|
>  | thp_swpout                     |      1,557 |           0 |           0 |
>  | thp_swpout_fallback            |          0 |       3,248 |       3,752 |

This is also increased, but I supposed we're just doing more
(z)swapping out in general...

>  | pgmajfault                     |      3,726 |       6,470 |       5,691 |
>  |-------------------------------------------------------------------------|
>  | hugepages-2048kB/stats/zswpout |            |       2,416 |       2,261 |
>  |-------------------------------------------------------------------------|
>  | hugepages-2048kB/stats/swpout  |      1,557 |           0 |           0 |
>   -------------------------------------------------------------------------
>

I'm not trying to delay this patch - I fully believe in supporting
zswap for larger pages (both mTHP and THP - whatever the memory
reclaim subsystem throws at us).

But we need to get to the bottom of this :) These are very suspicious
and concerning data. If this is something urgent, I can live with a
gate to enable/disable this, but I'd much prefer we understand what's
going on here.

Sridhar, Kanchana P Aug. 21, 2024, 7:07 p.m. UTC | #8

Hi Nhat,

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Wednesday, August 21, 2024 7:43 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> On Sun, Aug 18, 2024 at 10:16 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi All,
> >
> > This patch-series enables zswap_store() to accept and store mTHP
> > folios. The most significant contribution in this series is from the
> > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > migrated to v6.11-rc3 in patch 2/4 of this series.
> >
> > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> >      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Additionally, there is an attempt to modularize some of the functionality
> > in zswap_store(), to make it more amenable to supporting any-order
> > mTHPs.
> >
> > For instance, the determination of whether a folio is same-filled is
> > based on mapping an index into the folio to derive the page. Likewise,
> > there is a function "zswap_store_entry" added to store a zswap_entry in
> > the xarray.
> >
> > For accounting purposes, the patch-series adds per-order mTHP sysfs
> > "zswpout" counters that get incremented upon successful zswap_store of
> > an mTHP folio:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > This patch-series is a precursor to ZSWAP compress batching of mTHP
> > swap-out and decompress batching of swap-ins based on
> swapin_readahead(),
> > using Intel IAA hardware acceleration, which we would like to submit in
> > subsequent RFC patch-series, with performance improvement data.
> >
> > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> >
> > Changes since v3:
> > =================
> > 1) Rebased to mm-unstable commit
> 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> >    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> >    changes to count_mthp_stat() so that it's always defined, even when THP
> >    is disabled. Barry, I have also made one other change in page_io.c
> >    where count_mthp_stat() is called by count_swpout_vm_event(). I would
> >    appreciate it if you can review this. Thanks!
> >    Hopefully this should resolve the kernel robot build errors.
> >
> > Changes since v2:
> > =================
> > 1) Gathered usemem data using SSD as the backing swap device for zswap,
> >    as suggested by Ying Huang. Ying, I would appreciate it if you can
> >    review the latest data. Thanks!
> > 2) Generated the base commit info in the patches to attempt to address
> >    the kernel test robot build errors.
> > 3) No code changes to the individual patches themselves.
> >
> > Changes since RFC v1:
> > =====================
> >
> > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> >    Thanks Barry!
> > 2) Addressed some of the code review comments that Nhat Pham provided
> in
> >    Ryan's initial RFC [1]:
> >    - Added a comment about the cgroup zswap limit checks occuring once
> per
> >      folio at the beginning of zswap_store().
> >      Nhat, Ryan, please do let me know if the comments convey the summary
> >      from the RFC discussion. Thanks!
> >    - Posted data on running the cgroup suite's zswap kselftest.
> > 3) Rebased to v6.11-rc3.
> > 4) Gathered performance data with usemem and the rebased patch-series.
> >
> > Performance Testing:
> > ====================
> > Testing of this patch-series was done with the v6.11-rc3 mainline, without
> > and with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket.
> >
> > The system has 503 GiB RAM, with a 4G SSD as the backing swap device for
> > ZSWAP. Core frequency was fixed at 2500MHz.
> >
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed. Following a similar methodology as in Ryan Roberts'
> > "Swap-out mTHP without splitting" series [2], 70 usemem processes were
> > run, each allocating and writing 1G of memory:
> >
> >     usemem --init-time -w -O -n 70 1g
> >
> > Since I was constrained to get the 70 usemem processes to generate
> > swapout activity with the 4G SSD, I ended up using different cgroup
> > memory.high fixed limits for the experiments with 64K mTHP and 2M THP:
> >
> > 64K mTHP experiments: cgroup memory fixed at 60G
> > 2M THP experiments  : cgroup memory fixed at 55G
> >
> > The vm/sysfs stats included after the performance data provide details
> > on the swapout activity to SSD/ZSWAP.
> >
> > Other kernel configuration parameters:
> >
> >     ZSWAP Compressor  : LZ4, DEFLATE-IAA
> >     ZSWAP Allocator   : ZSMALLOC
> >     SWAP page-cluster : 2
> >
> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > IAA "compression verification" is enabled. Hence each IAA compression
> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > returned by the hardware will be compared and errors reported in case of
> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors.
> >
> > Throughput reported by usemem and perf sys time for running the test
> > are as follows, averaged across 3 runs:
> >
> >  64KB mTHP (cgroup memory.high set to 60G):
> >  ==========================================
> >   ------------------------------------------------------------------
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
> >  |                    |                   |       KB/s |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |    335,346 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |    271,558 |       -19% |
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    388,154 |        16% |
> >  |------------------------------------------------------------------|
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
> >  |                    |                   |        sec |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |      91.37 |   Baseline |
> >  |zswap-mTHP=Store    | ZSWAP lz4         |     265.43 |      -191% |
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     235.60 |      -158% |
> >   ------------------------------------------------------------------
> 
> Yeah no, this is not good. That throughput regression is concerning...
> 
> Is this tied to lz4 only, or do you observe similar trends in other
> compressors that are not deflate-iaa?

Let me gather data with other software compressors.

> 
> 
> >
> >   -----------------------------------------------------------------------
> >  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |  zswap-
> mTHP |
> >  |                              |   mainline |       Store |       Store |
> >  |                              |            |         lz4 | deflate-iaa |
> >  |-----------------------------------------------------------------------|
> >  | pswpin                       |          0 |           0 |           0 |
> >  | pswpout                      |    174,432 |           0 |           0 |
> >  | zswpin                       |        703 |         534 |         721 |
> >  | zswpout                      |      1,501 |   1,491,654 |   1,398,805 |
> >  |-----------------------------------------------------------------------|
> >  | thp_swpout                   |          0 |           0 |           0 |
> >  | thp_swpout_fallback          |          0 |           0 |           0 |
> >  | pgmajfault                   |      3,364 |       3,650 |       3,431 |
> >  |-----------------------------------------------------------------------|
> >  | hugepages-64kB/stats/zswpout |            |      63,200 |      63,244 |
> >  |-----------------------------------------------------------------------|
> >  | hugepages-64kB/stats/swpout  |     10,902 |           0 |           0 |
> >   -----------------------------------------------------------------------
> >
> 
> Yeah this is not good. Something fishy is going on, if we see this
> ginormous jump from 175000 (z)swpout pages to almost 1.5 million
> pages. That's a massive jump.
> 
> Either it's:
> 
> 1.Your theory - zswap store keeps banging on the limit (which suggests
> incompatibility between the way zswap currently behaves and our
> reclaim logic)
> 
> 2. The data here is ridiculously incompressible. We're needing to
> zswpout roughly 8.5 times the number of pages, so the saving is 8.5
> less => we only save 11.76% of memory for each page??? That's not
> right...
> 
> 3. There's an outright bug somewhere.
> 
> Very suspicious.
> 
> >
> >  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 55G):
> >  =======================================================
> >   ------------------------------------------------------------------
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
> >  |                    |                   |       KB/s |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |    190,827 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |     32,026 |       -83% |
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    203,772 |         7% |
> >  |------------------------------------------------------------------|
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
> >  |                    |                   |        sec |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |      27.23 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |     156.52 |      -475% |
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     171.45 |      -530% |
> >   ------------------------------------------------------------------
> 
> I'm confused. This is a *regression* right? A massive one that is -
> sys time is *more* than 5 times the old value?
> 
> >
> >   -------------------------------------------------------------------------
> >  | VMSTATS, mTHP ZSWAP/SSD stats  |  v6.11-rc3 |  zswap-mTHP |  zswap-
> mTHP |
> >  |                                |   mainline |       Store |       Store |
> >  |                                |            |         lz4 | deflate-iaa |
> >  |-------------------------------------------------------------------------|
> >  | pswpin                         |          0 |           0 |           0 |
> >  | pswpout                        |    797,184 |           0 |           0 |
> >  | zswpin                         |        690 |         649 |         669 |
> >  | zswpout                        |      1,465 |   1,596,382 |   1,540,766 |
> >  |-------------------------------------------------------------------------|
> >  | thp_swpout                     |      1,557 |           0 |           0 |
> >  | thp_swpout_fallback            |          0 |       3,248 |       3,752 |
> 
> This is also increased, but I supposed we're just doing more
> (z)swapping out in general...
> 
> >  | pgmajfault                     |      3,726 |       6,470 |       5,691 |
> >  |-------------------------------------------------------------------------|
> >  | hugepages-2048kB/stats/zswpout |            |       2,416 |       2,261 |
> >  |-------------------------------------------------------------------------|
> >  | hugepages-2048kB/stats/swpout  |      1,557 |           0 |           0 |
> >   -------------------------------------------------------------------------
> >
> 
> I'm not trying to delay this patch - I fully believe in supporting
> zswap for larger pages (both mTHP and THP - whatever the memory
> reclaim subsystem throws at us).
> 
> But we need to get to the bottom of this :) These are very suspicious
> and concerning data. If this is something urgent, I can live with a
> gate to enable/disable this, but I'd much prefer we understand what's
> going on here.

Thanks for this analysis. I will debug this some more, so we can better
understand these results.

Thanks,
Kanchana

Yosry Ahmed Aug. 24, 2024, 3:09 a.m. UTC | #9

[..]
>
> I'm not trying to delay this patch - I fully believe in supporting
> zswap for larger pages (both mTHP and THP - whatever the memory
> reclaim subsystem throws at us).
>
> But we need to get to the bottom of this :) These are very suspicious
> and concerning data. If this is something urgent, I can live with a
> gate to enable/disable this, but I'd much prefer we understand what's
> going on here.

Agreed. I don't think merging this support is urgent, so I think we
should better understand what is happening here. If there is a problem
with how we charge compressed memory today (temporary double charges),
we need to sort this out before the the mTHP support, as it will only
make things worse.

I have to admit I didn't take a deep look at the discussion and data,
so there may be other problems that I didn't notice. It seems to me
like Kanchana is doing more debugging to understand what is happening,
so that's great!

As for the patches, we should sort out the impact on a higher level
before discussing implementation details. From a quick look though it
seems like the first patch can be dropped after Usama's patches that
remove the same-filled handling from zswap land, and the last two
patches can be squashed.

Sridhar, Kanchana P Aug. 24, 2024, 6:21 a.m. UTC | #10

Hi Nhat,

> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Wednesday, August 21, 2024 12:08 PM
> To: Nhat Pham <nphamcs@gmail.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>;
> Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> Hi Nhat,
> 
> > -----Original Message-----
> > From: Nhat Pham <nphamcs@gmail.com>
> > Sent: Wednesday, August 21, 2024 7:43 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> >
> > On Sun, Aug 18, 2024 at 10:16 PM Kanchana P Sridhar
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > > Hi All,
> > >
> > > This patch-series enables zswap_store() to accept and store mTHP
> > > folios. The most significant contribution in this series is from the
> > > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > > migrated to v6.11-rc3 in patch 2/4 of this series.
> > >
> > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> > >      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> > ryan.roberts@arm.com/T/#u
> > >
> > > Additionally, there is an attempt to modularize some of the functionality
> > > in zswap_store(), to make it more amenable to supporting any-order
> > > mTHPs.
> > >
> > > For instance, the determination of whether a folio is same-filled is
> > > based on mapping an index into the folio to derive the page. Likewise,
> > > there is a function "zswap_store_entry" added to store a zswap_entry in
> > > the xarray.
> > >
> > > For accounting purposes, the patch-series adds per-order mTHP sysfs
> > > "zswpout" counters that get incremented upon successful zswap_store of
> > > an mTHP folio:
> > >
> > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> > >
> > > This patch-series is a precursor to ZSWAP compress batching of mTHP
> > > swap-out and decompress batching of swap-ins based on
> > swapin_readahead(),
> > > using Intel IAA hardware acceleration, which we would like to submit in
> > > subsequent RFC patch-series, with performance improvement data.
> > >
> > > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> > >
> > > Changes since v3:
> > > =================
> > > 1) Rebased to mm-unstable commit
> > 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> > >    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> > >    changes to count_mthp_stat() so that it's always defined, even when
> THP
> > >    is disabled. Barry, I have also made one other change in page_io.c
> > >    where count_mthp_stat() is called by count_swpout_vm_event(). I
> would
> > >    appreciate it if you can review this. Thanks!
> > >    Hopefully this should resolve the kernel robot build errors.
> > >
> > > Changes since v2:
> > > =================
> > > 1) Gathered usemem data using SSD as the backing swap device for
> zswap,
> > >    as suggested by Ying Huang. Ying, I would appreciate it if you can
> > >    review the latest data. Thanks!
> > > 2) Generated the base commit info in the patches to attempt to address
> > >    the kernel test robot build errors.
> > > 3) No code changes to the individual patches themselves.
> > >
> > > Changes since RFC v1:
> > > =====================
> > >
> > > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> > >    Thanks Barry!
> > > 2) Addressed some of the code review comments that Nhat Pham
> provided
> > in
> > >    Ryan's initial RFC [1]:
> > >    - Added a comment about the cgroup zswap limit checks occuring once
> > per
> > >      folio at the beginning of zswap_store().
> > >      Nhat, Ryan, please do let me know if the comments convey the
> summary
> > >      from the RFC discussion. Thanks!
> > >    - Posted data on running the cgroup suite's zswap kselftest.
> > > 3) Rebased to v6.11-rc3.
> > > 4) Gathered performance data with usemem and the rebased patch-
> series.
> > >
> > > Performance Testing:
> > > ====================
> > > Testing of this patch-series was done with the v6.11-rc3 mainline, without
> > > and with this patch-series, on an Intel Sapphire Rapids server,
> > > dual-socket 56 cores per socket, 4 IAA devices per socket.
> > >
> > > The system has 503 GiB RAM, with a 4G SSD as the backing swap device
> for
> > > ZSWAP. Core frequency was fixed at 2500MHz.
> > >
> > > The vm-scalability "usemem" test was run in a cgroup whose
> memory.high
> > > was fixed. Following a similar methodology as in Ryan Roberts'
> > > "Swap-out mTHP without splitting" series [2], 70 usemem processes were
> > > run, each allocating and writing 1G of memory:
> > >
> > >     usemem --init-time -w -O -n 70 1g
> > >
> > > Since I was constrained to get the 70 usemem processes to generate
> > > swapout activity with the 4G SSD, I ended up using different cgroup
> > > memory.high fixed limits for the experiments with 64K mTHP and 2M THP:
> > >
> > > 64K mTHP experiments: cgroup memory fixed at 60G
> > > 2M THP experiments  : cgroup memory fixed at 55G
> > >
> > > The vm/sysfs stats included after the performance data provide details
> > > on the swapout activity to SSD/ZSWAP.
> > >
> > > Other kernel configuration parameters:
> > >
> > >     ZSWAP Compressor  : LZ4, DEFLATE-IAA
> > >     ZSWAP Allocator   : ZSMALLOC
> > >     SWAP page-cluster : 2
> > >
> > > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > > IAA "compression verification" is enabled. Hence each IAA compression
> > > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > > returned by the hardware will be compared and errors reported in case of
> > > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > > compared to the software compressors.
> > >
> > > Throughput reported by usemem and perf sys time for running the test
> > > are as follows, averaged across 3 runs:
> > >
> > >  64KB mTHP (cgroup memory.high set to 60G):
> > >  ==========================================
> > >   ------------------------------------------------------------------
> > >  |                    |                   |            |            |
> > >  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
> > >  |                    |                   |       KB/s |            |
> > >  |--------------------|-------------------|------------|------------|
> > >  |v6.11-rc3 mainline  | SSD               |    335,346 |   Baseline |
> > >  |zswap-mTHP-Store    | ZSWAP lz4         |    271,558 |       -19% |
> > >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    388,154 |        16% |
> > >  |------------------------------------------------------------------|
> > >  |                    |                   |            |            |
> > >  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
> > >  |                    |                   |        sec |            |
> > >  |--------------------|-------------------|------------|------------|
> > >  |v6.11-rc3 mainline  | SSD               |      91.37 |   Baseline |
> > >  |zswap-mTHP=Store    | ZSWAP lz4         |     265.43 |      -191% |
> > >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     235.60 |      -158% |
> > >   ------------------------------------------------------------------
> >
> > Yeah no, this is not good. That throughput regression is concerning...
> >
> > Is this tied to lz4 only, or do you observe similar trends in other
> > compressors that are not deflate-iaa?
> 
> Let me gather data with other software compressors.
> 
> >
> >
> > >
> > >   -----------------------------------------------------------------------
> > >  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |  zswap-
> > mTHP |
> > >  |                              |   mainline |       Store |       Store |
> > >  |                              |            |         lz4 | deflate-iaa |
> > >  |-----------------------------------------------------------------------|
> > >  | pswpin                       |          0 |           0 |           0 |
> > >  | pswpout                      |    174,432 |           0 |           0 |
> > >  | zswpin                       |        703 |         534 |         721 |
> > >  | zswpout                      |      1,501 |   1,491,654 |   1,398,805 |
> > >  |-----------------------------------------------------------------------|
> > >  | thp_swpout                   |          0 |           0 |           0 |
> > >  | thp_swpout_fallback          |          0 |           0 |           0 |
> > >  | pgmajfault                   |      3,364 |       3,650 |       3,431 |
> > >  |-----------------------------------------------------------------------|
> > >  | hugepages-64kB/stats/zswpout |            |      63,200 |      63,244 |
> > >  |-----------------------------------------------------------------------|
> > >  | hugepages-64kB/stats/swpout  |     10,902 |           0 |           0 |
> > >   -----------------------------------------------------------------------
> > >
> >
> > Yeah this is not good. Something fishy is going on, if we see this
> > ginormous jump from 175000 (z)swpout pages to almost 1.5 million
> > pages. That's a massive jump.
> >
> > Either it's:
> >
> > 1.Your theory - zswap store keeps banging on the limit (which suggests
> > incompatibility between the way zswap currently behaves and our
> > reclaim logic)
> >
> > 2. The data here is ridiculously incompressible. We're needing to
> > zswpout roughly 8.5 times the number of pages, so the saving is 8.5
> > less => we only save 11.76% of memory for each page??? That's not
> > right...
> >
> > 3. There's an outright bug somewhere.
> >
> > Very suspicious.
> >
> > >
> > >  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 55G):
> > >  =======================================================
> > >   ------------------------------------------------------------------
> > >  |                    |                   |            |            |
> > >  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
> > >  |                    |                   |       KB/s |            |
> > >  |--------------------|-------------------|------------|------------|
> > >  |v6.11-rc3 mainline  | SSD               |    190,827 |   Baseline |
> > >  |zswap-mTHP-Store    | ZSWAP lz4         |     32,026 |       -83% |
> > >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    203,772 |         7% |
> > >  |------------------------------------------------------------------|
> > >  |                    |                   |            |            |
> > >  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
> > >  |                    |                   |        sec |            |
> > >  |--------------------|-------------------|------------|------------|
> > >  |v6.11-rc3 mainline  | SSD               |      27.23 |   Baseline |
> > >  |zswap-mTHP-Store    | ZSWAP lz4         |     156.52 |      -475% |
> > >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     171.45 |      -530% |
> > >   ------------------------------------------------------------------
> >
> > I'm confused. This is a *regression* right? A massive one that is -
> > sys time is *more* than 5 times the old value?
> >
> > >
> > >   -------------------------------------------------------------------------
> > >  | VMSTATS, mTHP ZSWAP/SSD stats  |  v6.11-rc3 |  zswap-mTHP |
> zswap-
> > mTHP |
> > >  |                                |   mainline |       Store |       Store |
> > >  |                                |            |         lz4 | deflate-iaa |
> > >  |-------------------------------------------------------------------------|
> > >  | pswpin                         |          0 |           0 |           0 |
> > >  | pswpout                        |    797,184 |           0 |           0 |
> > >  | zswpin                         |        690 |         649 |         669 |
> > >  | zswpout                        |      1,465 |   1,596,382 |   1,540,766 |
> > >  |-------------------------------------------------------------------------|
> > >  | thp_swpout                     |      1,557 |           0 |           0 |
> > >  | thp_swpout_fallback            |          0 |       3,248 |       3,752 |
> >
> > This is also increased, but I supposed we're just doing more
> > (z)swapping out in general...
> >
> > >  | pgmajfault                     |      3,726 |       6,470 |       5,691 |
> > >  |-------------------------------------------------------------------------|
> > >  | hugepages-2048kB/stats/zswpout |            |       2,416 |       2,261 |
> > >  |-------------------------------------------------------------------------|
> > >  | hugepages-2048kB/stats/swpout  |      1,557 |           0 |           0 |
> > >   -------------------------------------------------------------------------
> > >
> >
> > I'm not trying to delay this patch - I fully believe in supporting
> > zswap for larger pages (both mTHP and THP - whatever the memory
> > reclaim subsystem throws at us).
> >
> > But we need to get to the bottom of this :) These are very suspicious
> > and concerning data. If this is something urgent, I can live with a
> > gate to enable/disable this, but I'd much prefer we understand what's
> > going on here.


I started out with 2 main hypotheses to explain why zswap incurs more
reclaim wrt SSD:

1) The cgroup zswap charge, that hastens the memory.high limit to be
   breached, and adds to the reclaim being triggered in
   mem_cgroup_handle_over_high().

2) Does a faster reclaim path somehow cause less allocation stalls; thereby
   causing more breaches of memory.high, hence more reclaim -- and does this
   cycle repeat, potentially leading to higher swapout activity with zswap?

I focused on gathering data with lz4 for this debug, under the reasonable
assumption that results with deflate-iaa will be better. Once we figure out
an overall direction on next steps, I will publish results with zswap lz4,
deflate-iaa, etc.

All experiments except "Exp 1.A" are run with
usemem --init-time -w -O -n 70 1g.

General settings for all data presented in this patch-series:

vm.swappiness = 100
zswap shrinker_enabled = N

 Experiment 1 - impact of not doing cgroup zswap charge:
 -------------------------------------------------------

I wanted to first understand by how much we improve without the cgroup
zswap charge. I commented out both, the calls to obj_cgroup_charge_zswap()
and obj_cgroup_uncharge_zswap() in zswap.c in my patch-set.
We improve throughput by quite a bit with this change, and are now better
than mTHP getting swapped out to SSD. We have also slightly improved on the
sys time, though this is still a regression as compared to SSD. If you
recall, we were worse on throughput and sys time with v4.

Averages over 3 runs are summarized in each case.

 Exp 1.A: usemem -n 1 58g:
 -------------------------

 64KB mTHP (cgroup memory.high set to 60G):
 ==========================================

                SSD mTHP    zswap mTHP v4   zswap mTHP no_charge
 ----------------------------------------------------------------
 pswpout          586,352                0                      0
 zswpout            1,005        1,042,963                587,181
 ----------------------------------------------------------------
 Total swapout    587,357        1,042,963                587,181
 ----------------------------------------------------------------

Without the zswap charge to cgroup, the total swapout activity for
zswap-mTHP is on par with that of SSD-mTHP for the single process case.


 Exp 1.B: usemem -n 70 1g:
 -------------------------
 v4 results with cgroup zswap charge:
 ------------------------------------

 64KB mTHP (cgroup memory.high set to 60G):
 ==========================================
  ------------------------------------------------------------------
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     | Throughput |     Change |
 |                    |                   |       KB/s |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | SSD               |    335,346 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |    271,558 |       -19% |
 |------------------------------------------------------------------|
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     |   Sys time |     Change |
 |                    |                   |        sec |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | SSD               |      91.37 |   Baseline |
 |zswap-mTHP=Store    | ZSWAP lz4         |     265.43 |      -191% |
  ------------------------------------------------------------------

  -----------------------------------------------------------------------
 | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
 |                              |   mainline |       Store |       Store |
 |                              |            |         lz4 | deflate-iaa |
 |-----------------------------------------------------------------------|
 | pswpout                      |    174,432 |           0 |           0 |
 | zswpout                      |      1,501 |   1,491,654 |   1,398,805 |
 |-----------------------------------------------------------------------|
 | hugepages-64kB/stats/zswpout |            |      63,200 |      63,244 |
 |-----------------------------------------------------------------------|
 | hugepages-64kB/stats/swpout  |     10,902 |           0 |           0 |
  -----------------------------------------------------------------------

 Debug results without cgroup zswap charge in both, "Before" and "After":
 ------------------------------------------------------------------------

 64KB mTHP (cgroup memory.high set to 60G):
 ==========================================
  ------------------------------------------------------------------
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     | Throughput |     Change |
 |                    |                   |       KB/s |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | SSD               |    300,565 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |    420,125 |        40% |
 |------------------------------------------------------------------|
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     |   Sys time |     Change |
 |                    |                   |        sec |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | SSD               |      90.76 |   Baseline |
 |zswap-mTHP=Store    | ZSWAP lz4         |     213.09 |      -135% |
  ------------------------------------------------------------------

  ---------------------------------------------------------
 | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |
 |                              |   mainline |       Store |
 |                              |            |         lz4 |
 |----------------------------------------------------------
 | pswpout                      |    330,640 |           0 |
 | zswpout                      |      1,527 |   1,384,725 |
 |----------------------------------------------------------
 | hugepages-64kB/stats/zswpout |            |      63,335 |
 |----------------------------------------------------------
 | hugepages-64kB/stats/swpout  |     18,242 |           0 |
  ---------------------------------------------------------


Based on these results, I kept the cgroup zswap charging commented out in
subsequent debug steps, so as to not place zswap at a disadvantage when
trying to determine further causes for hypothesis (1).


Experiment 2 - swap latency/reclamation with 64K mTHP:
------------------------------------------------------

Number of swap_writepage    Total swap_writepage  Average swap_writepage
    calls from all cores      Latency (millisec)      Latency (microsec)
---------------------------------------------------------------------------
SSD               21,373               165,434.9                   7,740
zswap            344,109                55,446.8                     161
---------------------------------------------------------------------------


Reclamation analysis: 64k mTHP swapout:
---------------------------------------
"Before":
  Total SSD compressed data size   =  1,362,296,832  bytes
  Total SSD write IO latency       =        887,861  milliseconds

  Average SSD compressed data size =      1,089,837  bytes
  Average SSD write IO latency     =        710,289  microseconds

"After":
  Total ZSWAP compressed pool size =  2,610,657,430  bytes
  Total ZSWAP compress latency     =         55,984  milliseconds

  Average ZSWAP compress length    =          2,055  bytes
  Average ZSWAP compress latency   =             44  microseconds

  zswap-LZ4 mTHP compression ratio =  1.99
  All moderately compressible pages. 0 zswap_store errors.                                    
  84% of pages compress to 2056 bytes.


 Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP:
 ------------------------------------------------------------

 I wanted to take a step back and understand how the mainline v6.11-rc3
 handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off) and when
 swapped out to ZSWAP. Interestingly, higher swapout activity is observed
 with 4K folios and v6.11-rc3 (with the debug change to not charge zswap to
 cgroup).

 v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP:

 -------------------------------------------------------------
 SSD (CONFIG_ZSWAP is OFF)       ZSWAP          lz4    lzo-rle
 -------------------------------------------------------------
 cgroup memory.events:           cgroup memory.events:
 
 low                 0           low              0          0
 high            5,068           high       321,923    375,116
 max                 0           max              0          0
 oom                 0           oom              0          0
 oom_kill            0           oom_kill         0          0
 oom_group_kill      0           oom_group_kill   0          0
 -------------------------------------------------------------

 SSD (CONFIG_ZSWAP is OFF):
 --------------------------
 pswpout            415,709
 sys time (sec)      301.02
 Throughput KB/s    155,970
 memcg_high events    5,068
 --------------------------


 ZSWAP                  lz4         lz4         lz4     lzo-rle
 --------------------------------------------------------------
 zswpout          1,598,550   1,515,151   1,449,432   1,493,917
 sys time (sec)      889.36      481.21      581.22      635.75
 Throughput KB/s     35,176      14,765      20,253      21,407
 memcg_high events  321,923     412,733     369,976     375,116
 --------------------------------------------------------------

 This shows that there is a performance regression of -60% to -195% with
 zswap as compared to SSD with 4K folios. The higher swapout activity with
 zswap is seen here too (i.e., this doesn't appear to be mTHP-specific).

 I verified this to be the case even with the v6.7 kernel, which also
 showed a 2.3X throughput improvement when we don't charge zswap:

 ZSWAP lz4                 v6.7      v6.7 with no cgroup zswap charge
 --------------------------------------------------------------------
 zswpout              1,419,802       1,398,620
 sys time (sec)           535.4          613.41
 Throughput KB/s          8,671          20,045
 memcg_high events      574,046         451,859
 --------------------------------------------------------------------


Summary from the debug:
-----------------------
1) Excess reclaim is exacerbated by zswap charge to cgroup. Without the
   charge, reclaim is on par with SSD for mTHP in the single process
   case. The multiple process excess reclaim seems to be most likely
   resulting from over-reclaim done by the cores, in their respective calls
   to mem_cgroup_handle_over_high().

2) The higher swapout activity with zswap as compared to SSD does not
   appear to be specific to mTHP. Higher reclaim activity and sys time
   regression with zswap (as compared to a setup where there is only SSD
   configured as swap) exists with 4K pages as far back as v6.7.

3) The debug indicates the hypothesis (2) is worth more investigation:
   Does a faster reclaim path somehow cause less allocation stalls; thereby
   causing more breaches of memory.high, hence more reclaim -- and does this
   cycle repeat, potentially leading to higher swapout activity with zswap?
   Any advise on this being a possibility, and suggestions/pointers to
   verify this, would be greatly appreciated.

4) Interestingly, the # of memcg_high events reduces significantly with 64K
   mTHP as compared to the above 4K high events data, when tested with v4
   and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-mTHP). This
   potentially indicates something to do with allocation efficiency
   countering the higher reclaim that seems to be caused by swapout
   efficiency.

5) Nhat, Yosry: would it be possible for you to run the 4K folios
   usemem -n 70 1g (with 60G memory.high) expmnt with 4G and some higher
   value SSD configuration in your setup and say, v6.11-rc3. I would like
   to rule out the memory constrained 4G SSD in my setup somehow skewing
   the behavior of zswap vis-a-vis
   allocation/memcg_handle_over_high/reclaim. I realize your time is
   valuable, however I think an independent confirmation of what I have
   been observing, would be really helpful for us to figure out potential
   root-causes and solutions.

6) I tried a small change in memcontrol.c::mem_cgroup_handle_over_high() to
   break out of the loop if we have reclaimed a total of at least
   "nr_pages":

	nr_reclaimed = reclaim_high(memcg,
				    in_retry ? SWAP_CLUSTER_MAX : nr_pages,
				    gfp_mask);

+	nr_reclaimed_total += nr_reclaimed;
+
+	if (nr_reclaimed_total >= nr_pages)
+		goto out;


   This was only for debug purposes, and did seem to mitigate the higher
   reclaim behavior for 4K folios:
   
 ZSWAP                  lz4             lz4             lz4
 ----------------------------------------------------------
 zswpout          1,305,367       1,349,195       1,529,235
 sys time (sec)      472.06          507.76          646.39
 Throughput KB/s     55,144          21,811          88,310
 memcg_high events  257,890         343,213         172,351
 ----------------------------------------------------------

On average, this change results in 17% improvement in sys time, 2.35X
improvement in throughput and 30% fewer memcg_high events.

I look forward to further inputs on next steps.

Thanks,
Kanchana


> 
> Thanks for this analysis. I will debug this some more, so we can better
> understand these results.
> 
> Thanks,
> Kanchana

Sridhar, Kanchana P Aug. 24, 2024, 6:24 a.m. UTC | #11

Hi Yosry,

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Friday, August 23, 2024 8:10 PM
> To: Nhat Pham <nphamcs@gmail.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; Zou, Nanhai
> <nanhai.zou@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> [..]
> >
> > I'm not trying to delay this patch - I fully believe in supporting
> > zswap for larger pages (both mTHP and THP - whatever the memory
> > reclaim subsystem throws at us).
> >
> > But we need to get to the bottom of this :) These are very suspicious
> > and concerning data. If this is something urgent, I can live with a
> > gate to enable/disable this, but I'd much prefer we understand what's
> > going on here.
> 
> Agreed. I don't think merging this support is urgent, so I think we
> should better understand what is happening here. If there is a problem
> with how we charge compressed memory today (temporary double charges),
> we need to sort this out before the the mTHP support, as it will only
> make things worse.
> 
> I have to admit I didn't take a deep look at the discussion and data,
> so there may be other problems that I didn't notice. It seems to me
> like Kanchana is doing more debugging to understand what is happening,
> so that's great!

This sounds good. I just shared the data and my learnings from some
debugging experiments. I would appreciate it if you can review this and
suggest next steps.

> 
> As for the patches, we should sort out the impact on a higher level
> before discussing implementation details. From a quick look though it
> seems like the first patch can be dropped after Usama's patches that
> remove the same-filled handling from zswap land, and the last two
> patches can be squashed.

Sure, this sounds good.

Thanks,
Kanchana

Nhat Pham Aug. 26, 2024, 2:12 p.m. UTC | #12

On Fri, Aug 23, 2024 at 11:21 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi Nhat,
>
>
> I started out with 2 main hypotheses to explain why zswap incurs more
> reclaim wrt SSD:
>
> 1) The cgroup zswap charge, that hastens the memory.high limit to be
>    breached, and adds to the reclaim being triggered in
>    mem_cgroup_handle_over_high().
>
> 2) Does a faster reclaim path somehow cause less allocation stalls; thereby
>    causing more breaches of memory.high, hence more reclaim -- and does this
>    cycle repeat, potentially leading to higher swapout activity with zswap?

By faster reclaim path, do you mean zswap has a lower reclaim latency?

>
> I focused on gathering data with lz4 for this debug, under the reasonable
> assumption that results with deflate-iaa will be better. Once we figure out
> an overall direction on next steps, I will publish results with zswap lz4,
> deflate-iaa, etc.
>
> All experiments except "Exp 1.A" are run with
> usemem --init-time -w -O -n 70 1g.
>
> General settings for all data presented in this patch-series:
>
> vm.swappiness = 100
> zswap shrinker_enabled = N
>
>  Experiment 1 - impact of not doing cgroup zswap charge:
>  -------------------------------------------------------
>
> I wanted to first understand by how much we improve without the cgroup
> zswap charge. I commented out both, the calls to obj_cgroup_charge_zswap()
> and obj_cgroup_uncharge_zswap() in zswap.c in my patch-set.
> We improve throughput by quite a bit with this change, and are now better
> than mTHP getting swapped out to SSD. We have also slightly improved on the
> sys time, though this is still a regression as compared to SSD. If you
> recall, we were worse on throughput and sys time with v4.

I'm not 100% sure about the validity this pair of experiments.

The thing is, you cannot ignore zswap's memory footprint altogether.
That's the whole point of the trade-off. It's probably gigabytes worth
of unaccounted memory usage - I see that your SSD size is 4G, and
since compression ratio is less than 2, that's potentially 2G worth of
memory give or take you are not charging to the cgroup, which can
altogether alter the memory pressure and reclaim dynamics.

The zswap charging itself is not the problem - that's fair and
healthy. It might be the overreaction by the memory reclaim subsystem
that seems anomalous?

>
> Averages over 3 runs are summarized in each case.
>
>  Exp 1.A: usemem -n 1 58g:
>  -------------------------
>
>  64KB mTHP (cgroup memory.high set to 60G):
>  ==========================================
>
>                 SSD mTHP    zswap mTHP v4   zswap mTHP no_charge
>  ----------------------------------------------------------------
>  pswpout          586,352                0                      0
>  zswpout            1,005        1,042,963                587,181
>  ----------------------------------------------------------------
>  Total swapout    587,357        1,042,963                587,181
>  ----------------------------------------------------------------
>
> Without the zswap charge to cgroup, the total swapout activity for
> zswap-mTHP is on par with that of SSD-mTHP for the single process case.
>
>
>  Exp 1.B: usemem -n 70 1g:
>  -------------------------
>  v4 results with cgroup zswap charge:
>  ------------------------------------
>
>  64KB mTHP (cgroup memory.high set to 60G):
>  ==========================================
>   ------------------------------------------------------------------
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     | Throughput |     Change |
>  |                    |                   |       KB/s |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | SSD               |    335,346 |   Baseline |
>  |zswap-mTHP-Store    | ZSWAP lz4         |    271,558 |       -19% |
>  |------------------------------------------------------------------|
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     |   Sys time |     Change |
>  |                    |                   |        sec |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | SSD               |      91.37 |   Baseline |
>  |zswap-mTHP=Store    | ZSWAP lz4         |     265.43 |      -191% |
>   ------------------------------------------------------------------
>
>   -----------------------------------------------------------------------
>  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
>  |                              |   mainline |       Store |       Store |
>  |                              |            |         lz4 | deflate-iaa |
>  |-----------------------------------------------------------------------|
>  | pswpout                      |    174,432 |           0 |           0 |
>  | zswpout                      |      1,501 |   1,491,654 |   1,398,805 |
>  |-----------------------------------------------------------------------|
>  | hugepages-64kB/stats/zswpout |            |      63,200 |      63,244 |
>  |-----------------------------------------------------------------------|
>  | hugepages-64kB/stats/swpout  |     10,902 |           0 |           0 |
>   -----------------------------------------------------------------------
>
>  Debug results without cgroup zswap charge in both, "Before" and "After":
>  ------------------------------------------------------------------------
>
>  64KB mTHP (cgroup memory.high set to 60G):
>  ==========================================
>   ------------------------------------------------------------------
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     | Throughput |     Change |
>  |                    |                   |       KB/s |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | SSD               |    300,565 |   Baseline |
>  |zswap-mTHP-Store    | ZSWAP lz4         |    420,125 |        40% |
>  |------------------------------------------------------------------|
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     |   Sys time |     Change |
>  |                    |                   |        sec |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | SSD               |      90.76 |   Baseline |
>  |zswap-mTHP=Store    | ZSWAP lz4         |     213.09 |      -135% |
>   ------------------------------------------------------------------
>
>   ---------------------------------------------------------
>  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |
>  |                              |   mainline |       Store |
>  |                              |            |         lz4 |
>  |----------------------------------------------------------
>  | pswpout                      |    330,640 |           0 |
>  | zswpout                      |      1,527 |   1,384,725 |
>  |----------------------------------------------------------
>  | hugepages-64kB/stats/zswpout |            |      63,335 |
>  |----------------------------------------------------------
>  | hugepages-64kB/stats/swpout  |     18,242 |           0 |
>   ---------------------------------------------------------
>

Hmm, in the 70 processes case, it looks like we're still seeing
latency regression, and that same pattern of overreclaiming, even
without zswap cgroup charging?

That seems like a hint - concurrency exacerbates the problem?

>
> Based on these results, I kept the cgroup zswap charging commented out in
> subsequent debug steps, so as to not place zswap at a disadvantage when
> trying to determine further causes for hypothesis (1).
>
>
> Experiment 2 - swap latency/reclamation with 64K mTHP:
> ------------------------------------------------------
>
> Number of swap_writepage    Total swap_writepage  Average swap_writepage
>     calls from all cores      Latency (millisec)      Latency (microsec)
> ---------------------------------------------------------------------------
> SSD               21,373               165,434.9                   7,740
> zswap            344,109                55,446.8                     161
> ---------------------------------------------------------------------------
>
>
> Reclamation analysis: 64k mTHP swapout:
> ---------------------------------------
> "Before":
>   Total SSD compressed data size   =  1,362,296,832  bytes
>   Total SSD write IO latency       =        887,861  milliseconds
>
>   Average SSD compressed data size =      1,089,837  bytes
>   Average SSD write IO latency     =        710,289  microseconds
>
> "After":
>   Total ZSWAP compressed pool size =  2,610,657,430  bytes
>   Total ZSWAP compress latency     =         55,984  milliseconds
>
>   Average ZSWAP compress length    =          2,055  bytes
>   Average ZSWAP compress latency   =             44  microseconds
>
>   zswap-LZ4 mTHP compression ratio =  1.99
>   All moderately compressible pages. 0 zswap_store errors.
>   84% of pages compress to 2056 bytes.

Hmm this ratio isn't very good indeed - it is less than 2-to-1 memory saving...

Internally, we often see 1-3 or 1-4 saving ratio (or even more).

Probably does not explain everything, but worth double checking -
could you check with zstd to see if the ratio improves.

>
>
>  Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP:
>  ------------------------------------------------------------
>
>  I wanted to take a step back and understand how the mainline v6.11-rc3
>  handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off) and when
>  swapped out to ZSWAP. Interestingly, higher swapout activity is observed
>  with 4K folios and v6.11-rc3 (with the debug change to not charge zswap to
>  cgroup).
>
>  v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP:
>
>  -------------------------------------------------------------
>  SSD (CONFIG_ZSWAP is OFF)       ZSWAP          lz4    lzo-rle
>  -------------------------------------------------------------
>  cgroup memory.events:           cgroup memory.events:
>
>  low                 0           low              0          0
>  high            5,068           high       321,923    375,116
>  max                 0           max              0          0
>  oom                 0           oom              0          0
>  oom_kill            0           oom_kill         0          0
>  oom_group_kill      0           oom_group_kill   0          0
>  -------------------------------------------------------------
>
>  SSD (CONFIG_ZSWAP is OFF):
>  --------------------------
>  pswpout            415,709
>  sys time (sec)      301.02
>  Throughput KB/s    155,970
>  memcg_high events    5,068
>  --------------------------
>
>
>  ZSWAP                  lz4         lz4         lz4     lzo-rle
>  --------------------------------------------------------------
>  zswpout          1,598,550   1,515,151   1,449,432   1,493,917
>  sys time (sec)      889.36      481.21      581.22      635.75
>  Throughput KB/s     35,176      14,765      20,253      21,407
>  memcg_high events  321,923     412,733     369,976     375,116
>  --------------------------------------------------------------
>
>  This shows that there is a performance regression of -60% to -195% with
>  zswap as compared to SSD with 4K folios. The higher swapout activity with
>  zswap is seen here too (i.e., this doesn't appear to be mTHP-specific).
>
>  I verified this to be the case even with the v6.7 kernel, which also
>  showed a 2.3X throughput improvement when we don't charge zswap:
>
>  ZSWAP lz4                 v6.7      v6.7 with no cgroup zswap charge
>  --------------------------------------------------------------------
>  zswpout              1,419,802       1,398,620
>  sys time (sec)           535.4          613.41

systime increases without zswap cgroup charging? That's strange...

>  Throughput KB/s          8,671          20,045
>  memcg_high events      574,046         451,859

So, on 4k folio setup, even without cgroup charge, we are still seeing:

1. More zswpout (than observed in SSD)
2. 40-50% worse latency - in fact it is worse without zswap cgroup charging.
3. 100 times the amount of memcg_high events? This is perhaps the
*strangest* to me. You're already removing zswap cgroup charging, then
where does this comes from? How can we have memory.high violation when
zswap does *not* contribute to memory usage?

Is this due to swap limit charging? Do you have a cgroup swap limit?

mem_high = page_counter_read(&memcg->memory) >
           READ_ONCE(memcg->memory.high);
swap_high = page_counter_read(&memcg->swap) >
           READ_ONCE(memcg->swap.high);
[...]

if (mem_high || swap_high) {
    /*
    * The allocating tasks in this cgroup will need to do
    * reclaim or be throttled to prevent further growth
    * of the memory or swap footprints.
    *
    * Target some best-effort fairness between the tasks,
    * and distribute reclaim work and delay penalties
    * based on how much each task is actually allocating.
    */
    current->memcg_nr_pages_over_high += batch;
    set_notify_resume(current);
    break;
}


>  --------------------------------------------------------------------
>
>
> Summary from the debug:
> -----------------------
> 1) Excess reclaim is exacerbated by zswap charge to cgroup. Without the
>    charge, reclaim is on par with SSD for mTHP in the single process
>    case. The multiple process excess reclaim seems to be most likely
>    resulting from over-reclaim done by the cores, in their respective calls
>    to mem_cgroup_handle_over_high().

Exarcebate, yes. I'm not 100% it's the sole or even the main cause.

You still see a degree of overreclaiming without zswap cgroup charging in:

1. 70 processes, with mTHP
2. 70 processes, with 4K folios.

>
> 2) The higher swapout activity with zswap as compared to SSD does not
>    appear to be specific to mTHP. Higher reclaim activity and sys time
>    regression with zswap (as compared to a setup where there is only SSD
>    configured as swap) exists with 4K pages as far back as v6.7.

Yeah I can believe that without mthp, the same-ish workload would
cause the same regression.

>
> 3) The debug indicates the hypothesis (2) is worth more investigation:
>    Does a faster reclaim path somehow cause less allocation stalls; thereby
>    causing more breaches of memory.high, hence more reclaim -- and does this
>    cycle repeat, potentially leading to higher swapout activity with zswap?
>    Any advise on this being a possibility, and suggestions/pointers to
>    verify this, would be greatly appreciated.

Add stalls along the zswap path? :)

>
> 4) Interestingly, the # of memcg_high events reduces significantly with 64K
>    mTHP as compared to the above 4K high events data, when tested with v4
>    and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-mTHP). This
>    potentially indicates something to do with allocation efficiency
>    countering the higher reclaim that seems to be caused by swapout
>    efficiency.
>
> 5) Nhat, Yosry: would it be possible for you to run the 4K folios
>    usemem -n 70 1g (with 60G memory.high) expmnt with 4G and some higher
>    value SSD configuration in your setup and say, v6.11-rc3. I would like
>    to rule out the memory constrained 4G SSD in my setup somehow skewing
>    the behavior of zswap vis-a-vis
>    allocation/memcg_handle_over_high/reclaim. I realize your time is
>    valuable, however I think an independent confirmation of what I have
>    been observing, would be really helpful for us to figure out potential
>    root-causes and solutions.

It might take awhile for me to set up your benchmark, but yeah 4G
swapfile seems small on a 64G host - of course it depends on the
workload, but this has a lot memory usage. In fact the total memory
usage (70G?) is slightly above memory.high + 4G swapfile - note that
this is exarcebated by, once again, zswap's less-than-100% memory
saving ratio.

>
> 6) I tried a small change in memcontrol.c::mem_cgroup_handle_over_high() to
>    break out of the loop if we have reclaimed a total of at least
>    "nr_pages":
>
>         nr_reclaimed = reclaim_high(memcg,
>                                     in_retry ? SWAP_CLUSTER_MAX : nr_pages,
>                                     gfp_mask);
>
> +       nr_reclaimed_total += nr_reclaimed;
> +
> +       if (nr_reclaimed_total >= nr_pages)
> +               goto out;
>
>
>    This was only for debug purposes, and did seem to mitigate the higher
>    reclaim behavior for 4K folios:
>
>  ZSWAP                  lz4             lz4             lz4
>  ----------------------------------------------------------
>  zswpout          1,305,367       1,349,195       1,529,235
>  sys time (sec)      472.06          507.76          646.39
>  Throughput KB/s     55,144          21,811          88,310
>  memcg_high events  257,890         343,213         172,351
>  ----------------------------------------------------------
>
> On average, this change results in 17% improvement in sys time, 2.35X
> improvement in throughput and 30% fewer memcg_high events.
>
> I look forward to further inputs on next steps.
>
> Thanks,
> Kanchana
>
>
> >
> > Thanks for this analysis. I will debug this some more, so we can better
> > understand these results.
> >
> > Thanks,
> > Kanchana

Sridhar, Kanchana P Aug. 27, 2024, 6:08 a.m. UTC | #13

Hi Nhat,

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Monday, August 26, 2024 7:12 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> On Fri, Aug 23, 2024 at 11:21 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi Nhat,
> >
> >
> > I started out with 2 main hypotheses to explain why zswap incurs more
> > reclaim wrt SSD:
> >
> > 1) The cgroup zswap charge, that hastens the memory.high limit to be
> >    breached, and adds to the reclaim being triggered in
> >    mem_cgroup_handle_over_high().
> >
> > 2) Does a faster reclaim path somehow cause less allocation stalls; thereby
> >    causing more breaches of memory.high, hence more reclaim -- and does
> this
> >    cycle repeat, potentially leading to higher swapout activity with zswap?
> 
> By faster reclaim path, do you mean zswap has a lower reclaim latency?

Thanks for your follow-up comments/suggestions. Yes, I was characterizing
lower zswap reclaim latency as faster reclaim path.

> 
> >
> > I focused on gathering data with lz4 for this debug, under the reasonable
> > assumption that results with deflate-iaa will be better. Once we figure out
> > an overall direction on next steps, I will publish results with zswap lz4,
> > deflate-iaa, etc.
> >
> > All experiments except "Exp 1.A" are run with
> > usemem --init-time -w -O -n 70 1g.
> >
> > General settings for all data presented in this patch-series:
> >
> > vm.swappiness = 100
> > zswap shrinker_enabled = N
> >
> >  Experiment 1 - impact of not doing cgroup zswap charge:
> >  -------------------------------------------------------
> >
> > I wanted to first understand by how much we improve without the cgroup
> > zswap charge. I commented out both, the calls to
> obj_cgroup_charge_zswap()
> > and obj_cgroup_uncharge_zswap() in zswap.c in my patch-set.
> > We improve throughput by quite a bit with this change, and are now better
> > than mTHP getting swapped out to SSD. We have also slightly improved on
> the
> > sys time, though this is still a regression as compared to SSD. If you
> > recall, we were worse on throughput and sys time with v4.
> 
> I'm not 100% sure about the validity this pair of experiments.
> 
> The thing is, you cannot ignore zswap's memory footprint altogether.
> That's the whole point of the trade-off. It's probably gigabytes worth
> of unaccounted memory usage - I see that your SSD size is 4G, and
> since compression ratio is less than 2, that's potentially 2G worth of
> memory give or take you are not charging to the cgroup, which can
> altogether alter the memory pressure and reclaim dynamics.

I agree, the zswap memory utilization charging to the cgroup is the right
thing to do (assuming we solve the temporary double-charging, as Yosry
and you have pointed out). I have summarized the zswap memory footprint
with different compressors in the results towards the end of this email.

> 
> The zswap charging itself is not the problem - that's fair and
> healthy. It might be the overreaction by the memory reclaim subsystem
> that seems anomalous?

I think so too, about the anomalous behavior.

> 
> >
> > Averages over 3 runs are summarized in each case.
> >
> >  Exp 1.A: usemem -n 1 58g:
> >  -------------------------
> >
> >  64KB mTHP (cgroup memory.high set to 60G):
> >  ==========================================
> >
> >                 SSD mTHP    zswap mTHP v4   zswap mTHP no_charge
> >  ----------------------------------------------------------------
> >  pswpout          586,352                0                      0
> >  zswpout            1,005        1,042,963                587,181
> >  ----------------------------------------------------------------
> >  Total swapout    587,357        1,042,963                587,181
> >  ----------------------------------------------------------------
> >
> > Without the zswap charge to cgroup, the total swapout activity for
> > zswap-mTHP is on par with that of SSD-mTHP for the single process case.
> >
> >
> >  Exp 1.B: usemem -n 70 1g:
> >  -------------------------
> >  v4 results with cgroup zswap charge:
> >  ------------------------------------
> >
> >  64KB mTHP (cgroup memory.high set to 60G):
> >  ==========================================
> >   ------------------------------------------------------------------
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     | Throughput |     Change |
> >  |                    |                   |       KB/s |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |    335,346 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |    271,558 |       -19% |
> >  |------------------------------------------------------------------|
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     |   Sys time |     Change |
> >  |                    |                   |        sec |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |      91.37 |   Baseline |
> >  |zswap-mTHP=Store    | ZSWAP lz4         |     265.43 |      -191% |
> >   ------------------------------------------------------------------
> >
> >   -----------------------------------------------------------------------
> >  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |  zswap-
> mTHP |
> >  |                              |   mainline |       Store |       Store |
> >  |                              |            |         lz4 | deflate-iaa |
> >  |-----------------------------------------------------------------------|
> >  | pswpout                      |    174,432 |           0 |           0 |
> >  | zswpout                      |      1,501 |   1,491,654 |   1,398,805 |
> >  |-----------------------------------------------------------------------|
> >  | hugepages-64kB/stats/zswpout |            |      63,200 |      63,244 |
> >  |-----------------------------------------------------------------------|
> >  | hugepages-64kB/stats/swpout  |     10,902 |           0 |           0 |
> >   -----------------------------------------------------------------------
> >
> >  Debug results without cgroup zswap charge in both, "Before" and "After":
> >  ------------------------------------------------------------------------
> >
> >  64KB mTHP (cgroup memory.high set to 60G):
> >  ==========================================
> >   ------------------------------------------------------------------
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     | Throughput |     Change |
> >  |                    |                   |       KB/s |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |    300,565 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |    420,125 |        40% |
> >  |------------------------------------------------------------------|
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     |   Sys time |     Change |
> >  |                    |                   |        sec |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | SSD               |      90.76 |   Baseline |
> >  |zswap-mTHP=Store    | ZSWAP lz4         |     213.09 |      -135% |
> >   ------------------------------------------------------------------
> >
> >   ---------------------------------------------------------
> >  | VMSTATS, mTHP ZSWAP/SSD stats|  v6.11-rc3 |  zswap-mTHP |
> >  |                              |   mainline |       Store |
> >  |                              |            |         lz4 |
> >  |----------------------------------------------------------
> >  | pswpout                      |    330,640 |           0 |
> >  | zswpout                      |      1,527 |   1,384,725 |
> >  |----------------------------------------------------------
> >  | hugepages-64kB/stats/zswpout |            |      63,335 |
> >  |----------------------------------------------------------
> >  | hugepages-64kB/stats/swpout  |     18,242 |           0 |
> >   ---------------------------------------------------------
> >
> 
> Hmm, in the 70 processes case, it looks like we're still seeing
> latency regression, and that same pattern of overreclaiming, even
> without zswap cgroup charging?
> 
> That seems like a hint - concurrency exacerbates the problem?

Agreed, that was my conclusion as well.

> 
> >
> > Based on these results, I kept the cgroup zswap charging commented out in
> > subsequent debug steps, so as to not place zswap at a disadvantage when
> > trying to determine further causes for hypothesis (1).
> >
> >
> > Experiment 2 - swap latency/reclamation with 64K mTHP:
> > ------------------------------------------------------
> >
> > Number of swap_writepage    Total swap_writepage  Average
> swap_writepage
> >     calls from all cores      Latency (millisec)      Latency (microsec)
> > ---------------------------------------------------------------------------
> > SSD               21,373               165,434.9                   7,740
> > zswap            344,109                55,446.8                     161
> > ---------------------------------------------------------------------------
> >
> >
> > Reclamation analysis: 64k mTHP swapout:
> > ---------------------------------------
> > "Before":
> >   Total SSD compressed data size   =  1,362,296,832  bytes
> >   Total SSD write IO latency       =        887,861  milliseconds
> >
> >   Average SSD compressed data size =      1,089,837  bytes
> >   Average SSD write IO latency     =        710,289  microseconds
> >
> > "After":
> >   Total ZSWAP compressed pool size =  2,610,657,430  bytes
> >   Total ZSWAP compress latency     =         55,984  milliseconds
> >
> >   Average ZSWAP compress length    =          2,055  bytes
> >   Average ZSWAP compress latency   =             44  microseconds
> >
> >   zswap-LZ4 mTHP compression ratio =  1.99
> >   All moderately compressible pages. 0 zswap_store errors.
> >   84% of pages compress to 2056 bytes.
> 
> Hmm this ratio isn't very good indeed - it is less than 2-to-1 memory saving...
> 
> Internally, we often see 1-3 or 1-4 saving ratio (or even more).

Agree with this as well. In our experiments with other workloads, we
typically see much higher ratios.

> 
> Probably does not explain everything, but worth double checking -
> could you check with zstd to see if the ratio improves.

Sure. I gathered ratio and compressed memory footprint data today with
64K mTHP, the 4G SSD swapfile and different zswap compressors.

 This patch-series and no zswap charging, 64K mTHP:
---------------------------------------------------------------------------
                       Total         Total     Average      Average   Comp
                  compressed   compression  compressed  compression  ratio
                      length       latency      length      latency
                       bytes  milliseconds       bytes  nanoseconds
---------------------------------------------------------------------------
SSD (no zswap) 1,362,296,832       887,861  
lz4            2,610,657,430        55,984       2,055      44,065    1.99
zstd             729,129,528        50,986         565      39,510    7.25
deflate-iaa    1,286,533,438        44,785       1,415      49,252    2.89
---------------------------------------------------------------------------

zstd does very well on ratio, as expected.

> 
> >
> >
> >  Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP:
> >  ------------------------------------------------------------
> >
> >  I wanted to take a step back and understand how the mainline v6.11-rc3
> >  handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off) and
> when
> >  swapped out to ZSWAP. Interestingly, higher swapout activity is observed
> >  with 4K folios and v6.11-rc3 (with the debug change to not charge zswap to
> >  cgroup).
> >
> >  v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP:
> >
> >  -------------------------------------------------------------
> >  SSD (CONFIG_ZSWAP is OFF)       ZSWAP          lz4    lzo-rle
> >  -------------------------------------------------------------
> >  cgroup memory.events:           cgroup memory.events:
> >
> >  low                 0           low              0          0
> >  high            5,068           high       321,923    375,116
> >  max                 0           max              0          0
> >  oom                 0           oom              0          0
> >  oom_kill            0           oom_kill         0          0
> >  oom_group_kill      0           oom_group_kill   0          0
> >  -------------------------------------------------------------
> >
> >  SSD (CONFIG_ZSWAP is OFF):
> >  --------------------------
> >  pswpout            415,709
> >  sys time (sec)      301.02
> >  Throughput KB/s    155,970
> >  memcg_high events    5,068
> >  --------------------------
> >
> >
> >  ZSWAP                  lz4         lz4         lz4     lzo-rle
> >  --------------------------------------------------------------
> >  zswpout          1,598,550   1,515,151   1,449,432   1,493,917
> >  sys time (sec)      889.36      481.21      581.22      635.75
> >  Throughput KB/s     35,176      14,765      20,253      21,407
> >  memcg_high events  321,923     412,733     369,976     375,116
> >  --------------------------------------------------------------
> >
> >  This shows that there is a performance regression of -60% to -195% with
> >  zswap as compared to SSD with 4K folios. The higher swapout activity with
> >  zswap is seen here too (i.e., this doesn't appear to be mTHP-specific).
> >
> >  I verified this to be the case even with the v6.7 kernel, which also
> >  showed a 2.3X throughput improvement when we don't charge zswap:
> >
> >  ZSWAP lz4                 v6.7      v6.7 with no cgroup zswap charge
> >  --------------------------------------------------------------------
> >  zswpout              1,419,802       1,398,620
> >  sys time (sec)           535.4          613.41
> 
> systime increases without zswap cgroup charging? That's strange...

Additional data gathered with v6.11-rc3 (listed below) based on your suggestion
to investigate potential swap.high breaches should hopefully provide some
explanation.

> 
> >  Throughput KB/s          8,671          20,045
> >  memcg_high events      574,046         451,859
> 
> So, on 4k folio setup, even without cgroup charge, we are still seeing:
> 
> 1. More zswpout (than observed in SSD)
> 2. 40-50% worse latency - in fact it is worse without zswap cgroup charging.
> 3. 100 times the amount of memcg_high events? This is perhaps the
> *strangest* to me. You're already removing zswap cgroup charging, then
> where does this comes from? How can we have memory.high violation when
> zswap does *not* contribute to memory usage?
> 
> Is this due to swap limit charging? Do you have a cgroup swap limit?
> 
> mem_high = page_counter_read(&memcg->memory) >
>            READ_ONCE(memcg->memory.high);
> swap_high = page_counter_read(&memcg->swap) >
>            READ_ONCE(memcg->swap.high);
> [...]
> 
> if (mem_high || swap_high) {
>     /*
>     * The allocating tasks in this cgroup will need to do
>     * reclaim or be throttled to prevent further growth
>     * of the memory or swap footprints.
>     *
>     * Target some best-effort fairness between the tasks,
>     * and distribute reclaim work and delay penalties
>     * based on how much each task is actually allocating.
>     */
>     current->memcg_nr_pages_over_high += batch;
>     set_notify_resume(current);
>     break;
> }
> 

I don't have a swap.high limit set on the cgroup; it is set to "max".

I ran experiments with v6.11-rc3, no zswap charging, 4K folios and different
zswap compressors to verify if swap.high is breached with the 4G SSD swapfile.

 SSD (CONFIG_ZSWAP is OFF):

                                SSD          SSD          SSD
 ------------------------------------------------------------ 
 pswpout                    415,709    1,032,170      636,582
 sys time (sec)              301.02       328.15       306.98
 Throughput KB/s            155,970       89,621      122,219
 memcg_high events            5,068       15,072        8,344
 memcg_swap_high events           0            0            0
 memcg_swap_fail events           0            0            0
 ------------------------------------------------------------
                                    
 ZSWAP                               zstd         zstd       zstd
 ----------------------------------------------------------------
 zswpout                        1,391,524    1,382,965  1,417,307
 sys time (sec)                    474.68       568.24     489.80
 Throughput KB/s                   26,099       23,404    111,115
 memcg_high events                335,112      340,335    162,260
 memcg_swap_high events                 0            0          0
 memcg_swap_fail events         1,226,899    5,742,153
  (mem_cgroup_try_charge_swap)
 memcg_memory_stat_pgactivate   1,259,547
  (shrink_folio_list)
 ----------------------------------------------------------------

 ZSWAP                      lzo-rle      lzo-rle     lzo-rle
 -----------------------------------------------------------
 zswpout                  1,493,917    1,363,040   1,428,133
 sys time (sec)              635.75       498.63      484.65
 Throughput KB/s             21,407       23,827      20,237
 memcg_high events          375,116      352,814     373,667
 memcg_swap_high events           0            0           0
 memcg_swap_fail events     715,211      
 -----------------------------------------------------------
                                    
 ZSWAP                         lz4         lz4        lz4          lz4
 ---------------------------------------------------------------------
 zswpout                 1,378,781   1,598,550   1,515,151   1,449,432
 sys time (sec)             495.45      889.36      481.21      581.22
 Throughput KB/s            26,248      35,176      14,765      20,253
 memcg_high events         347,209     321,923     412,733     369,976
 memcg_swap_high events          0           0           0           0
 memcg_swap_fail events    580,103           0 
 ---------------------------------------------------------------------

 ZSWAP                  deflate-iaa   deflate-iaa    deflate-iaa
 ----------------------------------------------------------------
 zswpout                    380,471     1,440,902      1,397,965
 sys time (sec)              329.06        570.77         467.41
 Throughput KB/s            283,867        28,403        190,600
 memcg_high events            5,551       422,831         28,154
 memcg_swap_high events           0             0              0
 memcg_swap_fail events           0     2,686,758        438,562
 ----------------------------------------------------------------

There are no swap.high memcg events recorded in any of the SSD/zswap
 experiments. However, I do see significant number of memcg_swap_fail
 events in some of the zswap runs, for all 3 compressors. This is not
 consistent, because there are some runs with 0 memcg_swap_fail for all
 compressors.

 There is a possible co-relation between memcg_swap_fail events
 (/sys/fs/cgroup/test/memory.swap.events) and the high # of memcg_high
 events. The root-cause appears to be that there are no available swap
 slots, memcg_swap_fail is incremented, add_to_swap() fails in
 shrink_folio_list(), followed by "activate_locked:" for the folio.
 The folio re-activation is recorded in cgroup memory.stat pgactivate
 events. The failure to swap out folios due to lack of swap slots could
 contribute towards memory.high breaches.

swp_entry_t folio_alloc_swap(struct folio *folio)
{
...
        get_swap_pages(1, &entry, 0);
out:
	if (mem_cgroup_try_charge_swap(folio, entry)) {
		put_swap_folio(folio, entry);
		entry.val = 0;
	}
	return entry;
}

int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
{
...
	if (!entry.val) {
		WARN_ONCE(1, "__mem_cgroup_try_charge_swap: MEMCG_SWAP_FAIL entry.val is 0");
		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
		return 0;
	}

...
}

This is the call stack (v6.11-rc3 mainline) as reference for the above
analysis:

[  109.130504] __mem_cgroup_try_charge_swap: MEMCG_SWAP_FAIL entry.val is 0
[  109.130515] WARNING: CPU: 143 PID: 5200 at mm/memcontrol.c:5011 __mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) 
[  109.130652] RIP: 0010:__mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) 
[  109.130682] Call Trace:
[  109.130686]  <TASK>
[  109.130689] ? __warn (kernel/panic.c:735) 
[  109.130695] ? __mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) 
[  109.130698] ? report_bug (lib/bug.c:201 lib/bug.c:219) 
[  109.130705] ? prb_read_valid (kernel/printk/printk_ringbuffer.c:2183) 
[  109.130710] ? handle_bug (arch/x86/kernel/traps.c:239) 
[  109.130715] ? exc_invalid_op (arch/x86/kernel/traps.c:260 (discriminator 1)) 
[  109.130718] ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621) 
[  109.130722] ? __mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) 
[  109.130725] ? __mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) 
[  109.130728] folio_alloc_swap (mm/swap_slots.c:348) 
[  109.130734] add_to_swap (mm/swap_state.c:189) 
[  109.130737] shrink_folio_list (mm/vmscan.c:1235) 
[  109.130744] ? __mod_zone_page_state (mm/vmstat.c:367) 
[  109.130748] ? isolate_lru_folios (mm/vmscan.c:1598 mm/vmscan.c:1736) 
[  109.130753] shrink_inactive_list (./include/linux/spinlock.h:376 mm/vmscan.c:1961) 
[  109.130758] shrink_lruvec (mm/vmscan.c:2194 mm/vmscan.c:5706) 
[  109.130763] shrink_node (mm/vmscan.c:5910 mm/vmscan.c:5948) 
[  109.130768] do_try_to_free_pages (mm/vmscan.c:6134 mm/vmscan.c:6254) 
[  109.130772] try_to_free_mem_cgroup_pages (./include/linux/sched/mm.h:355 ./include/linux/sched/mm.h:456 mm/vmscan.c:6588) 
[  109.130778] reclaim_high (mm/memcontrol.c:1906) 
[  109.130783] mem_cgroup_handle_over_high (./include/linux/memcontrol.h:556 mm/memcontrol.c:2001 mm/memcontrol.c:2108) 
[  109.130787] irqentry_exit_to_user_mode (./include/linux/resume_user_mode.h:60 kernel/entry/common.c:114 ./include/linux/entry-common.h:328 kernel/entry/common.c:231) 
[  109.130792] asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623) 


 However, this is probably not the only cause for either the high # of
 memory.high breaches or the over-reclaim with zswap, as seen in the lz4
 data where the memory.high is significant even in cases where there are no
 memcg_swap_fails.

Some observations/questions based on the above 4K folios swapout data:

1) There are more memcg_high events as the swapout latency reduces
   (i.e. faster swap-write path). This is even without charging zswap
   utilization to the cgroup.

2) There appears to be a direct co-relation between higher # of
   memcg_swap_fail events, and an increase in memcg_high breaches and
   reduction in usemem throughput. This combined with the observation in
   (1) suggests that with a faster compressor, we need more swap slots,
   that increases the probability of running out of swap slots with the 4G
   SSD backing device.

3) Could the data shared earlier on reduction in memcg_high breaches with
   64K mTHP swapout provide some more clues, if we agree with (1) and (2):

   "Interestingly, the # of memcg_high events reduces significantly with 64K
   mTHP as compared to the above 4K memcg_high events data, when tested
   with v4 and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-mTHP)."

4) In the case of each zswap compressor, there are some runs that go
   through with 0 memcg_swap_fail events. These runs generally have better
   fewer memcg_high breaches and better sys time/throughput.

5) For a given swap setup, there is some amount of variance in
   sys time for this workload.

6) All this suggests that the primary root cause is the concurrency setup,
   where there could be randomness between runs as to the # of processes
   that observe the memory.high breach due to other factors such as
   availability of swap slots for alloc.

To summarize, I believe the root-cause is the 4G SSD swapfile resulting in
running out of swap slots, and anomalous behavior with over-reclaim when 70
concurrent processes are working with the 60G memory limit while trying to
allocate 1G each; with randomness in processes reacting to the breach.

The cgroup zswap charging exacerbates this situation, but is not a problem
in and of itself.

Nhat, as you pointed out, this is somewhat of an unrealistic scenario that
doesn't seem to indicate any specific problems to be solved, other than the
temporary cgroup zswap double-charging.

Would it be fair to evaluate this patch-series based on a more realistic
swapfile configuration based on 176G ZRAM, for which I had shared the data
in v2? There weren't any problems with swap slots availability or any
anomalies that I can think of with this setup, other than the fact that the
"Before" and "After" sys times could not be directly compared for 2 key
reasons:

 - ZRAM compressed data is not charged to the cgroup, similar to SSD.
 - ZSWAP compressed data is charged to the cgroup.

This disparity causes fewer swapouts, better sys time/throughput in the
"Before" experiments.

In the "After" experiments, this disparity causes more swapouts only with
zswap-lz4 due to the poorer compression ratio combined with the cgroup
charge; and hence a regression in sys time/throughput.

However, the better compression ratio with deflate-iaa results in
comparable # of swapouts as "Before", with better sys time/throughput.

My main rationale for suggesting the v2 ZRAM swapfile data is that the
disparities are the same as with the 4G SSD swapfile, but there are no anomalies,
with reasonable explanations for the data.

I would appreciate everyones' thoughts on this. If this sounds Ok, then I can
submit a v5 with the changes suggested by Yosry. 

I am listing here the v2 data with 176G ZRAM swapfile again, just for reference.

v2 data with cgroup zswap charging:
-----------------------------------

 64KB mTHP:
 ==========
  ------------------------------------------------------------------
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     | Throughput | Change|
 |                    |                   |       KB/s |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |    118,928 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |     82,665 |       -30% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |    176,210 |        48% |
 |------------------------------------------------------------------|
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     |   Sys time | Change|
 |                    |                   |        sec |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |   1,032.20 |   Baseline |
 |zswap-mTHP=Store    | ZSWAP lz4         |   1,854.51 |       -80% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |     582.71 |        44% |
  ------------------------------------------------------------------

  -----------------------------------------------------------------------
 | VMSTATS, mTHP ZSWAP stats,   |  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
 | mTHP ZRAM stats:             |   mainline |       Store |       Store |
 |                              |            |         lz4 | deflate-iaa |
 |-----------------------------------------------------------------------|
 | pswpin                       |         16 |           0 |           0 |
 | pswpout                      |  7,770,720 |           0 |           0 |
 | zswpin                       |        547 |         695 |         579 |
 | zswpout                      |      1,394 |  15,462,778 |   7,284,554 |
 |-----------------------------------------------------------------------|
 | thp_swpout                   |          0 |           0 |           0 |
 | thp_swpout_fallback          |          0 |           0 |           0 |
 | pgmajfault                   |      3,786 |       3,541 |       3,367 |
 |-----------------------------------------------------------------------|
 | hugepages-64kB/stats/zswpout |            |     966,328 |     455,196 |
 |-----------------------------------------------------------------------|
 | hugepages-64kB/stats/swpout  |    485,670 |           0 |           0 |
  -----------------------------------------------------------------------


 2MB PMD-THP/2048K mTHP:
 =======================
  ------------------------------------------------------------------
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     | Throughput | Change|
 |                    |                   |       KB/s |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |    177,340 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |     84,030 |       -53% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |    185,691 |         5% |
 |------------------------------------------------------------------|
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     |   Sys time | Change|
 |                    |                   |        sec |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |     876.29 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |   1,740.55 |       -99% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |     650.33 |        26% |
  ------------------------------------------------------------------

  ------------------------------------------------------------------------- 
 | VMSTATS, mTHP ZSWAP stats,     |  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
 | mTHP ZRAM stats:               |   mainline |       Store |       Store |
 |                                |            |         lz4 | deflate-iaa |
 |-------------------------------------------------------------------------|
 | pswpin                         |          0 |           0 |           0 |
 | pswpout                        |  8,628,224 |           0 |           0 |
 | zswpin                         |        678 |      22,733 |       1,641 |
 | zswpout                        |      1,481 |  14,828,597 |   9,404,937 |
 |-------------------------------------------------------------------------|
 | thp_swpout                     |     16,852 |           0 |           0 |
 | thp_swpout_fallback            |          0 |           0 |           0 |
 | pgmajfault                     |      3,467 |      25,550 |       4,800 |
 |-------------------------------------------------------------------------|
 | hugepages-2048kB/stats/zswpout |            |      28,924 |      18,366 |
 |-------------------------------------------------------------------------|
 | hugepages-2048kB/stats/swpout  |     16,852 |           0 |           0 |
  -------------------------------------------------------------------------

> 
> >  --------------------------------------------------------------------
> >
> >
> > Summary from the debug:
> > -----------------------
> > 1) Excess reclaim is exacerbated by zswap charge to cgroup. Without the
> >    charge, reclaim is on par with SSD for mTHP in the single process
> >    case. The multiple process excess reclaim seems to be most likely
> >    resulting from over-reclaim done by the cores, in their respective calls
> >    to mem_cgroup_handle_over_high().
> 
> Exarcebate, yes. I'm not 100% it's the sole or even the main cause.
> 
> You still see a degree of overreclaiming without zswap cgroup charging in:
> 
> 1. 70 processes, with mTHP
> 2. 70 processes, with 4K folios.

That's correct, although the over-reclaiming is not as bad with mTHP.

> 
> >
> > 2) The higher swapout activity with zswap as compared to SSD does not
> >    appear to be specific to mTHP. Higher reclaim activity and sys time
> >    regression with zswap (as compared to a setup where there is only SSD
> >    configured as swap) exists with 4K pages as far back as v6.7.
> 
> Yeah I can believe that without mthp, the same-ish workload would
> cause the same regression.

This makes sense.

> 
> >
> > 3) The debug indicates the hypothesis (2) is worth more investigation:
> >    Does a faster reclaim path somehow cause less allocation stalls; thereby
> >    causing more breaches of memory.high, hence more reclaim -- and does
> this
> >    cycle repeat, potentially leading to higher swapout activity with zswap?
> >    Any advise on this being a possibility, and suggestions/pointers to
> >    verify this, would be greatly appreciated.
> 
> Add stalls along the zswap path? :)

Yes, possibly! Hopefully, the swap slots availability learning from today's
experiments makes things a little clearer.

> 
> >
> > 4) Interestingly, the # of memcg_high events reduces significantly with 64K
> >    mTHP as compared to the above 4K high events data, when tested with v4
> >    and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-mTHP). This
> >    potentially indicates something to do with allocation efficiency
> >    countering the higher reclaim that seems to be caused by swapout
> >    efficiency.
> >
> > 5) Nhat, Yosry: would it be possible for you to run the 4K folios
> >    usemem -n 70 1g (with 60G memory.high) expmnt with 4G and some
> higher
> >    value SSD configuration in your setup and say, v6.11-rc3. I would like
> >    to rule out the memory constrained 4G SSD in my setup somehow skewing
> >    the behavior of zswap vis-a-vis
> >    allocation/memcg_handle_over_high/reclaim. I realize your time is
> >    valuable, however I think an independent confirmation of what I have
> >    been observing, would be really helpful for us to figure out potential
> >    root-causes and solutions.
> 
> It might take awhile for me to set up your benchmark, but yeah 4G
> swapfile seems small on a 64G host - of course it depends on the
> workload, but this has a lot memory usage. In fact the total memory
> usage (70G?) is slightly above memory.high + 4G swapfile - note that
> this is exarcebated by, once again, zswap's less-than-100% memory
> saving ratio.

I agree, this is somewhat of an unrealistic setup. Hopefully the data and my
learnings shared from the experiments I ran today, should provide some
insights into possible root-causes for the anomalous over-reclaim behavior.

Thanks,
Kanchana

> 
> >
> > 6) I tried a small change in memcontrol.c::mem_cgroup_handle_over_high()
> to
> >    break out of the loop if we have reclaimed a total of at least
> >    "nr_pages":
> >
> >         nr_reclaimed = reclaim_high(memcg,
> >                                     in_retry ? SWAP_CLUSTER_MAX : nr_pages,
> >                                     gfp_mask);
> >
> > +       nr_reclaimed_total += nr_reclaimed;
> > +
> > +       if (nr_reclaimed_total >= nr_pages)
> > +               goto out;
> >
> >
> >    This was only for debug purposes, and did seem to mitigate the higher
> >    reclaim behavior for 4K folios:
> >
> >  ZSWAP                  lz4             lz4             lz4
> >  ----------------------------------------------------------
> >  zswpout          1,305,367       1,349,195       1,529,235
> >  sys time (sec)      472.06          507.76          646.39
> >  Throughput KB/s     55,144          21,811          88,310
> >  memcg_high events  257,890         343,213         172,351
> >  ----------------------------------------------------------
> >
> > On average, this change results in 17% improvement in sys time, 2.35X
> > improvement in throughput and 30% fewer memcg_high events.
> >
> > I look forward to further inputs on next steps.
> >
> > Thanks,
> > Kanchana
> >
> >
> > >
> > > Thanks for this analysis. I will debug this some more, so we can better
> > > understand these results.
> > >
> > > Thanks,
> > > Kanchana

Nhat Pham Aug. 27, 2024, 2:55 p.m. UTC | #14

On Sun, Aug 18, 2024 at 7:16 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi All,
>
> base-commit: 8c0b4f7b65fd1ca7af01267f491e815a40d77444
> --
> 2.27.0
>

BTW, where does this commit come from? I assume this is post-mTHP
swapout - does it have mTHP swapin? Chris Li's patch series to improve
swap slot allocation?

Can't seem to find it when I fetch mm-unstable for some reason hmmmmm.

Nhat Pham Aug. 27, 2024, 3:23 p.m. UTC | #15

On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> > Internally, we often see 1-3 or 1-4 saving ratio (or even more).
>
> Agree with this as well. In our experiments with other workloads, we
> typically see much higher ratios.
>
> >
> > Probably does not explain everything, but worth double checking -
> > could you check with zstd to see if the ratio improves.
>
> Sure. I gathered ratio and compressed memory footprint data today with
> 64K mTHP, the 4G SSD swapfile and different zswap compressors.
>
>  This patch-series and no zswap charging, 64K mTHP:
> ---------------------------------------------------------------------------
>                        Total         Total     Average      Average   Comp
>                   compressed   compression  compressed  compression  ratio
>                       length       latency      length      latency
>                        bytes  milliseconds       bytes  nanoseconds
> ---------------------------------------------------------------------------
> SSD (no zswap) 1,362,296,832       887,861
> lz4            2,610,657,430        55,984       2,055      44,065    1.99
> zstd             729,129,528        50,986         565      39,510    7.25
> deflate-iaa    1,286,533,438        44,785       1,415      49,252    2.89
> ---------------------------------------------------------------------------
>
> zstd does very well on ratio, as expected.

Wait. So zstd is displaying 7-to-1 compression ratio? And has *lower*
average latency?

Why are we running benchmark on lz4 again? Sure there is no free lunch
and no compressor that works well on all kind of data, but lz4's
performance here is so bad that it's borderline justifiable to
disable/bypass zswap with this kind of compresison ratio...

Can I ask you to run benchmarking on zstd from now on?

>
> >
> > >
> > >
> > >  Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP:
> > >  ------------------------------------------------------------
> > >
> > >  I wanted to take a step back and understand how the mainline v6.11-rc3
> > >  handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off) and
> > when
> > >  swapped out to ZSWAP. Interestingly, higher swapout activity is observed
> > >  with 4K folios and v6.11-rc3 (with the debug change to not charge zswap to
> > >  cgroup).
> > >
> > >  v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP:
> > >
> > >  -------------------------------------------------------------
> > >  SSD (CONFIG_ZSWAP is OFF)       ZSWAP          lz4    lzo-rle
> > >  -------------------------------------------------------------
> > >  cgroup memory.events:           cgroup memory.events:
> > >
> > >  low                 0           low              0          0
> > >  high            5,068           high       321,923    375,116
> > >  max                 0           max              0          0
> > >  oom                 0           oom              0          0
> > >  oom_kill            0           oom_kill         0          0
> > >  oom_group_kill      0           oom_group_kill   0          0
> > >  -------------------------------------------------------------
> > >
> > >  SSD (CONFIG_ZSWAP is OFF):
> > >  --------------------------
> > >  pswpout            415,709
> > >  sys time (sec)      301.02
> > >  Throughput KB/s    155,970
> > >  memcg_high events    5,068
> > >  --------------------------
> > >
> > >
> > >  ZSWAP                  lz4         lz4         lz4     lzo-rle
> > >  --------------------------------------------------------------
> > >  zswpout          1,598,550   1,515,151   1,449,432   1,493,917
> > >  sys time (sec)      889.36      481.21      581.22      635.75
> > >  Throughput KB/s     35,176      14,765      20,253      21,407
> > >  memcg_high events  321,923     412,733     369,976     375,116
> > >  --------------------------------------------------------------
> > >
> > >  This shows that there is a performance regression of -60% to -195% with
> > >  zswap as compared to SSD with 4K folios. The higher swapout activity with
> > >  zswap is seen here too (i.e., this doesn't appear to be mTHP-specific).
> > >
> > >  I verified this to be the case even with the v6.7 kernel, which also
> > >  showed a 2.3X throughput improvement when we don't charge zswap:
> > >
> > >  ZSWAP lz4                 v6.7      v6.7 with no cgroup zswap charge
> > >  --------------------------------------------------------------------
> > >  zswpout              1,419,802       1,398,620
> > >  sys time (sec)           535.4          613.41
> >
> > systime increases without zswap cgroup charging? That's strange...
>
> Additional data gathered with v6.11-rc3 (listed below) based on your suggestion
> to investigate potential swap.high breaches should hopefully provide some
> explanation.
>
> >
> > >  Throughput KB/s          8,671          20,045
> > >  memcg_high events      574,046         451,859
> >
> > So, on 4k folio setup, even without cgroup charge, we are still seeing:
> >
> > 1. More zswpout (than observed in SSD)
> > 2. 40-50% worse latency - in fact it is worse without zswap cgroup charging.
> > 3. 100 times the amount of memcg_high events? This is perhaps the
> > *strangest* to me. You're already removing zswap cgroup charging, then
> > where does this comes from? How can we have memory.high violation when
> > zswap does *not* contribute to memory usage?
> >
> > Is this due to swap limit charging? Do you have a cgroup swap limit?
> >
> > mem_high = page_counter_read(&memcg->memory) >
> >            READ_ONCE(memcg->memory.high);
> > swap_high = page_counter_read(&memcg->swap) >
> >            READ_ONCE(memcg->swap.high);
> > [...]
> >
> > if (mem_high || swap_high) {
> >     /*
> >     * The allocating tasks in this cgroup will need to do
> >     * reclaim or be throttled to prevent further growth
> >     * of the memory or swap footprints.
> >     *
> >     * Target some best-effort fairness between the tasks,
> >     * and distribute reclaim work and delay penalties
> >     * based on how much each task is actually allocating.
> >     */
> >     current->memcg_nr_pages_over_high += batch;
> >     set_notify_resume(current);
> >     break;
> > }
> >
>
> I don't have a swap.high limit set on the cgroup; it is set to "max".
>
> I ran experiments with v6.11-rc3, no zswap charging, 4K folios and different
> zswap compressors to verify if swap.high is breached with the 4G SSD swapfile.
>
>  SSD (CONFIG_ZSWAP is OFF):
>
>                                 SSD          SSD          SSD
>  ------------------------------------------------------------
>  pswpout                    415,709    1,032,170      636,582
>  sys time (sec)              301.02       328.15       306.98
>  Throughput KB/s            155,970       89,621      122,219
>  memcg_high events            5,068       15,072        8,344
>  memcg_swap_high events           0            0            0
>  memcg_swap_fail events           0            0            0
>  ------------------------------------------------------------
>
>  ZSWAP                               zstd         zstd       zstd
>  ----------------------------------------------------------------
>  zswpout                        1,391,524    1,382,965  1,417,307
>  sys time (sec)                    474.68       568.24     489.80
>  Throughput KB/s                   26,099       23,404    111,115
>  memcg_high events                335,112      340,335    162,260
>  memcg_swap_high events                 0            0          0
>  memcg_swap_fail events         1,226,899    5,742,153
>   (mem_cgroup_try_charge_swap)
>  memcg_memory_stat_pgactivate   1,259,547
>   (shrink_folio_list)
>  ----------------------------------------------------------------
>
>  ZSWAP                      lzo-rle      lzo-rle     lzo-rle
>  -----------------------------------------------------------
>  zswpout                  1,493,917    1,363,040   1,428,133
>  sys time (sec)              635.75       498.63      484.65
>  Throughput KB/s             21,407       23,827      20,237
>  memcg_high events          375,116      352,814     373,667
>  memcg_swap_high events           0            0           0
>  memcg_swap_fail events     715,211
>  -----------------------------------------------------------
>
>  ZSWAP                         lz4         lz4        lz4          lz4
>  ---------------------------------------------------------------------
>  zswpout                 1,378,781   1,598,550   1,515,151   1,449,432
>  sys time (sec)             495.45      889.36      481.21      581.22
>  Throughput KB/s            26,248      35,176      14,765      20,253
>  memcg_high events         347,209     321,923     412,733     369,976
>  memcg_swap_high events          0           0           0           0
>  memcg_swap_fail events    580,103           0
>  ---------------------------------------------------------------------
>
>  ZSWAP                  deflate-iaa   deflate-iaa    deflate-iaa
>  ----------------------------------------------------------------
>  zswpout                    380,471     1,440,902      1,397,965
>  sys time (sec)              329.06        570.77         467.41
>  Throughput KB/s            283,867        28,403        190,600
>  memcg_high events            5,551       422,831         28,154
>  memcg_swap_high events           0             0              0
>  memcg_swap_fail events           0     2,686,758        438,562
>  ----------------------------------------------------------------

Why are there 3 columns for each of the compressors? Is this different
runs of the same workload?

And why do some columns have missing cells?

>
> There are no swap.high memcg events recorded in any of the SSD/zswap
>  experiments. However, I do see significant number of memcg_swap_fail
>  events in some of the zswap runs, for all 3 compressors. This is not
>  consistent, because there are some runs with 0 memcg_swap_fail for all
>  compressors.
>
>  There is a possible co-relation between memcg_swap_fail events
>  (/sys/fs/cgroup/test/memory.swap.events) and the high # of memcg_high
>  events. The root-cause appears to be that there are no available swap
>  slots, memcg_swap_fail is incremented, add_to_swap() fails in
>  shrink_folio_list(), followed by "activate_locked:" for the folio.
>  The folio re-activation is recorded in cgroup memory.stat pgactivate
>  events. The failure to swap out folios due to lack of swap slots could
>  contribute towards memory.high breaches.

Yeah FWIW, that was gonna be my first suggestion. This swapfile size
is wayyyy too small...

But that said, the link is not clear to me at all. The only thing I
can think of is lz4's performance sucks so bad that it's not saving
enough memory, leading to regression. And since it's still taking up
swap slot, we cannot use swap either?

>
>  However, this is probably not the only cause for either the high # of
>  memory.high breaches or the over-reclaim with zswap, as seen in the lz4
>  data where the memory.high is significant even in cases where there are no
>  memcg_swap_fails.
>
> Some observations/questions based on the above 4K folios swapout data:
>
> 1) There are more memcg_high events as the swapout latency reduces
>    (i.e. faster swap-write path). This is even without charging zswap
>    utilization to the cgroup.

This is still inexplicable to me. If we are not charging zswap usage,
we shouldn't even be triggering the reclaim_high() path, no?

I'm curious - can you use bpftrace to tracks where/when reclaim_high
is being called?

>
> 2) There appears to be a direct co-relation between higher # of
>    memcg_swap_fail events, and an increase in memcg_high breaches and
>    reduction in usemem throughput. This combined with the observation in
>    (1) suggests that with a faster compressor, we need more swap slots,
>    that increases the probability of running out of swap slots with the 4G
>    SSD backing device.
>
> 3) Could the data shared earlier on reduction in memcg_high breaches with
>    64K mTHP swapout provide some more clues, if we agree with (1) and (2):
>
>    "Interestingly, the # of memcg_high events reduces significantly with 64K
>    mTHP as compared to the above 4K memcg_high events data, when tested
>    with v4 and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-mTHP)."
>
> 4) In the case of each zswap compressor, there are some runs that go
>    through with 0 memcg_swap_fail events. These runs generally have better
>    fewer memcg_high breaches and better sys time/throughput.
>
> 5) For a given swap setup, there is some amount of variance in
>    sys time for this workload.
>
> 6) All this suggests that the primary root cause is the concurrency setup,
>    where there could be randomness between runs as to the # of processes
>    that observe the memory.high breach due to other factors such as
>    availability of swap slots for alloc.
>
> To summarize, I believe the root-cause is the 4G SSD swapfile resulting in
> running out of swap slots, and anomalous behavior with over-reclaim when 70
> concurrent processes are working with the 60G memory limit while trying to
> allocate 1G each; with randomness in processes reacting to the breach.
>
> The cgroup zswap charging exacerbates this situation, but is not a problem
> in and of itself.
>
> Nhat, as you pointed out, this is somewhat of an unrealistic scenario that
> doesn't seem to indicate any specific problems to be solved, other than the
> temporary cgroup zswap double-charging.
>
> Would it be fair to evaluate this patch-series based on a more realistic
> swapfile configuration based on 176G ZRAM, for which I had shared the data
> in v2? There weren't any problems with swap slots availability or any
> anomalies that I can think of with this setup, other than the fact that the
> "Before" and "After" sys times could not be directly compared for 2 key
> reasons:
>
>  - ZRAM compressed data is not charged to the cgroup, similar to SSD.
>  - ZSWAP compressed data is charged to the cgroup.

Yeah that's a bit unfair still. Wild idea, but what about we compare
SSD without zswap (or SSD with zswap, but without this patch series so
that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing
swapfile on zram block device).

It is stupid, I know. But let's take advantage of the fact that zram
is not charged to cgroup, pretending that its memory foot print is
empty?

I don't know how zram works though, so my apologies if it's a stupid
suggestion :)

Nhat Pham Aug. 27, 2024, 3:30 p.m. UTC | #16

On Tue, Aug 27, 2024 at 8:23 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> Yeah that's a bit unfair still. Wild idea, but what about we compare
> SSD without zswap (or SSD with zswap, but without this patch series so
> that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing
> swapfile on zram block device).
>
> It is stupid, I know. But let's take advantage of the fact that zram
> is not charged to cgroup, pretending that its memory foot print is
> empty?
>
> I don't know how zram works though, so my apologies if it's a stupid
> suggestion :)

Oh nvm, looks like that's what you're already doing.

That said, the lz4 column is soooo bad still, whereas the deflate-iaa
clearly shows improvement! This means it could be
compressor-dependent.

Can you try it with zstd?

Sridhar, Kanchana P Aug. 27, 2024, 6:09 p.m. UTC | #17

Hi Nhat,

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Tuesday, August 27, 2024 7:55 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> On Sun, Aug 18, 2024 at 7:16 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi All,
> >
> > base-commit: 8c0b4f7b65fd1ca7af01267f491e815a40d77444
> > --
> > 2.27.0
> >
> 
> BTW, where does this commit come from? I assume this is post-mTHP
> swapout - does it have mTHP swapin? Chris Li's patch series to improve
> swap slot allocation?
> 
> Can't seem to find it when I fetch mm-unstable for some reason hmmmmm.

This was the latest mm-unstable as of 8/18/2024:

commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444
Author: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Date:   Thu May 11 13:22:30 2023 +0800

    mm: optimization on page allocation when CMA enabled

Let me rebase to the latest mm-unstable and send out an updated patchset.

mm-unstable as of 8/27/2024:

- Has some of Chris Li's patches to improve swap slot allocation:

https://patchwork.kernel.org/project/linux-mm/patch/20240730-swap-allocator-v5-3-cb9c148b9297@kernel.org/
https://patchwork.kernel.org/project/linux-mm/patch/20240730-swap-allocator-v5-2-cb9c148b9297@kernel.org/
https://patchwork.kernel.org/project/linux-mm/patch/20240730-swap-allocator-v5-1-cb9c148b9297@kernel.org/

- Does not yet have mTHP swapin as far as I can tell.

Thanks,
Kanchana

Sridhar, Kanchana P Aug. 27, 2024, 6:42 p.m. UTC | #18

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Tuesday, August 27, 2024 8:24 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > > Internally, we often see 1-3 or 1-4 saving ratio (or even more).
> >
> > Agree with this as well. In our experiments with other workloads, we
> > typically see much higher ratios.
> >
> > >
> > > Probably does not explain everything, but worth double checking -
> > > could you check with zstd to see if the ratio improves.
> >
> > Sure. I gathered ratio and compressed memory footprint data today with
> > 64K mTHP, the 4G SSD swapfile and different zswap compressors.
> >
> >  This patch-series and no zswap charging, 64K mTHP:
> > ---------------------------------------------------------------------------
> >                        Total         Total     Average      Average   Comp
> >                   compressed   compression  compressed  compression  ratio
> >                       length       latency      length      latency
> >                        bytes  milliseconds       bytes  nanoseconds
> > ---------------------------------------------------------------------------
> > SSD (no zswap) 1,362,296,832       887,861
> > lz4            2,610,657,430        55,984       2,055      44,065    1.99
> > zstd             729,129,528        50,986         565      39,510    7.25
> > deflate-iaa    1,286,533,438        44,785       1,415      49,252    2.89
> > ---------------------------------------------------------------------------
> >
> > zstd does very well on ratio, as expected.
> 
> Wait. So zstd is displaying 7-to-1 compression ratio? And has *lower*
> average latency?
> 
> Why are we running benchmark on lz4 again? Sure there is no free lunch
> and no compressor that works well on all kind of data, but lz4's
> performance here is so bad that it's borderline justifiable to
> disable/bypass zswap with this kind of compresison ratio...
> 
> Can I ask you to run benchmarking on zstd from now on?

Sure, will do.

> 
> >
> > >
> > > >
> > > >
> > > >  Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP:
> > > >  ------------------------------------------------------------
> > > >
> > > >  I wanted to take a step back and understand how the mainline v6.11-
> rc3
> > > >  handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off) and
> > > when
> > > >  swapped out to ZSWAP. Interestingly, higher swapout activity is
> observed
> > > >  with 4K folios and v6.11-rc3 (with the debug change to not charge
> zswap to
> > > >  cgroup).
> > > >
> > > >  v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP:
> > > >
> > > >  -------------------------------------------------------------
> > > >  SSD (CONFIG_ZSWAP is OFF)       ZSWAP          lz4    lzo-rle
> > > >  -------------------------------------------------------------
> > > >  cgroup memory.events:           cgroup memory.events:
> > > >
> > > >  low                 0           low              0          0
> > > >  high            5,068           high       321,923    375,116
> > > >  max                 0           max              0          0
> > > >  oom                 0           oom              0          0
> > > >  oom_kill            0           oom_kill         0          0
> > > >  oom_group_kill      0           oom_group_kill   0          0
> > > >  -------------------------------------------------------------
> > > >
> > > >  SSD (CONFIG_ZSWAP is OFF):
> > > >  --------------------------
> > > >  pswpout            415,709
> > > >  sys time (sec)      301.02
> > > >  Throughput KB/s    155,970
> > > >  memcg_high events    5,068
> > > >  --------------------------
> > > >
> > > >
> > > >  ZSWAP                  lz4         lz4         lz4     lzo-rle
> > > >  --------------------------------------------------------------
> > > >  zswpout          1,598,550   1,515,151   1,449,432   1,493,917
> > > >  sys time (sec)      889.36      481.21      581.22      635.75
> > > >  Throughput KB/s     35,176      14,765      20,253      21,407
> > > >  memcg_high events  321,923     412,733     369,976     375,116
> > > >  --------------------------------------------------------------
> > > >
> > > >  This shows that there is a performance regression of -60% to -195%
> with
> > > >  zswap as compared to SSD with 4K folios. The higher swapout activity
> with
> > > >  zswap is seen here too (i.e., this doesn't appear to be mTHP-specific).
> > > >
> > > >  I verified this to be the case even with the v6.7 kernel, which also
> > > >  showed a 2.3X throughput improvement when we don't charge zswap:
> > > >
> > > >  ZSWAP lz4                 v6.7      v6.7 with no cgroup zswap charge
> > > >  --------------------------------------------------------------------
> > > >  zswpout              1,419,802       1,398,620
> > > >  sys time (sec)           535.4          613.41
> > >
> > > systime increases without zswap cgroup charging? That's strange...
> >
> > Additional data gathered with v6.11-rc3 (listed below) based on your
> suggestion
> > to investigate potential swap.high breaches should hopefully provide some
> > explanation.
> >
> > >
> > > >  Throughput KB/s          8,671          20,045
> > > >  memcg_high events      574,046         451,859
> > >
> > > So, on 4k folio setup, even without cgroup charge, we are still seeing:
> > >
> > > 1. More zswpout (than observed in SSD)
> > > 2. 40-50% worse latency - in fact it is worse without zswap cgroup
> charging.
> > > 3. 100 times the amount of memcg_high events? This is perhaps the
> > > *strangest* to me. You're already removing zswap cgroup charging, then
> > > where does this comes from? How can we have memory.high violation
> when
> > > zswap does *not* contribute to memory usage?
> > >
> > > Is this due to swap limit charging? Do you have a cgroup swap limit?
> > >
> > > mem_high = page_counter_read(&memcg->memory) >
> > >            READ_ONCE(memcg->memory.high);
> > > swap_high = page_counter_read(&memcg->swap) >
> > >            READ_ONCE(memcg->swap.high);
> > > [...]
> > >
> > > if (mem_high || swap_high) {
> > >     /*
> > >     * The allocating tasks in this cgroup will need to do
> > >     * reclaim or be throttled to prevent further growth
> > >     * of the memory or swap footprints.
> > >     *
> > >     * Target some best-effort fairness between the tasks,
> > >     * and distribute reclaim work and delay penalties
> > >     * based on how much each task is actually allocating.
> > >     */
> > >     current->memcg_nr_pages_over_high += batch;
> > >     set_notify_resume(current);
> > >     break;
> > > }
> > >
> >
> > I don't have a swap.high limit set on the cgroup; it is set to "max".
> >
> > I ran experiments with v6.11-rc3, no zswap charging, 4K folios and different
> > zswap compressors to verify if swap.high is breached with the 4G SSD
> swapfile.
> >
> >  SSD (CONFIG_ZSWAP is OFF):
> >
> >                                 SSD          SSD          SSD
> >  ------------------------------------------------------------
> >  pswpout                    415,709    1,032,170      636,582
> >  sys time (sec)              301.02       328.15       306.98
> >  Throughput KB/s            155,970       89,621      122,219
> >  memcg_high events            5,068       15,072        8,344
> >  memcg_swap_high events           0            0            0
> >  memcg_swap_fail events           0            0            0
> >  ------------------------------------------------------------
> >
> >  ZSWAP                               zstd         zstd       zstd
> >  ----------------------------------------------------------------
> >  zswpout                        1,391,524    1,382,965  1,417,307
> >  sys time (sec)                    474.68       568.24     489.80
> >  Throughput KB/s                   26,099       23,404    111,115
> >  memcg_high events                335,112      340,335    162,260
> >  memcg_swap_high events                 0            0          0
> >  memcg_swap_fail events         1,226,899    5,742,153
> >   (mem_cgroup_try_charge_swap)
> >  memcg_memory_stat_pgactivate   1,259,547
> >   (shrink_folio_list)
> >  ----------------------------------------------------------------
> >
> >  ZSWAP                      lzo-rle      lzo-rle     lzo-rle
> >  -----------------------------------------------------------
> >  zswpout                  1,493,917    1,363,040   1,428,133
> >  sys time (sec)              635.75       498.63      484.65
> >  Throughput KB/s             21,407       23,827      20,237
> >  memcg_high events          375,116      352,814     373,667
> >  memcg_swap_high events           0            0           0
> >  memcg_swap_fail events     715,211
> >  -----------------------------------------------------------
> >
> >  ZSWAP                         lz4         lz4        lz4          lz4
> >  ---------------------------------------------------------------------
> >  zswpout                 1,378,781   1,598,550   1,515,151   1,449,432
> >  sys time (sec)             495.45      889.36      481.21      581.22
> >  Throughput KB/s            26,248      35,176      14,765      20,253
> >  memcg_high events         347,209     321,923     412,733     369,976
> >  memcg_swap_high events          0           0           0           0
> >  memcg_swap_fail events    580,103           0
> >  ---------------------------------------------------------------------
> >
> >  ZSWAP                  deflate-iaa   deflate-iaa    deflate-iaa
> >  ----------------------------------------------------------------
> >  zswpout                    380,471     1,440,902      1,397,965
> >  sys time (sec)              329.06        570.77         467.41
> >  Throughput KB/s            283,867        28,403        190,600
> >  memcg_high events            5,551       422,831         28,154
> >  memcg_swap_high events           0             0              0
> >  memcg_swap_fail events           0     2,686,758        438,562
> >  ----------------------------------------------------------------
> 
> Why are there 3 columns for each of the compressors? Is this different
> runs of the same workload?
> 
> And why do some columns have missing cells?

Yes, these are different runs of the same workload. Since there is some
amount of variance seen in the data, I figured it is best to publish the
metrics from the individual runs rather than averaging.

Some of these runs were gathered earlier with the same code base,
however, I wasn't monitoring/logging the memcg_swap_high/memcg_swap_fail
events at that time. For those runs, just these two counters have missing
column entries; the rest of the data is still valid.

> 
> >
> > There are no swap.high memcg events recorded in any of the SSD/zswap
> >  experiments. However, I do see significant number of memcg_swap_fail
> >  events in some of the zswap runs, for all 3 compressors. This is not
> >  consistent, because there are some runs with 0 memcg_swap_fail for all
> >  compressors.
> >
> >  There is a possible co-relation between memcg_swap_fail events
> >  (/sys/fs/cgroup/test/memory.swap.events) and the high # of memcg_high
> >  events. The root-cause appears to be that there are no available swap
> >  slots, memcg_swap_fail is incremented, add_to_swap() fails in
> >  shrink_folio_list(), followed by "activate_locked:" for the folio.
> >  The folio re-activation is recorded in cgroup memory.stat pgactivate
> >  events. The failure to swap out folios due to lack of swap slots could
> >  contribute towards memory.high breaches.
> 
> Yeah FWIW, that was gonna be my first suggestion. This swapfile size
> is wayyyy too small...
> 
> But that said, the link is not clear to me at all. The only thing I
> can think of is lz4's performance sucks so bad that it's not saving
> enough memory, leading to regression. And since it's still taking up
> swap slot, we cannot use swap either?

The occurrence of memcg_swap_fail events establishes that swap slots
are not available with 4G of swap space. This causes those 4K folios to
remain in memory, which can worsen an existing problem with memory.high
breaches.

However, it is worth noting that this is not the only contributor to
memcg_high events that still occur without zswap charging. The data shows
321,923 occurrences of memcg_high in Col 2 of the lz4 table, that also has
0 occurrences of memcg_swap_fail reported in the cgroup stats. 

> 
> >
> >  However, this is probably not the only cause for either the high # of
> >  memory.high breaches or the over-reclaim with zswap, as seen in the lz4
> >  data where the memory.high is significant even in cases where there are no
> >  memcg_swap_fails.
> >
> > Some observations/questions based on the above 4K folios swapout data:
> >
> > 1) There are more memcg_high events as the swapout latency reduces
> >    (i.e. faster swap-write path). This is even without charging zswap
> >    utilization to the cgroup.
> 
> This is still inexplicable to me. If we are not charging zswap usage,
> we shouldn't even be triggering the reclaim_high() path, no?
> 
> I'm curious - can you use bpftrace to tracks where/when reclaim_high
> is being called?

I had confirmed earlier with counters that all calls to reclaim_high()
were from include/linux/resume_user_mode.h::resume_user_mode_work().
I will confirm this with zstd and bpftrace and share.

Thanks,
Kanchana

> 
> >
> > 2) There appears to be a direct co-relation between higher # of
> >    memcg_swap_fail events, and an increase in memcg_high breaches and
> >    reduction in usemem throughput. This combined with the observation in
> >    (1) suggests that with a faster compressor, we need more swap slots,
> >    that increases the probability of running out of swap slots with the 4G
> >    SSD backing device.
> >
> > 3) Could the data shared earlier on reduction in memcg_high breaches with
> >    64K mTHP swapout provide some more clues, if we agree with (1) and (2):
> >
> >    "Interestingly, the # of memcg_high events reduces significantly with 64K
> >    mTHP as compared to the above 4K memcg_high events data, when
> tested
> >    with v4 and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-
> mTHP)."
> >
> > 4) In the case of each zswap compressor, there are some runs that go
> >    through with 0 memcg_swap_fail events. These runs generally have better
> >    fewer memcg_high breaches and better sys time/throughput.
> >
> > 5) For a given swap setup, there is some amount of variance in
> >    sys time for this workload.
> >
> > 6) All this suggests that the primary root cause is the concurrency setup,
> >    where there could be randomness between runs as to the # of processes
> >    that observe the memory.high breach due to other factors such as
> >    availability of swap slots for alloc.
> >
> > To summarize, I believe the root-cause is the 4G SSD swapfile resulting in
> > running out of swap slots, and anomalous behavior with over-reclaim when
> 70
> > concurrent processes are working with the 60G memory limit while trying to
> > allocate 1G each; with randomness in processes reacting to the breach.
> >
> > The cgroup zswap charging exacerbates this situation, but is not a problem
> > in and of itself.
> >
> > Nhat, as you pointed out, this is somewhat of an unrealistic scenario that
> > doesn't seem to indicate any specific problems to be solved, other than the
> > temporary cgroup zswap double-charging.
> >
> > Would it be fair to evaluate this patch-series based on a more realistic
> > swapfile configuration based on 176G ZRAM, for which I had shared the
> data
> > in v2? There weren't any problems with swap slots availability or any
> > anomalies that I can think of with this setup, other than the fact that the
> > "Before" and "After" sys times could not be directly compared for 2 key
> > reasons:
> >
> >  - ZRAM compressed data is not charged to the cgroup, similar to SSD.
> >  - ZSWAP compressed data is charged to the cgroup.
> 
> Yeah that's a bit unfair still. Wild idea, but what about we compare
> SSD without zswap (or SSD with zswap, but without this patch series so
> that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing
> swapfile on zram block device).
> 
> It is stupid, I know. But let's take advantage of the fact that zram
> is not charged to cgroup, pretending that its memory foot print is
> empty?
> 
> I don't know how zram works though, so my apologies if it's a stupid
> suggestion :)

Sridhar, Kanchana P Aug. 27, 2024, 6:43 p.m. UTC | #19

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Tuesday, August 27, 2024 8:30 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> On Tue, Aug 27, 2024 at 8:23 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P
> > <kanchana.p.sridhar@intel.com> wrote:
> > Yeah that's a bit unfair still. Wild idea, but what about we compare
> > SSD without zswap (or SSD with zswap, but without this patch series so
> > that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing
> > swapfile on zram block device).
> >
> > It is stupid, I know. But let's take advantage of the fact that zram
> > is not charged to cgroup, pretending that its memory foot print is
> > empty?
> >
> > I don't know how zram works though, so my apologies if it's a stupid
> > suggestion :)
> 
> Oh nvm, looks like that's what you're already doing.
> 
> That said, the lz4 column is soooo bad still, whereas the deflate-iaa
> clearly shows improvement! This means it could be
> compressor-dependent.
> 
> Can you try it with zstd?

Sure, I will gather data with zstd.

Thanks,
Kanchana

Sridhar, Kanchana P Aug. 28, 2024, 7:24 a.m. UTC | #20

> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Tuesday, August 27, 2024 11:42 AM
> To: Nhat Pham <nphamcs@gmail.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>;
> Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> 
> > -----Original Message-----
> > From: Nhat Pham <nphamcs@gmail.com>
> > Sent: Tuesday, August 27, 2024 8:24 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> >
> > On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > > > Internally, we often see 1-3 or 1-4 saving ratio (or even more).
> > >
> > > Agree with this as well. In our experiments with other workloads, we
> > > typically see much higher ratios.
> > >
> > > >
> > > > Probably does not explain everything, but worth double checking -
> > > > could you check with zstd to see if the ratio improves.
> > >
> > > Sure. I gathered ratio and compressed memory footprint data today with
> > > 64K mTHP, the 4G SSD swapfile and different zswap compressors.
> > >
> > >  This patch-series and no zswap charging, 64K mTHP:
> > > ---------------------------------------------------------------------------
> > >                        Total         Total     Average      Average   Comp
> > >                   compressed   compression  compressed  compression  ratio
> > >                       length       latency      length      latency
> > >                        bytes  milliseconds       bytes  nanoseconds
> > > ---------------------------------------------------------------------------
> > > SSD (no zswap) 1,362,296,832       887,861
> > > lz4            2,610,657,430        55,984       2,055      44,065    1.99
> > > zstd             729,129,528        50,986         565      39,510    7.25
> > > deflate-iaa    1,286,533,438        44,785       1,415      49,252    2.89
> > > ---------------------------------------------------------------------------
> > >
> > > zstd does very well on ratio, as expected.
> >
> > Wait. So zstd is displaying 7-to-1 compression ratio? And has *lower*
> > average latency?
> >
> > Why are we running benchmark on lz4 again? Sure there is no free lunch
> > and no compressor that works well on all kind of data, but lz4's
> > performance here is so bad that it's borderline justifiable to
> > disable/bypass zswap with this kind of compresison ratio...
> >
> > Can I ask you to run benchmarking on zstd from now on?
> 
> Sure, will do.
> 
> >
> > >
> > > >
> > > > >
> > > > >
> > > > >  Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP:
> > > > >  ------------------------------------------------------------
> > > > >
> > > > >  I wanted to take a step back and understand how the mainline v6.11-
> > rc3
> > > > >  handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off)
> and
> > > > when
> > > > >  swapped out to ZSWAP. Interestingly, higher swapout activity is
> > observed
> > > > >  with 4K folios and v6.11-rc3 (with the debug change to not charge
> > zswap to
> > > > >  cgroup).
> > > > >
> > > > >  v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP:
> > > > >
> > > > >  -------------------------------------------------------------
> > > > >  SSD (CONFIG_ZSWAP is OFF)       ZSWAP          lz4    lzo-rle
> > > > >  -------------------------------------------------------------
> > > > >  cgroup memory.events:           cgroup memory.events:
> > > > >
> > > > >  low                 0           low              0          0
> > > > >  high            5,068           high       321,923    375,116
> > > > >  max                 0           max              0          0
> > > > >  oom                 0           oom              0          0
> > > > >  oom_kill            0           oom_kill         0          0
> > > > >  oom_group_kill      0           oom_group_kill   0          0
> > > > >  -------------------------------------------------------------
> > > > >
> > > > >  SSD (CONFIG_ZSWAP is OFF):
> > > > >  --------------------------
> > > > >  pswpout            415,709
> > > > >  sys time (sec)      301.02
> > > > >  Throughput KB/s    155,970
> > > > >  memcg_high events    5,068
> > > > >  --------------------------
> > > > >
> > > > >
> > > > >  ZSWAP                  lz4         lz4         lz4     lzo-rle
> > > > >  --------------------------------------------------------------
> > > > >  zswpout          1,598,550   1,515,151   1,449,432   1,493,917
> > > > >  sys time (sec)      889.36      481.21      581.22      635.75
> > > > >  Throughput KB/s     35,176      14,765      20,253      21,407
> > > > >  memcg_high events  321,923     412,733     369,976     375,116
> > > > >  --------------------------------------------------------------
> > > > >
> > > > >  This shows that there is a performance regression of -60% to -195%
> > with
> > > > >  zswap as compared to SSD with 4K folios. The higher swapout activity
> > with
> > > > >  zswap is seen here too (i.e., this doesn't appear to be mTHP-specific).
> > > > >
> > > > >  I verified this to be the case even with the v6.7 kernel, which also
> > > > >  showed a 2.3X throughput improvement when we don't charge
> zswap:
> > > > >
> > > > >  ZSWAP lz4                 v6.7      v6.7 with no cgroup zswap charge
> > > > >  --------------------------------------------------------------------
> > > > >  zswpout              1,419,802       1,398,620
> > > > >  sys time (sec)           535.4          613.41
> > > >
> > > > systime increases without zswap cgroup charging? That's strange...
> > >
> > > Additional data gathered with v6.11-rc3 (listed below) based on your
> > suggestion
> > > to investigate potential swap.high breaches should hopefully provide
> some
> > > explanation.
> > >
> > > >
> > > > >  Throughput KB/s          8,671          20,045
> > > > >  memcg_high events      574,046         451,859
> > > >
> > > > So, on 4k folio setup, even without cgroup charge, we are still seeing:
> > > >
> > > > 1. More zswpout (than observed in SSD)
> > > > 2. 40-50% worse latency - in fact it is worse without zswap cgroup
> > charging.
> > > > 3. 100 times the amount of memcg_high events? This is perhaps the
> > > > *strangest* to me. You're already removing zswap cgroup charging,
> then
> > > > where does this comes from? How can we have memory.high violation
> > when
> > > > zswap does *not* contribute to memory usage?
> > > >
> > > > Is this due to swap limit charging? Do you have a cgroup swap limit?
> > > >
> > > > mem_high = page_counter_read(&memcg->memory) >
> > > >            READ_ONCE(memcg->memory.high);
> > > > swap_high = page_counter_read(&memcg->swap) >
> > > >            READ_ONCE(memcg->swap.high);
> > > > [...]
> > > >
> > > > if (mem_high || swap_high) {
> > > >     /*
> > > >     * The allocating tasks in this cgroup will need to do
> > > >     * reclaim or be throttled to prevent further growth
> > > >     * of the memory or swap footprints.
> > > >     *
> > > >     * Target some best-effort fairness between the tasks,
> > > >     * and distribute reclaim work and delay penalties
> > > >     * based on how much each task is actually allocating.
> > > >     */
> > > >     current->memcg_nr_pages_over_high += batch;
> > > >     set_notify_resume(current);
> > > >     break;
> > > > }
> > > >
> > >
> > > I don't have a swap.high limit set on the cgroup; it is set to "max".
> > >
> > > I ran experiments with v6.11-rc3, no zswap charging, 4K folios and
> different
> > > zswap compressors to verify if swap.high is breached with the 4G SSD
> > swapfile.
> > >
> > >  SSD (CONFIG_ZSWAP is OFF):
> > >
> > >                                 SSD          SSD          SSD
> > >  ------------------------------------------------------------
> > >  pswpout                    415,709    1,032,170      636,582
> > >  sys time (sec)              301.02       328.15       306.98
> > >  Throughput KB/s            155,970       89,621      122,219
> > >  memcg_high events            5,068       15,072        8,344
> > >  memcg_swap_high events           0            0            0
> > >  memcg_swap_fail events           0            0            0
> > >  ------------------------------------------------------------
> > >
> > >  ZSWAP                               zstd         zstd       zstd
> > >  ----------------------------------------------------------------
> > >  zswpout                        1,391,524    1,382,965  1,417,307
> > >  sys time (sec)                    474.68       568.24     489.80
> > >  Throughput KB/s                   26,099       23,404    111,115
> > >  memcg_high events                335,112      340,335    162,260
> > >  memcg_swap_high events                 0            0          0
> > >  memcg_swap_fail events         1,226,899    5,742,153
> > >   (mem_cgroup_try_charge_swap)
> > >  memcg_memory_stat_pgactivate   1,259,547
> > >   (shrink_folio_list)
> > >  ----------------------------------------------------------------
> > >
> > >  ZSWAP                      lzo-rle      lzo-rle     lzo-rle
> > >  -----------------------------------------------------------
> > >  zswpout                  1,493,917    1,363,040   1,428,133
> > >  sys time (sec)              635.75       498.63      484.65
> > >  Throughput KB/s             21,407       23,827      20,237
> > >  memcg_high events          375,116      352,814     373,667
> > >  memcg_swap_high events           0            0           0
> > >  memcg_swap_fail events     715,211
> > >  -----------------------------------------------------------
> > >
> > >  ZSWAP                         lz4         lz4        lz4          lz4
> > >  ---------------------------------------------------------------------
> > >  zswpout                 1,378,781   1,598,550   1,515,151   1,449,432
> > >  sys time (sec)             495.45      889.36      481.21      581.22
> > >  Throughput KB/s            26,248      35,176      14,765      20,253
> > >  memcg_high events         347,209     321,923     412,733     369,976
> > >  memcg_swap_high events          0           0           0           0
> > >  memcg_swap_fail events    580,103           0
> > >  ---------------------------------------------------------------------
> > >
> > >  ZSWAP                  deflate-iaa   deflate-iaa    deflate-iaa
> > >  ----------------------------------------------------------------
> > >  zswpout                    380,471     1,440,902      1,397,965
> > >  sys time (sec)              329.06        570.77         467.41
> > >  Throughput KB/s            283,867        28,403        190,600
> > >  memcg_high events            5,551       422,831         28,154
> > >  memcg_swap_high events           0             0              0
> > >  memcg_swap_fail events           0     2,686,758        438,562
> > >  ----------------------------------------------------------------
> >
> > Why are there 3 columns for each of the compressors? Is this different
> > runs of the same workload?
> >
> > And why do some columns have missing cells?
> 
> Yes, these are different runs of the same workload. Since there is some
> amount of variance seen in the data, I figured it is best to publish the
> metrics from the individual runs rather than averaging.
> 
> Some of these runs were gathered earlier with the same code base,
> however, I wasn't monitoring/logging the
> memcg_swap_high/memcg_swap_fail
> events at that time. For those runs, just these two counters have missing
> column entries; the rest of the data is still valid.
> 
> >
> > >
> > > There are no swap.high memcg events recorded in any of the SSD/zswap
> > >  experiments. However, I do see significant number of memcg_swap_fail
> > >  events in some of the zswap runs, for all 3 compressors. This is not
> > >  consistent, because there are some runs with 0 memcg_swap_fail for all
> > >  compressors.
> > >
> > >  There is a possible co-relation between memcg_swap_fail events
> > >  (/sys/fs/cgroup/test/memory.swap.events) and the high # of
> memcg_high
> > >  events. The root-cause appears to be that there are no available swap
> > >  slots, memcg_swap_fail is incremented, add_to_swap() fails in
> > >  shrink_folio_list(), followed by "activate_locked:" for the folio.
> > >  The folio re-activation is recorded in cgroup memory.stat pgactivate
> > >  events. The failure to swap out folios due to lack of swap slots could
> > >  contribute towards memory.high breaches.
> >
> > Yeah FWIW, that was gonna be my first suggestion. This swapfile size
> > is wayyyy too small...
> >
> > But that said, the link is not clear to me at all. The only thing I
> > can think of is lz4's performance sucks so bad that it's not saving
> > enough memory, leading to regression. And since it's still taking up
> > swap slot, we cannot use swap either?
> 
> The occurrence of memcg_swap_fail events establishes that swap slots
> are not available with 4G of swap space. This causes those 4K folios to
> remain in memory, which can worsen an existing problem with memory.high
> breaches.
> 
> However, it is worth noting that this is not the only contributor to
> memcg_high events that still occur without zswap charging. The data shows
> 321,923 occurrences of memcg_high in Col 2 of the lz4 table, that also has
> 0 occurrences of memcg_swap_fail reported in the cgroup stats.
> 
> >
> > >
> > >  However, this is probably not the only cause for either the high # of
> > >  memory.high breaches or the over-reclaim with zswap, as seen in the lz4
> > >  data where the memory.high is significant even in cases where there are
> no
> > >  memcg_swap_fails.
> > >
> > > Some observations/questions based on the above 4K folios swapout data:
> > >
> > > 1) There are more memcg_high events as the swapout latency reduces
> > >    (i.e. faster swap-write path). This is even without charging zswap
> > >    utilization to the cgroup.
> >
> > This is still inexplicable to me. If we are not charging zswap usage,
> > we shouldn't even be triggering the reclaim_high() path, no?
> >
> > I'm curious - can you use bpftrace to tracks where/when reclaim_high
> > is being called?

Hi Nhat,

Since reclaim_high() is called only in a handful of places, I figured I
would just use debugfs u64 counters to record where it gets called from.

These are the places where I increment the debugfs counters:

include/linux/resume_user_mode.h:
---------------------------------
diff --git a/include/linux/resume_user_mode.h b/include/linux/resume_user_mode.h
index e0135e0adae0..382f5469e9a2 100644
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -24,6 +24,7 @@ static inline void set_notify_resume(struct task_struct *task)
 		kick_process(task);
 }
 
+extern u64 hoh_userland;
 
 /**
  * resume_user_mode_work - Perform work before returning to user mode
@@ -56,6 +57,7 @@ static inline void resume_user_mode_work(struct pt_regs *regs)
 	}
 #endif
 
+	++hoh_userland;
 	mem_cgroup_handle_over_high(GFP_KERNEL);
 	blkcg_maybe_throttle_current();
 

mm/memcontrol.c:
----------------
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f29157288b7d..6738bb670a78 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1910,9 +1910,12 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
 	return nr_reclaimed;
 }
 
+extern u64 rec_high_hwf;
+
 static void high_work_func(struct work_struct *work)
 {
 	struct mem_cgroup *memcg;
+	++rec_high_hwf;
 
 	memcg = container_of(work, struct mem_cgroup, high_work);
 	reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
@@ -2055,6 +2058,8 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
 	return penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
 }
 
+extern u64 rec_high_hoh;
+
 /*
  * Reclaims memory over the high limit. Called directly from
  * try_charge() (context permitting), as well as from the userland
@@ -2097,6 +2102,7 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	 * memory.high is currently batched, whereas memory.max and the page
 	 * allocator run every time an allocation is made.
 	 */
+	++rec_high_hoh;
 	nr_reclaimed = reclaim_high(memcg,
 				    in_retry ? SWAP_CLUSTER_MAX : nr_pages,
 				    gfp_mask);
@@ -2153,6 +2159,8 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	css_put(&memcg->css);
 }
 
+extern u64 hoh_trycharge;
+
 int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		     unsigned int nr_pages)
 {
@@ -2344,8 +2352,10 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
 	    !(current->flags & PF_MEMALLOC) &&
-	    gfpflags_allow_blocking(gfp_mask))
+	    gfpflags_allow_blocking(gfp_mask)) {
+		++hoh_trycharge;
 		mem_cgroup_handle_over_high(gfp_mask);
+	}
 	return 0;
 }
 

I reverted my debug changes for "zswap to not charge cgroup" when I ran
these next set of experiments that record the # of times and locations
where reclaim_high() is called.

zstd is the compressor I have configured for both ZSWAP and ZRAM.

 6.11-rc3 mainline, 176Gi ZRAM backing for ZSWAP, zstd, 64K mTHP:
 ----------------------------------------------------------------
 /sys/fs/cgroup/iax/memory.events:
 high 112,910

 hoh_userland 128,835
 hoh_trycharge 0
 rec_high_hoh 113,079
 rec_high_hwf 0

 6.11-rc3 mainline, 4G SSD backing for ZSWAP, zstd, 64K mTHP:
 ------------------------------------------------------------
 /sys/fs/cgroup/iax/memory.events:
 high 4,693
 
 hoh_userland 14,069
 hoh_trycharge 0
 rec_high_hoh 4,694
 rec_high_hwf 0


 ZSWAP-mTHP, 176Gi ZRAM backing for ZSWAP, zstd, 64K mTHP:
 ---------------------------------------------------------
 /sys/fs/cgroup/iax/memory.events:
 high 139,495
 
 hoh_userland 156,628
 hoh_trycharge 0
 rec_high_hoh 140,039
 rec_high_hwf 0

 ZSWAP-mTHP, 4G SSD backing for ZSWAP, zstd, 64K mTHP:
 -----------------------------------------------------
 /sys/fs/cgroup/iax/memory.events:
 high 20,427
 
 /sys/fs/cgroup/iax/memory.swap.events:
 fail 20,856
 
 hoh_userland 31,346
 hoh_trycharge 0
 rec_high_hoh 20,513
 rec_high_hwf 0

This shows that in all cases, reclaim_high() is called only from the return
path to user mode after handling a page-fault.

Thanks,
Kanchana

> 
> I had confirmed earlier with counters that all calls to reclaim_high()
> were from include/linux/resume_user_mode.h::resume_user_mode_work().
> I will confirm this with zstd and bpftrace and share.
> 
> Thanks,
> Kanchana
> 
> >
> > >
> > > 2) There appears to be a direct co-relation between higher # of
> > >    memcg_swap_fail events, and an increase in memcg_high breaches and
> > >    reduction in usemem throughput. This combined with the observation in
> > >    (1) suggests that with a faster compressor, we need more swap slots,
> > >    that increases the probability of running out of swap slots with the 4G
> > >    SSD backing device.
> > >
> > > 3) Could the data shared earlier on reduction in memcg_high breaches
> with
> > >    64K mTHP swapout provide some more clues, if we agree with (1) and
> (2):
> > >
> > >    "Interestingly, the # of memcg_high events reduces significantly with
> 64K
> > >    mTHP as compared to the above 4K memcg_high events data, when
> > tested
> > >    with v4 and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-
> > mTHP)."
> > >
> > > 4) In the case of each zswap compressor, there are some runs that go
> > >    through with 0 memcg_swap_fail events. These runs generally have
> better
> > >    fewer memcg_high breaches and better sys time/throughput.
> > >
> > > 5) For a given swap setup, there is some amount of variance in
> > >    sys time for this workload.
> > >
> > > 6) All this suggests that the primary root cause is the concurrency setup,
> > >    where there could be randomness between runs as to the # of processes
> > >    that observe the memory.high breach due to other factors such as
> > >    availability of swap slots for alloc.
> > >
> > > To summarize, I believe the root-cause is the 4G SSD swapfile resulting in
> > > running out of swap slots, and anomalous behavior with over-reclaim
> when
> > 70
> > > concurrent processes are working with the 60G memory limit while trying
> to
> > > allocate 1G each; with randomness in processes reacting to the breach.
> > >
> > > The cgroup zswap charging exacerbates this situation, but is not a problem
> > > in and of itself.
> > >
> > > Nhat, as you pointed out, this is somewhat of an unrealistic scenario that
> > > doesn't seem to indicate any specific problems to be solved, other than
> the
> > > temporary cgroup zswap double-charging.
> > >
> > > Would it be fair to evaluate this patch-series based on a more realistic
> > > swapfile configuration based on 176G ZRAM, for which I had shared the
> > data
> > > in v2? There weren't any problems with swap slots availability or any
> > > anomalies that I can think of with this setup, other than the fact that the
> > > "Before" and "After" sys times could not be directly compared for 2 key
> > > reasons:
> > >
> > >  - ZRAM compressed data is not charged to the cgroup, similar to SSD.
> > >  - ZSWAP compressed data is charged to the cgroup.
> >
> > Yeah that's a bit unfair still. Wild idea, but what about we compare
> > SSD without zswap (or SSD with zswap, but without this patch series so
> > that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing
> > swapfile on zram block device).
> >
> > It is stupid, I know. But let's take advantage of the fact that zram
> > is not charged to cgroup, pretending that its memory foot print is
> > empty?
> >
> > I don't know how zram works though, so my apologies if it's a stupid
> > suggestion :)

Sridhar, Kanchana P Aug. 28, 2024, 7:27 a.m. UTC | #21

> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Tuesday, August 27, 2024 11:43 AM
> To: Nhat Pham <nphamcs@gmail.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>;
> Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> 
> > -----Original Message-----
> > From: Nhat Pham <nphamcs@gmail.com>
> > Sent: Tuesday, August 27, 2024 8:30 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> >
> > On Tue, Aug 27, 2024 at 8:23 AM Nhat Pham <nphamcs@gmail.com>
> wrote:
> > >
> > > On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P
> > > <kanchana.p.sridhar@intel.com> wrote:
> > > Yeah that's a bit unfair still. Wild idea, but what about we compare
> > > SSD without zswap (or SSD with zswap, but without this patch series so
> > > that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing
> > > swapfile on zram block device).
> > >
> > > It is stupid, I know. But let's take advantage of the fact that zram
> > > is not charged to cgroup, pretending that its memory foot print is
> > > empty?
> > >
> > > I don't know how zram works though, so my apologies if it's a stupid
> > > suggestion :)
> >
> > Oh nvm, looks like that's what you're already doing.
> >
> > That said, the lz4 column is soooo bad still, whereas the deflate-iaa
> > clearly shows improvement! This means it could be
> > compressor-dependent.
> >
> > Can you try it with zstd?
> 
> Sure, I will gather data with zstd.

I will be sending out a v5 shortly with data gathered with zstd.

Thanks,
Kanchana

> 
> Thanks,
> Kanchana

Yosry Ahmed Aug. 28, 2024, 7:43 a.m. UTC | #22

[..]
>
> This shows that in all cases, reclaim_high() is called only from the return
> path to user mode after handling a page-fault.

I am sorry I haven't been keeping up with this thread, I don't have a
lot of capacity right now.

If my understanding is correct, the summary of the problem we are
observing here is that with high concurrency (70 processes), we
observe worse system time, worse throughput, and higher memory_high
events with zswap than SSD swap. This is true (with varying degrees)
for 4K or mTHP, and with or without charging zswap compressed memory.

Did I get that right?

I saw you also mentioned that reclaim latency is directly correlated
to higher memory_high events.

Is it possible that with SSD swap, because we wait for IO during
reclaim, this gives a chance for other processes to allocate and free
the memory they need. While with zswap because everything is
synchronous, all processes are trying to allocate their memory at the
same time resulting in higher reclaim rates?

IOW, maybe with zswap all the processes try to allocate their memory
at the same time, so the total amount of memory needed at any given
instance is much higher than memory.high, so we keep producing
memory_high events and reclaiming. If 70 processes all require 1G at
the same time, then we need 70G of memory at once, we will keep
thrashing pages in/out of zswap.

While with SSD swap, due to the waits imposed by IO, the allocations
are more spread out and more serialized, and the amount of memory
needed at any given instance is lower; resulting in less reclaim
activity and ultimately faster overall execution?

Could you please describe what the processes are doing? Are they
allocating memory and holding on to it, or immediately freeing it?

Do you have visibility into when each process allocates and frees memory?

Sridhar, Kanchana P Aug. 28, 2024, 6:50 p.m. UTC | #23

Hi Yosry,

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, August 28, 2024 12:44 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> [..]
> >
> > This shows that in all cases, reclaim_high() is called only from the return
> > path to user mode after handling a page-fault.
> 
> I am sorry I haven't been keeping up with this thread, I don't have a
> lot of capacity right now.
> 
> If my understanding is correct, the summary of the problem we are
> observing here is that with high concurrency (70 processes), we
> observe worse system time, worse throughput, and higher memory_high
> events with zswap than SSD swap. This is true (with varying degrees)
> for 4K or mTHP, and with or without charging zswap compressed memory.
> 
> Did I get that right?

Thanks for your review and comments! Yes, this is correct.

> 
> I saw you also mentioned that reclaim latency is directly correlated
> to higher memory_high events.

That was my observation based on the swap-constrained experiments with 4G SSD.
With a faster compressor, we allow allocations to proceed quickly, and if the pages
are not being faulted in, we need more swap slots. This increases the probability of
running out of swap slots with the 4G SSD backing device, which, as the data in v4
shows, causes memcg_swap_fail events, that drive folios to be resident in memory
(triggering memcg_high breaches as allocations proceed even without zswap cgroup
charging).

Things change when the experiments are run in a situation where there is abundant
swap space and when the default behavior of zswap compressed data being charged
to the cgroup is enabled, as in the data with 176GiB ZRAM as ZSWAP's backing
swapfile posted in v5. Now, the critical path to workload performance changes to
concurrent reclaims in response to memcg_high events due to allocation and zswap
usage. We see a lesser increase in swapout activity (as compared to the swap-constrained
experiments in v4), and compress latency seems to become the bottleneck. Each
individual process's throughput/sys time degrades mainly as a function of compress
latency. Anyway, these were some of my learnings from these experiments. Please
do let me know if there are other insights/analysis I could be missing.

> 
> Is it possible that with SSD swap, because we wait for IO during
> reclaim, this gives a chance for other processes to allocate and free
> the memory they need. While with zswap because everything is
> synchronous, all processes are trying to allocate their memory at the
> same time resulting in higher reclaim rates?
> 
> IOW, maybe with zswap all the processes try to allocate their memory
> at the same time, so the total amount of memory needed at any given
> instance is much higher than memory.high, so we keep producing
> memory_high events and reclaiming. If 70 processes all require 1G at
> the same time, then we need 70G of memory at once, we will keep
> thrashing pages in/out of zswap.
> 
> While with SSD swap, due to the waits imposed by IO, the allocations
> are more spread out and more serialized, and the amount of memory
> needed at any given instance is lower; resulting in less reclaim
> activity and ultimately faster overall execution?

This is a very interesting hypothesis, that is along the lines of the
"slower compressor" essentially causing allocation stalls (and buffering us from
the swap slots unavailability effect) observation I gathered from the 4G SSD
experiments. I think this is a possibility.

> 
> Could you please describe what the processes are doing? Are they
> allocating memory and holding on to it, or immediately freeing it?

I have been using the vm-scalability usemem workload for these experiments.
Thanks Ying for suggesting I use this workload!

I am running usemem with these config options: usemem --init-time -w -O -n 70 1g.
This forks 70 processes, each of which does the following:

1) Allocates 1G mmap virtual memory with MAP_ANONYMOUS, read/write permissions.
2) Steps through and accesses each 8 bytes chunk of memory in the mmap-ed region, and:
    2.a) Writes the index of that chunk to the (unsigned long *) memory at that index.
3) Generates statistics on throughput.

There is an "munmap()" after step (2.a) that I have commented out because I wanted to
see how much cold memory resides in the zswap zpool after the workload exits. Interestingly,
this was 0 for 64K mTHP, but of the order of several hundreds of MB for 2M THP.

> 
> Do you have visibility into when each process allocates and frees memory?

Yes. Hopefully the above offers some clarifications.

Thanks,
Kanchana

Yosry Ahmed Aug. 28, 2024, 10:34 p.m. UTC | #24

On Wed, Aug 28, 2024 at 11:50 AM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi Yosry,
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Wednesday, August 28, 2024 12:44 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> > mm@kvack.org; hannes@cmpxchg.org; ryan.roberts@arm.com; Huang, Ying
> > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> >
> > [..]
> > >
> > > This shows that in all cases, reclaim_high() is called only from the return
> > > path to user mode after handling a page-fault.
> >
> > I am sorry I haven't been keeping up with this thread, I don't have a
> > lot of capacity right now.
> >
> > If my understanding is correct, the summary of the problem we are
> > observing here is that with high concurrency (70 processes), we
> > observe worse system time, worse throughput, and higher memory_high
> > events with zswap than SSD swap. This is true (with varying degrees)
> > for 4K or mTHP, and with or without charging zswap compressed memory.
> >
> > Did I get that right?
>
> Thanks for your review and comments! Yes, this is correct.
>
> >
> > I saw you also mentioned that reclaim latency is directly correlated
> > to higher memory_high events.
>
> That was my observation based on the swap-constrained experiments with 4G SSD.
> With a faster compressor, we allow allocations to proceed quickly, and if the pages
> are not being faulted in, we need more swap slots. This increases the probability of
> running out of swap slots with the 4G SSD backing device, which, as the data in v4
> shows, causes memcg_swap_fail events, that drive folios to be resident in memory
> (triggering memcg_high breaches as allocations proceed even without zswap cgroup
> charging).
>
> Things change when the experiments are run in a situation where there is abundant
> swap space and when the default behavior of zswap compressed data being charged
> to the cgroup is enabled, as in the data with 176GiB ZRAM as ZSWAP's backing
> swapfile posted in v5. Now, the critical path to workload performance changes to
> concurrent reclaims in response to memcg_high events due to allocation and zswap
> usage. We see a lesser increase in swapout activity (as compared to the swap-constrained
> experiments in v4), and compress latency seems to become the bottleneck. Each
> individual process's throughput/sys time degrades mainly as a function of compress
> latency. Anyway, these were some of my learnings from these experiments. Please
> do let me know if there are other insights/analysis I could be missing.
>
> >
> > Is it possible that with SSD swap, because we wait for IO during
> > reclaim, this gives a chance for other processes to allocate and free
> > the memory they need. While with zswap because everything is
> > synchronous, all processes are trying to allocate their memory at the
> > same time resulting in higher reclaim rates?
> >
> > IOW, maybe with zswap all the processes try to allocate their memory
> > at the same time, so the total amount of memory needed at any given
> > instance is much higher than memory.high, so we keep producing
> > memory_high events and reclaiming. If 70 processes all require 1G at
> > the same time, then we need 70G of memory at once, we will keep
> > thrashing pages in/out of zswap.
> >
> > While with SSD swap, due to the waits imposed by IO, the allocations
> > are more spread out and more serialized, and the amount of memory
> > needed at any given instance is lower; resulting in less reclaim
> > activity and ultimately faster overall execution?
>
> This is a very interesting hypothesis, that is along the lines of the
> "slower compressor" essentially causing allocation stalls (and buffering us from
> the swap slots unavailability effect) observation I gathered from the 4G SSD
> experiments. I think this is a possibility.
>
> >
> > Could you please describe what the processes are doing? Are they
> > allocating memory and holding on to it, or immediately freeing it?
>
> I have been using the vm-scalability usemem workload for these experiments.
> Thanks Ying for suggesting I use this workload!
>
> I am running usemem with these config options: usemem --init-time -w -O -n 70 1g.
> This forks 70 processes, each of which does the following:
>
> 1) Allocates 1G mmap virtual memory with MAP_ANONYMOUS, read/write permissions.
> 2) Steps through and accesses each 8 bytes chunk of memory in the mmap-ed region, and:
>     2.a) Writes the index of that chunk to the (unsigned long *) memory at that index.
> 3) Generates statistics on throughput.
>
> There is an "munmap()" after step (2.a) that I have commented out because I wanted to
> see how much cold memory resides in the zswap zpool after the workload exits. Interestingly,
> this was 0 for 64K mTHP, but of the order of several hundreds of MB for 2M THP.

Does the process exit immediately after step (3)? The memory will be
unmapped and freed once the process exits anyway, so removing an unmap
that immediately precedes the process exiting should have no effect.

I wonder how this changes if the processes sleep and keep the memory
mapped for a while, to force the situation where all the memory is
needed at the same time on SSD as well as zswap. This could make the
playing field more even and force the same thrashing to happen on SSD
for a more fair comparison.

It's not a fix, if very fast reclaim with zswap ends up causing more
problems perhaps we need to tweak the throttling of memory.high or
something.

Sridhar, Kanchana P Aug. 29, 2024, 12:14 a.m. UTC | #25

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, August 28, 2024 3:34 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> On Wed, Aug 28, 2024 at 11:50 AM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi Yosry,
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosryahmed@google.com>
> > > Sent: Wednesday, August 28, 2024 12:44 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org;
> linux-
> > > mm@kvack.org; hannes@cmpxchg.org; ryan.roberts@arm.com; Huang,
> Ying
> > > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org;
> > > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> > >
> > > [..]
> > > >
> > > > This shows that in all cases, reclaim_high() is called only from the return
> > > > path to user mode after handling a page-fault.
> > >
> > > I am sorry I haven't been keeping up with this thread, I don't have a
> > > lot of capacity right now.
> > >
> > > If my understanding is correct, the summary of the problem we are
> > > observing here is that with high concurrency (70 processes), we
> > > observe worse system time, worse throughput, and higher memory_high
> > > events with zswap than SSD swap. This is true (with varying degrees)
> > > for 4K or mTHP, and with or without charging zswap compressed memory.
> > >
> > > Did I get that right?
> >
> > Thanks for your review and comments! Yes, this is correct.
> >
> > >
> > > I saw you also mentioned that reclaim latency is directly correlated
> > > to higher memory_high events.
> >
> > That was my observation based on the swap-constrained experiments with
> 4G SSD.
> > With a faster compressor, we allow allocations to proceed quickly, and if the
> pages
> > are not being faulted in, we need more swap slots. This increases the
> probability of
> > running out of swap slots with the 4G SSD backing device, which, as the data
> in v4
> > shows, causes memcg_swap_fail events, that drive folios to be resident in
> memory
> > (triggering memcg_high breaches as allocations proceed even without
> zswap cgroup
> > charging).
> >
> > Things change when the experiments are run in a situation where there is
> abundant
> > swap space and when the default behavior of zswap compressed data being
> charged
> > to the cgroup is enabled, as in the data with 176GiB ZRAM as ZSWAP's
> backing
> > swapfile posted in v5. Now, the critical path to workload performance
> changes to
> > concurrent reclaims in response to memcg_high events due to allocation
> and zswap
> > usage. We see a lesser increase in swapout activity (as compared to the
> swap-constrained
> > experiments in v4), and compress latency seems to become the bottleneck.
> Each
> > individual process's throughput/sys time degrades mainly as a function of
> compress
> > latency. Anyway, these were some of my learnings from these experiments.
> Please
> > do let me know if there are other insights/analysis I could be missing.
> >
> > >
> > > Is it possible that with SSD swap, because we wait for IO during
> > > reclaim, this gives a chance for other processes to allocate and free
> > > the memory they need. While with zswap because everything is
> > > synchronous, all processes are trying to allocate their memory at the
> > > same time resulting in higher reclaim rates?
> > >
> > > IOW, maybe with zswap all the processes try to allocate their memory
> > > at the same time, so the total amount of memory needed at any given
> > > instance is much higher than memory.high, so we keep producing
> > > memory_high events and reclaiming. If 70 processes all require 1G at
> > > the same time, then we need 70G of memory at once, we will keep
> > > thrashing pages in/out of zswap.
> > >
> > > While with SSD swap, due to the waits imposed by IO, the allocations
> > > are more spread out and more serialized, and the amount of memory
> > > needed at any given instance is lower; resulting in less reclaim
> > > activity and ultimately faster overall execution?
> >
> > This is a very interesting hypothesis, that is along the lines of the
> > "slower compressor" essentially causing allocation stalls (and buffering us
> from
> > the swap slots unavailability effect) observation I gathered from the 4G SSD
> > experiments. I think this is a possibility.
> >
> > >
> > > Could you please describe what the processes are doing? Are they
> > > allocating memory and holding on to it, or immediately freeing it?
> >
> > I have been using the vm-scalability usemem workload for these
> experiments.
> > Thanks Ying for suggesting I use this workload!
> >
> > I am running usemem with these config options: usemem --init-time -w -O -
> n 70 1g.
> > This forks 70 processes, each of which does the following:
> >
> > 1) Allocates 1G mmap virtual memory with MAP_ANONYMOUS, read/write
> permissions.
> > 2) Steps through and accesses each 8 bytes chunk of memory in the mmap-
> ed region, and:
> >     2.a) Writes the index of that chunk to the (unsigned long *) memory at
> that index.
> > 3) Generates statistics on throughput.
> >
> > There is an "munmap()" after step (2.a) that I have commented out because
> I wanted to
> > see how much cold memory resides in the zswap zpool after the workload
> exits. Interestingly,
> > this was 0 for 64K mTHP, but of the order of several hundreds of MB for 2M
> THP.
> 
> Does the process exit immediately after step (3)? The memory will be
> unmapped and freed once the process exits anyway, so removing an unmap
> that immediately precedes the process exiting should have no effect.

Yes, you're right.

> 
> I wonder how this changes if the processes sleep and keep the memory
> mapped for a while, to force the situation where all the memory is
> needed at the same time on SSD as well as zswap. This could make the
> playing field more even and force the same thrashing to happen on SSD
> for a more fair comparison.

Good point. I believe I saw an option in usemem that could facilitate this.
I will investigate.

> 
> It's not a fix, if very fast reclaim with zswap ends up causing more
> problems perhaps we need to tweak the throttling of memory.high or
> something.

Sure, that is a possibility. Although, proactive reclaim might mitigate this,
in which case very fast reclaim with zswap might help.

Thanks,
Kanchana

[v4,0/4] mm: ZSWAP swap-out of mTHP folios

Message

Comments