mbox series

[v7,0/8] mm: ZSWAP swap-out of mTHP folios

Message ID 20240924011709.7037-1-kanchana.p.sridhar@intel.com (mailing list archive)
Headers show
Series mm: ZSWAP swap-out of mTHP folios | expand

Message

Kanchana P Sridhar Sept. 24, 2024, 1:17 a.m. UTC
Hi All,

This patch-series enables zswap_store() to accept and store mTHP
folios. The most significant contribution in this series is from the 
earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
migrated to mm-unstable as of 9-23-2024 in patches 5,6 of this series.

[1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
     https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

Additionally, there is an attempt to modularize some of the functionality
in zswap_store(), to make it more amenable to supporting any-order
mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry
in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
delete all offsets corresponding to a higher order folio stored in zswap.

For accounting purposes, the patch-series adds per-order mTHP sysfs
"zswpout" counters that get incremented upon successful zswap_store of
an mTHP folio:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout

A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
will enable/disable zswap storing of (m)THP. When disabled, zswap will
fallback to rejecting the mTHP folio, to be processed by the backing
swap device.

This patch-series is a pre-requisite for ZSWAP compress batching of mTHP
swap-out and decompress batching of swap-ins based on swapin_readahead(),
using Intel IAA hardware acceleration, which we would like to submit in
subsequent patch-series, with performance improvement data.

Thanks to Ying Huang for pre-posting review feedback and suggestions!

Thanks also to Nhat, Yosry, Barry, Chengming, Usama and Ying for their
helpful feedback, data reviews and suggestions!

Co-development signoff request:
===============================
I would like to request Ryan Roberts' co-developer signoff on patches
5 and 6 in this series. Thanks Ryan!

Changes since v6:
=================
1) Rebased to mm-unstable as of 9-23-2024,
   commit acfabf7e197f7a5bedf4749dac1f39551417b049.
2) Refactored into smaller commits, as suggested by Yosry and
   Chengming. Thanks both!
3) Reworded the commit log for patches 5 and 6 as per Yosry's
   suggestion. Thanks Yosry!
4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk
   partition. Also, all experiments are run with usemem --sleep 10, so that
   the memory allocated by the 70 processes remains in memory
   longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for
   their help with refining the performance characterization methodology.
5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested by
   Nhat. Thanks Nhat!

Changes since v5:
=================
1) Rebased to mm-unstable as of 8/29/2024,
   commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
   enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
   suggestion to add a knob by which users can enable/disable this
   change. Nhat, I hope this is along the lines of what you were
   thinking.
3) Added vm-scalability usemem data with 4K folios with
   CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make sure
   there is no regression with this change.
4) Added data with usemem with 64K and 2M THP for an alternate view of
   before/after, as suggested by Yosry, so we can understand the impact
   of when mTHPs are split into 4K folios in shrink_folio_list()
   (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
   in zswap. Thanks Yosry for this suggestion.

Changes since v4:
=================
1) Published before/after data with zstd, as suggested by Nhat (Thanks
   Nhat for the data reviews!).
2) Rebased to mm-unstable from 8/27/2024,
   commit b659edec079c90012cf8d05624e312d1062b8b87.
3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
   CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
   robot; as per Nhat's and Michal's suggestion to not require a separate
   patch to fix the build errors (thanks both!).
4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
   suggested by Yosry (Thanks Yosry!).
5) Squashed the commits that define new mthp zswpout stat counters, and
   invoke count_mthp_stat() after successful zswap_store()s; into a single
   commit. Thanks Yosry for this suggestion!

Changes since v3:
=================
1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
   Thanks to Barry for suggesting aligning with Ryan Roberts' latest
   changes to count_mthp_stat() so that it's always defined, even when THP
   is disabled. Barry, I have also made one other change in page_io.c
   where count_mthp_stat() is called by count_swpout_vm_event(). I would
   appreciate it if you can review this. Thanks!
   Hopefully this should resolve the kernel robot build errors.

Changes since v2:
=================
1) Gathered usemem data using SSD as the backing swap device for zswap,
   as suggested by Ying Huang. Ying, I would appreciate it if you can
   review the latest data. Thanks!
2) Generated the base commit info in the patches to attempt to address
   the kernel test robot build errors.
3) No code changes to the individual patches themselves.

Changes since RFC v1:
=====================

1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
   Thanks Barry!
2) Addressed some of the code review comments that Nhat Pham provided in
   Ryan's initial RFC [1]:
   - Added a comment about the cgroup zswap limit checks occuring once per
     folio at the beginning of zswap_store().
     Nhat, Ryan, please do let me know if the comments convey the summary
     from the RFC discussion. Thanks!
   - Posted data on running the cgroup suite's zswap kselftest.
3) Rebased to v6.11-rc3.
4) Gathered performance data with usemem and the rebased patch-series.


Regression Testing:
===================
I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K
folios with mm-unstable and with this patch-series. The main goal was
to make sure that there is no functional or performance regression
wrt the earlier zswap behavior for 4K folios,
CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of 4K
pages goes through the newly added code path [zswap_store(),
zswap_store_page()].

The data indicates there is no regression.

 ------------------------------------------------------------------------------
                     mm-unstable 8-28-2024                        zswap-mTHP v6
                                              CONFIG_ZSWAP_STORE_THP_DEFAULT_ON
                                                                     is not set
 ------------------------------------------------------------------------------
 ZSWAP compressor        zstd     deflate-                     zstd    deflate-
                                       iaa                                  iaa
 ------------------------------------------------------------------------------
 Throughput (KB/s)    110,775      113,010               111,550        121,937
 sys time (sec)      1,141.72       954.87              1,131.95         828.47
 memcg_high           140,500      153,737               139,772        134,129
 memcg_swap_high            0            0                     0              0
 memcg_swap_fail            0            0                     0              0
 pswpin                     0            0                     0              0
 pswpout                    0            0                     0              0
 zswpin                   675          690                   682            684
 zswpout            9,552,298   10,603,271             9,566,392      9,267,213
 thp_swpout                 0            0                     0              0
 thp_swpout_                0            0                     0              0
  fallback                                                                     
 pgmajfault             3,453        3,468                 3,841          3,487
 ZSWPOUT-64kB-mTHP        n/a          n/a                     0              0
 SWPOUT-64kB-mTHP           0            0                     0              0
 ------------------------------------------------------------------------------
                                                 

Performance Testing:
====================
Testing of this patch-series was done with mm-unstable as of 9-23-2024,
commit acfabf7e197f7a5bedf4749dac1f39551417b049. Data was gathered
without/with this patch-series, on an Intel Sapphire Rapids server,
dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and
823G SSD disk partition swap. Core frequency was fixed at 2500MHz.

The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 40G. The is no swap limit set for the cgroup. Following a
similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
series [2], 70 usemem processes were run, each allocating and writing 1G of
memory, and sleeping for 10 sec before exiting:

    usemem --init-time -w -O -s 10 -n 70 1g

The vm/sysfs mTHP stats included with the performance data provide details
on the swapout activity to ZSWAP/swap.

Other kernel configuration parameters:

    ZSWAP Compressors : zstd, deflate-iaa
    ZSWAP Allocator   : zsmalloc
    SWAP page-cluster : 2

In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
IAA "compression verification" is enabled. Hence each IAA compression
will be decompressed internally by the "iaa_crypto" driver, the crc-s
returned by the hardware will be compared and errors reported in case of
mismatches. Thus "deflate-iaa" helps ensure better data integrity as
compared to the software compressors.

Throughput is derived by averaging the individual 70 processes' throughputs
reported by usemem. elapsed/sys times are measured with perf. All data
points per compressor/kernel/mTHP configuration are averaged across 3 runs.

Case 1: Comparing zswap 4K vs. zswap mTHP
=========================================

In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
64K/2M (m)THP to be split into 4K folios that get processed by zswap.

The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that results
in 64K/2M (m)THP to not be split, and processed by zswap.

 64KB mTHP (cgroup memory.high set to 40G):
 ==========================================

 -------------------------------------------------------------------------------
                    mm-unstable 9-23-2024              zswap-mTHP     Change wrt
                        CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
                                 Baseline
 -------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)   143,323      125,485     153,550     129,609    7%       3%
 elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
 sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
 memcg_high          132,743      169,825     148,075     192,744
 memcg_swap_fail     639,067      841,553       2,204       2,215
 pswpin                    0            0           0           0
 pswpout                   0            0           0           0
 zswpin                  795          873         760         902
 zswpout          10,011,266   13,195,137  10,010,017  13,193,554
 thp_swpout                0            0           0           0
 thp_swpout_               0            0           0           0
  fallback
 64kB-mthp_          639,065      841,553       2,204       2,215
  swpout_fallback
 pgmajfault            2,861        2,924       3,054       3,259
 ZSWPOUT-64kB            n/a          n/a     623,451     822,268
 SWPOUT-64kB               0            0           0           0
 -------------------------------------------------------------------------------


 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
 =======================================================

 -------------------------------------------------------------------------------
                    mm-unstable 9-23-2024              zswap-mTHP     Change wrt
                        CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
                                 Baseline
 -------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)   145,616      139,640     169,404     141,168   16%       1%
 elapsed time (sec)    25.05        23.85       23.02       23.37    8%       2%
 sys time (sec)       790.53       676.34      613.26      677.83   22%    -0.2%
 memcg_high           16,702       25,197      17,374      23,890
 memcg_swap_fail      21,485       27,814         114         144
 pswpin                    0            0           0           0
 pswpout                   0            0           0           0
 zswpin                  793          852         778         922
 zswpout          10,011,709   13,186,882  10,010,893  13,195,600
 thp_swpout                0            0           0           0
 thp_swpout_          21,485       27,814         114         144
  fallback
 2048kB-mthp_            n/a          n/a           0           0
  swpout_fallback
 pgmajfault            2,701        2,822       4,151       5,066
 ZSWPOUT-2048kB          n/a          n/a      19,442      25,615
 SWPOUT-2048kB             0            0           0           0
 -------------------------------------------------------------------------------

We mostly see improvements in throughput, elapsed and sys time for zstd and
deflate-iaa, when comparing before (THP_SWAP=N) vs. after (THP_SWAP=Y).


Case 2: Comparing SSD swap mTHP vs. zswap mTHP
==============================================

In this scenario, CONFIG_THP_SWAP is enabled in "before" and "after"
experiments. The "before" represents zswap rejecting mTHP, and the mTHP
being stored by the 823G SSD swap. The "after" represents data with this
patch-series, that results in 64K/2M (m)THP being processed and stored by
zswap.

 64KB mTHP (cgroup memory.high set to 40G):
 ==========================================

 -------------------------------------------------------------------------------
                    mm-unstable 9-23-2024              zswap-mTHP     Change wrt
                        CONFIG_THP_SWAP=Y       CONFIG_THP_SWAP=Y       Baseline
                                 Baseline
 -------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)    20,265       20,696     153,550     129,609   658%    526%
 elapsed time (sec)    72.44        70.86       23.90       25.19    67%     64%
 sys time (sec)        77.95        77.99      757.70      731.13  -872%   -837%
 memcg_high          115,811      113,277     148,075     192,744
 memcg_swap_fail       2,386        2,425       2,204       2,215
 pswpin                   16           16           0           0
 pswpout           7,774,235    7,616,069           0           0
 zswpin                  728          749         760         902
 zswpout              38,424       39,022  10,010,017  13,193,554
 thp_swpout                0            0           0           0
 thp_swpout_               0            0           0           0
  fallback	                                                 
 64kB-mthp_            2,386        2,425       2,204       2,215
  swpout_fallback                                                
 pgmajfault            2,757        2,860       3,054       3,259
 ZSWPOUT-64kB            n/a          n/a     623,451     822,268
 SWPOUT-64kB         485,890      476,004           0           0
 -------------------------------------------------------------------------------


 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
 =======================================================

 -------------------------------------------------------------------------------
                    mm-unstable 9-23-2024              zswap-mTHP     Change wrt
                        CONFIG_THP_SWAP=Y       CONFIG_THP_SWAP=Y       Baseline
                                 Baseline
 -------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)    24,347       35,971     169,404     141,168    596%   292%
 elapsed time (sec)    63.52        64.59       23.02       23.37     64%    64%
 sys time (sec)        27.91        27.01      613.26      677.83  -2098% -2410%
 memcg_high           13,576       13,467      17,374      23,890
 memcg_swap_fail         162          124         114         144
 pswpin                    0            0           0           0
 pswpout           7,003,307    7,168,853           0           0
 zswpin                  741          722         778         922
 zswpout              84,429       65,315  10,010,893  13,195,600
 thp_swpout           13,678       14,002           0           0
 thp_swpout_             162          124         114         144
  fallback	                                                 
 2048kB-mthp_            n/a          n/a           0           0
  swpout_fallback                                                
 pgmajfault            3,345        2,903       4,151       5,066
 ZSWPOUT-2048kB          n/a          n/a      19,442      25,615
 SWPOUT-2048kB        13,678       14,002           0           0
 -------------------------------------------------------------------------------

We see significant improvements in throughput and elapsed time for zstd and
deflate-iaa, when comparing before (mTHP-SSD) vs. after (mTHP-ZSWAP). The
sys time increases with mTHP-ZSWAP as expected, due to the CPU compression
time vs. asynchronous disk write times, as pointed out by Ying and Yosry.

In the "Before" scenario, when zswap does not store mTHP, only allocations
count towards the cgroup memory limit. However, in the "After" scenario,
with the introduction of zswap_store() mTHP, both, allocations as well as
the zswap compressed pool usage from all 70 processes are counted towards
the memory limit. As a result, we see higher swapout activity in the
"After" data. Hence, more time is spent doing reclaim as the zswap cgroup
charge leads to more frequent memory.high breaches.

Summary:
========
The v7 data presented above comparing zswap-mTHP with a conventional 823G
SSD swap demonstrates good performance improvements with zswap-mTHP. Hence,
it seems reasonable for zswap_store to support (m)THP, so that further
performance improvements can be implemented.

Some of the ideas that have shown promise in our experiments are:

1) IAA compress/decompress batching.
2) Distributing compress jobs across all IAA devices on the socket.

In the experimental setup used in this patchset, we have enabled
IAA compress verification to ensure additional hardware data integrity CRC
checks not currently done by the software compressors. The tests run for
this patchset are also using only 1 IAA device per core, that avails of 2
compress engines on the device. In our experiments with IAA batching, we
distribute compress jobs from all cores to the 8 compress engines available
per socket. We further compress the pages in each mTHP in parallel in the
accelerator. As a result, we improve compress latency and reclaim
throughput.

The following compares the same usemem workload characteristics between:

1) zstd (v7 experiments)
2) deflate-iaa "Fixed mode" (v7 experiments)
3) deflate-iaa with batching
4) deflate-iaa-canned "Canned mode" [3] with batching

vm.page-cluster is set to "2" for all runs.

64K mTHP ZSWAP:
===============

 -------------------------------------------------------------------------------
 ZSWAP            zstd   IAA Fixed   IAA Fixed  IAA Canned     IAA    IAA    IAA
 compressor       (v7)        (v7)  + Batching  + Batching   Batch Canned Canned
                                                               vs.    vs.  Batch
 64K mTHP                                                    Seqtl  Fixed    vs.
                                                                            ZSTD
 ------------------------------------------------------------------------------- 
 Throughput    153,550     129,609     156,215     166,975   21%     7%       9%
     (KB/s)
 elapsed time    23.90       25.19       22.46       21.38   11%     5%      11%
        (sec)
 sys time       757.70      731.13      715.62      648.83    2%     9%      14%
    (sec)
 memcg_high    148,075     192,744     197,548     181,734
 memcg_swap_     2,204       2,215       2,293       2,263
  fail
 pswpin              0           0           0           0 
 pswpout             0           0           0           0 
 zswpin            760         902         774         833
 zswpout    10,010,017  13,193,554  13,193,176  12,125,616
 thp_swpout          0           0           0           0 
 thp_swpout_         0           0           0           0 
  fallback
 64kB-mthp_      2,204       2,215       2,293       2,263
  swpout_
  fallback
 pgmajfault      3,054       3,259       3,545       3,516
 ZSWPOUT-64kB  623,451     822,268     822,176     755,480
 SWPOUT-64kB         0           0           0           0 
 swap_ra           146         161         152         159
 swap_ra_hit        64         121          68          88
 -------------------------------------------------------------------------------
				   

2M THP ZSWAP:
=============

 -------------------------------------------------------------------------------
 ZSWAP            zstd   IAA Fixed   IAA Fixed  IAA Canned     IAA    IAA    IAA
 compressor       (v7)        (v7)  + Batching  + Batching   Batch Canned Canned
                                                               vs.    vs.  Batch
 2M THP                                                      Seqtl  Fixed    vs.
                                                                            ZSTD
 ------------------------------------------------------------------------------- 
 Throughput    169,404     141,168     175,089     193,407     24%    10%    14%
     (KB/s)
 elapsed time    23.02       23.37       21.13       19.97     10%     5%    13%
        (sec)
 sys time       613.26      677.83      630.51      533.80      7%    15%    13%
    (sec)
 memcg_high     17,374      23,890      24,349      22,374
 memcg_swap_       114         144         102          88
  fail
 pswpin              0           0           0           0
 pswpout             0           0           0           0
 zswpin            778         922       6,492       6,642
 zswpout    10,010,893  13,195,600  13,199,907  12,132,265
 thp_swpout          0           0           0           0
 thp_swpout_       114         144         102          88
  fallback
 pgmajfault      4,151       5,066       5,032       4,999
 ZSWPOUT-2MB    19,442      25,615      25,666      23,594
 SWPOUT-2MB          0           0           0           0
 swap_ra             3           9       4,383       4,494
 swap_ra_hit         2           6       4,298       4,412
 -------------------------------------------------------------------------------


With ZSWAP IAA compress/decompress batching, we are able to demonstrate
significant performance improvements and memory savings in scalability
experiments under memory pressure, as compared to software compressors. We
hope to submit this work in subsequent patch series.

Thanks,
Kanchana

[1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
[2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/
[3] https://patchwork.kernel.org/project/linux-crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/


Kanchana P Sridhar (8):
  mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
  mm: zswap: Modify zswap_compress() to accept a page instead of a
    folio.
  mm: zswap: Refactor code to store an entry in zswap xarray.
  mm: zswap: Refactor code to delete stored offsets in case of errors.
  mm: zswap: Compress and store a specific page in a folio.
  mm: zswap: Support mTHP swapout in zswap_store().
  mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout
    stats.
  mm: Document the newly added mTHP zswpout stats, clarify swpout
    semantics.

 Documentation/admin-guide/mm/transhuge.rst |   8 +-
 include/linux/huge_mm.h                    |   1 +
 include/linux/memcontrol.h                 |   4 +
 mm/Kconfig                                 |   8 +
 mm/huge_memory.c                           |   3 +
 mm/page_io.c                               |   1 +
 mm/zswap.c                                 | 248 ++++++++++++++++-----
 7 files changed, 210 insertions(+), 63 deletions(-)


base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049

Comments

Yosry Ahmed Sept. 24, 2024, 7:34 p.m. UTC | #1
On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi All,
>
> This patch-series enables zswap_store() to accept and store mTHP
> folios. The most significant contribution in this series is from the
> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> migrated to mm-unstable as of 9-23-2024 in patches 5,6 of this series.
>
> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
>      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Additionally, there is an attempt to modularize some of the functionality
> in zswap_store(), to make it more amenable to supporting any-order
> mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry
> in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> delete all offsets corresponding to a higher order folio stored in zswap.

These are implementation details that are not very useful here, you
can just mention that the first few patches do refactoring prep work.

>
> For accounting purposes, the patch-series adds per-order mTHP sysfs
> "zswpout" counters that get incremented upon successful zswap_store of
> an mTHP folio:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
>
> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
> will enable/disable zswap storing of (m)THP. When disabled, zswap will
> fallback to rejecting the mTHP folio, to be processed by the backing
> swap device.

Why is this needed? Do we just not have enough confidence in the
feature yet, or are there some cases that regress from enabling mTHP
for zswapout?

Does generic mTHP swapout/swapin also use config options?

>
> This patch-series is a pre-requisite for ZSWAP compress batching of mTHP
> swap-out and decompress batching of swap-ins based on swapin_readahead(),
> using Intel IAA hardware acceleration, which we would like to submit in
> subsequent patch-series, with performance improvement data.
>
> Thanks to Ying Huang for pre-posting review feedback and suggestions!
>
> Thanks also to Nhat, Yosry, Barry, Chengming, Usama and Ying for their
> helpful feedback, data reviews and suggestions!
>
> Co-development signoff request:
> ===============================
> I would like to request Ryan Roberts' co-developer signoff on patches
> 5 and 6 in this series. Thanks Ryan!
>
> Changes since v6:
> =================

Please put the changelog at the very end, I almost missed the
performance evaluation.

> 1) Rebased to mm-unstable as of 9-23-2024,
>    commit acfabf7e197f7a5bedf4749dac1f39551417b049.
> 2) Refactored into smaller commits, as suggested by Yosry and
>    Chengming. Thanks both!
> 3) Reworded the commit log for patches 5 and 6 as per Yosry's
>    suggestion. Thanks Yosry!
> 4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk
>    partition. Also, all experiments are run with usemem --sleep 10, so that
>    the memory allocated by the 70 processes remains in memory
>    longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for
>    their help with refining the performance characterization methodology.
> 5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested by
>    Nhat. Thanks Nhat!
>
> Changes since v5:
> =================
> 1) Rebased to mm-unstable as of 8/29/2024,
>    commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
> 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
>    enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
>    suggestion to add a knob by which users can enable/disable this
>    change. Nhat, I hope this is along the lines of what you were
>    thinking.
> 3) Added vm-scalability usemem data with 4K folios with
>    CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make sure
>    there is no regression with this change.
> 4) Added data with usemem with 64K and 2M THP for an alternate view of
>    before/after, as suggested by Yosry, so we can understand the impact
>    of when mTHPs are split into 4K folios in shrink_folio_list()
>    (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
>    in zswap. Thanks Yosry for this suggestion.
>
> Changes since v4:
> =================
> 1) Published before/after data with zstd, as suggested by Nhat (Thanks
>    Nhat for the data reviews!).
> 2) Rebased to mm-unstable from 8/27/2024,
>    commit b659edec079c90012cf8d05624e312d1062b8b87.
> 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
>    CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
>    robot; as per Nhat's and Michal's suggestion to not require a separate
>    patch to fix the build errors (thanks both!).
> 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
>    suggested by Yosry (Thanks Yosry!).
> 5) Squashed the commits that define new mthp zswpout stat counters, and
>    invoke count_mthp_stat() after successful zswap_store()s; into a single
>    commit. Thanks Yosry for this suggestion!
>
> Changes since v3:
> =================
> 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
>    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
>    changes to count_mthp_stat() so that it's always defined, even when THP
>    is disabled. Barry, I have also made one other change in page_io.c
>    where count_mthp_stat() is called by count_swpout_vm_event(). I would
>    appreciate it if you can review this. Thanks!
>    Hopefully this should resolve the kernel robot build errors.
>
> Changes since v2:
> =================
> 1) Gathered usemem data using SSD as the backing swap device for zswap,
>    as suggested by Ying Huang. Ying, I would appreciate it if you can
>    review the latest data. Thanks!
> 2) Generated the base commit info in the patches to attempt to address
>    the kernel test robot build errors.
> 3) No code changes to the individual patches themselves.
>
> Changes since RFC v1:
> =====================
>
> 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
>    Thanks Barry!
> 2) Addressed some of the code review comments that Nhat Pham provided in
>    Ryan's initial RFC [1]:
>    - Added a comment about the cgroup zswap limit checks occuring once per
>      folio at the beginning of zswap_store().
>      Nhat, Ryan, please do let me know if the comments convey the summary
>      from the RFC discussion. Thanks!
>    - Posted data on running the cgroup suite's zswap kselftest.
> 3) Rebased to v6.11-rc3.
> 4) Gathered performance data with usemem and the rebased patch-series.
>
>
> Regression Testing:
> ===================
> I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K
> folios with mm-unstable and with this patch-series. The main goal was
> to make sure that there is no functional or performance regression
> wrt the earlier zswap behavior for 4K folios,
> CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of 4K
> pages goes through the newly added code path [zswap_store(),
> zswap_store_page()].
>
> The data indicates there is no regression.
>
>  ------------------------------------------------------------------------------
>                      mm-unstable 8-28-2024                        zswap-mTHP v6
>                                               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON
>                                                                      is not set
>  ------------------------------------------------------------------------------
>  ZSWAP compressor        zstd     deflate-                     zstd    deflate-
>                                        iaa                                  iaa
>  ------------------------------------------------------------------------------
>  Throughput (KB/s)    110,775      113,010               111,550        121,937
>  sys time (sec)      1,141.72       954.87              1,131.95         828.47
>  memcg_high           140,500      153,737               139,772        134,129
>  memcg_swap_high            0            0                     0              0
>  memcg_swap_fail            0            0                     0              0
>  pswpin                     0            0                     0              0
>  pswpout                    0            0                     0              0
>  zswpin                   675          690                   682            684
>  zswpout            9,552,298   10,603,271             9,566,392      9,267,213
>  thp_swpout                 0            0                     0              0
>  thp_swpout_                0            0                     0              0
>   fallback
>  pgmajfault             3,453        3,468                 3,841          3,487
>  ZSWPOUT-64kB-mTHP        n/a          n/a                     0              0
>  SWPOUT-64kB-mTHP           0            0                     0              0
>  ------------------------------------------------------------------------------

It's probably better to put the zstd columns next to each other, and
the deflate-iaa columns next to each other, for easier visual
comparisons.

>
>
> Performance Testing:
> ====================
> Testing of this patch-series was done with mm-unstable as of 9-23-2024,
> commit acfabf7e197f7a5bedf4749dac1f39551417b049. Data was gathered
> without/with this patch-series, on an Intel Sapphire Rapids server,
> dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and
> 823G SSD disk partition swap. Core frequency was fixed at 2500MHz.
>
> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 40G. The is no swap limit set for the cgroup. Following a
> similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
> series [2], 70 usemem processes were run, each allocating and writing 1G of
> memory, and sleeping for 10 sec before exiting:
>
>     usemem --init-time -w -O -s 10 -n 70 1g
>
> The vm/sysfs mTHP stats included with the performance data provide details
> on the swapout activity to ZSWAP/swap.
>
> Other kernel configuration parameters:
>
>     ZSWAP Compressors : zstd, deflate-iaa
>     ZSWAP Allocator   : zsmalloc
>     SWAP page-cluster : 2
>
> In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> IAA "compression verification" is enabled. Hence each IAA compression
> will be decompressed internally by the "iaa_crypto" driver, the crc-s
> returned by the hardware will be compared and errors reported in case of
> mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> compared to the software compressors.
>
> Throughput is derived by averaging the individual 70 processes' throughputs
> reported by usemem. elapsed/sys times are measured with perf. All data
> points per compressor/kernel/mTHP configuration are averaged across 3 runs.
>
> Case 1: Comparing zswap 4K vs. zswap mTHP
> =========================================
>
> In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
> 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
>
> The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that results
> in 64K/2M (m)THP to not be split, and processed by zswap.
>
>  64KB mTHP (cgroup memory.high set to 40G):
>  ==========================================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
>                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
>                                  Baseline
>  -------------------------------------------------------------------------------
>  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>                                       iaa                     iaa            iaa
>  -------------------------------------------------------------------------------
>  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%       3%
>  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
>  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
>  memcg_high          132,743      169,825     148,075     192,744
>  memcg_swap_fail     639,067      841,553       2,204       2,215
>  pswpin                    0            0           0           0
>  pswpout                   0            0           0           0
>  zswpin                  795          873         760         902
>  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
>  thp_swpout                0            0           0           0
>  thp_swpout_               0            0           0           0
>   fallback
>  64kB-mthp_          639,065      841,553       2,204       2,215
>   swpout_fallback
>  pgmajfault            2,861        2,924       3,054       3,259
>  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
>  SWPOUT-64kB               0            0           0           0
>  -------------------------------------------------------------------------------
>
>
>  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
>  =======================================================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
>                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
>                                  Baseline
>  -------------------------------------------------------------------------------
>  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>                                       iaa                     iaa            iaa
>  -------------------------------------------------------------------------------
>  Throughput (KB/s)   145,616      139,640     169,404     141,168   16%       1%
>  elapsed time (sec)    25.05        23.85       23.02       23.37    8%       2%
>  sys time (sec)       790.53       676.34      613.26      677.83   22%    -0.2%
>  memcg_high           16,702       25,197      17,374      23,890
>  memcg_swap_fail      21,485       27,814         114         144
>  pswpin                    0            0           0           0
>  pswpout                   0            0           0           0
>  zswpin                  793          852         778         922
>  zswpout          10,011,709   13,186,882  10,010,893  13,195,600
>  thp_swpout                0            0           0           0
>  thp_swpout_          21,485       27,814         114         144
>   fallback
>  2048kB-mthp_            n/a          n/a           0           0
>   swpout_fallback
>  pgmajfault            2,701        2,822       4,151       5,066
>  ZSWPOUT-2048kB          n/a          n/a      19,442      25,615
>  SWPOUT-2048kB             0            0           0           0
>  -------------------------------------------------------------------------------
>
> We mostly see improvements in throughput, elapsed and sys time for zstd and
> deflate-iaa, when comparing before (THP_SWAP=N) vs. after (THP_SWAP=Y).
>
>
> Case 2: Comparing SSD swap mTHP vs. zswap mTHP
> ==============================================
>
> In this scenario, CONFIG_THP_SWAP is enabled in "before" and "after"
> experiments. The "before" represents zswap rejecting mTHP, and the mTHP
> being stored by the 823G SSD swap. The "after" represents data with this
> patch-series, that results in 64K/2M (m)THP being processed and stored by
> zswap.
>
>  64KB mTHP (cgroup memory.high set to 40G):
>  ==========================================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
>                         CONFIG_THP_SWAP=Y       CONFIG_THP_SWAP=Y       Baseline
>                                  Baseline
>  -------------------------------------------------------------------------------
>  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>                                       iaa                     iaa            iaa
>  -------------------------------------------------------------------------------
>  Throughput (KB/s)    20,265       20,696     153,550     129,609   658%    526%
>  elapsed time (sec)    72.44        70.86       23.90       25.19    67%     64%
>  sys time (sec)        77.95        77.99      757.70      731.13  -872%   -837%
>  memcg_high          115,811      113,277     148,075     192,744
>  memcg_swap_fail       2,386        2,425       2,204       2,215
>  pswpin                   16           16           0           0
>  pswpout           7,774,235    7,616,069           0           0
>  zswpin                  728          749         760         902
>  zswpout              38,424       39,022  10,010,017  13,193,554
>  thp_swpout                0            0           0           0
>  thp_swpout_               0            0           0           0
>   fallback
>  64kB-mthp_            2,386        2,425       2,204       2,215
>   swpout_fallback
>  pgmajfault            2,757        2,860       3,054       3,259
>  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
>  SWPOUT-64kB         485,890      476,004           0           0
>  -------------------------------------------------------------------------------
>
>
>  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
>  =======================================================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
>                         CONFIG_THP_SWAP=Y       CONFIG_THP_SWAP=Y       Baseline
>                                  Baseline
>  -------------------------------------------------------------------------------
>  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>                                       iaa                     iaa            iaa
>  -------------------------------------------------------------------------------
>  Throughput (KB/s)    24,347       35,971     169,404     141,168    596%   292%
>  elapsed time (sec)    63.52        64.59       23.02       23.37     64%    64%
>  sys time (sec)        27.91        27.01      613.26      677.83  -2098% -2410%
>  memcg_high           13,576       13,467      17,374      23,890
>  memcg_swap_fail         162          124         114         144
>  pswpin                    0            0           0           0
>  pswpout           7,003,307    7,168,853           0           0
>  zswpin                  741          722         778         922
>  zswpout              84,429       65,315  10,010,893  13,195,600
>  thp_swpout           13,678       14,002           0           0
>  thp_swpout_             162          124         114         144
>   fallback
>  2048kB-mthp_            n/a          n/a           0           0
>   swpout_fallback
>  pgmajfault            3,345        2,903       4,151       5,066
>  ZSWPOUT-2048kB          n/a          n/a      19,442      25,615
>  SWPOUT-2048kB        13,678       14,002           0           0
>  -------------------------------------------------------------------------------
>
> We see significant improvements in throughput and elapsed time for zstd and
> deflate-iaa, when comparing before (mTHP-SSD) vs. after (mTHP-ZSWAP). The
> sys time increases with mTHP-ZSWAP as expected, due to the CPU compression
> time vs. asynchronous disk write times, as pointed out by Ying and Yosry.
>
> In the "Before" scenario, when zswap does not store mTHP, only allocations
> count towards the cgroup memory limit. However, in the "After" scenario,
> with the introduction of zswap_store() mTHP, both, allocations as well as
> the zswap compressed pool usage from all 70 processes are counted towards
> the memory limit. As a result, we see higher swapout activity in the
> "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> charge leads to more frequent memory.high breaches.
>
> Summary:
> ========
> The v7 data presented above comparing zswap-mTHP with a conventional 823G
> SSD swap demonstrates good performance improvements with zswap-mTHP. Hence,
> it seems reasonable for zswap_store to support (m)THP, so that further
> performance improvements can be implemented.
>
> Some of the ideas that have shown promise in our experiments are:
>
> 1) IAA compress/decompress batching.
> 2) Distributing compress jobs across all IAA devices on the socket.
>
> In the experimental setup used in this patchset, we have enabled
> IAA compress verification to ensure additional hardware data integrity CRC
> checks not currently done by the software compressors. The tests run for
> this patchset are also using only 1 IAA device per core, that avails of 2
> compress engines on the device. In our experiments with IAA batching, we
> distribute compress jobs from all cores to the 8 compress engines available
> per socket. We further compress the pages in each mTHP in parallel in the
> accelerator. As a result, we improve compress latency and reclaim
> throughput.
>
> The following compares the same usemem workload characteristics between:
>
> 1) zstd (v7 experiments)
> 2) deflate-iaa "Fixed mode" (v7 experiments)
> 3) deflate-iaa with batching
> 4) deflate-iaa-canned "Canned mode" [3] with batching
>
> vm.page-cluster is set to "2" for all runs.
>
> 64K mTHP ZSWAP:
> ===============
>
>  -------------------------------------------------------------------------------
>  ZSWAP            zstd   IAA Fixed   IAA Fixed  IAA Canned     IAA    IAA    IAA
>  compressor       (v7)        (v7)  + Batching  + Batching   Batch Canned Canned
>                                                                vs.    vs.  Batch
>  64K mTHP                                                    Seqtl  Fixed    vs.
>                                                                             ZSTD
>  -------------------------------------------------------------------------------
>  Throughput    153,550     129,609     156,215     166,975   21%     7%       9%
>      (KB/s)
>  elapsed time    23.90       25.19       22.46       21.38   11%     5%      11%
>         (sec)
>  sys time       757.70      731.13      715.62      648.83    2%     9%      14%
>     (sec)
>  memcg_high    148,075     192,744     197,548     181,734
>  memcg_swap_     2,204       2,215       2,293       2,263
>   fail
>  pswpin              0           0           0           0
>  pswpout             0           0           0           0
>  zswpin            760         902         774         833
>  zswpout    10,010,017  13,193,554  13,193,176  12,125,616
>  thp_swpout          0           0           0           0
>  thp_swpout_         0           0           0           0
>   fallback
>  64kB-mthp_      2,204       2,215       2,293       2,263
>   swpout_
>   fallback
>  pgmajfault      3,054       3,259       3,545       3,516
>  ZSWPOUT-64kB  623,451     822,268     822,176     755,480
>  SWPOUT-64kB         0           0           0           0
>  swap_ra           146         161         152         159
>  swap_ra_hit        64         121          68          88
>  -------------------------------------------------------------------------------
>
>
> 2M THP ZSWAP:
> =============
>
>  -------------------------------------------------------------------------------
>  ZSWAP            zstd   IAA Fixed   IAA Fixed  IAA Canned     IAA    IAA    IAA
>  compressor       (v7)        (v7)  + Batching  + Batching   Batch Canned Canned
>                                                                vs.    vs.  Batch
>  2M THP                                                      Seqtl  Fixed    vs.
>                                                                             ZSTD
>  -------------------------------------------------------------------------------
>  Throughput    169,404     141,168     175,089     193,407     24%    10%    14%
>      (KB/s)
>  elapsed time    23.02       23.37       21.13       19.97     10%     5%    13%
>         (sec)
>  sys time       613.26      677.83      630.51      533.80      7%    15%    13%
>     (sec)
>  memcg_high     17,374      23,890      24,349      22,374
>  memcg_swap_       114         144         102          88
>   fail
>  pswpin              0           0           0           0
>  pswpout             0           0           0           0
>  zswpin            778         922       6,492       6,642
>  zswpout    10,010,893  13,195,600  13,199,907  12,132,265
>  thp_swpout          0           0           0           0
>  thp_swpout_       114         144         102          88
>   fallback
>  pgmajfault      4,151       5,066       5,032       4,999
>  ZSWPOUT-2MB    19,442      25,615      25,666      23,594
>  SWPOUT-2MB          0           0           0           0
>  swap_ra             3           9       4,383       4,494
>  swap_ra_hit         2           6       4,298       4,412
>  -------------------------------------------------------------------------------
>
>
> With ZSWAP IAA compress/decompress batching, we are able to demonstrate
> significant performance improvements and memory savings in scalability
> experiments under memory pressure, as compared to software compressors. We
> hope to submit this work in subsequent patch series.

Honestly I would remove the detailed results of the followup series
for batching, it should be enough to mention a single figure for
further expected improvement from ongoing work that depends on this.

>
> Thanks,
> Kanchana
>
> [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
> [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/
> [3] https://patchwork.kernel.org/project/linux-crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/
>
>
> Kanchana P Sridhar (8):
>   mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
>   mm: zswap: Modify zswap_compress() to accept a page instead of a
>     folio.
>   mm: zswap: Refactor code to store an entry in zswap xarray.
>   mm: zswap: Refactor code to delete stored offsets in case of errors.
>   mm: zswap: Compress and store a specific page in a folio.
>   mm: zswap: Support mTHP swapout in zswap_store().
>   mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout
>     stats.
>   mm: Document the newly added mTHP zswpout stats, clarify swpout
>     semantics.
>
>  Documentation/admin-guide/mm/transhuge.rst |   8 +-
>  include/linux/huge_mm.h                    |   1 +
>  include/linux/memcontrol.h                 |   4 +
>  mm/Kconfig                                 |   8 +
>  mm/huge_memory.c                           |   3 +
>  mm/page_io.c                               |   1 +
>  mm/zswap.c                                 | 248 ++++++++++++++++-----
>  7 files changed, 210 insertions(+), 63 deletions(-)
>
>
> base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
> --
> 2.27.0
>
Kanchana P Sridhar Sept. 24, 2024, 10:50 p.m. UTC | #2
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, September 24, 2024 12:35 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> 
> On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi All,
> >
> > This patch-series enables zswap_store() to accept and store mTHP
> > folios. The most significant contribution in this series is from the
> > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > migrated to mm-unstable as of 9-23-2024 in patches 5,6 of this series.
> >
> > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> >      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Additionally, there is an attempt to modularize some of the functionality
> > in zswap_store(), to make it more amenable to supporting any-order
> > mTHPs. For instance, the function zswap_store_entry() stores a
> zswap_entry
> > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> > delete all offsets corresponding to a higher order folio stored in zswap.
> 
> These are implementation details that are not very useful here, you
> can just mention that the first few patches do refactoring prep work.

Thanks Yosry for the comments! Sure, I will reword this as you've
suggested in v8.

> 
> >
> > For accounting purposes, the patch-series adds per-order mTHP sysfs
> > "zswpout" counters that get incremented upon successful zswap_store of
> > an mTHP folio:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by
> default)
> > will enable/disable zswap storing of (m)THP. When disabled, zswap will
> > fallback to rejecting the mTHP folio, to be processed by the backing
> > swap device.
> 
> Why is this needed? Do we just not have enough confidence in the
> feature yet, or are there some cases that regress from enabling mTHP
> for zswapout?
> 
> Does generic mTHP swapout/swapin also use config options?

As discussed in the other comments' follow-up, I will delete the config
option and runtime knob.

> 
> >
> > This patch-series is a pre-requisite for ZSWAP compress batching of mTHP
> > swap-out and decompress batching of swap-ins based on
> swapin_readahead(),
> > using Intel IAA hardware acceleration, which we would like to submit in
> > subsequent patch-series, with performance improvement data.
> >
> > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> >
> > Thanks also to Nhat, Yosry, Barry, Chengming, Usama and Ying for their
> > helpful feedback, data reviews and suggestions!
> >
> > Co-development signoff request:
> > ===============================
> > I would like to request Ryan Roberts' co-developer signoff on patches
> > 5 and 6 in this series. Thanks Ryan!
> >
> > Changes since v6:
> > =================
> 
> Please put the changelog at the very end, I almost missed the
> performance evaluation.

Sure, will fix this.

> 
> > 1) Rebased to mm-unstable as of 9-23-2024,
> >    commit acfabf7e197f7a5bedf4749dac1f39551417b049.
> > 2) Refactored into smaller commits, as suggested by Yosry and
> >    Chengming. Thanks both!
> > 3) Reworded the commit log for patches 5 and 6 as per Yosry's
> >    suggestion. Thanks Yosry!
> > 4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk
> >    partition. Also, all experiments are run with usemem --sleep 10, so that
> >    the memory allocated by the 70 processes remains in memory
> >    longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for
> >    their help with refining the performance characterization methodology.
> > 5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested
> by
> >    Nhat. Thanks Nhat!
> >
> > Changes since v5:
> > =================
> > 1) Rebased to mm-unstable as of 8/29/2024,
> >    commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
> > 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
> >    enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
> >    suggestion to add a knob by which users can enable/disable this
> >    change. Nhat, I hope this is along the lines of what you were
> >    thinking.
> > 3) Added vm-scalability usemem data with 4K folios with
> >    CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make
> sure
> >    there is no regression with this change.
> > 4) Added data with usemem with 64K and 2M THP for an alternate view of
> >    before/after, as suggested by Yosry, so we can understand the impact
> >    of when mTHPs are split into 4K folios in shrink_folio_list()
> >    (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
> >    in zswap. Thanks Yosry for this suggestion.
> >
> > Changes since v4:
> > =================
> > 1) Published before/after data with zstd, as suggested by Nhat (Thanks
> >    Nhat for the data reviews!).
> > 2) Rebased to mm-unstable from 8/27/2024,
> >    commit b659edec079c90012cf8d05624e312d1062b8b87.
> > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
> >    CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
> >    robot; as per Nhat's and Michal's suggestion to not require a separate
> >    patch to fix the build errors (thanks both!).
> > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
> >    suggested by Yosry (Thanks Yosry!).
> > 5) Squashed the commits that define new mthp zswpout stat counters, and
> >    invoke count_mthp_stat() after successful zswap_store()s; into a single
> >    commit. Thanks Yosry for this suggestion!
> >
> > Changes since v3:
> > =================
> > 1) Rebased to mm-unstable commit
> 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> >    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> >    changes to count_mthp_stat() so that it's always defined, even when THP
> >    is disabled. Barry, I have also made one other change in page_io.c
> >    where count_mthp_stat() is called by count_swpout_vm_event(). I would
> >    appreciate it if you can review this. Thanks!
> >    Hopefully this should resolve the kernel robot build errors.
> >
> > Changes since v2:
> > =================
> > 1) Gathered usemem data using SSD as the backing swap device for zswap,
> >    as suggested by Ying Huang. Ying, I would appreciate it if you can
> >    review the latest data. Thanks!
> > 2) Generated the base commit info in the patches to attempt to address
> >    the kernel test robot build errors.
> > 3) No code changes to the individual patches themselves.
> >
> > Changes since RFC v1:
> > =====================
> >
> > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> >    Thanks Barry!
> > 2) Addressed some of the code review comments that Nhat Pham provided
> in
> >    Ryan's initial RFC [1]:
> >    - Added a comment about the cgroup zswap limit checks occuring once
> per
> >      folio at the beginning of zswap_store().
> >      Nhat, Ryan, please do let me know if the comments convey the summary
> >      from the RFC discussion. Thanks!
> >    - Posted data on running the cgroup suite's zswap kselftest.
> > 3) Rebased to v6.11-rc3.
> > 4) Gathered performance data with usemem and the rebased patch-series.
> >
> >
> > Regression Testing:
> > ===================
> > I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K
> > folios with mm-unstable and with this patch-series. The main goal was
> > to make sure that there is no functional or performance regression
> > wrt the earlier zswap behavior for 4K folios,
> > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of
> 4K
> > pages goes through the newly added code path [zswap_store(),
> > zswap_store_page()].
> >
> > The data indicates there is no regression.
> >
> >  ------------------------------------------------------------------------------
> >                      mm-unstable 8-28-2024                        zswap-mTHP v6
> >                                               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON
> >                                                                      is not set
> >  ------------------------------------------------------------------------------
> >  ZSWAP compressor        zstd     deflate-                     zstd    deflate-
> >                                        iaa                                  iaa
> >  ------------------------------------------------------------------------------
> >  Throughput (KB/s)    110,775      113,010               111,550        121,937
> >  sys time (sec)      1,141.72       954.87              1,131.95         828.47
> >  memcg_high           140,500      153,737               139,772        134,129
> >  memcg_swap_high            0            0                     0              0
> >  memcg_swap_fail            0            0                     0              0
> >  pswpin                     0            0                     0              0
> >  pswpout                    0            0                     0              0
> >  zswpin                   675          690                   682            684
> >  zswpout            9,552,298   10,603,271             9,566,392      9,267,213
> >  thp_swpout                 0            0                     0              0
> >  thp_swpout_                0            0                     0              0
> >   fallback
> >  pgmajfault             3,453        3,468                 3,841          3,487
> >  ZSWPOUT-64kB-mTHP        n/a          n/a                     0              0
> >  SWPOUT-64kB-mTHP           0            0                     0              0
> >  ------------------------------------------------------------------------------
> 
> It's probably better to put the zstd columns next to each other, and
> the deflate-iaa columns next to each other, for easier visual
> comparisons.

Sure. Will change this accordingly, in v8.

> 
> >
> >
> > Performance Testing:
> > ====================
> > Testing of this patch-series was done with mm-unstable as of 9-23-2024,
> > commit acfabf7e197f7a5bedf4749dac1f39551417b049. Data was gathered
> > without/with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and
> > 823G SSD disk partition swap. Core frequency was fixed at 2500MHz.
> >
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed at 40G. The is no swap limit set for the cgroup. Following a
> > similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
> > series [2], 70 usemem processes were run, each allocating and writing 1G of
> > memory, and sleeping for 10 sec before exiting:
> >
> >     usemem --init-time -w -O -s 10 -n 70 1g
> >
> > The vm/sysfs mTHP stats included with the performance data provide
> details
> > on the swapout activity to ZSWAP/swap.
> >
> > Other kernel configuration parameters:
> >
> >     ZSWAP Compressors : zstd, deflate-iaa
> >     ZSWAP Allocator   : zsmalloc
> >     SWAP page-cluster : 2
> >
> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > IAA "compression verification" is enabled. Hence each IAA compression
> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > returned by the hardware will be compared and errors reported in case of
> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors.
> >
> > Throughput is derived by averaging the individual 70 processes' throughputs
> > reported by usemem. elapsed/sys times are measured with perf. All data
> > points per compressor/kernel/mTHP configuration are averaged across 3
> runs.
> >
> > Case 1: Comparing zswap 4K vs. zswap mTHP
> > =========================================
> >
> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
> >
> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that
> results
> > in 64K/2M (m)THP to not be split, and processed by zswap.
> >
> >  64KB mTHP (cgroup memory.high set to 40G):
> >  ==========================================
> >
> >  -------------------------------------------------------------------------------
> >                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
> >                                  Baseline
> >  -------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >                                       iaa                     iaa            iaa
> >  -------------------------------------------------------------------------------
> >  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%       3%
> >  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
> >  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
> >  memcg_high          132,743      169,825     148,075     192,744
> >  memcg_swap_fail     639,067      841,553       2,204       2,215
> >  pswpin                    0            0           0           0
> >  pswpout                   0            0           0           0
> >  zswpin                  795          873         760         902
> >  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
> >  thp_swpout                0            0           0           0
> >  thp_swpout_               0            0           0           0
> >   fallback
> >  64kB-mthp_          639,065      841,553       2,204       2,215
> >   swpout_fallback
> >  pgmajfault            2,861        2,924       3,054       3,259
> >  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
> >  SWPOUT-64kB               0            0           0           0
> >  -------------------------------------------------------------------------------
> >
> >
> >  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
> >  =======================================================
> >
> >  -------------------------------------------------------------------------------
> >                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
> >                                  Baseline
> >  -------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >                                       iaa                     iaa            iaa
> >  -------------------------------------------------------------------------------
> >  Throughput (KB/s)   145,616      139,640     169,404     141,168   16%       1%
> >  elapsed time (sec)    25.05        23.85       23.02       23.37    8%       2%
> >  sys time (sec)       790.53       676.34      613.26      677.83   22%    -0.2%
> >  memcg_high           16,702       25,197      17,374      23,890
> >  memcg_swap_fail      21,485       27,814         114         144
> >  pswpin                    0            0           0           0
> >  pswpout                   0            0           0           0
> >  zswpin                  793          852         778         922
> >  zswpout          10,011,709   13,186,882  10,010,893  13,195,600
> >  thp_swpout                0            0           0           0
> >  thp_swpout_          21,485       27,814         114         144
> >   fallback
> >  2048kB-mthp_            n/a          n/a           0           0
> >   swpout_fallback
> >  pgmajfault            2,701        2,822       4,151       5,066
> >  ZSWPOUT-2048kB          n/a          n/a      19,442      25,615
> >  SWPOUT-2048kB             0            0           0           0
> >  -------------------------------------------------------------------------------
> >
> > We mostly see improvements in throughput, elapsed and sys time for zstd
> and
> > deflate-iaa, when comparing before (THP_SWAP=N) vs. after
> (THP_SWAP=Y).
> >
> >
> > Case 2: Comparing SSD swap mTHP vs. zswap mTHP
> > ==============================================
> >
> > In this scenario, CONFIG_THP_SWAP is enabled in "before" and "after"
> > experiments. The "before" represents zswap rejecting mTHP, and the mTHP
> > being stored by the 823G SSD swap. The "after" represents data with this
> > patch-series, that results in 64K/2M (m)THP being processed and stored by
> > zswap.
> >
> >  64KB mTHP (cgroup memory.high set to 40G):
> >  ==========================================
> >
> >  -------------------------------------------------------------------------------
> >                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
> >                         CONFIG_THP_SWAP=Y       CONFIG_THP_SWAP=Y       Baseline
> >                                  Baseline
> >  -------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >                                       iaa                     iaa            iaa
> >  -------------------------------------------------------------------------------
> >  Throughput (KB/s)    20,265       20,696     153,550     129,609   658%    526%
> >  elapsed time (sec)    72.44        70.86       23.90       25.19    67%     64%
> >  sys time (sec)        77.95        77.99      757.70      731.13  -872%   -837%
> >  memcg_high          115,811      113,277     148,075     192,744
> >  memcg_swap_fail       2,386        2,425       2,204       2,215
> >  pswpin                   16           16           0           0
> >  pswpout           7,774,235    7,616,069           0           0
> >  zswpin                  728          749         760         902
> >  zswpout              38,424       39,022  10,010,017  13,193,554
> >  thp_swpout                0            0           0           0
> >  thp_swpout_               0            0           0           0
> >   fallback
> >  64kB-mthp_            2,386        2,425       2,204       2,215
> >   swpout_fallback
> >  pgmajfault            2,757        2,860       3,054       3,259
> >  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
> >  SWPOUT-64kB         485,890      476,004           0           0
> >  -------------------------------------------------------------------------------
> >
> >
> >  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
> >  =======================================================
> >
> >  -------------------------------------------------------------------------------
> >                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
> >                         CONFIG_THP_SWAP=Y       CONFIG_THP_SWAP=Y       Baseline
> >                                  Baseline
> >  -------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >                                       iaa                     iaa            iaa
> >  -------------------------------------------------------------------------------
> >  Throughput (KB/s)    24,347       35,971     169,404     141,168    596%   292%
> >  elapsed time (sec)    63.52        64.59       23.02       23.37     64%    64%
> >  sys time (sec)        27.91        27.01      613.26      677.83  -2098% -2410%
> >  memcg_high           13,576       13,467      17,374      23,890
> >  memcg_swap_fail         162          124         114         144
> >  pswpin                    0            0           0           0
> >  pswpout           7,003,307    7,168,853           0           0
> >  zswpin                  741          722         778         922
> >  zswpout              84,429       65,315  10,010,893  13,195,600
> >  thp_swpout           13,678       14,002           0           0
> >  thp_swpout_             162          124         114         144
> >   fallback
> >  2048kB-mthp_            n/a          n/a           0           0
> >   swpout_fallback
> >  pgmajfault            3,345        2,903       4,151       5,066
> >  ZSWPOUT-2048kB          n/a          n/a      19,442      25,615
> >  SWPOUT-2048kB        13,678       14,002           0           0
> >  -------------------------------------------------------------------------------
> >
> > We see significant improvements in throughput and elapsed time for zstd
> and
> > deflate-iaa, when comparing before (mTHP-SSD) vs. after (mTHP-ZSWAP).
> The
> > sys time increases with mTHP-ZSWAP as expected, due to the CPU
> compression
> > time vs. asynchronous disk write times, as pointed out by Ying and Yosry.
> >
> > In the "Before" scenario, when zswap does not store mTHP, only allocations
> > count towards the cgroup memory limit. However, in the "After" scenario,
> > with the introduction of zswap_store() mTHP, both, allocations as well as
> > the zswap compressed pool usage from all 70 processes are counted
> towards
> > the memory limit. As a result, we see higher swapout activity in the
> > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> > charge leads to more frequent memory.high breaches.
> >
> > Summary:
> > ========
> > The v7 data presented above comparing zswap-mTHP with a conventional
> 823G
> > SSD swap demonstrates good performance improvements with zswap-
> mTHP. Hence,
> > it seems reasonable for zswap_store to support (m)THP, so that further
> > performance improvements can be implemented.
> >
> > Some of the ideas that have shown promise in our experiments are:
> >
> > 1) IAA compress/decompress batching.
> > 2) Distributing compress jobs across all IAA devices on the socket.
> >
> > In the experimental setup used in this patchset, we have enabled
> > IAA compress verification to ensure additional hardware data integrity CRC
> > checks not currently done by the software compressors. The tests run for
> > this patchset are also using only 1 IAA device per core, that avails of 2
> > compress engines on the device. In our experiments with IAA batching, we
> > distribute compress jobs from all cores to the 8 compress engines available
> > per socket. We further compress the pages in each mTHP in parallel in the
> > accelerator. As a result, we improve compress latency and reclaim
> > throughput.
> >
> > The following compares the same usemem workload characteristics
> between:
> >
> > 1) zstd (v7 experiments)
> > 2) deflate-iaa "Fixed mode" (v7 experiments)
> > 3) deflate-iaa with batching
> > 4) deflate-iaa-canned "Canned mode" [3] with batching
> >
> > vm.page-cluster is set to "2" for all runs.
> >
> > 64K mTHP ZSWAP:
> > ===============
> >
> >  -------------------------------------------------------------------------------
> >  ZSWAP            zstd   IAA Fixed   IAA Fixed  IAA Canned     IAA    IAA    IAA
> >  compressor       (v7)        (v7)  + Batching  + Batching   Batch Canned Canned
> >                                                                vs.    vs.  Batch
> >  64K mTHP                                                    Seqtl  Fixed    vs.
> >                                                                             ZSTD
> >  -------------------------------------------------------------------------------
> >  Throughput    153,550     129,609     156,215     166,975   21%     7%       9%
> >      (KB/s)
> >  elapsed time    23.90       25.19       22.46       21.38   11%     5%      11%
> >         (sec)
> >  sys time       757.70      731.13      715.62      648.83    2%     9%      14%
> >     (sec)
> >  memcg_high    148,075     192,744     197,548     181,734
> >  memcg_swap_     2,204       2,215       2,293       2,263
> >   fail
> >  pswpin              0           0           0           0
> >  pswpout             0           0           0           0
> >  zswpin            760         902         774         833
> >  zswpout    10,010,017  13,193,554  13,193,176  12,125,616
> >  thp_swpout          0           0           0           0
> >  thp_swpout_         0           0           0           0
> >   fallback
> >  64kB-mthp_      2,204       2,215       2,293       2,263
> >   swpout_
> >   fallback
> >  pgmajfault      3,054       3,259       3,545       3,516
> >  ZSWPOUT-64kB  623,451     822,268     822,176     755,480
> >  SWPOUT-64kB         0           0           0           0
> >  swap_ra           146         161         152         159
> >  swap_ra_hit        64         121          68          88
> >  -------------------------------------------------------------------------------
> >
> >
> > 2M THP ZSWAP:
> > =============
> >
> >  -------------------------------------------------------------------------------
> >  ZSWAP            zstd   IAA Fixed   IAA Fixed  IAA Canned     IAA    IAA    IAA
> >  compressor       (v7)        (v7)  + Batching  + Batching   Batch Canned Canned
> >                                                                vs.    vs.  Batch
> >  2M THP                                                      Seqtl  Fixed    vs.
> >                                                                             ZSTD
> >  -------------------------------------------------------------------------------
> >  Throughput    169,404     141,168     175,089     193,407     24%    10%    14%
> >      (KB/s)
> >  elapsed time    23.02       23.37       21.13       19.97     10%     5%    13%
> >         (sec)
> >  sys time       613.26      677.83      630.51      533.80      7%    15%    13%
> >     (sec)
> >  memcg_high     17,374      23,890      24,349      22,374
> >  memcg_swap_       114         144         102          88
> >   fail
> >  pswpin              0           0           0           0
> >  pswpout             0           0           0           0
> >  zswpin            778         922       6,492       6,642
> >  zswpout    10,010,893  13,195,600  13,199,907  12,132,265
> >  thp_swpout          0           0           0           0
> >  thp_swpout_       114         144         102          88
> >   fallback
> >  pgmajfault      4,151       5,066       5,032       4,999
> >  ZSWPOUT-2MB    19,442      25,615      25,666      23,594
> >  SWPOUT-2MB          0           0           0           0
> >  swap_ra             3           9       4,383       4,494
> >  swap_ra_hit         2           6       4,298       4,412
> >  -------------------------------------------------------------------------------
> >
> >
> > With ZSWAP IAA compress/decompress batching, we are able to
> demonstrate
> > significant performance improvements and memory savings in scalability
> > experiments under memory pressure, as compared to software
> compressors. We
> > hope to submit this work in subsequent patch series.
> 
> Honestly I would remove the detailed results of the followup series
> for batching, it should be enough to mention a single figure for
> further expected improvement from ongoing work that depends on this.

Definitely, will summarize the results of batching in the cover letter for v8.

Thanks,
Kanchana

> 
> >
> > Thanks,
> > Kanchana
> >
> > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> > [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-
> ryan.roberts@arm.com/
> > [3] https://patchwork.kernel.org/project/linux-
> crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/
> >
> >
> > Kanchana P Sridhar (8):
> >   mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
> >   mm: zswap: Modify zswap_compress() to accept a page instead of a
> >     folio.
> >   mm: zswap: Refactor code to store an entry in zswap xarray.
> >   mm: zswap: Refactor code to delete stored offsets in case of errors.
> >   mm: zswap: Compress and store a specific page in a folio.
> >   mm: zswap: Support mTHP swapout in zswap_store().
> >   mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout
> >     stats.
> >   mm: Document the newly added mTHP zswpout stats, clarify swpout
> >     semantics.
> >
> >  Documentation/admin-guide/mm/transhuge.rst |   8 +-
> >  include/linux/huge_mm.h                    |   1 +
> >  include/linux/memcontrol.h                 |   4 +
> >  mm/Kconfig                                 |   8 +
> >  mm/huge_memory.c                           |   3 +
> >  mm/page_io.c                               |   1 +
> >  mm/zswap.c                                 | 248 ++++++++++++++++-----
> >  7 files changed, 210 insertions(+), 63 deletions(-)
> >
> >
> > base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
> > --
> > 2.27.0
> >
Huang, Ying Sept. 25, 2024, 6:35 a.m. UTC | #3
Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:

[snip]

>
> Case 1: Comparing zswap 4K vs. zswap mTHP
> =========================================
>
> In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
> 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
>
> The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that results
> in 64K/2M (m)THP to not be split, and processed by zswap.
>
>  64KB mTHP (cgroup memory.high set to 40G):
>  ==========================================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
>                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
>                                  Baseline
>  -------------------------------------------------------------------------------
>  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>                                       iaa                     iaa            iaa
>  -------------------------------------------------------------------------------
>  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%       3%
>  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
>  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
>  memcg_high          132,743      169,825     148,075     192,744
>  memcg_swap_fail     639,067      841,553       2,204       2,215
>  pswpin                    0            0           0           0
>  pswpout                   0            0           0           0
>  zswpin                  795          873         760         902
>  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
>  thp_swpout                0            0           0           0
>  thp_swpout_               0            0           0           0
>   fallback
>  64kB-mthp_          639,065      841,553       2,204       2,215
>   swpout_fallback
>  pgmajfault            2,861        2,924       3,054       3,259
>  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
>  SWPOUT-64kB               0            0           0           0
>  -------------------------------------------------------------------------------
>

IIUC, the throughput is the sum of throughput of all usemem processes?

One possible issue of usemem test case is the "imbalance" issue.  That
is, some usemem processes may swap-out/swap-in less, so the score is
very high; while some other processes may swap-out/swap-in more, so the
score is very low.  Sometimes, the total score decreases, but the scores
of usemem processes are more balanced, so that the performance should be
considered better.  And, in general, we should make usemem score
balanced among processes via say longer test time.  Can you check this
in your test results?

[snip]

--
Best Regards,
Huang, Ying
Kanchana P Sridhar Sept. 25, 2024, 6:39 p.m. UTC | #4
> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Tuesday, September 24, 2024 11:35 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali,
> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> 
> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:
> 
> [snip]
> 
> >
> > Case 1: Comparing zswap 4K vs. zswap mTHP
> > =========================================
> >
> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
> >
> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that
> results
> > in 64K/2M (m)THP to not be split, and processed by zswap.
> >
> >  64KB mTHP (cgroup memory.high set to 40G):
> >  ==========================================
> >
> >  -------------------------------------------------------------------------------
> >                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
> >                                  Baseline
> >  -------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >                                       iaa                     iaa            iaa
> >  -------------------------------------------------------------------------------
> >  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%       3%
> >  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
> >  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
> >  memcg_high          132,743      169,825     148,075     192,744
> >  memcg_swap_fail     639,067      841,553       2,204       2,215
> >  pswpin                    0            0           0           0
> >  pswpout                   0            0           0           0
> >  zswpin                  795          873         760         902
> >  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
> >  thp_swpout                0            0           0           0
> >  thp_swpout_               0            0           0           0
> >   fallback
> >  64kB-mthp_          639,065      841,553       2,204       2,215
> >   swpout_fallback
> >  pgmajfault            2,861        2,924       3,054       3,259
> >  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
> >  SWPOUT-64kB               0            0           0           0
> >  -------------------------------------------------------------------------------
> >
> 
> IIUC, the throughput is the sum of throughput of all usemem processes?
> 
> One possible issue of usemem test case is the "imbalance" issue.  That
> is, some usemem processes may swap-out/swap-in less, so the score is
> very high; while some other processes may swap-out/swap-in more, so the
> score is very low.  Sometimes, the total score decreases, but the scores
> of usemem processes are more balanced, so that the performance should be
> considered better.  And, in general, we should make usemem score
> balanced among processes via say longer test time.  Can you check this
> in your test results?

Actually, the throughput data listed in the cover-letter is the average of
all the usemem processes. Your observation about the "imbalance" issue is
right. Some processes see a higher throughput than others. I have noticed
that the throughputs progressively reduce as the individual processes exit
and print their stats.

Listed below are the stats from two runs of usemem70: sleep 10 and sleep 30.
Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are
enabled, zswap uses zstd. 


-----------------------------------------------
               sleep 10           sleep 30
      Throughput (KB/s)  Throughput (KB/s)
 -----------------------------------------------
                181,540            191,686
                179,651            191,459
                179,068            188,834
                177,244            187,568
                177,215            186,703
                176,565            185,584
                176,546            185,370
                176,470            185,021
                176,214            184,303
                176,128            184,040
                175,279            183,932
                174,745            180,831
                173,935            179,418
                161,546            168,014
                160,332            167,540
                160,122            167,364
                159,613            167,020
                159,546            166,590
                159,021            166,483
                158,845            166,418
                158,426            166,264
                158,396            166,066
                158,371            165,944
                158,298            165,866
                158,250            165,884
                158,057            165,533
                158,011            165,532
                157,899            165,457
                157,894            165,424
                157,839            165,410
                157,731            165,407
                157,629            165,273
                157,626            164,867
                157,581            164,636
                157,471            164,266
                157,430            164,225
                157,287            163,290
                156,289            153,597
                153,970            147,494
                148,244            147,102
                142,907            146,111
                142,811            145,789
                139,171            141,168
                136,314            140,714
                133,616            140,111
                132,881            139,636
                132,729            136,943
                132,680            136,844
                132,248            135,726
                132,027            135,384
                131,929            135,270
                131,766            134,748
                131,667            134,733
                131,576            134,582
                131,396            134,302
                131,351            134,160
                131,135            134,102
                130,885            134,097
                130,854            134,058
                130,767            134,006
                130,666            133,960
                130,647            133,894
                130,152            133,837
                130,006            133,747
                129,921            133,679
                129,856            133,666
                129,377            133,564
                128,366            133,331
                127,988            132,938
                126,903            132,746
 -----------------------------------------------
      sum    10,526,916         10,919,561
  average       150,385            155,994
   stddev        17,551             19,633
 -----------------------------------------------
    elapsed       24.40              43.66
 time (sec)
   sys time      806.25             766.05
      (sec)
    zswpout  10,008,713         10,008,407
  64K folio     623,463            623,629
     swpout
 -----------------------------------------------

As we increase the time for which allocations are maintained,
there seems to be a slight improvement in throughput, but the
variance increases as well. The processes with lower throughput
could be the ones that handle the memcg being over limit by
doing reclaim, possibly before they can allocate.

Interestingly, the longer test time does seem to reduce the amount
of reclaim (hence lower sys time), but more 64K large folios seem to
be reclaimed. Could this mean that with longer test time (sleep 30),
more cold memory residing in large folios is getting reclaimed, as
against memory just relinquished by the exiting processes?

Thanks,
Kanchana

> 
> [snip]
> 
> --
> Best Regards,
> Huang, Ying
Huang, Ying Sept. 26, 2024, 12:44 a.m. UTC | #5
"Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:

>> -----Original Message-----
>> From: Huang, Ying <ying.huang@intel.com>
>> Sent: Tuesday, September 24, 2024 11:35 PM
>> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
>> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
>> chengming.zhou@linux.dev; usamaarif642@gmail.com;
>> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
>> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali,
>> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
>> <vinodh.gopal@intel.com>
>> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
>> 
>> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:
>> 
>> [snip]
>> 
>> >
>> > Case 1: Comparing zswap 4K vs. zswap mTHP
>> > =========================================
>> >
>> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
>> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
>> >
>> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that
>> results
>> > in 64K/2M (m)THP to not be split, and processed by zswap.
>> >
>> >  64KB mTHP (cgroup memory.high set to 40G):
>> >  ==========================================
>> >
>> >  -------------------------------------------------------------------------------
>> >                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
>> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
>> >                                  Baseline
>> >  -------------------------------------------------------------------------------
>> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>> >                                       iaa                     iaa            iaa
>> >  -------------------------------------------------------------------------------
>> >  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%       3%
>> >  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
>> >  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
>> >  memcg_high          132,743      169,825     148,075     192,744
>> >  memcg_swap_fail     639,067      841,553       2,204       2,215
>> >  pswpin                    0            0           0           0
>> >  pswpout                   0            0           0           0
>> >  zswpin                  795          873         760         902
>> >  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
>> >  thp_swpout                0            0           0           0
>> >  thp_swpout_               0            0           0           0
>> >   fallback
>> >  64kB-mthp_          639,065      841,553       2,204       2,215
>> >   swpout_fallback
>> >  pgmajfault            2,861        2,924       3,054       3,259
>> >  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
>> >  SWPOUT-64kB               0            0           0           0
>> >  -------------------------------------------------------------------------------
>> >
>> 
>> IIUC, the throughput is the sum of throughput of all usemem processes?
>> 
>> One possible issue of usemem test case is the "imbalance" issue.  That
>> is, some usemem processes may swap-out/swap-in less, so the score is
>> very high; while some other processes may swap-out/swap-in more, so the
>> score is very low.  Sometimes, the total score decreases, but the scores
>> of usemem processes are more balanced, so that the performance should be
>> considered better.  And, in general, we should make usemem score
>> balanced among processes via say longer test time.  Can you check this
>> in your test results?
>
> Actually, the throughput data listed in the cover-letter is the average of
> all the usemem processes. Your observation about the "imbalance" issue is
> right. Some processes see a higher throughput than others. I have noticed
> that the throughputs progressively reduce as the individual processes exit
> and print their stats.
>
> Listed below are the stats from two runs of usemem70: sleep 10 and sleep 30.
> Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are
> enabled, zswap uses zstd. 
>
>
> -----------------------------------------------
>                sleep 10           sleep 30
>       Throughput (KB/s)  Throughput (KB/s)
>  -----------------------------------------------
>                 181,540            191,686
>                 179,651            191,459
>                 179,068            188,834
>                 177,244            187,568
>                 177,215            186,703
>                 176,565            185,584
>                 176,546            185,370
>                 176,470            185,021
>                 176,214            184,303
>                 176,128            184,040
>                 175,279            183,932
>                 174,745            180,831
>                 173,935            179,418
>                 161,546            168,014
>                 160,332            167,540
>                 160,122            167,364
>                 159,613            167,020
>                 159,546            166,590
>                 159,021            166,483
>                 158,845            166,418
>                 158,426            166,264
>                 158,396            166,066
>                 158,371            165,944
>                 158,298            165,866
>                 158,250            165,884
>                 158,057            165,533
>                 158,011            165,532
>                 157,899            165,457
>                 157,894            165,424
>                 157,839            165,410
>                 157,731            165,407
>                 157,629            165,273
>                 157,626            164,867
>                 157,581            164,636
>                 157,471            164,266
>                 157,430            164,225
>                 157,287            163,290
>                 156,289            153,597
>                 153,970            147,494
>                 148,244            147,102
>                 142,907            146,111
>                 142,811            145,789
>                 139,171            141,168
>                 136,314            140,714
>                 133,616            140,111
>                 132,881            139,636
>                 132,729            136,943
>                 132,680            136,844
>                 132,248            135,726
>                 132,027            135,384
>                 131,929            135,270
>                 131,766            134,748
>                 131,667            134,733
>                 131,576            134,582
>                 131,396            134,302
>                 131,351            134,160
>                 131,135            134,102
>                 130,885            134,097
>                 130,854            134,058
>                 130,767            134,006
>                 130,666            133,960
>                 130,647            133,894
>                 130,152            133,837
>                 130,006            133,747
>                 129,921            133,679
>                 129,856            133,666
>                 129,377            133,564
>                 128,366            133,331
>                 127,988            132,938
>                 126,903            132,746
>  -----------------------------------------------
>       sum    10,526,916         10,919,561
>   average       150,385            155,994
>    stddev        17,551             19,633
>  -----------------------------------------------
>     elapsed       24.40              43.66
>  time (sec)
>    sys time      806.25             766.05
>       (sec)
>     zswpout  10,008,713         10,008,407
>   64K folio     623,463            623,629
>      swpout
>  -----------------------------------------------

Although there are some imbalance, I don't find it's too much.  So, I
think the test result is reasonable.  Please pay attention to the
imbalance issue in the future tests.

> As we increase the time for which allocations are maintained,
> there seems to be a slight improvement in throughput, but the
> variance increases as well. The processes with lower throughput
> could be the ones that handle the memcg being over limit by
> doing reclaim, possibly before they can allocate.
>
> Interestingly, the longer test time does seem to reduce the amount
> of reclaim (hence lower sys time), but more 64K large folios seem to
> be reclaimed. Could this mean that with longer test time (sleep 30),
> more cold memory residing in large folios is getting reclaimed, as
> against memory just relinquished by the exiting processes?

I don't think longer sleep time in test helps much to balance.  Can you
try with less process, and larger memory size per process?  I guess that
this will improve balance.

--
Best Regards,
Huang, Ying
Kanchana P Sridhar Sept. 26, 2024, 3:48 a.m. UTC | #6
Hi Ying,

> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Wednesday, September 25, 2024 5:45 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali,
> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> 
> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:
> 
> >> -----Original Message-----
> >> From: Huang, Ying <ying.huang@intel.com>
> >> Sent: Tuesday, September 24, 2024 11:35 PM
> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> >> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
> >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>;
> Feghali,
> >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> >> <vinodh.gopal@intel.com>
> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> >>
> >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:
> >>
> >> [snip]
> >>
> >> >
> >> > Case 1: Comparing zswap 4K vs. zswap mTHP
> >> > =========================================
> >> >
> >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that
> results in
> >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
> >> >
> >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that
> >> results
> >> > in 64K/2M (m)THP to not be split, and processed by zswap.
> >> >
> >> >  64KB mTHP (cgroup memory.high set to 40G):
> >> >  ==========================================
> >> >
> >> >  -------------------------------------------------------------------------------
> >> >                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
> >> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
> Baseline
> >> >                                  Baseline
> >> >  -------------------------------------------------------------------------------
> >> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >> >                                       iaa                     iaa            iaa
> >> >  -------------------------------------------------------------------------------
> >> >  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%
> 3%
> >> >  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
> >> >  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
> >> >  memcg_high          132,743      169,825     148,075     192,744
> >> >  memcg_swap_fail     639,067      841,553       2,204       2,215
> >> >  pswpin                    0            0           0           0
> >> >  pswpout                   0            0           0           0
> >> >  zswpin                  795          873         760         902
> >> >  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
> >> >  thp_swpout                0            0           0           0
> >> >  thp_swpout_               0            0           0           0
> >> >   fallback
> >> >  64kB-mthp_          639,065      841,553       2,204       2,215
> >> >   swpout_fallback
> >> >  pgmajfault            2,861        2,924       3,054       3,259
> >> >  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
> >> >  SWPOUT-64kB               0            0           0           0
> >> >  -------------------------------------------------------------------------------
> >> >
> >>
> >> IIUC, the throughput is the sum of throughput of all usemem processes?
> >>
> >> One possible issue of usemem test case is the "imbalance" issue.  That
> >> is, some usemem processes may swap-out/swap-in less, so the score is
> >> very high; while some other processes may swap-out/swap-in more, so the
> >> score is very low.  Sometimes, the total score decreases, but the scores
> >> of usemem processes are more balanced, so that the performance should
> be
> >> considered better.  And, in general, we should make usemem score
> >> balanced among processes via say longer test time.  Can you check this
> >> in your test results?
> >
> > Actually, the throughput data listed in the cover-letter is the average of
> > all the usemem processes. Your observation about the "imbalance" issue is
> > right. Some processes see a higher throughput than others. I have noticed
> > that the throughputs progressively reduce as the individual processes exit
> > and print their stats.
> >
> > Listed below are the stats from two runs of usemem70: sleep 10 and sleep
> 30.
> > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are
> > enabled, zswap uses zstd.
> >
> >
> > -----------------------------------------------
> >                sleep 10           sleep 30
> >       Throughput (KB/s)  Throughput (KB/s)
> >  -----------------------------------------------
> >                 181,540            191,686
> >                 179,651            191,459
> >                 179,068            188,834
> >                 177,244            187,568
> >                 177,215            186,703
> >                 176,565            185,584
> >                 176,546            185,370
> >                 176,470            185,021
> >                 176,214            184,303
> >                 176,128            184,040
> >                 175,279            183,932
> >                 174,745            180,831
> >                 173,935            179,418
> >                 161,546            168,014
> >                 160,332            167,540
> >                 160,122            167,364
> >                 159,613            167,020
> >                 159,546            166,590
> >                 159,021            166,483
> >                 158,845            166,418
> >                 158,426            166,264
> >                 158,396            166,066
> >                 158,371            165,944
> >                 158,298            165,866
> >                 158,250            165,884
> >                 158,057            165,533
> >                 158,011            165,532
> >                 157,899            165,457
> >                 157,894            165,424
> >                 157,839            165,410
> >                 157,731            165,407
> >                 157,629            165,273
> >                 157,626            164,867
> >                 157,581            164,636
> >                 157,471            164,266
> >                 157,430            164,225
> >                 157,287            163,290
> >                 156,289            153,597
> >                 153,970            147,494
> >                 148,244            147,102
> >                 142,907            146,111
> >                 142,811            145,789
> >                 139,171            141,168
> >                 136,314            140,714
> >                 133,616            140,111
> >                 132,881            139,636
> >                 132,729            136,943
> >                 132,680            136,844
> >                 132,248            135,726
> >                 132,027            135,384
> >                 131,929            135,270
> >                 131,766            134,748
> >                 131,667            134,733
> >                 131,576            134,582
> >                 131,396            134,302
> >                 131,351            134,160
> >                 131,135            134,102
> >                 130,885            134,097
> >                 130,854            134,058
> >                 130,767            134,006
> >                 130,666            133,960
> >                 130,647            133,894
> >                 130,152            133,837
> >                 130,006            133,747
> >                 129,921            133,679
> >                 129,856            133,666
> >                 129,377            133,564
> >                 128,366            133,331
> >                 127,988            132,938
> >                 126,903            132,746
> >  -----------------------------------------------
> >       sum    10,526,916         10,919,561
> >   average       150,385            155,994
> >    stddev        17,551             19,633
> >  -----------------------------------------------
> >     elapsed       24.40              43.66
> >  time (sec)
> >    sys time      806.25             766.05
> >       (sec)
> >     zswpout  10,008,713         10,008,407
> >   64K folio     623,463            623,629
> >      swpout
> >  -----------------------------------------------
> 
> Although there are some imbalance, I don't find it's too much.  So, I
> think the test result is reasonable.  Please pay attention to the
> imbalance issue in the future tests.

Sure, will do so.

> 
> > As we increase the time for which allocations are maintained,
> > there seems to be a slight improvement in throughput, but the
> > variance increases as well. The processes with lower throughput
> > could be the ones that handle the memcg being over limit by
> > doing reclaim, possibly before they can allocate.
> >
> > Interestingly, the longer test time does seem to reduce the amount
> > of reclaim (hence lower sys time), but more 64K large folios seem to
> > be reclaimed. Could this mean that with longer test time (sleep 30),
> > more cold memory residing in large folios is getting reclaimed, as
> > against memory just relinquished by the exiting processes?
> 
> I don't think longer sleep time in test helps much to balance.  Can you
> try with less process, and larger memory size per process?  I guess that
> this will improve balance.

I tried this, and the data is listed below:

  usemem options:
  ---------------
  30 processes allocate 10G each
  cgroup memory limit = 150G
  sleep 10
  525Gi SSD disk swap partition
  64K large folios enabled      

  Throughput (KB/s) of each of the 30 processes:
 ---------------------------------------------------------------
                      mm-unstable    zswap_store of large folios
                        9-25-2024                v7
 zswap compressor:           zstd         zstd  deflate-iaa
 ---------------------------------------------------------------
                           38,393      234,485      374,427
                           37,283      215,528      314,225
                           37,156      214,942      304,413
                           37,143      213,073      304,146
                           36,814      212,904      290,186
                           36,277      212,304      288,212
                           36,104      212,207      285,682
                           36,000      210,173      270,661
                           35,994      208,487      256,960
                           35,979      207,788      248,313
                           35,967      207,714      235,338
                           35,966      207,703      229,335
                           35,835      207,690      221,697
                           35,793      207,418      221,600
                           35,692      206,160      219,346
                           35,682      206,128      219,162
                           35,681      205,817      219,155
                           35,678      205,546      214,862
                           35,678      205,523      214,710
                           35,677      204,951      214,282
                           35,677      204,283      213,441
                           35,677      203,348      213,011
                           35,675      203,028      212,923
                           35,673      201,922      212,492
                           35,672      201,660      212,225
                           35,672      200,724      211,808
                           35,672      200,324      211,420
                           35,671      199,686      211,413
                           35,667      198,858      211,346
                           35,667      197,590      211,209
 ---------------------------------------------------------------
 sum                     1,081,515    6,217,964    7,268,000
 average                    36,051      207,265      242,267
 stddev                        655        7,010       42,234
 elapsed time (sec)         343.70       107.40        84.34
 sys time (sec)             269.30     2,520.13     1,696.20
 memcg.high breaches       443,672      475,074      623,333
 zswpout                    22,605   48,931,249   54,777,100
 pswpout                40,004,528            0            0
 hugepages-64K zswpout           0    3,057,090    3,421,855
 hugepages-64K swpout    2,500,283            0            0
 ---------------------------------------------------------------

As you can see, this is quite a memory-constrained scenario, where we
are giving a 50% of total memory required, as the memory limit for the
cgroup in which the 30 processes are run. This causes significantly more
reclaim activity than the setup I was using thus far (70 processes, 1G,
40G limit).

The variance or "imbalance" reduces somewhat for zstd, but not for IAA.

IAA shows really good throughput (17%) and elapsed time (21%) and
sys time (33%) improvement wrt zstd with zswap_store of large folios.
These are the memory-constrained scenarios in which IAA typically
does really well. IAA verify_compress is enabled, so this is an added
data integrity checks benefit we get with IAA.

I would like to get your and the maintainers' feedback on whether
I should switch to this "usemem30-10G" setup for v8?

Thanks,
Kanchana

> 
> --
> Best Regards,
> Huang, Ying
Huang, Ying Sept. 26, 2024, 6:47 a.m. UTC | #7
"Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:

> Hi Ying,
>
>> -----Original Message-----
>> From: Huang, Ying <ying.huang@intel.com>
>> Sent: Wednesday, September 25, 2024 5:45 PM
>> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
>> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
>> chengming.zhou@linux.dev; usamaarif642@gmail.com;
>> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
>> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali,
>> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
>> <vinodh.gopal@intel.com>
>> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
>> 
>> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:
>> 
>> >> -----Original Message-----
>> >> From: Huang, Ying <ying.huang@intel.com>
>> >> Sent: Tuesday, September 24, 2024 11:35 PM
>> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
>> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
>> >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
>> >> chengming.zhou@linux.dev; usamaarif642@gmail.com;
>> >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
>> >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>;
>> Feghali,
>> >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
>> >> <vinodh.gopal@intel.com>
>> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
>> >>
>> >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:
>> >>
>> >> [snip]
>> >>
>> >> >
>> >> > Case 1: Comparing zswap 4K vs. zswap mTHP
>> >> > =========================================
>> >> >
>> >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that
>> results in
>> >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
>> >> >
>> >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that
>> >> results
>> >> > in 64K/2M (m)THP to not be split, and processed by zswap.
>> >> >
>> >> >  64KB mTHP (cgroup memory.high set to 40G):
>> >> >  ==========================================
>> >> >
>> >> >  -------------------------------------------------------------------------------
>> >> >                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
>> >> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
>> Baseline
>> >> >                                  Baseline
>> >> >  -------------------------------------------------------------------------------
>> >> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>> >> >                                       iaa                     iaa            iaa
>> >> >  -------------------------------------------------------------------------------
>> >> >  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%
>> 3%
>> >> >  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
>> >> >  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
>> >> >  memcg_high          132,743      169,825     148,075     192,744
>> >> >  memcg_swap_fail     639,067      841,553       2,204       2,215
>> >> >  pswpin                    0            0           0           0
>> >> >  pswpout                   0            0           0           0
>> >> >  zswpin                  795          873         760         902
>> >> >  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
>> >> >  thp_swpout                0            0           0           0
>> >> >  thp_swpout_               0            0           0           0
>> >> >   fallback
>> >> >  64kB-mthp_          639,065      841,553       2,204       2,215
>> >> >   swpout_fallback
>> >> >  pgmajfault            2,861        2,924       3,054       3,259
>> >> >  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
>> >> >  SWPOUT-64kB               0            0           0           0
>> >> >  -------------------------------------------------------------------------------
>> >> >
>> >>
>> >> IIUC, the throughput is the sum of throughput of all usemem processes?
>> >>
>> >> One possible issue of usemem test case is the "imbalance" issue.  That
>> >> is, some usemem processes may swap-out/swap-in less, so the score is
>> >> very high; while some other processes may swap-out/swap-in more, so the
>> >> score is very low.  Sometimes, the total score decreases, but the scores
>> >> of usemem processes are more balanced, so that the performance should
>> be
>> >> considered better.  And, in general, we should make usemem score
>> >> balanced among processes via say longer test time.  Can you check this
>> >> in your test results?
>> >
>> > Actually, the throughput data listed in the cover-letter is the average of
>> > all the usemem processes. Your observation about the "imbalance" issue is
>> > right. Some processes see a higher throughput than others. I have noticed
>> > that the throughputs progressively reduce as the individual processes exit
>> > and print their stats.
>> >
>> > Listed below are the stats from two runs of usemem70: sleep 10 and sleep
>> 30.
>> > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are
>> > enabled, zswap uses zstd.
>> >
>> >
>> > -----------------------------------------------
>> >                sleep 10           sleep 30
>> >       Throughput (KB/s)  Throughput (KB/s)
>> >  -----------------------------------------------
>> >                 181,540            191,686
>> >                 179,651            191,459
>> >                 179,068            188,834
>> >                 177,244            187,568
>> >                 177,215            186,703
>> >                 176,565            185,584
>> >                 176,546            185,370
>> >                 176,470            185,021
>> >                 176,214            184,303
>> >                 176,128            184,040
>> >                 175,279            183,932
>> >                 174,745            180,831
>> >                 173,935            179,418
>> >                 161,546            168,014
>> >                 160,332            167,540
>> >                 160,122            167,364
>> >                 159,613            167,020
>> >                 159,546            166,590
>> >                 159,021            166,483
>> >                 158,845            166,418
>> >                 158,426            166,264
>> >                 158,396            166,066
>> >                 158,371            165,944
>> >                 158,298            165,866
>> >                 158,250            165,884
>> >                 158,057            165,533
>> >                 158,011            165,532
>> >                 157,899            165,457
>> >                 157,894            165,424
>> >                 157,839            165,410
>> >                 157,731            165,407
>> >                 157,629            165,273
>> >                 157,626            164,867
>> >                 157,581            164,636
>> >                 157,471            164,266
>> >                 157,430            164,225
>> >                 157,287            163,290
>> >                 156,289            153,597
>> >                 153,970            147,494
>> >                 148,244            147,102
>> >                 142,907            146,111
>> >                 142,811            145,789
>> >                 139,171            141,168
>> >                 136,314            140,714
>> >                 133,616            140,111
>> >                 132,881            139,636
>> >                 132,729            136,943
>> >                 132,680            136,844
>> >                 132,248            135,726
>> >                 132,027            135,384
>> >                 131,929            135,270
>> >                 131,766            134,748
>> >                 131,667            134,733
>> >                 131,576            134,582
>> >                 131,396            134,302
>> >                 131,351            134,160
>> >                 131,135            134,102
>> >                 130,885            134,097
>> >                 130,854            134,058
>> >                 130,767            134,006
>> >                 130,666            133,960
>> >                 130,647            133,894
>> >                 130,152            133,837
>> >                 130,006            133,747
>> >                 129,921            133,679
>> >                 129,856            133,666
>> >                 129,377            133,564
>> >                 128,366            133,331
>> >                 127,988            132,938
>> >                 126,903            132,746
>> >  -----------------------------------------------
>> >       sum    10,526,916         10,919,561
>> >   average       150,385            155,994
>> >    stddev        17,551             19,633
>> >  -----------------------------------------------
>> >     elapsed       24.40              43.66
>> >  time (sec)
>> >    sys time      806.25             766.05
>> >       (sec)
>> >     zswpout  10,008,713         10,008,407
>> >   64K folio     623,463            623,629
>> >      swpout
>> >  -----------------------------------------------
>> 
>> Although there are some imbalance, I don't find it's too much.  So, I
>> think the test result is reasonable.  Please pay attention to the
>> imbalance issue in the future tests.
>
> Sure, will do so.
>
>> 
>> > As we increase the time for which allocations are maintained,
>> > there seems to be a slight improvement in throughput, but the
>> > variance increases as well. The processes with lower throughput
>> > could be the ones that handle the memcg being over limit by
>> > doing reclaim, possibly before they can allocate.
>> >
>> > Interestingly, the longer test time does seem to reduce the amount
>> > of reclaim (hence lower sys time), but more 64K large folios seem to
>> > be reclaimed. Could this mean that with longer test time (sleep 30),
>> > more cold memory residing in large folios is getting reclaimed, as
>> > against memory just relinquished by the exiting processes?
>> 
>> I don't think longer sleep time in test helps much to balance.  Can you
>> try with less process, and larger memory size per process?  I guess that
>> this will improve balance.
>
> I tried this, and the data is listed below:
>
>   usemem options:
>   ---------------
>   30 processes allocate 10G each
>   cgroup memory limit = 150G
>   sleep 10
>   525Gi SSD disk swap partition
>   64K large folios enabled      
>
>   Throughput (KB/s) of each of the 30 processes:
>  ---------------------------------------------------------------
>                       mm-unstable    zswap_store of large folios
>                         9-25-2024                v7
>  zswap compressor:           zstd         zstd  deflate-iaa
>  ---------------------------------------------------------------
>                            38,393      234,485      374,427
>                            37,283      215,528      314,225
>                            37,156      214,942      304,413
>                            37,143      213,073      304,146
>                            36,814      212,904      290,186
>                            36,277      212,304      288,212
>                            36,104      212,207      285,682
>                            36,000      210,173      270,661
>                            35,994      208,487      256,960
>                            35,979      207,788      248,313
>                            35,967      207,714      235,338
>                            35,966      207,703      229,335
>                            35,835      207,690      221,697
>                            35,793      207,418      221,600
>                            35,692      206,160      219,346
>                            35,682      206,128      219,162
>                            35,681      205,817      219,155
>                            35,678      205,546      214,862
>                            35,678      205,523      214,710
>                            35,677      204,951      214,282
>                            35,677      204,283      213,441
>                            35,677      203,348      213,011
>                            35,675      203,028      212,923
>                            35,673      201,922      212,492
>                            35,672      201,660      212,225
>                            35,672      200,724      211,808
>                            35,672      200,324      211,420
>                            35,671      199,686      211,413
>                            35,667      198,858      211,346
>                            35,667      197,590      211,209
>  ---------------------------------------------------------------
>  sum                     1,081,515    6,217,964    7,268,000
>  average                    36,051      207,265      242,267
>  stddev                        655        7,010       42,234
>  elapsed time (sec)         343.70       107.40        84.34
>  sys time (sec)             269.30     2,520.13     1,696.20
>  memcg.high breaches       443,672      475,074      623,333
>  zswpout                    22,605   48,931,249   54,777,100
>  pswpout                40,004,528            0            0
>  hugepages-64K zswpout           0    3,057,090    3,421,855
>  hugepages-64K swpout    2,500,283            0            0
>  ---------------------------------------------------------------
>
> As you can see, this is quite a memory-constrained scenario, where we
> are giving a 50% of total memory required, as the memory limit for the
> cgroup in which the 30 processes are run. This causes significantly more
> reclaim activity than the setup I was using thus far (70 processes, 1G,
> 40G limit).
>
> The variance or "imbalance" reduces somewhat for zstd, but not for IAA.
>
> IAA shows really good throughput (17%) and elapsed time (21%) and
> sys time (33%) improvement wrt zstd with zswap_store of large folios.
> These are the memory-constrained scenarios in which IAA typically
> does really well. IAA verify_compress is enabled, so this is an added
> data integrity checks benefit we get with IAA.
>
> I would like to get your and the maintainers' feedback on whether
> I should switch to this "usemem30-10G" setup for v8?

The results looks good to me.  I suggest you to use it.

--
Best Regards,
Huang, Ying
Kanchana P Sridhar Sept. 26, 2024, 9:44 p.m. UTC | #8
> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Wednesday, September 25, 2024 11:48 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali,
> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> 
> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:
> 
> > Hi Ying,
> >
> >> -----Original Message-----
> >> From: Huang, Ying <ying.huang@intel.com>
> >> Sent: Wednesday, September 25, 2024 5:45 PM
> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> >> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
> >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>;
> Feghali,
> >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> >> <vinodh.gopal@intel.com>
> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> >>
> >> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:
> >>
> >> >> -----Original Message-----
> >> >> From: Huang, Ying <ying.huang@intel.com>
> >> >> Sent: Tuesday, September 24, 2024 11:35 PM
> >> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> >> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> >> >> hannes@cmpxchg.org; yosryahmed@google.com;
> nphamcs@gmail.com;
> >> >> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> >> >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
> >> >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>;
> >> Feghali,
> >> >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> >> >> <vinodh.gopal@intel.com>
> >> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> >> >>
> >> >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:
> >> >>
> >> >> [snip]
> >> >>
> >> >> >
> >> >> > Case 1: Comparing zswap 4K vs. zswap mTHP
> >> >> > =========================================
> >> >> >
> >> >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that
> >> results in
> >> >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
> >> >> >
> >> >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series,
> that
> >> >> results
> >> >> > in 64K/2M (m)THP to not be split, and processed by zswap.
> >> >> >
> >> >> >  64KB mTHP (cgroup memory.high set to 40G):
> >> >> >  ==========================================
> >> >> >
> >> >> >  -------------------------------------------------------------------------------
> >> >> >                     mm-unstable 9-23-2024              zswap-mTHP     Change
> wrt
> >> >> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
> >> Baseline
> >> >> >                                  Baseline
> >> >> >  -------------------------------------------------------------------------------
> >> >> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd
> deflate-
> >> >> >                                       iaa                     iaa            iaa
> >> >> >  -------------------------------------------------------------------------------
> >> >> >  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%
> >> 3%
> >> >> >  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
> >> >> >  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
> >> >> >  memcg_high          132,743      169,825     148,075     192,744
> >> >> >  memcg_swap_fail     639,067      841,553       2,204       2,215
> >> >> >  pswpin                    0            0           0           0
> >> >> >  pswpout                   0            0           0           0
> >> >> >  zswpin                  795          873         760         902
> >> >> >  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
> >> >> >  thp_swpout                0            0           0           0
> >> >> >  thp_swpout_               0            0           0           0
> >> >> >   fallback
> >> >> >  64kB-mthp_          639,065      841,553       2,204       2,215
> >> >> >   swpout_fallback
> >> >> >  pgmajfault            2,861        2,924       3,054       3,259
> >> >> >  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
> >> >> >  SWPOUT-64kB               0            0           0           0
> >> >> >  -------------------------------------------------------------------------------
> >> >> >
> >> >>
> >> >> IIUC, the throughput is the sum of throughput of all usemem processes?
> >> >>
> >> >> One possible issue of usemem test case is the "imbalance" issue.  That
> >> >> is, some usemem processes may swap-out/swap-in less, so the score is
> >> >> very high; while some other processes may swap-out/swap-in more, so
> the
> >> >> score is very low.  Sometimes, the total score decreases, but the scores
> >> >> of usemem processes are more balanced, so that the performance
> should
> >> be
> >> >> considered better.  And, in general, we should make usemem score
> >> >> balanced among processes via say longer test time.  Can you check this
> >> >> in your test results?
> >> >
> >> > Actually, the throughput data listed in the cover-letter is the average of
> >> > all the usemem processes. Your observation about the "imbalance" issue
> is
> >> > right. Some processes see a higher throughput than others. I have
> noticed
> >> > that the throughputs progressively reduce as the individual processes
> exit
> >> > and print their stats.
> >> >
> >> > Listed below are the stats from two runs of usemem70: sleep 10 and
> sleep
> >> 30.
> >> > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios
> are
> >> > enabled, zswap uses zstd.
> >> >
> >> >
> >> > -----------------------------------------------
> >> >                sleep 10           sleep 30
> >> >       Throughput (KB/s)  Throughput (KB/s)
> >> >  -----------------------------------------------
> >> >                 181,540            191,686
> >> >                 179,651            191,459
> >> >                 179,068            188,834
> >> >                 177,244            187,568
> >> >                 177,215            186,703
> >> >                 176,565            185,584
> >> >                 176,546            185,370
> >> >                 176,470            185,021
> >> >                 176,214            184,303
> >> >                 176,128            184,040
> >> >                 175,279            183,932
> >> >                 174,745            180,831
> >> >                 173,935            179,418
> >> >                 161,546            168,014
> >> >                 160,332            167,540
> >> >                 160,122            167,364
> >> >                 159,613            167,020
> >> >                 159,546            166,590
> >> >                 159,021            166,483
> >> >                 158,845            166,418
> >> >                 158,426            166,264
> >> >                 158,396            166,066
> >> >                 158,371            165,944
> >> >                 158,298            165,866
> >> >                 158,250            165,884
> >> >                 158,057            165,533
> >> >                 158,011            165,532
> >> >                 157,899            165,457
> >> >                 157,894            165,424
> >> >                 157,839            165,410
> >> >                 157,731            165,407
> >> >                 157,629            165,273
> >> >                 157,626            164,867
> >> >                 157,581            164,636
> >> >                 157,471            164,266
> >> >                 157,430            164,225
> >> >                 157,287            163,290
> >> >                 156,289            153,597
> >> >                 153,970            147,494
> >> >                 148,244            147,102
> >> >                 142,907            146,111
> >> >                 142,811            145,789
> >> >                 139,171            141,168
> >> >                 136,314            140,714
> >> >                 133,616            140,111
> >> >                 132,881            139,636
> >> >                 132,729            136,943
> >> >                 132,680            136,844
> >> >                 132,248            135,726
> >> >                 132,027            135,384
> >> >                 131,929            135,270
> >> >                 131,766            134,748
> >> >                 131,667            134,733
> >> >                 131,576            134,582
> >> >                 131,396            134,302
> >> >                 131,351            134,160
> >> >                 131,135            134,102
> >> >                 130,885            134,097
> >> >                 130,854            134,058
> >> >                 130,767            134,006
> >> >                 130,666            133,960
> >> >                 130,647            133,894
> >> >                 130,152            133,837
> >> >                 130,006            133,747
> >> >                 129,921            133,679
> >> >                 129,856            133,666
> >> >                 129,377            133,564
> >> >                 128,366            133,331
> >> >                 127,988            132,938
> >> >                 126,903            132,746
> >> >  -----------------------------------------------
> >> >       sum    10,526,916         10,919,561
> >> >   average       150,385            155,994
> >> >    stddev        17,551             19,633
> >> >  -----------------------------------------------
> >> >     elapsed       24.40              43.66
> >> >  time (sec)
> >> >    sys time      806.25             766.05
> >> >       (sec)
> >> >     zswpout  10,008,713         10,008,407
> >> >   64K folio     623,463            623,629
> >> >      swpout
> >> >  -----------------------------------------------
> >>
> >> Although there are some imbalance, I don't find it's too much.  So, I
> >> think the test result is reasonable.  Please pay attention to the
> >> imbalance issue in the future tests.
> >
> > Sure, will do so.
> >
> >>
> >> > As we increase the time for which allocations are maintained,
> >> > there seems to be a slight improvement in throughput, but the
> >> > variance increases as well. The processes with lower throughput
> >> > could be the ones that handle the memcg being over limit by
> >> > doing reclaim, possibly before they can allocate.
> >> >
> >> > Interestingly, the longer test time does seem to reduce the amount
> >> > of reclaim (hence lower sys time), but more 64K large folios seem to
> >> > be reclaimed. Could this mean that with longer test time (sleep 30),
> >> > more cold memory residing in large folios is getting reclaimed, as
> >> > against memory just relinquished by the exiting processes?
> >>
> >> I don't think longer sleep time in test helps much to balance.  Can you
> >> try with less process, and larger memory size per process?  I guess that
> >> this will improve balance.
> >
> > I tried this, and the data is listed below:
> >
> >   usemem options:
> >   ---------------
> >   30 processes allocate 10G each
> >   cgroup memory limit = 150G
> >   sleep 10
> >   525Gi SSD disk swap partition
> >   64K large folios enabled
> >
> >   Throughput (KB/s) of each of the 30 processes:
> >  ---------------------------------------------------------------
> >                       mm-unstable    zswap_store of large folios
> >                         9-25-2024                v7
> >  zswap compressor:           zstd         zstd  deflate-iaa
> >  ---------------------------------------------------------------
> >                            38,393      234,485      374,427
> >                            37,283      215,528      314,225
> >                            37,156      214,942      304,413
> >                            37,143      213,073      304,146
> >                            36,814      212,904      290,186
> >                            36,277      212,304      288,212
> >                            36,104      212,207      285,682
> >                            36,000      210,173      270,661
> >                            35,994      208,487      256,960
> >                            35,979      207,788      248,313
> >                            35,967      207,714      235,338
> >                            35,966      207,703      229,335
> >                            35,835      207,690      221,697
> >                            35,793      207,418      221,600
> >                            35,692      206,160      219,346
> >                            35,682      206,128      219,162
> >                            35,681      205,817      219,155
> >                            35,678      205,546      214,862
> >                            35,678      205,523      214,710
> >                            35,677      204,951      214,282
> >                            35,677      204,283      213,441
> >                            35,677      203,348      213,011
> >                            35,675      203,028      212,923
> >                            35,673      201,922      212,492
> >                            35,672      201,660      212,225
> >                            35,672      200,724      211,808
> >                            35,672      200,324      211,420
> >                            35,671      199,686      211,413
> >                            35,667      198,858      211,346
> >                            35,667      197,590      211,209
> >  ---------------------------------------------------------------
> >  sum                     1,081,515    6,217,964    7,268,000
> >  average                    36,051      207,265      242,267
> >  stddev                        655        7,010       42,234
> >  elapsed time (sec)         343.70       107.40        84.34
> >  sys time (sec)             269.30     2,520.13     1,696.20
> >  memcg.high breaches       443,672      475,074      623,333
> >  zswpout                    22,605   48,931,249   54,777,100
> >  pswpout                40,004,528            0            0
> >  hugepages-64K zswpout           0    3,057,090    3,421,855
> >  hugepages-64K swpout    2,500,283            0            0
> >  ---------------------------------------------------------------
> >
> > As you can see, this is quite a memory-constrained scenario, where we
> > are giving a 50% of total memory required, as the memory limit for the
> > cgroup in which the 30 processes are run. This causes significantly more
> > reclaim activity than the setup I was using thus far (70 processes, 1G,
> > 40G limit).
> >
> > The variance or "imbalance" reduces somewhat for zstd, but not for IAA.
> >
> > IAA shows really good throughput (17%) and elapsed time (21%) and
> > sys time (33%) improvement wrt zstd with zswap_store of large folios.
> > These are the memory-constrained scenarios in which IAA typically
> > does really well. IAA verify_compress is enabled, so this is an added
> > data integrity checks benefit we get with IAA.
> >
> > I would like to get your and the maintainers' feedback on whether
> > I should switch to this "usemem30-10G" setup for v8?
> 
> The results looks good to me.  I suggest you to use it.

Ok, sure, thanks Ying.

Thanks,
Kanchana

> 
> --
> Best Regards,
> Huang, Ying