Message ID | 20240819021621.29125-1-kanchana.p.sridhar@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: ZSWAP swap-out of mTHP folios | expand |
Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: [snip] > > Performance Testing: > ==================== > Testing of this patch-series was done with the v6.11-rc3 mainline, without > and with this patch-series, on an Intel Sapphire Rapids server, > dual-socket 56 cores per socket, 4 IAA devices per socket. > > The system has 503 GiB RAM, with a 4G SSD as the backing swap device for > ZSWAP. Core frequency was fixed at 2500MHz. > > The vm-scalability "usemem" test was run in a cgroup whose memory.high > was fixed. Following a similar methodology as in Ryan Roberts' > "Swap-out mTHP without splitting" series [2], 70 usemem processes were > run, each allocating and writing 1G of memory: > > usemem --init-time -w -O -n 70 1g > > Since I was constrained to get the 70 usemem processes to generate > swapout activity with the 4G SSD, I ended up using different cgroup > memory.high fixed limits for the experiments with 64K mTHP and 2M THP: > > 64K mTHP experiments: cgroup memory fixed at 60G > 2M THP experiments : cgroup memory fixed at 55G > > The vm/sysfs stats included after the performance data provide details > on the swapout activity to SSD/ZSWAP. > > Other kernel configuration parameters: > > ZSWAP Compressor : LZ4, DEFLATE-IAA > ZSWAP Allocator : ZSMALLOC > SWAP page-cluster : 2 > > In the experiments where "deflate-iaa" is used as the ZSWAP compressor, > IAA "compression verification" is enabled. Hence each IAA compression > will be decompressed internally by the "iaa_crypto" driver, the crc-s > returned by the hardware will be compared and errors reported in case of > mismatches. Thus "deflate-iaa" helps ensure better data integrity as > compared to the software compressors. > > Throughput reported by usemem and perf sys time for running the test > are as follows, averaged across 3 runs: > > 64KB mTHP (cgroup memory.high set to 60G): > ========================================== > ------------------------------------------------------------------ > | | | | | > |Kernel | mTHP SWAP-OUT | Throughput | Improvement| > | | | KB/s | | > |--------------------|-------------------|------------|------------| > |v6.11-rc3 mainline | SSD | 335,346 | Baseline | > |zswap-mTHP-Store | ZSWAP lz4 | 271,558 | -19% | zswap throughput is worse than ssd swap? This doesn't look right. > |zswap-mTHP-Store | ZSWAP deflate-iaa | 388,154 | 16% | > |------------------------------------------------------------------| > | | | | | > |Kernel | mTHP SWAP-OUT | Sys time | Improvement| > | | | sec | | > |--------------------|-------------------|------------|------------| > |v6.11-rc3 mainline | SSD | 91.37 | Baseline | > |zswap-mTHP=Store | ZSWAP lz4 | 265.43 | -191% | > |zswap-mTHP-Store | ZSWAP deflate-iaa | 235.60 | -158% | > ------------------------------------------------------------------ > > ----------------------------------------------------------------------- > | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | zswap-mTHP | > | | mainline | Store | Store | > | | | lz4 | deflate-iaa | > |-----------------------------------------------------------------------| > | pswpin | 0 | 0 | 0 | > | pswpout | 174,432 | 0 | 0 | > | zswpin | 703 | 534 | 721 | > | zswpout | 1,501 | 1,491,654 | 1,398,805 | It appears that the number of swapped pages for zswap is much larger than that of SSD swap. Why? I guess this is why zswap throughput is worse. > |-----------------------------------------------------------------------| > | thp_swpout | 0 | 0 | 0 | > | thp_swpout_fallback | 0 | 0 | 0 | > | pgmajfault | 3,364 | 3,650 | 3,431 | > |-----------------------------------------------------------------------| > | hugepages-64kB/stats/zswpout | | 63,200 | 63,244 | > |-----------------------------------------------------------------------| > | hugepages-64kB/stats/swpout | 10,902 | 0 | 0 | > ----------------------------------------------------------------------- > [snip] -- Best Regards, Huang, Ying
Hi Ying, > -----Original Message----- > From: Huang, Ying <ying.huang@intel.com> > Sent: Sunday, August 18, 2024 8:17 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: > > [snip] > > > > > Performance Testing: > > ==================== > > Testing of this patch-series was done with the v6.11-rc3 mainline, without > > and with this patch-series, on an Intel Sapphire Rapids server, > > dual-socket 56 cores per socket, 4 IAA devices per socket. > > > > The system has 503 GiB RAM, with a 4G SSD as the backing swap device for > > ZSWAP. Core frequency was fixed at 2500MHz. > > > > The vm-scalability "usemem" test was run in a cgroup whose memory.high > > was fixed. Following a similar methodology as in Ryan Roberts' > > "Swap-out mTHP without splitting" series [2], 70 usemem processes were > > run, each allocating and writing 1G of memory: > > > > usemem --init-time -w -O -n 70 1g > > > > Since I was constrained to get the 70 usemem processes to generate > > swapout activity with the 4G SSD, I ended up using different cgroup > > memory.high fixed limits for the experiments with 64K mTHP and 2M THP: > > > > 64K mTHP experiments: cgroup memory fixed at 60G > > 2M THP experiments : cgroup memory fixed at 55G > > > > The vm/sysfs stats included after the performance data provide details > > on the swapout activity to SSD/ZSWAP. > > > > Other kernel configuration parameters: > > > > ZSWAP Compressor : LZ4, DEFLATE-IAA > > ZSWAP Allocator : ZSMALLOC > > SWAP page-cluster : 2 > > > > In the experiments where "deflate-iaa" is used as the ZSWAP compressor, > > IAA "compression verification" is enabled. Hence each IAA compression > > will be decompressed internally by the "iaa_crypto" driver, the crc-s > > returned by the hardware will be compared and errors reported in case of > > mismatches. Thus "deflate-iaa" helps ensure better data integrity as > > compared to the software compressors. > > > > Throughput reported by usemem and perf sys time for running the test > > are as follows, averaged across 3 runs: > > > > 64KB mTHP (cgroup memory.high set to 60G): > > ========================================== > > ------------------------------------------------------------------ > > | | | | | > > |Kernel | mTHP SWAP-OUT | Throughput | Improvement| > > | | | KB/s | | > > |--------------------|-------------------|------------|------------| > > |v6.11-rc3 mainline | SSD | 335,346 | Baseline | > > |zswap-mTHP-Store | ZSWAP lz4 | 271,558 | -19% | > > zswap throughput is worse than ssd swap? This doesn't look right. I realize it might look that way, however, this is not an apples-to-apples comparison, as explained in the latter part of my analysis (after the 2M THP data tables). The primary reason for this is because of running the test under a fixed cgroup memory limit. In the "Before" scenario, mTHP get swapped out to SSD. However, the disk swap usage is not accounted towards checking if the cgroup's memory limit has been exceeded. Hence there are relatively fewer swap-outs, resulting mainly from the 1G allocations from each of the 70 usemem processes working with a 60G memory limit on the parent cgroup. However, the picture changes in the "After" scenario. mTHPs will now get stored in zswap, which is accounted for in the cgroup's memory.current and counts towards the fixed memory limit in effect for the parent cgroup. As a result, when mTHP get stored in zswap, the mTHP compressed data in the zswap zpool now count towards the cgroup's active memory and memory limit. This is in addition to the 1G allocations from each of the 70 processes. As you can see, this creates more memory pressure on the cgroup, resulting in more swap-outs. With lz4 as the zswap compressor, this results in lesser throughput wrt "Before". However, with IAA as the zswap compressor, the throughout with zswap mTHP is better than "Before" because of better hardware compress latencies, which handle the higher swap-out activity without compromising on throughput. > > > |zswap-mTHP-Store | ZSWAP deflate-iaa | 388,154 | 16% | > > |------------------------------------------------------------------| > > | | | | | > > |Kernel | mTHP SWAP-OUT | Sys time | Improvement| > > | | | sec | | > > |--------------------|-------------------|------------|------------| > > |v6.11-rc3 mainline | SSD | 91.37 | Baseline | > > |zswap-mTHP=Store | ZSWAP lz4 | 265.43 | -191% | > > |zswap-mTHP-Store | ZSWAP deflate-iaa | 235.60 | -158% | > > ------------------------------------------------------------------ > > > > ----------------------------------------------------------------------- > > | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | zswap- > mTHP | > > | | mainline | Store | Store | > > | | | lz4 | deflate-iaa | > > |-----------------------------------------------------------------------| > > | pswpin | 0 | 0 | 0 | > > | pswpout | 174,432 | 0 | 0 | > > | zswpin | 703 | 534 | 721 | > > | zswpout | 1,501 | 1,491,654 | 1,398,805 | > > It appears that the number of swapped pages for zswap is much larger > than that of SSD swap. Why? I guess this is why zswap throughput is > worse. Your observation is correct. I hope the above explanation helps as to the reasoning behind this. Thanks, Kanchana > > > |-----------------------------------------------------------------------| > > | thp_swpout | 0 | 0 | 0 | > > | thp_swpout_fallback | 0 | 0 | 0 | > > | pgmajfault | 3,364 | 3,650 | 3,431 | > > |-----------------------------------------------------------------------| > > | hugepages-64kB/stats/zswpout | | 63,200 | 63,244 | > > |-----------------------------------------------------------------------| > > | hugepages-64kB/stats/swpout | 10,902 | 0 | 0 | > > ----------------------------------------------------------------------- > > > > [snip] > > -- > Best Regards, > Huang, Ying
"Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: > Hi Ying, > >> -----Original Message----- >> From: Huang, Ying <ying.huang@intel.com> >> Sent: Sunday, August 18, 2024 8:17 PM >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; >> ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org; >> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K >> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> >> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios >> >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: >> >> [snip] >> >> > >> > Performance Testing: >> > ==================== >> > Testing of this patch-series was done with the v6.11-rc3 mainline, without >> > and with this patch-series, on an Intel Sapphire Rapids server, >> > dual-socket 56 cores per socket, 4 IAA devices per socket. >> > >> > The system has 503 GiB RAM, with a 4G SSD as the backing swap device for >> > ZSWAP. Core frequency was fixed at 2500MHz. >> > >> > The vm-scalability "usemem" test was run in a cgroup whose memory.high >> > was fixed. Following a similar methodology as in Ryan Roberts' >> > "Swap-out mTHP without splitting" series [2], 70 usemem processes were >> > run, each allocating and writing 1G of memory: >> > >> > usemem --init-time -w -O -n 70 1g >> > >> > Since I was constrained to get the 70 usemem processes to generate >> > swapout activity with the 4G SSD, I ended up using different cgroup >> > memory.high fixed limits for the experiments with 64K mTHP and 2M THP: >> > >> > 64K mTHP experiments: cgroup memory fixed at 60G >> > 2M THP experiments : cgroup memory fixed at 55G >> > >> > The vm/sysfs stats included after the performance data provide details >> > on the swapout activity to SSD/ZSWAP. >> > >> > Other kernel configuration parameters: >> > >> > ZSWAP Compressor : LZ4, DEFLATE-IAA >> > ZSWAP Allocator : ZSMALLOC >> > SWAP page-cluster : 2 >> > >> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor, >> > IAA "compression verification" is enabled. Hence each IAA compression >> > will be decompressed internally by the "iaa_crypto" driver, the crc-s >> > returned by the hardware will be compared and errors reported in case of >> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as >> > compared to the software compressors. >> > >> > Throughput reported by usemem and perf sys time for running the test >> > are as follows, averaged across 3 runs: >> > >> > 64KB mTHP (cgroup memory.high set to 60G): >> > ========================================== >> > ------------------------------------------------------------------ >> > | | | | | >> > |Kernel | mTHP SWAP-OUT | Throughput | Improvement| >> > | | | KB/s | | >> > |--------------------|-------------------|------------|------------| >> > |v6.11-rc3 mainline | SSD | 335,346 | Baseline | >> > |zswap-mTHP-Store | ZSWAP lz4 | 271,558 | -19% | >> >> zswap throughput is worse than ssd swap? This doesn't look right. > > I realize it might look that way, however, this is not an apples-to-apples comparison, > as explained in the latter part of my analysis (after the 2M THP data tables). > The primary reason for this is because of running the test under a fixed > cgroup memory limit. > > In the "Before" scenario, mTHP get swapped out to SSD. However, the disk swap > usage is not accounted towards checking if the cgroup's memory limit has been > exceeded. Hence there are relatively fewer swap-outs, resulting mainly from the > 1G allocations from each of the 70 usemem processes working with a 60G memory > limit on the parent cgroup. > > However, the picture changes in the "After" scenario. mTHPs will now get stored in > zswap, which is accounted for in the cgroup's memory.current and counts > towards the fixed memory limit in effect for the parent cgroup. As a result, when > mTHP get stored in zswap, the mTHP compressed data in the zswap zpool now > count towards the cgroup's active memory and memory limit. This is in addition > to the 1G allocations from each of the 70 processes. > > As you can see, this creates more memory pressure on the cgroup, resulting in > more swap-outs. With lz4 as the zswap compressor, this results in lesser throughput > wrt "Before". > > However, with IAA as the zswap compressor, the throughout with zswap mTHP is > better than "Before" because of better hardware compress latencies, which handle > the higher swap-out activity without compromising on throughput. > >> >> > |zswap-mTHP-Store | ZSWAP deflate-iaa | 388,154 | 16% | >> > |------------------------------------------------------------------| >> > | | | | | >> > |Kernel | mTHP SWAP-OUT | Sys time | Improvement| >> > | | | sec | | >> > |--------------------|-------------------|------------|------------| >> > |v6.11-rc3 mainline | SSD | 91.37 | Baseline | >> > |zswap-mTHP=Store | ZSWAP lz4 | 265.43 | -191% | >> > |zswap-mTHP-Store | ZSWAP deflate-iaa | 235.60 | -158% | >> > ------------------------------------------------------------------ >> > >> > ----------------------------------------------------------------------- >> > | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | zswap- >> mTHP | >> > | | mainline | Store | Store | >> > | | | lz4 | deflate-iaa | >> > |-----------------------------------------------------------------------| >> > | pswpin | 0 | 0 | 0 | >> > | pswpout | 174,432 | 0 | 0 | >> > | zswpin | 703 | 534 | 721 | >> > | zswpout | 1,501 | 1,491,654 | 1,398,805 | >> >> It appears that the number of swapped pages for zswap is much larger >> than that of SSD swap. Why? I guess this is why zswap throughput is >> worse. > > Your observation is correct. I hope the above explanation helps as to the > reasoning behind this. Before: (174432 + 1501) * 4 / 1024 = 687.2 MB After: 1491654 * 4.0 / 1024 = 5826.8 MB From your previous words, 10GB memory should be swapped out. Even if the average compression ratio is 0, the swap-out count of zswap should be about 100% more than that of SSD. However, the ratio here appears unreasonable. -- Best Regards, Huang, Ying > Thanks, > Kanchana > >> >> > |-----------------------------------------------------------------------| >> > | thp_swpout | 0 | 0 | 0 | >> > | thp_swpout_fallback | 0 | 0 | 0 | >> > | pgmajfault | 3,364 | 3,650 | 3,431 | >> > |-----------------------------------------------------------------------| >> > | hugepages-64kB/stats/zswpout | | 63,200 | 63,244 | >> > |-----------------------------------------------------------------------| >> > | hugepages-64kB/stats/swpout | 10,902 | 0 | 0 | >> > ----------------------------------------------------------------------- >> > >> >> [snip] >> >> -- >> Best Regards, >> Huang, Ying
Hi Ying, > -----Original Message----- > From: Huang, Ying <ying.huang@intel.com> > Sent: Sunday, August 18, 2024 10:52 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: > > > Hi Ying, > > > >> -----Original Message----- > >> From: Huang, Ying <ying.huang@intel.com> > >> Sent: Sunday, August 18, 2024 8:17 PM > >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > >> ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux- > foundation.org; > >> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > >> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > >> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > >> > >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: > >> > >> [snip] > >> > >> > > >> > Performance Testing: > >> > ==================== > >> > Testing of this patch-series was done with the v6.11-rc3 mainline, > without > >> > and with this patch-series, on an Intel Sapphire Rapids server, > >> > dual-socket 56 cores per socket, 4 IAA devices per socket. > >> > > >> > The system has 503 GiB RAM, with a 4G SSD as the backing swap device > for > >> > ZSWAP. Core frequency was fixed at 2500MHz. > >> > > >> > The vm-scalability "usemem" test was run in a cgroup whose > memory.high > >> > was fixed. Following a similar methodology as in Ryan Roberts' > >> > "Swap-out mTHP without splitting" series [2], 70 usemem processes > were > >> > run, each allocating and writing 1G of memory: > >> > > >> > usemem --init-time -w -O -n 70 1g > >> > > >> > Since I was constrained to get the 70 usemem processes to generate > >> > swapout activity with the 4G SSD, I ended up using different cgroup > >> > memory.high fixed limits for the experiments with 64K mTHP and 2M > THP: > >> > > >> > 64K mTHP experiments: cgroup memory fixed at 60G > >> > 2M THP experiments : cgroup memory fixed at 55G > >> > > >> > The vm/sysfs stats included after the performance data provide details > >> > on the swapout activity to SSD/ZSWAP. > >> > > >> > Other kernel configuration parameters: > >> > > >> > ZSWAP Compressor : LZ4, DEFLATE-IAA > >> > ZSWAP Allocator : ZSMALLOC > >> > SWAP page-cluster : 2 > >> > > >> > In the experiments where "deflate-iaa" is used as the ZSWAP > compressor, > >> > IAA "compression verification" is enabled. Hence each IAA compression > >> > will be decompressed internally by the "iaa_crypto" driver, the crc-s > >> > returned by the hardware will be compared and errors reported in case > of > >> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as > >> > compared to the software compressors. > >> > > >> > Throughput reported by usemem and perf sys time for running the test > >> > are as follows, averaged across 3 runs: > >> > > >> > 64KB mTHP (cgroup memory.high set to 60G): > >> > ========================================== > >> > ------------------------------------------------------------------ > >> > | | | | | > >> > |Kernel | mTHP SWAP-OUT | Throughput | Improvement| > >> > | | | KB/s | | > >> > |--------------------|-------------------|------------|------------| > >> > |v6.11-rc3 mainline | SSD | 335,346 | Baseline | > >> > |zswap-mTHP-Store | ZSWAP lz4 | 271,558 | -19% | > >> > >> zswap throughput is worse than ssd swap? This doesn't look right. > > > > I realize it might look that way, however, this is not an apples-to-apples > comparison, > > as explained in the latter part of my analysis (after the 2M THP data tables). > > The primary reason for this is because of running the test under a fixed > > cgroup memory limit. > > > > In the "Before" scenario, mTHP get swapped out to SSD. However, the disk > swap > > usage is not accounted towards checking if the cgroup's memory limit has > been > > exceeded. Hence there are relatively fewer swap-outs, resulting mainly > from the > > 1G allocations from each of the 70 usemem processes working with a 60G > memory > > limit on the parent cgroup. > > > > However, the picture changes in the "After" scenario. mTHPs will now get > stored in > > zswap, which is accounted for in the cgroup's memory.current and counts > > towards the fixed memory limit in effect for the parent cgroup. As a result, > when > > mTHP get stored in zswap, the mTHP compressed data in the zswap zpool > now > > count towards the cgroup's active memory and memory limit. This is in > addition > > to the 1G allocations from each of the 70 processes. > > > > As you can see, this creates more memory pressure on the cgroup, resulting > in > > more swap-outs. With lz4 as the zswap compressor, this results in lesser > throughput > > wrt "Before". > > > > However, with IAA as the zswap compressor, the throughout with zswap > mTHP is > > better than "Before" because of better hardware compress latencies, which > handle > > the higher swap-out activity without compromising on throughput. > > > >> > >> > |zswap-mTHP-Store | ZSWAP deflate-iaa | 388,154 | 16% | > >> > |------------------------------------------------------------------| > >> > | | | | | > >> > |Kernel | mTHP SWAP-OUT | Sys time | Improvement| > >> > | | | sec | | > >> > |--------------------|-------------------|------------|------------| > >> > |v6.11-rc3 mainline | SSD | 91.37 | Baseline | > >> > |zswap-mTHP=Store | ZSWAP lz4 | 265.43 | -191% | > >> > |zswap-mTHP-Store | ZSWAP deflate-iaa | 235.60 | -158% | > >> > ------------------------------------------------------------------ > >> > > >> > ----------------------------------------------------------------------- > >> > | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | > zswap- > >> mTHP | > >> > | | mainline | Store | Store | > >> > | | | lz4 | deflate-iaa | > >> > |-----------------------------------------------------------------------| > >> > | pswpin | 0 | 0 | 0 | > >> > | pswpout | 174,432 | 0 | 0 | > >> > | zswpin | 703 | 534 | 721 | > >> > | zswpout | 1,501 | 1,491,654 | 1,398,805 | > >> > >> It appears that the number of swapped pages for zswap is much larger > >> than that of SSD swap. Why? I guess this is why zswap throughput is > >> worse. > > > > Your observation is correct. I hope the above explanation helps as to the > > reasoning behind this. > > Before: > (174432 + 1501) * 4 / 1024 = 687.2 MB > > After: > 1491654 * 4.0 / 1024 = 5826.8 MB > > From your previous words, 10GB memory should be swapped out. > > Even if the average compression ratio is 0, the swap-out count of zswap > should be about 100% more than that of SSD. However, the ratio here > appears unreasonable. Excellent point! In order to understand this better myself, I ran usemem with 1 process that tries to allocate 58G: cgroup memory.high = 60,000,000,000 usemem --init-time -w -O -n 1 58g usemem -n 1 58g Before After ---------------------------------------------- pswpout 586,352 0 zswpout 1,005 1,042,963 ---------------------------------------------- Total swapout 587,357 1,042,963 ---------------------------------------------- In the case where the cgroup has only 1 process, your rationale above applies (more or less). This shows the stats collected every 100 micro-seconds from the critical section of the workload right before the memory limit is reached (Before and After): =========================================================================== BEFORE zswap_store mTHP: =========================================================================== cgroup_memory cgroup_memory zswap_pool zram_compr w/o zswap _total_size _data_size --------------------------------------------------------------------------- 59,999,600,640 59,999,600,640 0 74 59,999,911,936 59,999,911,936 0 14,139,441 60,000,083,968 59,997,634,560 2,449,408 53,448,205 59,999,952,896 59,997,503,488 2,449,408 93,477,490 60,000,083,968 59,997,634,560 2,449,408 133,152,754 60,000,083,968 59,997,634,560 2,449,408 172,628,328 59,999,952,896 59,997,503,488 2,449,408 212,760,840 60,000,083,968 59,997,634,560 2,449,408 251,999,675 60,000,083,968 59,997,634,560 2,449,408 291,058,130 60,000,083,968 59,997,634,560 2,449,408 329,655,206 59,999,793,152 59,997,343,744 2,449,408 368,938,904 59,999,924,224 59,997,474,816 2,449,408 408,652,723 59,999,924,224 59,997,474,816 2,449,408 447,830,071 60,000,055,296 59,997,605,888 2,449,408 487,776,082 59,999,924,224 59,997,474,816 2,449,408 526,826,360 60,000,055,296 59,997,605,888 2,449,408 566,193,520 60,000,055,296 59,997,605,888 2,449,408 604,625,879 60,000,055,296 59,997,605,888 2,449,408 642,545,706 59,999,924,224 59,997,474,816 2,449,408 681,958,173 59,999,924,224 59,997,474,816 2,449,408 721,908,162 59,999,924,224 59,997,474,816 2,449,408 761,935,307 59,999,924,224 59,997,474,816 2,449,408 802,014,594 59,999,924,224 59,997,474,816 2,449,408 842,087,656 59,999,924,224 59,997,474,816 2,449,408 883,889,588 59,999,924,224 59,997,474,816 2,449,408 804,458,184 59,999,793,152 59,997,343,744 2,449,408 94,150,548 54,938,513,408 54,936,064,000 2,449,408 172,644 29,492,523,008 29,490,073,600 2,449,408 172,644 3,465,621,504 3,463,172,096 2,449,408 131,457 --------------------------------------------------------------------------- =========================================================================== AFTER zswap_store mTHP: =========================================================================== cgroup_memory cgroup_memory zswap_pool w/o zswap _total_size --------------------------------------------------------------------------- 55,578,234,880 55,578,234,880 0 56,104,095,744 56,104,095,744 0 56,644,898,816 56,644,898,816 0 57,184,653,312 57,184,653,312 0 57,706,057,728 57,706,057,728 0 58,226,937,856 58,226,937,856 0 58,747,293,696 58,747,293,696 0 59,275,776,000 59,275,776,000 0 59,793,772,544 59,793,772,544 0 60,000,141,312 60,000,141,312 0 59,999,956,992 59,999,956,992 0 60,000,169,984 60,000,169,984 0 59,999,907,840 59,951,226,880 48,680,960 60,000,169,984 59,900,010,496 100,159,488 60,000,169,984 59,848,007,680 152,162,304 60,000,169,984 59,795,513,344 204,656,640 59,999,907,840 59,743,477,760 256,430,080 60,000,038,912 59,692,097,536 307,941,376 60,000,169,984 59,641,208,832 358,961,152 60,000,038,912 59,589,992,448 410,046,464 60,000,169,984 59,539,005,440 461,164,544 60,000,169,984 59,487,657,984 512,512,000 60,000,038,912 59,434,868,736 565,170,176 60,000,038,912 59,383,259,136 616,779,776 60,000,169,984 59,331,518,464 668,651,520 60,000,169,984 59,279,843,328 720,326,656 60,000,169,984 59,228,626,944 771,543,040 59,999,907,840 59,176,984,576 822,923,264 60,000,038,912 59,124,326,400 875,712,512 60,000,169,984 59,072,454,656 927,715,328 60,000,169,984 59,020,156,928 980,013,056 60,000,038,912 58,966,974,464 1,033,064,448 60,000,038,912 58,913,628,160 1,086,410,752 60,000,038,912 58,858,840,064 1,141,198,848 60,000,169,984 58,804,314,112 1,195,855,872 59,999,907,840 58,748,936,192 1,250,971,648 60,000,169,984 58,695,131,136 1,305,038,848 60,000,169,984 58,642,800,640 1,357,369,344 60,000,169,984 58,589,782,016 1,410,387,968 60,000,038,912 58,535,124,992 1,464,913,920 60,000,169,984 58,482,925,568 1,517,244,416 60,000,169,984 58,429,775,872 1,570,394,112 60,000,038,912 58,376,658,944 1,623,379,968 60,000,169,984 58,323,247,104 1,676,922,880 60,000,038,912 58,271,113,216 1,728,925,696 60,000,038,912 58,216,292,352 1,783,746,560 60,000,038,912 58,164,289,536 1,835,749,376 60,000,038,912 58,112,090,112 1,887,948,800 60,000,038,912 58,058,350,592 1,941,688,320 59,999,907,840 58,004,971,520 1,994,936,320 60,000,169,984 57,953,165,312 2,047,004,672 59,999,907,840 57,900,277,760 2,099,630,080 60,000,038,912 57,847,586,816 2,152,452,096 60,000,169,984 57,793,421,312 2,206,748,672 59,999,907,840 57,741,582,336 2,258,325,504 60,012,826,624 57,734,840,320 2,277,986,304 60,098,793,472 57,820,348,416 2,278,445,056 60,176,334,848 57,897,889,792 2,278,445,056 60,269,826,048 57,991,380,992 2,278,445,056 59,687,481,344 57,851,977,728 1,835,503,616 59,049,836,544 57,888,108,544 1,161,728,000 58,406,068,224 57,929,551,872 476,516,352 43,837,923,328 43,837,919,232 4,096 18,124,546,048 18,124,541,952 4,096 2,846,720 2,842,624 4,096 --------------------------------------------------------------------------- I have also attached plots of the memory pressure reported by PSI. Both these sets of data should give a sense of the added memory pressure on the cgroup because of zswap mTHP stores. The data shows that the cgroup is over the limit much more frequently in the "After" than in "Before". However, the rationale that you suggested seems more reasonable and apparent in the 1 process case. However, with 70 processes trying to allocate 1G, things get more complicated. These are the functions that should provide more clarity: [1] mm/memcontrol.c: mem_cgroup_handle_over_high(). [2] mm/memcontrol.c: try_charge_memcg(). [3] include/linux/resume_user_mode.h: resume_user_mode_work(). At a high level, when zswap mTHP compressed pool usage starts counting towards cgroup.memory.current, there are two inter-related effects occurring that ultimately cause more reclaim to happen: 1) When each process reclaims a folio and zswap_store() writes out each page in the folio, it charges the compressed size to the memcg "obj_cgroup_charge_zswap(objcg, entry->length);". This calls [2] and sets current->memcg_nr_pages_over_high if the limit is exceeded. The comments towards the end of [2] are relevant. 2) When each of the processes returns from a page-fault, it checks if the cgroup memory usage is over the limit in [3], and if so, it will trigger reclaim. I confirmed that in the case of usemem, all calls to [1] occur from the code path in [3]. However, my takeaway from this is that the more reclaim that results in zswap_store(), for e.g., from mTHP folios, there is higher likelihood of overage recorded per-process in current->memcg_nr_pages_over_high, which could potentially be causing each process to reclaim memory, even if it is possible that the swapout from a few of the 70 processes could have brought the parent cgroup under the limit. Please do let me know if you have any other questions. Appreciate your feedback and comments. Thanks, Kanchana > > -- > Best Regards, > Huang, Ying > > > Thanks, > > Kanchana > > > >> > >> > |-----------------------------------------------------------------------| > >> > | thp_swpout | 0 | 0 | 0 | > >> > | thp_swpout_fallback | 0 | 0 | 0 | > >> > | pgmajfault | 3,364 | 3,650 | 3,431 | > >> > |-----------------------------------------------------------------------| > >> > | hugepages-64kB/stats/zswpout | | 63,200 | 63,244 | > >> > |-----------------------------------------------------------------------| > >> > | hugepages-64kB/stats/swpout | 10,902 | 0 | 0 | > >> > ----------------------------------------------------------------------- > >> > > >> > >> [snip] > >> > >> -- > >> Best Regards, > >> Huang, Ying
On Mon, Aug 19, 2024 at 11:01 PM Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> wrote: > > Hi Ying, > > I confirmed that in the case of usemem, all calls to [1] occur from the code path in [3]. > However, my takeaway from this is that the more reclaim that results in zswap_store(), > for e.g., from mTHP folios, there is higher likelihood of overage recorded per-process in > current->memcg_nr_pages_over_high, which could potentially be causing each > process to reclaim memory, even if it is possible that the swapout from a few of > the 70 processes could have brought the parent cgroup under the limit. Yeah IIUC, the memory increase from zswap store happens immediately/synchronously (swap_writepage() -> zswap_store() -> obj_cgroup_charge_zswap()), before the memory saving kicks in. This is a non-issue for swap - the memory saving doesn't happen right away, but it also doesn't increase memory usage (well, as you pointed out, obj_cgroup_charge_zswap() doesn't even happen). And yes, this is compounded a) if you're in a high concurrency regime, where all tasks in the same cgroup, under memory pressure, all go into reclaim. and b) for larger folios, where we compress multiple pages before the saving happens. I wonder how bad the effect is tho - could you quantify the reclamation amount that happens per zswap store somehow with tracing magic? Also, I wonder if there is a "charge delta" mechanism, where we directly uncharge by (page size - zswap object size), to avoid the temporary double charging... Sort of like what folio migration is doing now v.s what it used to do. Seems complicated - not even sure if it's possible TBH. > > Please do let me know if you have any other questions. Appreciate your feedback > and comments. > > Thanks, > Kanchana
Hi Nhat, > -----Original Message----- > From: Nhat Pham <nphamcs@gmail.com> > Sent: Tuesday, August 20, 2024 2:14 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: Huang, Ying <ying.huang@intel.com>; linux-kernel@vger.kernel.org; linux- > mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com; > ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > On Mon, Aug 19, 2024 at 11:01 PM Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> wrote: > > > > Hi Ying, > > > > I confirmed that in the case of usemem, all calls to [1] occur from the code > path in [3]. > > However, my takeaway from this is that the more reclaim that results in > zswap_store(), > > for e.g., from mTHP folios, there is higher likelihood of overage recorded > per-process in > > current->memcg_nr_pages_over_high, which could potentially be causing > each > > process to reclaim memory, even if it is possible that the swapout from a > few of > > the 70 processes could have brought the parent cgroup under the limit. > > Yeah IIUC, the memory increase from zswap store happens > immediately/synchronously (swap_writepage() -> zswap_store() -> > obj_cgroup_charge_zswap()), before the memory saving kicks in. This is > a non-issue for swap - the memory saving doesn't happen right away, > but it also doesn't increase memory usage (well, as you pointed out, > obj_cgroup_charge_zswap() doesn't even happen). > > And yes, this is compounded a) if you're in a high concurrency regime, > where all tasks in the same cgroup, under memory pressure, all go into > reclaim. and b) for larger folios, where we compress multiple pages > before the saving happens. I wonder how bad the effect is tho - could > you quantify the reclamation amount that happens per zswap store > somehow with tracing magic? Thanks very much for the detailed comments and explanations! Sure, I will gather data on the reclamation amount that happens per zswap store and share. > > Also, I wonder if there is a "charge delta" mechanism, where we > directly uncharge by (page size - zswap object size), to avoid the > temporary double charging... Sort of like what folio migration is > doing now v.s what it used to do. Seems complicated - not even sure if > it's possible TBH. Yes, this is a very interesting idea. I will also look into the feasibility of doing this in the shrink_folio_list()->swap_writepage()->zswap_store() path. Thanks again for the discussion, really appreciate it. Thanks, Kanchana > > > > > Please do let me know if you have any other questions. Appreciate your > feedback > > and comments. > > > > Thanks, > > Kanchana
On Sun, Aug 18, 2024 at 10:16 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > Hi All, > > This patch-series enables zswap_store() to accept and store mTHP > folios. The most significant contribution in this series is from the > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been > migrated to v6.11-rc3 in patch 2/4 of this series. > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting > https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u > > Additionally, there is an attempt to modularize some of the functionality > in zswap_store(), to make it more amenable to supporting any-order > mTHPs. > > For instance, the determination of whether a folio is same-filled is > based on mapping an index into the folio to derive the page. Likewise, > there is a function "zswap_store_entry" added to store a zswap_entry in > the xarray. > > For accounting purposes, the patch-series adds per-order mTHP sysfs > "zswpout" counters that get incremented upon successful zswap_store of > an mTHP folio: > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > This patch-series is a precursor to ZSWAP compress batching of mTHP > swap-out and decompress batching of swap-ins based on swapin_readahead(), > using Intel IAA hardware acceleration, which we would like to submit in > subsequent RFC patch-series, with performance improvement data. > > Thanks to Ying Huang for pre-posting review feedback and suggestions! > > Changes since v3: > ================= > 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444. > Thanks to Barry for suggesting aligning with Ryan Roberts' latest > changes to count_mthp_stat() so that it's always defined, even when THP > is disabled. Barry, I have also made one other change in page_io.c > where count_mthp_stat() is called by count_swpout_vm_event(). I would > appreciate it if you can review this. Thanks! > Hopefully this should resolve the kernel robot build errors. > > Changes since v2: > ================= > 1) Gathered usemem data using SSD as the backing swap device for zswap, > as suggested by Ying Huang. Ying, I would appreciate it if you can > review the latest data. Thanks! > 2) Generated the base commit info in the patches to attempt to address > the kernel test robot build errors. > 3) No code changes to the individual patches themselves. > > Changes since RFC v1: > ===================== > > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion. > Thanks Barry! > 2) Addressed some of the code review comments that Nhat Pham provided in > Ryan's initial RFC [1]: > - Added a comment about the cgroup zswap limit checks occuring once per > folio at the beginning of zswap_store(). > Nhat, Ryan, please do let me know if the comments convey the summary > from the RFC discussion. Thanks! > - Posted data on running the cgroup suite's zswap kselftest. > 3) Rebased to v6.11-rc3. > 4) Gathered performance data with usemem and the rebased patch-series. > > Performance Testing: > ==================== > Testing of this patch-series was done with the v6.11-rc3 mainline, without > and with this patch-series, on an Intel Sapphire Rapids server, > dual-socket 56 cores per socket, 4 IAA devices per socket. > > The system has 503 GiB RAM, with a 4G SSD as the backing swap device for > ZSWAP. Core frequency was fixed at 2500MHz. > > The vm-scalability "usemem" test was run in a cgroup whose memory.high > was fixed. Following a similar methodology as in Ryan Roberts' > "Swap-out mTHP without splitting" series [2], 70 usemem processes were > run, each allocating and writing 1G of memory: > > usemem --init-time -w -O -n 70 1g > > Since I was constrained to get the 70 usemem processes to generate > swapout activity with the 4G SSD, I ended up using different cgroup > memory.high fixed limits for the experiments with 64K mTHP and 2M THP: > > 64K mTHP experiments: cgroup memory fixed at 60G > 2M THP experiments : cgroup memory fixed at 55G > > The vm/sysfs stats included after the performance data provide details > on the swapout activity to SSD/ZSWAP. > > Other kernel configuration parameters: > > ZSWAP Compressor : LZ4, DEFLATE-IAA > ZSWAP Allocator : ZSMALLOC > SWAP page-cluster : 2 > > In the experiments where "deflate-iaa" is used as the ZSWAP compressor, > IAA "compression verification" is enabled. Hence each IAA compression > will be decompressed internally by the "iaa_crypto" driver, the crc-s > returned by the hardware will be compared and errors reported in case of > mismatches. Thus "deflate-iaa" helps ensure better data integrity as > compared to the software compressors. > > Throughput reported by usemem and perf sys time for running the test > are as follows, averaged across 3 runs: > > 64KB mTHP (cgroup memory.high set to 60G): > ========================================== > ------------------------------------------------------------------ > | | | | | > |Kernel | mTHP SWAP-OUT | Throughput | Improvement| > | | | KB/s | | > |--------------------|-------------------|------------|------------| > |v6.11-rc3 mainline | SSD | 335,346 | Baseline | > |zswap-mTHP-Store | ZSWAP lz4 | 271,558 | -19% | > |zswap-mTHP-Store | ZSWAP deflate-iaa | 388,154 | 16% | > |------------------------------------------------------------------| > | | | | | > |Kernel | mTHP SWAP-OUT | Sys time | Improvement| > | | | sec | | > |--------------------|-------------------|------------|------------| > |v6.11-rc3 mainline | SSD | 91.37 | Baseline | > |zswap-mTHP=Store | ZSWAP lz4 | 265.43 | -191% | > |zswap-mTHP-Store | ZSWAP deflate-iaa | 235.60 | -158% | > ------------------------------------------------------------------ Yeah no, this is not good. That throughput regression is concerning... Is this tied to lz4 only, or do you observe similar trends in other compressors that are not deflate-iaa? > > ----------------------------------------------------------------------- > | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | zswap-mTHP | > | | mainline | Store | Store | > | | | lz4 | deflate-iaa | > |-----------------------------------------------------------------------| > | pswpin | 0 | 0 | 0 | > | pswpout | 174,432 | 0 | 0 | > | zswpin | 703 | 534 | 721 | > | zswpout | 1,501 | 1,491,654 | 1,398,805 | > |-----------------------------------------------------------------------| > | thp_swpout | 0 | 0 | 0 | > | thp_swpout_fallback | 0 | 0 | 0 | > | pgmajfault | 3,364 | 3,650 | 3,431 | > |-----------------------------------------------------------------------| > | hugepages-64kB/stats/zswpout | | 63,200 | 63,244 | > |-----------------------------------------------------------------------| > | hugepages-64kB/stats/swpout | 10,902 | 0 | 0 | > ----------------------------------------------------------------------- > Yeah this is not good. Something fishy is going on, if we see this ginormous jump from 175000 (z)swpout pages to almost 1.5 million pages. That's a massive jump. Either it's: 1.Your theory - zswap store keeps banging on the limit (which suggests incompatibility between the way zswap currently behaves and our reclaim logic) 2. The data here is ridiculously incompressible. We're needing to zswpout roughly 8.5 times the number of pages, so the saving is 8.5 less => we only save 11.76% of memory for each page??? That's not right... 3. There's an outright bug somewhere. Very suspicious. > > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 55G): > ======================================================= > ------------------------------------------------------------------ > | | | | | > |Kernel | mTHP SWAP-OUT | Throughput | Improvement| > | | | KB/s | | > |--------------------|-------------------|------------|------------| > |v6.11-rc3 mainline | SSD | 190,827 | Baseline | > |zswap-mTHP-Store | ZSWAP lz4 | 32,026 | -83% | > |zswap-mTHP-Store | ZSWAP deflate-iaa | 203,772 | 7% | > |------------------------------------------------------------------| > | | | | | > |Kernel | mTHP SWAP-OUT | Sys time | Improvement| > | | | sec | | > |--------------------|-------------------|------------|------------| > |v6.11-rc3 mainline | SSD | 27.23 | Baseline | > |zswap-mTHP-Store | ZSWAP lz4 | 156.52 | -475% | > |zswap-mTHP-Store | ZSWAP deflate-iaa | 171.45 | -530% | > ------------------------------------------------------------------ I'm confused. This is a *regression* right? A massive one that is - sys time is *more* than 5 times the old value? > > ------------------------------------------------------------------------- > | VMSTATS, mTHP ZSWAP/SSD stats | v6.11-rc3 | zswap-mTHP | zswap-mTHP | > | | mainline | Store | Store | > | | | lz4 | deflate-iaa | > |-------------------------------------------------------------------------| > | pswpin | 0 | 0 | 0 | > | pswpout | 797,184 | 0 | 0 | > | zswpin | 690 | 649 | 669 | > | zswpout | 1,465 | 1,596,382 | 1,540,766 | > |-------------------------------------------------------------------------| > | thp_swpout | 1,557 | 0 | 0 | > | thp_swpout_fallback | 0 | 3,248 | 3,752 | This is also increased, but I supposed we're just doing more (z)swapping out in general... > | pgmajfault | 3,726 | 6,470 | 5,691 | > |-------------------------------------------------------------------------| > | hugepages-2048kB/stats/zswpout | | 2,416 | 2,261 | > |-------------------------------------------------------------------------| > | hugepages-2048kB/stats/swpout | 1,557 | 0 | 0 | > ------------------------------------------------------------------------- > I'm not trying to delay this patch - I fully believe in supporting zswap for larger pages (both mTHP and THP - whatever the memory reclaim subsystem throws at us). But we need to get to the bottom of this :) These are very suspicious and concerning data. If this is something urgent, I can live with a gate to enable/disable this, but I'd much prefer we understand what's going on here.
Hi Nhat, > -----Original Message----- > From: Nhat Pham <nphamcs@gmail.com> > Sent: Wednesday, August 21, 2024 7:43 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > On Sun, Aug 18, 2024 at 10:16 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > Hi All, > > > > This patch-series enables zswap_store() to accept and store mTHP > > folios. The most significant contribution in this series is from the > > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been > > migrated to v6.11-rc3 in patch 2/4 of this series. > > > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting > > https://lore.kernel.org/linux-mm/20231019110543.3284654-1- > ryan.roberts@arm.com/T/#u > > > > Additionally, there is an attempt to modularize some of the functionality > > in zswap_store(), to make it more amenable to supporting any-order > > mTHPs. > > > > For instance, the determination of whether a folio is same-filled is > > based on mapping an index into the folio to derive the page. Likewise, > > there is a function "zswap_store_entry" added to store a zswap_entry in > > the xarray. > > > > For accounting purposes, the patch-series adds per-order mTHP sysfs > > "zswpout" counters that get incremented upon successful zswap_store of > > an mTHP folio: > > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > > > This patch-series is a precursor to ZSWAP compress batching of mTHP > > swap-out and decompress batching of swap-ins based on > swapin_readahead(), > > using Intel IAA hardware acceleration, which we would like to submit in > > subsequent RFC patch-series, with performance improvement data. > > > > Thanks to Ying Huang for pre-posting review feedback and suggestions! > > > > Changes since v3: > > ================= > > 1) Rebased to mm-unstable commit > 8c0b4f7b65fd1ca7af01267f491e815a40d77444. > > Thanks to Barry for suggesting aligning with Ryan Roberts' latest > > changes to count_mthp_stat() so that it's always defined, even when THP > > is disabled. Barry, I have also made one other change in page_io.c > > where count_mthp_stat() is called by count_swpout_vm_event(). I would > > appreciate it if you can review this. Thanks! > > Hopefully this should resolve the kernel robot build errors. > > > > Changes since v2: > > ================= > > 1) Gathered usemem data using SSD as the backing swap device for zswap, > > as suggested by Ying Huang. Ying, I would appreciate it if you can > > review the latest data. Thanks! > > 2) Generated the base commit info in the patches to attempt to address > > the kernel test robot build errors. > > 3) No code changes to the individual patches themselves. > > > > Changes since RFC v1: > > ===================== > > > > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion. > > Thanks Barry! > > 2) Addressed some of the code review comments that Nhat Pham provided > in > > Ryan's initial RFC [1]: > > - Added a comment about the cgroup zswap limit checks occuring once > per > > folio at the beginning of zswap_store(). > > Nhat, Ryan, please do let me know if the comments convey the summary > > from the RFC discussion. Thanks! > > - Posted data on running the cgroup suite's zswap kselftest. > > 3) Rebased to v6.11-rc3. > > 4) Gathered performance data with usemem and the rebased patch-series. > > > > Performance Testing: > > ==================== > > Testing of this patch-series was done with the v6.11-rc3 mainline, without > > and with this patch-series, on an Intel Sapphire Rapids server, > > dual-socket 56 cores per socket, 4 IAA devices per socket. > > > > The system has 503 GiB RAM, with a 4G SSD as the backing swap device for > > ZSWAP. Core frequency was fixed at 2500MHz. > > > > The vm-scalability "usemem" test was run in a cgroup whose memory.high > > was fixed. Following a similar methodology as in Ryan Roberts' > > "Swap-out mTHP without splitting" series [2], 70 usemem processes were > > run, each allocating and writing 1G of memory: > > > > usemem --init-time -w -O -n 70 1g > > > > Since I was constrained to get the 70 usemem processes to generate > > swapout activity with the 4G SSD, I ended up using different cgroup > > memory.high fixed limits for the experiments with 64K mTHP and 2M THP: > > > > 64K mTHP experiments: cgroup memory fixed at 60G > > 2M THP experiments : cgroup memory fixed at 55G > > > > The vm/sysfs stats included after the performance data provide details > > on the swapout activity to SSD/ZSWAP. > > > > Other kernel configuration parameters: > > > > ZSWAP Compressor : LZ4, DEFLATE-IAA > > ZSWAP Allocator : ZSMALLOC > > SWAP page-cluster : 2 > > > > In the experiments where "deflate-iaa" is used as the ZSWAP compressor, > > IAA "compression verification" is enabled. Hence each IAA compression > > will be decompressed internally by the "iaa_crypto" driver, the crc-s > > returned by the hardware will be compared and errors reported in case of > > mismatches. Thus "deflate-iaa" helps ensure better data integrity as > > compared to the software compressors. > > > > Throughput reported by usemem and perf sys time for running the test > > are as follows, averaged across 3 runs: > > > > 64KB mTHP (cgroup memory.high set to 60G): > > ========================================== > > ------------------------------------------------------------------ > > | | | | | > > |Kernel | mTHP SWAP-OUT | Throughput | Improvement| > > | | | KB/s | | > > |--------------------|-------------------|------------|------------| > > |v6.11-rc3 mainline | SSD | 335,346 | Baseline | > > |zswap-mTHP-Store | ZSWAP lz4 | 271,558 | -19% | > > |zswap-mTHP-Store | ZSWAP deflate-iaa | 388,154 | 16% | > > |------------------------------------------------------------------| > > | | | | | > > |Kernel | mTHP SWAP-OUT | Sys time | Improvement| > > | | | sec | | > > |--------------------|-------------------|------------|------------| > > |v6.11-rc3 mainline | SSD | 91.37 | Baseline | > > |zswap-mTHP=Store | ZSWAP lz4 | 265.43 | -191% | > > |zswap-mTHP-Store | ZSWAP deflate-iaa | 235.60 | -158% | > > ------------------------------------------------------------------ > > Yeah no, this is not good. That throughput regression is concerning... > > Is this tied to lz4 only, or do you observe similar trends in other > compressors that are not deflate-iaa? Let me gather data with other software compressors. > > > > > > ----------------------------------------------------------------------- > > | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | zswap- > mTHP | > > | | mainline | Store | Store | > > | | | lz4 | deflate-iaa | > > |-----------------------------------------------------------------------| > > | pswpin | 0 | 0 | 0 | > > | pswpout | 174,432 | 0 | 0 | > > | zswpin | 703 | 534 | 721 | > > | zswpout | 1,501 | 1,491,654 | 1,398,805 | > > |-----------------------------------------------------------------------| > > | thp_swpout | 0 | 0 | 0 | > > | thp_swpout_fallback | 0 | 0 | 0 | > > | pgmajfault | 3,364 | 3,650 | 3,431 | > > |-----------------------------------------------------------------------| > > | hugepages-64kB/stats/zswpout | | 63,200 | 63,244 | > > |-----------------------------------------------------------------------| > > | hugepages-64kB/stats/swpout | 10,902 | 0 | 0 | > > ----------------------------------------------------------------------- > > > > Yeah this is not good. Something fishy is going on, if we see this > ginormous jump from 175000 (z)swpout pages to almost 1.5 million > pages. That's a massive jump. > > Either it's: > > 1.Your theory - zswap store keeps banging on the limit (which suggests > incompatibility between the way zswap currently behaves and our > reclaim logic) > > 2. The data here is ridiculously incompressible. We're needing to > zswpout roughly 8.5 times the number of pages, so the saving is 8.5 > less => we only save 11.76% of memory for each page??? That's not > right... > > 3. There's an outright bug somewhere. > > Very suspicious. > > > > > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 55G): > > ======================================================= > > ------------------------------------------------------------------ > > | | | | | > > |Kernel | mTHP SWAP-OUT | Throughput | Improvement| > > | | | KB/s | | > > |--------------------|-------------------|------------|------------| > > |v6.11-rc3 mainline | SSD | 190,827 | Baseline | > > |zswap-mTHP-Store | ZSWAP lz4 | 32,026 | -83% | > > |zswap-mTHP-Store | ZSWAP deflate-iaa | 203,772 | 7% | > > |------------------------------------------------------------------| > > | | | | | > > |Kernel | mTHP SWAP-OUT | Sys time | Improvement| > > | | | sec | | > > |--------------------|-------------------|------------|------------| > > |v6.11-rc3 mainline | SSD | 27.23 | Baseline | > > |zswap-mTHP-Store | ZSWAP lz4 | 156.52 | -475% | > > |zswap-mTHP-Store | ZSWAP deflate-iaa | 171.45 | -530% | > > ------------------------------------------------------------------ > > I'm confused. This is a *regression* right? A massive one that is - > sys time is *more* than 5 times the old value? > > > > > ------------------------------------------------------------------------- > > | VMSTATS, mTHP ZSWAP/SSD stats | v6.11-rc3 | zswap-mTHP | zswap- > mTHP | > > | | mainline | Store | Store | > > | | | lz4 | deflate-iaa | > > |-------------------------------------------------------------------------| > > | pswpin | 0 | 0 | 0 | > > | pswpout | 797,184 | 0 | 0 | > > | zswpin | 690 | 649 | 669 | > > | zswpout | 1,465 | 1,596,382 | 1,540,766 | > > |-------------------------------------------------------------------------| > > | thp_swpout | 1,557 | 0 | 0 | > > | thp_swpout_fallback | 0 | 3,248 | 3,752 | > > This is also increased, but I supposed we're just doing more > (z)swapping out in general... > > > | pgmajfault | 3,726 | 6,470 | 5,691 | > > |-------------------------------------------------------------------------| > > | hugepages-2048kB/stats/zswpout | | 2,416 | 2,261 | > > |-------------------------------------------------------------------------| > > | hugepages-2048kB/stats/swpout | 1,557 | 0 | 0 | > > ------------------------------------------------------------------------- > > > > I'm not trying to delay this patch - I fully believe in supporting > zswap for larger pages (both mTHP and THP - whatever the memory > reclaim subsystem throws at us). > > But we need to get to the bottom of this :) These are very suspicious > and concerning data. If this is something urgent, I can live with a > gate to enable/disable this, but I'd much prefer we understand what's > going on here. Thanks for this analysis. I will debug this some more, so we can better understand these results. Thanks, Kanchana
[..] > > I'm not trying to delay this patch - I fully believe in supporting > zswap for larger pages (both mTHP and THP - whatever the memory > reclaim subsystem throws at us). > > But we need to get to the bottom of this :) These are very suspicious > and concerning data. If this is something urgent, I can live with a > gate to enable/disable this, but I'd much prefer we understand what's > going on here. Agreed. I don't think merging this support is urgent, so I think we should better understand what is happening here. If there is a problem with how we charge compressed memory today (temporary double charges), we need to sort this out before the the mTHP support, as it will only make things worse. I have to admit I didn't take a deep look at the discussion and data, so there may be other problems that I didn't notice. It seems to me like Kanchana is doing more debugging to understand what is happening, so that's great! As for the patches, we should sort out the impact on a higher level before discussing implementation details. From a quick look though it seems like the first patch can be dropped after Usama's patches that remove the same-filled handling from zswap land, and the last two patches can be squashed.
Hi Nhat, > -----Original Message----- > From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Sent: Wednesday, August 21, 2024 12:08 PM > To: Nhat Pham <nphamcs@gmail.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>; > Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Subject: RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > Hi Nhat, > > > -----Original Message----- > > From: Nhat Pham <nphamcs@gmail.com> > > Sent: Wednesday, August 21, 2024 7:43 AM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com; > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > > > On Sun, Aug 18, 2024 at 10:16 PM Kanchana P Sridhar > > <kanchana.p.sridhar@intel.com> wrote: > > > > > > Hi All, > > > > > > This patch-series enables zswap_store() to accept and store mTHP > > > folios. The most significant contribution in this series is from the > > > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been > > > migrated to v6.11-rc3 in patch 2/4 of this series. > > > > > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting > > > https://lore.kernel.org/linux-mm/20231019110543.3284654-1- > > ryan.roberts@arm.com/T/#u > > > > > > Additionally, there is an attempt to modularize some of the functionality > > > in zswap_store(), to make it more amenable to supporting any-order > > > mTHPs. > > > > > > For instance, the determination of whether a folio is same-filled is > > > based on mapping an index into the folio to derive the page. Likewise, > > > there is a function "zswap_store_entry" added to store a zswap_entry in > > > the xarray. > > > > > > For accounting purposes, the patch-series adds per-order mTHP sysfs > > > "zswpout" counters that get incremented upon successful zswap_store of > > > an mTHP folio: > > > > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > > > > > This patch-series is a precursor to ZSWAP compress batching of mTHP > > > swap-out and decompress batching of swap-ins based on > > swapin_readahead(), > > > using Intel IAA hardware acceleration, which we would like to submit in > > > subsequent RFC patch-series, with performance improvement data. > > > > > > Thanks to Ying Huang for pre-posting review feedback and suggestions! > > > > > > Changes since v3: > > > ================= > > > 1) Rebased to mm-unstable commit > > 8c0b4f7b65fd1ca7af01267f491e815a40d77444. > > > Thanks to Barry for suggesting aligning with Ryan Roberts' latest > > > changes to count_mthp_stat() so that it's always defined, even when > THP > > > is disabled. Barry, I have also made one other change in page_io.c > > > where count_mthp_stat() is called by count_swpout_vm_event(). I > would > > > appreciate it if you can review this. Thanks! > > > Hopefully this should resolve the kernel robot build errors. > > > > > > Changes since v2: > > > ================= > > > 1) Gathered usemem data using SSD as the backing swap device for > zswap, > > > as suggested by Ying Huang. Ying, I would appreciate it if you can > > > review the latest data. Thanks! > > > 2) Generated the base commit info in the patches to attempt to address > > > the kernel test robot build errors. > > > 3) No code changes to the individual patches themselves. > > > > > > Changes since RFC v1: > > > ===================== > > > > > > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion. > > > Thanks Barry! > > > 2) Addressed some of the code review comments that Nhat Pham > provided > > in > > > Ryan's initial RFC [1]: > > > - Added a comment about the cgroup zswap limit checks occuring once > > per > > > folio at the beginning of zswap_store(). > > > Nhat, Ryan, please do let me know if the comments convey the > summary > > > from the RFC discussion. Thanks! > > > - Posted data on running the cgroup suite's zswap kselftest. > > > 3) Rebased to v6.11-rc3. > > > 4) Gathered performance data with usemem and the rebased patch- > series. > > > > > > Performance Testing: > > > ==================== > > > Testing of this patch-series was done with the v6.11-rc3 mainline, without > > > and with this patch-series, on an Intel Sapphire Rapids server, > > > dual-socket 56 cores per socket, 4 IAA devices per socket. > > > > > > The system has 503 GiB RAM, with a 4G SSD as the backing swap device > for > > > ZSWAP. Core frequency was fixed at 2500MHz. > > > > > > The vm-scalability "usemem" test was run in a cgroup whose > memory.high > > > was fixed. Following a similar methodology as in Ryan Roberts' > > > "Swap-out mTHP without splitting" series [2], 70 usemem processes were > > > run, each allocating and writing 1G of memory: > > > > > > usemem --init-time -w -O -n 70 1g > > > > > > Since I was constrained to get the 70 usemem processes to generate > > > swapout activity with the 4G SSD, I ended up using different cgroup > > > memory.high fixed limits for the experiments with 64K mTHP and 2M THP: > > > > > > 64K mTHP experiments: cgroup memory fixed at 60G > > > 2M THP experiments : cgroup memory fixed at 55G > > > > > > The vm/sysfs stats included after the performance data provide details > > > on the swapout activity to SSD/ZSWAP. > > > > > > Other kernel configuration parameters: > > > > > > ZSWAP Compressor : LZ4, DEFLATE-IAA > > > ZSWAP Allocator : ZSMALLOC > > > SWAP page-cluster : 2 > > > > > > In the experiments where "deflate-iaa" is used as the ZSWAP compressor, > > > IAA "compression verification" is enabled. Hence each IAA compression > > > will be decompressed internally by the "iaa_crypto" driver, the crc-s > > > returned by the hardware will be compared and errors reported in case of > > > mismatches. Thus "deflate-iaa" helps ensure better data integrity as > > > compared to the software compressors. > > > > > > Throughput reported by usemem and perf sys time for running the test > > > are as follows, averaged across 3 runs: > > > > > > 64KB mTHP (cgroup memory.high set to 60G): > > > ========================================== > > > ------------------------------------------------------------------ > > > | | | | | > > > |Kernel | mTHP SWAP-OUT | Throughput | Improvement| > > > | | | KB/s | | > > > |--------------------|-------------------|------------|------------| > > > |v6.11-rc3 mainline | SSD | 335,346 | Baseline | > > > |zswap-mTHP-Store | ZSWAP lz4 | 271,558 | -19% | > > > |zswap-mTHP-Store | ZSWAP deflate-iaa | 388,154 | 16% | > > > |------------------------------------------------------------------| > > > | | | | | > > > |Kernel | mTHP SWAP-OUT | Sys time | Improvement| > > > | | | sec | | > > > |--------------------|-------------------|------------|------------| > > > |v6.11-rc3 mainline | SSD | 91.37 | Baseline | > > > |zswap-mTHP=Store | ZSWAP lz4 | 265.43 | -191% | > > > |zswap-mTHP-Store | ZSWAP deflate-iaa | 235.60 | -158% | > > > ------------------------------------------------------------------ > > > > Yeah no, this is not good. That throughput regression is concerning... > > > > Is this tied to lz4 only, or do you observe similar trends in other > > compressors that are not deflate-iaa? > > Let me gather data with other software compressors. > > > > > > > > > > > ----------------------------------------------------------------------- > > > | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | zswap- > > mTHP | > > > | | mainline | Store | Store | > > > | | | lz4 | deflate-iaa | > > > |-----------------------------------------------------------------------| > > > | pswpin | 0 | 0 | 0 | > > > | pswpout | 174,432 | 0 | 0 | > > > | zswpin | 703 | 534 | 721 | > > > | zswpout | 1,501 | 1,491,654 | 1,398,805 | > > > |-----------------------------------------------------------------------| > > > | thp_swpout | 0 | 0 | 0 | > > > | thp_swpout_fallback | 0 | 0 | 0 | > > > | pgmajfault | 3,364 | 3,650 | 3,431 | > > > |-----------------------------------------------------------------------| > > > | hugepages-64kB/stats/zswpout | | 63,200 | 63,244 | > > > |-----------------------------------------------------------------------| > > > | hugepages-64kB/stats/swpout | 10,902 | 0 | 0 | > > > ----------------------------------------------------------------------- > > > > > > > Yeah this is not good. Something fishy is going on, if we see this > > ginormous jump from 175000 (z)swpout pages to almost 1.5 million > > pages. That's a massive jump. > > > > Either it's: > > > > 1.Your theory - zswap store keeps banging on the limit (which suggests > > incompatibility between the way zswap currently behaves and our > > reclaim logic) > > > > 2. The data here is ridiculously incompressible. We're needing to > > zswpout roughly 8.5 times the number of pages, so the saving is 8.5 > > less => we only save 11.76% of memory for each page??? That's not > > right... > > > > 3. There's an outright bug somewhere. > > > > Very suspicious. > > > > > > > > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 55G): > > > ======================================================= > > > ------------------------------------------------------------------ > > > | | | | | > > > |Kernel | mTHP SWAP-OUT | Throughput | Improvement| > > > | | | KB/s | | > > > |--------------------|-------------------|------------|------------| > > > |v6.11-rc3 mainline | SSD | 190,827 | Baseline | > > > |zswap-mTHP-Store | ZSWAP lz4 | 32,026 | -83% | > > > |zswap-mTHP-Store | ZSWAP deflate-iaa | 203,772 | 7% | > > > |------------------------------------------------------------------| > > > | | | | | > > > |Kernel | mTHP SWAP-OUT | Sys time | Improvement| > > > | | | sec | | > > > |--------------------|-------------------|------------|------------| > > > |v6.11-rc3 mainline | SSD | 27.23 | Baseline | > > > |zswap-mTHP-Store | ZSWAP lz4 | 156.52 | -475% | > > > |zswap-mTHP-Store | ZSWAP deflate-iaa | 171.45 | -530% | > > > ------------------------------------------------------------------ > > > > I'm confused. This is a *regression* right? A massive one that is - > > sys time is *more* than 5 times the old value? > > > > > > > > ------------------------------------------------------------------------- > > > | VMSTATS, mTHP ZSWAP/SSD stats | v6.11-rc3 | zswap-mTHP | > zswap- > > mTHP | > > > | | mainline | Store | Store | > > > | | | lz4 | deflate-iaa | > > > |-------------------------------------------------------------------------| > > > | pswpin | 0 | 0 | 0 | > > > | pswpout | 797,184 | 0 | 0 | > > > | zswpin | 690 | 649 | 669 | > > > | zswpout | 1,465 | 1,596,382 | 1,540,766 | > > > |-------------------------------------------------------------------------| > > > | thp_swpout | 1,557 | 0 | 0 | > > > | thp_swpout_fallback | 0 | 3,248 | 3,752 | > > > > This is also increased, but I supposed we're just doing more > > (z)swapping out in general... > > > > > | pgmajfault | 3,726 | 6,470 | 5,691 | > > > |-------------------------------------------------------------------------| > > > | hugepages-2048kB/stats/zswpout | | 2,416 | 2,261 | > > > |-------------------------------------------------------------------------| > > > | hugepages-2048kB/stats/swpout | 1,557 | 0 | 0 | > > > ------------------------------------------------------------------------- > > > > > > > I'm not trying to delay this patch - I fully believe in supporting > > zswap for larger pages (both mTHP and THP - whatever the memory > > reclaim subsystem throws at us). > > > > But we need to get to the bottom of this :) These are very suspicious > > and concerning data. If this is something urgent, I can live with a > > gate to enable/disable this, but I'd much prefer we understand what's > > going on here. I started out with 2 main hypotheses to explain why zswap incurs more reclaim wrt SSD: 1) The cgroup zswap charge, that hastens the memory.high limit to be breached, and adds to the reclaim being triggered in mem_cgroup_handle_over_high(). 2) Does a faster reclaim path somehow cause less allocation stalls; thereby causing more breaches of memory.high, hence more reclaim -- and does this cycle repeat, potentially leading to higher swapout activity with zswap? I focused on gathering data with lz4 for this debug, under the reasonable assumption that results with deflate-iaa will be better. Once we figure out an overall direction on next steps, I will publish results with zswap lz4, deflate-iaa, etc. All experiments except "Exp 1.A" are run with usemem --init-time -w -O -n 70 1g. General settings for all data presented in this patch-series: vm.swappiness = 100 zswap shrinker_enabled = N Experiment 1 - impact of not doing cgroup zswap charge: ------------------------------------------------------- I wanted to first understand by how much we improve without the cgroup zswap charge. I commented out both, the calls to obj_cgroup_charge_zswap() and obj_cgroup_uncharge_zswap() in zswap.c in my patch-set. We improve throughput by quite a bit with this change, and are now better than mTHP getting swapped out to SSD. We have also slightly improved on the sys time, though this is still a regression as compared to SSD. If you recall, we were worse on throughput and sys time with v4. Averages over 3 runs are summarized in each case. Exp 1.A: usemem -n 1 58g: ------------------------- 64KB mTHP (cgroup memory.high set to 60G): ========================================== SSD mTHP zswap mTHP v4 zswap mTHP no_charge ---------------------------------------------------------------- pswpout 586,352 0 0 zswpout 1,005 1,042,963 587,181 ---------------------------------------------------------------- Total swapout 587,357 1,042,963 587,181 ---------------------------------------------------------------- Without the zswap charge to cgroup, the total swapout activity for zswap-mTHP is on par with that of SSD-mTHP for the single process case. Exp 1.B: usemem -n 70 1g: ------------------------- v4 results with cgroup zswap charge: ------------------------------------ 64KB mTHP (cgroup memory.high set to 60G): ========================================== ------------------------------------------------------------------ | | | | | |Kernel | mTHP SWAP-OUT | Throughput | Change | | | | KB/s | | |--------------------|-------------------|------------|------------| |v6.11-rc3 mainline | SSD | 335,346 | Baseline | |zswap-mTHP-Store | ZSWAP lz4 | 271,558 | -19% | |------------------------------------------------------------------| | | | | | |Kernel | mTHP SWAP-OUT | Sys time | Change | | | | sec | | |--------------------|-------------------|------------|------------| |v6.11-rc3 mainline | SSD | 91.37 | Baseline | |zswap-mTHP=Store | ZSWAP lz4 | 265.43 | -191% | ------------------------------------------------------------------ ----------------------------------------------------------------------- | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | zswap-mTHP | | | mainline | Store | Store | | | | lz4 | deflate-iaa | |-----------------------------------------------------------------------| | pswpout | 174,432 | 0 | 0 | | zswpout | 1,501 | 1,491,654 | 1,398,805 | |-----------------------------------------------------------------------| | hugepages-64kB/stats/zswpout | | 63,200 | 63,244 | |-----------------------------------------------------------------------| | hugepages-64kB/stats/swpout | 10,902 | 0 | 0 | ----------------------------------------------------------------------- Debug results without cgroup zswap charge in both, "Before" and "After": ------------------------------------------------------------------------ 64KB mTHP (cgroup memory.high set to 60G): ========================================== ------------------------------------------------------------------ | | | | | |Kernel | mTHP SWAP-OUT | Throughput | Change | | | | KB/s | | |--------------------|-------------------|------------|------------| |v6.11-rc3 mainline | SSD | 300,565 | Baseline | |zswap-mTHP-Store | ZSWAP lz4 | 420,125 | 40% | |------------------------------------------------------------------| | | | | | |Kernel | mTHP SWAP-OUT | Sys time | Change | | | | sec | | |--------------------|-------------------|------------|------------| |v6.11-rc3 mainline | SSD | 90.76 | Baseline | |zswap-mTHP=Store | ZSWAP lz4 | 213.09 | -135% | ------------------------------------------------------------------ --------------------------------------------------------- | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | | | mainline | Store | | | | lz4 | |---------------------------------------------------------- | pswpout | 330,640 | 0 | | zswpout | 1,527 | 1,384,725 | |---------------------------------------------------------- | hugepages-64kB/stats/zswpout | | 63,335 | |---------------------------------------------------------- | hugepages-64kB/stats/swpout | 18,242 | 0 | --------------------------------------------------------- Based on these results, I kept the cgroup zswap charging commented out in subsequent debug steps, so as to not place zswap at a disadvantage when trying to determine further causes for hypothesis (1). Experiment 2 - swap latency/reclamation with 64K mTHP: ------------------------------------------------------ Number of swap_writepage Total swap_writepage Average swap_writepage calls from all cores Latency (millisec) Latency (microsec) --------------------------------------------------------------------------- SSD 21,373 165,434.9 7,740 zswap 344,109 55,446.8 161 --------------------------------------------------------------------------- Reclamation analysis: 64k mTHP swapout: --------------------------------------- "Before": Total SSD compressed data size = 1,362,296,832 bytes Total SSD write IO latency = 887,861 milliseconds Average SSD compressed data size = 1,089,837 bytes Average SSD write IO latency = 710,289 microseconds "After": Total ZSWAP compressed pool size = 2,610,657,430 bytes Total ZSWAP compress latency = 55,984 milliseconds Average ZSWAP compress length = 2,055 bytes Average ZSWAP compress latency = 44 microseconds zswap-LZ4 mTHP compression ratio = 1.99 All moderately compressible pages. 0 zswap_store errors. 84% of pages compress to 2056 bytes. Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP: ------------------------------------------------------------ I wanted to take a step back and understand how the mainline v6.11-rc3 handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off) and when swapped out to ZSWAP. Interestingly, higher swapout activity is observed with 4K folios and v6.11-rc3 (with the debug change to not charge zswap to cgroup). v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP: ------------------------------------------------------------- SSD (CONFIG_ZSWAP is OFF) ZSWAP lz4 lzo-rle ------------------------------------------------------------- cgroup memory.events: cgroup memory.events: low 0 low 0 0 high 5,068 high 321,923 375,116 max 0 max 0 0 oom 0 oom 0 0 oom_kill 0 oom_kill 0 0 oom_group_kill 0 oom_group_kill 0 0 ------------------------------------------------------------- SSD (CONFIG_ZSWAP is OFF): -------------------------- pswpout 415,709 sys time (sec) 301.02 Throughput KB/s 155,970 memcg_high events 5,068 -------------------------- ZSWAP lz4 lz4 lz4 lzo-rle -------------------------------------------------------------- zswpout 1,598,550 1,515,151 1,449,432 1,493,917 sys time (sec) 889.36 481.21 581.22 635.75 Throughput KB/s 35,176 14,765 20,253 21,407 memcg_high events 321,923 412,733 369,976 375,116 -------------------------------------------------------------- This shows that there is a performance regression of -60% to -195% with zswap as compared to SSD with 4K folios. The higher swapout activity with zswap is seen here too (i.e., this doesn't appear to be mTHP-specific). I verified this to be the case even with the v6.7 kernel, which also showed a 2.3X throughput improvement when we don't charge zswap: ZSWAP lz4 v6.7 v6.7 with no cgroup zswap charge -------------------------------------------------------------------- zswpout 1,419,802 1,398,620 sys time (sec) 535.4 613.41 Throughput KB/s 8,671 20,045 memcg_high events 574,046 451,859 -------------------------------------------------------------------- Summary from the debug: ----------------------- 1) Excess reclaim is exacerbated by zswap charge to cgroup. Without the charge, reclaim is on par with SSD for mTHP in the single process case. The multiple process excess reclaim seems to be most likely resulting from over-reclaim done by the cores, in their respective calls to mem_cgroup_handle_over_high(). 2) The higher swapout activity with zswap as compared to SSD does not appear to be specific to mTHP. Higher reclaim activity and sys time regression with zswap (as compared to a setup where there is only SSD configured as swap) exists with 4K pages as far back as v6.7. 3) The debug indicates the hypothesis (2) is worth more investigation: Does a faster reclaim path somehow cause less allocation stalls; thereby causing more breaches of memory.high, hence more reclaim -- and does this cycle repeat, potentially leading to higher swapout activity with zswap? Any advise on this being a possibility, and suggestions/pointers to verify this, would be greatly appreciated. 4) Interestingly, the # of memcg_high events reduces significantly with 64K mTHP as compared to the above 4K high events data, when tested with v4 and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-mTHP). This potentially indicates something to do with allocation efficiency countering the higher reclaim that seems to be caused by swapout efficiency. 5) Nhat, Yosry: would it be possible for you to run the 4K folios usemem -n 70 1g (with 60G memory.high) expmnt with 4G and some higher value SSD configuration in your setup and say, v6.11-rc3. I would like to rule out the memory constrained 4G SSD in my setup somehow skewing the behavior of zswap vis-a-vis allocation/memcg_handle_over_high/reclaim. I realize your time is valuable, however I think an independent confirmation of what I have been observing, would be really helpful for us to figure out potential root-causes and solutions. 6) I tried a small change in memcontrol.c::mem_cgroup_handle_over_high() to break out of the loop if we have reclaimed a total of at least "nr_pages": nr_reclaimed = reclaim_high(memcg, in_retry ? SWAP_CLUSTER_MAX : nr_pages, gfp_mask); + nr_reclaimed_total += nr_reclaimed; + + if (nr_reclaimed_total >= nr_pages) + goto out; This was only for debug purposes, and did seem to mitigate the higher reclaim behavior for 4K folios: ZSWAP lz4 lz4 lz4 ---------------------------------------------------------- zswpout 1,305,367 1,349,195 1,529,235 sys time (sec) 472.06 507.76 646.39 Throughput KB/s 55,144 21,811 88,310 memcg_high events 257,890 343,213 172,351 ---------------------------------------------------------- On average, this change results in 17% improvement in sys time, 2.35X improvement in throughput and 30% fewer memcg_high events. I look forward to further inputs on next steps. Thanks, Kanchana > > Thanks for this analysis. I will debug this some more, so we can better > understand these results. > > Thanks, > Kanchana
Hi Yosry, > -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Friday, August 23, 2024 8:10 PM > To: Nhat Pham <nphamcs@gmail.com> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux- > kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org; > ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>; > 21cnbao@gmail.com; akpm@linux-foundation.org; Zou, Nanhai > <nanhai.zou@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; > Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > [..] > > > > I'm not trying to delay this patch - I fully believe in supporting > > zswap for larger pages (both mTHP and THP - whatever the memory > > reclaim subsystem throws at us). > > > > But we need to get to the bottom of this :) These are very suspicious > > and concerning data. If this is something urgent, I can live with a > > gate to enable/disable this, but I'd much prefer we understand what's > > going on here. > > Agreed. I don't think merging this support is urgent, so I think we > should better understand what is happening here. If there is a problem > with how we charge compressed memory today (temporary double charges), > we need to sort this out before the the mTHP support, as it will only > make things worse. > > I have to admit I didn't take a deep look at the discussion and data, > so there may be other problems that I didn't notice. It seems to me > like Kanchana is doing more debugging to understand what is happening, > so that's great! This sounds good. I just shared the data and my learnings from some debugging experiments. I would appreciate it if you can review this and suggest next steps. > > As for the patches, we should sort out the impact on a higher level > before discussing implementation details. From a quick look though it > seems like the first patch can be dropped after Usama's patches that > remove the same-filled handling from zswap land, and the last two > patches can be squashed. Sure, this sounds good. Thanks, Kanchana
On Fri, Aug 23, 2024 at 11:21 PM Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> wrote: > > Hi Nhat, > > > I started out with 2 main hypotheses to explain why zswap incurs more > reclaim wrt SSD: > > 1) The cgroup zswap charge, that hastens the memory.high limit to be > breached, and adds to the reclaim being triggered in > mem_cgroup_handle_over_high(). > > 2) Does a faster reclaim path somehow cause less allocation stalls; thereby > causing more breaches of memory.high, hence more reclaim -- and does this > cycle repeat, potentially leading to higher swapout activity with zswap? By faster reclaim path, do you mean zswap has a lower reclaim latency? > > I focused on gathering data with lz4 for this debug, under the reasonable > assumption that results with deflate-iaa will be better. Once we figure out > an overall direction on next steps, I will publish results with zswap lz4, > deflate-iaa, etc. > > All experiments except "Exp 1.A" are run with > usemem --init-time -w -O -n 70 1g. > > General settings for all data presented in this patch-series: > > vm.swappiness = 100 > zswap shrinker_enabled = N > > Experiment 1 - impact of not doing cgroup zswap charge: > ------------------------------------------------------- > > I wanted to first understand by how much we improve without the cgroup > zswap charge. I commented out both, the calls to obj_cgroup_charge_zswap() > and obj_cgroup_uncharge_zswap() in zswap.c in my patch-set. > We improve throughput by quite a bit with this change, and are now better > than mTHP getting swapped out to SSD. We have also slightly improved on the > sys time, though this is still a regression as compared to SSD. If you > recall, we were worse on throughput and sys time with v4. I'm not 100% sure about the validity this pair of experiments. The thing is, you cannot ignore zswap's memory footprint altogether. That's the whole point of the trade-off. It's probably gigabytes worth of unaccounted memory usage - I see that your SSD size is 4G, and since compression ratio is less than 2, that's potentially 2G worth of memory give or take you are not charging to the cgroup, which can altogether alter the memory pressure and reclaim dynamics. The zswap charging itself is not the problem - that's fair and healthy. It might be the overreaction by the memory reclaim subsystem that seems anomalous? > > Averages over 3 runs are summarized in each case. > > Exp 1.A: usemem -n 1 58g: > ------------------------- > > 64KB mTHP (cgroup memory.high set to 60G): > ========================================== > > SSD mTHP zswap mTHP v4 zswap mTHP no_charge > ---------------------------------------------------------------- > pswpout 586,352 0 0 > zswpout 1,005 1,042,963 587,181 > ---------------------------------------------------------------- > Total swapout 587,357 1,042,963 587,181 > ---------------------------------------------------------------- > > Without the zswap charge to cgroup, the total swapout activity for > zswap-mTHP is on par with that of SSD-mTHP for the single process case. > > > Exp 1.B: usemem -n 70 1g: > ------------------------- > v4 results with cgroup zswap charge: > ------------------------------------ > > 64KB mTHP (cgroup memory.high set to 60G): > ========================================== > ------------------------------------------------------------------ > | | | | | > |Kernel | mTHP SWAP-OUT | Throughput | Change | > | | | KB/s | | > |--------------------|-------------------|------------|------------| > |v6.11-rc3 mainline | SSD | 335,346 | Baseline | > |zswap-mTHP-Store | ZSWAP lz4 | 271,558 | -19% | > |------------------------------------------------------------------| > | | | | | > |Kernel | mTHP SWAP-OUT | Sys time | Change | > | | | sec | | > |--------------------|-------------------|------------|------------| > |v6.11-rc3 mainline | SSD | 91.37 | Baseline | > |zswap-mTHP=Store | ZSWAP lz4 | 265.43 | -191% | > ------------------------------------------------------------------ > > ----------------------------------------------------------------------- > | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | zswap-mTHP | > | | mainline | Store | Store | > | | | lz4 | deflate-iaa | > |-----------------------------------------------------------------------| > | pswpout | 174,432 | 0 | 0 | > | zswpout | 1,501 | 1,491,654 | 1,398,805 | > |-----------------------------------------------------------------------| > | hugepages-64kB/stats/zswpout | | 63,200 | 63,244 | > |-----------------------------------------------------------------------| > | hugepages-64kB/stats/swpout | 10,902 | 0 | 0 | > ----------------------------------------------------------------------- > > Debug results without cgroup zswap charge in both, "Before" and "After": > ------------------------------------------------------------------------ > > 64KB mTHP (cgroup memory.high set to 60G): > ========================================== > ------------------------------------------------------------------ > | | | | | > |Kernel | mTHP SWAP-OUT | Throughput | Change | > | | | KB/s | | > |--------------------|-------------------|------------|------------| > |v6.11-rc3 mainline | SSD | 300,565 | Baseline | > |zswap-mTHP-Store | ZSWAP lz4 | 420,125 | 40% | > |------------------------------------------------------------------| > | | | | | > |Kernel | mTHP SWAP-OUT | Sys time | Change | > | | | sec | | > |--------------------|-------------------|------------|------------| > |v6.11-rc3 mainline | SSD | 90.76 | Baseline | > |zswap-mTHP=Store | ZSWAP lz4 | 213.09 | -135% | > ------------------------------------------------------------------ > > --------------------------------------------------------- > | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | > | | mainline | Store | > | | | lz4 | > |---------------------------------------------------------- > | pswpout | 330,640 | 0 | > | zswpout | 1,527 | 1,384,725 | > |---------------------------------------------------------- > | hugepages-64kB/stats/zswpout | | 63,335 | > |---------------------------------------------------------- > | hugepages-64kB/stats/swpout | 18,242 | 0 | > --------------------------------------------------------- > Hmm, in the 70 processes case, it looks like we're still seeing latency regression, and that same pattern of overreclaiming, even without zswap cgroup charging? That seems like a hint - concurrency exacerbates the problem? > > Based on these results, I kept the cgroup zswap charging commented out in > subsequent debug steps, so as to not place zswap at a disadvantage when > trying to determine further causes for hypothesis (1). > > > Experiment 2 - swap latency/reclamation with 64K mTHP: > ------------------------------------------------------ > > Number of swap_writepage Total swap_writepage Average swap_writepage > calls from all cores Latency (millisec) Latency (microsec) > --------------------------------------------------------------------------- > SSD 21,373 165,434.9 7,740 > zswap 344,109 55,446.8 161 > --------------------------------------------------------------------------- > > > Reclamation analysis: 64k mTHP swapout: > --------------------------------------- > "Before": > Total SSD compressed data size = 1,362,296,832 bytes > Total SSD write IO latency = 887,861 milliseconds > > Average SSD compressed data size = 1,089,837 bytes > Average SSD write IO latency = 710,289 microseconds > > "After": > Total ZSWAP compressed pool size = 2,610,657,430 bytes > Total ZSWAP compress latency = 55,984 milliseconds > > Average ZSWAP compress length = 2,055 bytes > Average ZSWAP compress latency = 44 microseconds > > zswap-LZ4 mTHP compression ratio = 1.99 > All moderately compressible pages. 0 zswap_store errors. > 84% of pages compress to 2056 bytes. Hmm this ratio isn't very good indeed - it is less than 2-to-1 memory saving... Internally, we often see 1-3 or 1-4 saving ratio (or even more). Probably does not explain everything, but worth double checking - could you check with zstd to see if the ratio improves. > > > Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP: > ------------------------------------------------------------ > > I wanted to take a step back and understand how the mainline v6.11-rc3 > handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off) and when > swapped out to ZSWAP. Interestingly, higher swapout activity is observed > with 4K folios and v6.11-rc3 (with the debug change to not charge zswap to > cgroup). > > v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP: > > ------------------------------------------------------------- > SSD (CONFIG_ZSWAP is OFF) ZSWAP lz4 lzo-rle > ------------------------------------------------------------- > cgroup memory.events: cgroup memory.events: > > low 0 low 0 0 > high 5,068 high 321,923 375,116 > max 0 max 0 0 > oom 0 oom 0 0 > oom_kill 0 oom_kill 0 0 > oom_group_kill 0 oom_group_kill 0 0 > ------------------------------------------------------------- > > SSD (CONFIG_ZSWAP is OFF): > -------------------------- > pswpout 415,709 > sys time (sec) 301.02 > Throughput KB/s 155,970 > memcg_high events 5,068 > -------------------------- > > > ZSWAP lz4 lz4 lz4 lzo-rle > -------------------------------------------------------------- > zswpout 1,598,550 1,515,151 1,449,432 1,493,917 > sys time (sec) 889.36 481.21 581.22 635.75 > Throughput KB/s 35,176 14,765 20,253 21,407 > memcg_high events 321,923 412,733 369,976 375,116 > -------------------------------------------------------------- > > This shows that there is a performance regression of -60% to -195% with > zswap as compared to SSD with 4K folios. The higher swapout activity with > zswap is seen here too (i.e., this doesn't appear to be mTHP-specific). > > I verified this to be the case even with the v6.7 kernel, which also > showed a 2.3X throughput improvement when we don't charge zswap: > > ZSWAP lz4 v6.7 v6.7 with no cgroup zswap charge > -------------------------------------------------------------------- > zswpout 1,419,802 1,398,620 > sys time (sec) 535.4 613.41 systime increases without zswap cgroup charging? That's strange... > Throughput KB/s 8,671 20,045 > memcg_high events 574,046 451,859 So, on 4k folio setup, even without cgroup charge, we are still seeing: 1. More zswpout (than observed in SSD) 2. 40-50% worse latency - in fact it is worse without zswap cgroup charging. 3. 100 times the amount of memcg_high events? This is perhaps the *strangest* to me. You're already removing zswap cgroup charging, then where does this comes from? How can we have memory.high violation when zswap does *not* contribute to memory usage? Is this due to swap limit charging? Do you have a cgroup swap limit? mem_high = page_counter_read(&memcg->memory) > READ_ONCE(memcg->memory.high); swap_high = page_counter_read(&memcg->swap) > READ_ONCE(memcg->swap.high); [...] if (mem_high || swap_high) { /* * The allocating tasks in this cgroup will need to do * reclaim or be throttled to prevent further growth * of the memory or swap footprints. * * Target some best-effort fairness between the tasks, * and distribute reclaim work and delay penalties * based on how much each task is actually allocating. */ current->memcg_nr_pages_over_high += batch; set_notify_resume(current); break; } > -------------------------------------------------------------------- > > > Summary from the debug: > ----------------------- > 1) Excess reclaim is exacerbated by zswap charge to cgroup. Without the > charge, reclaim is on par with SSD for mTHP in the single process > case. The multiple process excess reclaim seems to be most likely > resulting from over-reclaim done by the cores, in their respective calls > to mem_cgroup_handle_over_high(). Exarcebate, yes. I'm not 100% it's the sole or even the main cause. You still see a degree of overreclaiming without zswap cgroup charging in: 1. 70 processes, with mTHP 2. 70 processes, with 4K folios. > > 2) The higher swapout activity with zswap as compared to SSD does not > appear to be specific to mTHP. Higher reclaim activity and sys time > regression with zswap (as compared to a setup where there is only SSD > configured as swap) exists with 4K pages as far back as v6.7. Yeah I can believe that without mthp, the same-ish workload would cause the same regression. > > 3) The debug indicates the hypothesis (2) is worth more investigation: > Does a faster reclaim path somehow cause less allocation stalls; thereby > causing more breaches of memory.high, hence more reclaim -- and does this > cycle repeat, potentially leading to higher swapout activity with zswap? > Any advise on this being a possibility, and suggestions/pointers to > verify this, would be greatly appreciated. Add stalls along the zswap path? :) > > 4) Interestingly, the # of memcg_high events reduces significantly with 64K > mTHP as compared to the above 4K high events data, when tested with v4 > and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-mTHP). This > potentially indicates something to do with allocation efficiency > countering the higher reclaim that seems to be caused by swapout > efficiency. > > 5) Nhat, Yosry: would it be possible for you to run the 4K folios > usemem -n 70 1g (with 60G memory.high) expmnt with 4G and some higher > value SSD configuration in your setup and say, v6.11-rc3. I would like > to rule out the memory constrained 4G SSD in my setup somehow skewing > the behavior of zswap vis-a-vis > allocation/memcg_handle_over_high/reclaim. I realize your time is > valuable, however I think an independent confirmation of what I have > been observing, would be really helpful for us to figure out potential > root-causes and solutions. It might take awhile for me to set up your benchmark, but yeah 4G swapfile seems small on a 64G host - of course it depends on the workload, but this has a lot memory usage. In fact the total memory usage (70G?) is slightly above memory.high + 4G swapfile - note that this is exarcebated by, once again, zswap's less-than-100% memory saving ratio. > > 6) I tried a small change in memcontrol.c::mem_cgroup_handle_over_high() to > break out of the loop if we have reclaimed a total of at least > "nr_pages": > > nr_reclaimed = reclaim_high(memcg, > in_retry ? SWAP_CLUSTER_MAX : nr_pages, > gfp_mask); > > + nr_reclaimed_total += nr_reclaimed; > + > + if (nr_reclaimed_total >= nr_pages) > + goto out; > > > This was only for debug purposes, and did seem to mitigate the higher > reclaim behavior for 4K folios: > > ZSWAP lz4 lz4 lz4 > ---------------------------------------------------------- > zswpout 1,305,367 1,349,195 1,529,235 > sys time (sec) 472.06 507.76 646.39 > Throughput KB/s 55,144 21,811 88,310 > memcg_high events 257,890 343,213 172,351 > ---------------------------------------------------------- > > On average, this change results in 17% improvement in sys time, 2.35X > improvement in throughput and 30% fewer memcg_high events. > > I look forward to further inputs on next steps. > > Thanks, > Kanchana > > > > > > Thanks for this analysis. I will debug this some more, so we can better > > understand these results. > > > > Thanks, > > Kanchana
Hi Nhat, > -----Original Message----- > From: Nhat Pham <nphamcs@gmail.com> > Sent: Monday, August 26, 2024 7:12 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > On Fri, Aug 23, 2024 at 11:21 PM Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> wrote: > > > > Hi Nhat, > > > > > > I started out with 2 main hypotheses to explain why zswap incurs more > > reclaim wrt SSD: > > > > 1) The cgroup zswap charge, that hastens the memory.high limit to be > > breached, and adds to the reclaim being triggered in > > mem_cgroup_handle_over_high(). > > > > 2) Does a faster reclaim path somehow cause less allocation stalls; thereby > > causing more breaches of memory.high, hence more reclaim -- and does > this > > cycle repeat, potentially leading to higher swapout activity with zswap? > > By faster reclaim path, do you mean zswap has a lower reclaim latency? Thanks for your follow-up comments/suggestions. Yes, I was characterizing lower zswap reclaim latency as faster reclaim path. > > > > > I focused on gathering data with lz4 for this debug, under the reasonable > > assumption that results with deflate-iaa will be better. Once we figure out > > an overall direction on next steps, I will publish results with zswap lz4, > > deflate-iaa, etc. > > > > All experiments except "Exp 1.A" are run with > > usemem --init-time -w -O -n 70 1g. > > > > General settings for all data presented in this patch-series: > > > > vm.swappiness = 100 > > zswap shrinker_enabled = N > > > > Experiment 1 - impact of not doing cgroup zswap charge: > > ------------------------------------------------------- > > > > I wanted to first understand by how much we improve without the cgroup > > zswap charge. I commented out both, the calls to > obj_cgroup_charge_zswap() > > and obj_cgroup_uncharge_zswap() in zswap.c in my patch-set. > > We improve throughput by quite a bit with this change, and are now better > > than mTHP getting swapped out to SSD. We have also slightly improved on > the > > sys time, though this is still a regression as compared to SSD. If you > > recall, we were worse on throughput and sys time with v4. > > I'm not 100% sure about the validity this pair of experiments. > > The thing is, you cannot ignore zswap's memory footprint altogether. > That's the whole point of the trade-off. It's probably gigabytes worth > of unaccounted memory usage - I see that your SSD size is 4G, and > since compression ratio is less than 2, that's potentially 2G worth of > memory give or take you are not charging to the cgroup, which can > altogether alter the memory pressure and reclaim dynamics. I agree, the zswap memory utilization charging to the cgroup is the right thing to do (assuming we solve the temporary double-charging, as Yosry and you have pointed out). I have summarized the zswap memory footprint with different compressors in the results towards the end of this email. > > The zswap charging itself is not the problem - that's fair and > healthy. It might be the overreaction by the memory reclaim subsystem > that seems anomalous? I think so too, about the anomalous behavior. > > > > > Averages over 3 runs are summarized in each case. > > > > Exp 1.A: usemem -n 1 58g: > > ------------------------- > > > > 64KB mTHP (cgroup memory.high set to 60G): > > ========================================== > > > > SSD mTHP zswap mTHP v4 zswap mTHP no_charge > > ---------------------------------------------------------------- > > pswpout 586,352 0 0 > > zswpout 1,005 1,042,963 587,181 > > ---------------------------------------------------------------- > > Total swapout 587,357 1,042,963 587,181 > > ---------------------------------------------------------------- > > > > Without the zswap charge to cgroup, the total swapout activity for > > zswap-mTHP is on par with that of SSD-mTHP for the single process case. > > > > > > Exp 1.B: usemem -n 70 1g: > > ------------------------- > > v4 results with cgroup zswap charge: > > ------------------------------------ > > > > 64KB mTHP (cgroup memory.high set to 60G): > > ========================================== > > ------------------------------------------------------------------ > > | | | | | > > |Kernel | mTHP SWAP-OUT | Throughput | Change | > > | | | KB/s | | > > |--------------------|-------------------|------------|------------| > > |v6.11-rc3 mainline | SSD | 335,346 | Baseline | > > |zswap-mTHP-Store | ZSWAP lz4 | 271,558 | -19% | > > |------------------------------------------------------------------| > > | | | | | > > |Kernel | mTHP SWAP-OUT | Sys time | Change | > > | | | sec | | > > |--------------------|-------------------|------------|------------| > > |v6.11-rc3 mainline | SSD | 91.37 | Baseline | > > |zswap-mTHP=Store | ZSWAP lz4 | 265.43 | -191% | > > ------------------------------------------------------------------ > > > > ----------------------------------------------------------------------- > > | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | zswap- > mTHP | > > | | mainline | Store | Store | > > | | | lz4 | deflate-iaa | > > |-----------------------------------------------------------------------| > > | pswpout | 174,432 | 0 | 0 | > > | zswpout | 1,501 | 1,491,654 | 1,398,805 | > > |-----------------------------------------------------------------------| > > | hugepages-64kB/stats/zswpout | | 63,200 | 63,244 | > > |-----------------------------------------------------------------------| > > | hugepages-64kB/stats/swpout | 10,902 | 0 | 0 | > > ----------------------------------------------------------------------- > > > > Debug results without cgroup zswap charge in both, "Before" and "After": > > ------------------------------------------------------------------------ > > > > 64KB mTHP (cgroup memory.high set to 60G): > > ========================================== > > ------------------------------------------------------------------ > > | | | | | > > |Kernel | mTHP SWAP-OUT | Throughput | Change | > > | | | KB/s | | > > |--------------------|-------------------|------------|------------| > > |v6.11-rc3 mainline | SSD | 300,565 | Baseline | > > |zswap-mTHP-Store | ZSWAP lz4 | 420,125 | 40% | > > |------------------------------------------------------------------| > > | | | | | > > |Kernel | mTHP SWAP-OUT | Sys time | Change | > > | | | sec | | > > |--------------------|-------------------|------------|------------| > > |v6.11-rc3 mainline | SSD | 90.76 | Baseline | > > |zswap-mTHP=Store | ZSWAP lz4 | 213.09 | -135% | > > ------------------------------------------------------------------ > > > > --------------------------------------------------------- > > | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP | > > | | mainline | Store | > > | | | lz4 | > > |---------------------------------------------------------- > > | pswpout | 330,640 | 0 | > > | zswpout | 1,527 | 1,384,725 | > > |---------------------------------------------------------- > > | hugepages-64kB/stats/zswpout | | 63,335 | > > |---------------------------------------------------------- > > | hugepages-64kB/stats/swpout | 18,242 | 0 | > > --------------------------------------------------------- > > > > Hmm, in the 70 processes case, it looks like we're still seeing > latency regression, and that same pattern of overreclaiming, even > without zswap cgroup charging? > > That seems like a hint - concurrency exacerbates the problem? Agreed, that was my conclusion as well. > > > > > Based on these results, I kept the cgroup zswap charging commented out in > > subsequent debug steps, so as to not place zswap at a disadvantage when > > trying to determine further causes for hypothesis (1). > > > > > > Experiment 2 - swap latency/reclamation with 64K mTHP: > > ------------------------------------------------------ > > > > Number of swap_writepage Total swap_writepage Average > swap_writepage > > calls from all cores Latency (millisec) Latency (microsec) > > --------------------------------------------------------------------------- > > SSD 21,373 165,434.9 7,740 > > zswap 344,109 55,446.8 161 > > --------------------------------------------------------------------------- > > > > > > Reclamation analysis: 64k mTHP swapout: > > --------------------------------------- > > "Before": > > Total SSD compressed data size = 1,362,296,832 bytes > > Total SSD write IO latency = 887,861 milliseconds > > > > Average SSD compressed data size = 1,089,837 bytes > > Average SSD write IO latency = 710,289 microseconds > > > > "After": > > Total ZSWAP compressed pool size = 2,610,657,430 bytes > > Total ZSWAP compress latency = 55,984 milliseconds > > > > Average ZSWAP compress length = 2,055 bytes > > Average ZSWAP compress latency = 44 microseconds > > > > zswap-LZ4 mTHP compression ratio = 1.99 > > All moderately compressible pages. 0 zswap_store errors. > > 84% of pages compress to 2056 bytes. > > Hmm this ratio isn't very good indeed - it is less than 2-to-1 memory saving... > > Internally, we often see 1-3 or 1-4 saving ratio (or even more). Agree with this as well. In our experiments with other workloads, we typically see much higher ratios. > > Probably does not explain everything, but worth double checking - > could you check with zstd to see if the ratio improves. Sure. I gathered ratio and compressed memory footprint data today with 64K mTHP, the 4G SSD swapfile and different zswap compressors. This patch-series and no zswap charging, 64K mTHP: --------------------------------------------------------------------------- Total Total Average Average Comp compressed compression compressed compression ratio length latency length latency bytes milliseconds bytes nanoseconds --------------------------------------------------------------------------- SSD (no zswap) 1,362,296,832 887,861 lz4 2,610,657,430 55,984 2,055 44,065 1.99 zstd 729,129,528 50,986 565 39,510 7.25 deflate-iaa 1,286,533,438 44,785 1,415 49,252 2.89 --------------------------------------------------------------------------- zstd does very well on ratio, as expected. > > > > > > > Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP: > > ------------------------------------------------------------ > > > > I wanted to take a step back and understand how the mainline v6.11-rc3 > > handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off) and > when > > swapped out to ZSWAP. Interestingly, higher swapout activity is observed > > with 4K folios and v6.11-rc3 (with the debug change to not charge zswap to > > cgroup). > > > > v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP: > > > > ------------------------------------------------------------- > > SSD (CONFIG_ZSWAP is OFF) ZSWAP lz4 lzo-rle > > ------------------------------------------------------------- > > cgroup memory.events: cgroup memory.events: > > > > low 0 low 0 0 > > high 5,068 high 321,923 375,116 > > max 0 max 0 0 > > oom 0 oom 0 0 > > oom_kill 0 oom_kill 0 0 > > oom_group_kill 0 oom_group_kill 0 0 > > ------------------------------------------------------------- > > > > SSD (CONFIG_ZSWAP is OFF): > > -------------------------- > > pswpout 415,709 > > sys time (sec) 301.02 > > Throughput KB/s 155,970 > > memcg_high events 5,068 > > -------------------------- > > > > > > ZSWAP lz4 lz4 lz4 lzo-rle > > -------------------------------------------------------------- > > zswpout 1,598,550 1,515,151 1,449,432 1,493,917 > > sys time (sec) 889.36 481.21 581.22 635.75 > > Throughput KB/s 35,176 14,765 20,253 21,407 > > memcg_high events 321,923 412,733 369,976 375,116 > > -------------------------------------------------------------- > > > > This shows that there is a performance regression of -60% to -195% with > > zswap as compared to SSD with 4K folios. The higher swapout activity with > > zswap is seen here too (i.e., this doesn't appear to be mTHP-specific). > > > > I verified this to be the case even with the v6.7 kernel, which also > > showed a 2.3X throughput improvement when we don't charge zswap: > > > > ZSWAP lz4 v6.7 v6.7 with no cgroup zswap charge > > -------------------------------------------------------------------- > > zswpout 1,419,802 1,398,620 > > sys time (sec) 535.4 613.41 > > systime increases without zswap cgroup charging? That's strange... Additional data gathered with v6.11-rc3 (listed below) based on your suggestion to investigate potential swap.high breaches should hopefully provide some explanation. > > > Throughput KB/s 8,671 20,045 > > memcg_high events 574,046 451,859 > > So, on 4k folio setup, even without cgroup charge, we are still seeing: > > 1. More zswpout (than observed in SSD) > 2. 40-50% worse latency - in fact it is worse without zswap cgroup charging. > 3. 100 times the amount of memcg_high events? This is perhaps the > *strangest* to me. You're already removing zswap cgroup charging, then > where does this comes from? How can we have memory.high violation when > zswap does *not* contribute to memory usage? > > Is this due to swap limit charging? Do you have a cgroup swap limit? > > mem_high = page_counter_read(&memcg->memory) > > READ_ONCE(memcg->memory.high); > swap_high = page_counter_read(&memcg->swap) > > READ_ONCE(memcg->swap.high); > [...] > > if (mem_high || swap_high) { > /* > * The allocating tasks in this cgroup will need to do > * reclaim or be throttled to prevent further growth > * of the memory or swap footprints. > * > * Target some best-effort fairness between the tasks, > * and distribute reclaim work and delay penalties > * based on how much each task is actually allocating. > */ > current->memcg_nr_pages_over_high += batch; > set_notify_resume(current); > break; > } > I don't have a swap.high limit set on the cgroup; it is set to "max". I ran experiments with v6.11-rc3, no zswap charging, 4K folios and different zswap compressors to verify if swap.high is breached with the 4G SSD swapfile. SSD (CONFIG_ZSWAP is OFF): SSD SSD SSD ------------------------------------------------------------ pswpout 415,709 1,032,170 636,582 sys time (sec) 301.02 328.15 306.98 Throughput KB/s 155,970 89,621 122,219 memcg_high events 5,068 15,072 8,344 memcg_swap_high events 0 0 0 memcg_swap_fail events 0 0 0 ------------------------------------------------------------ ZSWAP zstd zstd zstd ---------------------------------------------------------------- zswpout 1,391,524 1,382,965 1,417,307 sys time (sec) 474.68 568.24 489.80 Throughput KB/s 26,099 23,404 111,115 memcg_high events 335,112 340,335 162,260 memcg_swap_high events 0 0 0 memcg_swap_fail events 1,226,899 5,742,153 (mem_cgroup_try_charge_swap) memcg_memory_stat_pgactivate 1,259,547 (shrink_folio_list) ---------------------------------------------------------------- ZSWAP lzo-rle lzo-rle lzo-rle ----------------------------------------------------------- zswpout 1,493,917 1,363,040 1,428,133 sys time (sec) 635.75 498.63 484.65 Throughput KB/s 21,407 23,827 20,237 memcg_high events 375,116 352,814 373,667 memcg_swap_high events 0 0 0 memcg_swap_fail events 715,211 ----------------------------------------------------------- ZSWAP lz4 lz4 lz4 lz4 --------------------------------------------------------------------- zswpout 1,378,781 1,598,550 1,515,151 1,449,432 sys time (sec) 495.45 889.36 481.21 581.22 Throughput KB/s 26,248 35,176 14,765 20,253 memcg_high events 347,209 321,923 412,733 369,976 memcg_swap_high events 0 0 0 0 memcg_swap_fail events 580,103 0 --------------------------------------------------------------------- ZSWAP deflate-iaa deflate-iaa deflate-iaa ---------------------------------------------------------------- zswpout 380,471 1,440,902 1,397,965 sys time (sec) 329.06 570.77 467.41 Throughput KB/s 283,867 28,403 190,600 memcg_high events 5,551 422,831 28,154 memcg_swap_high events 0 0 0 memcg_swap_fail events 0 2,686,758 438,562 ---------------------------------------------------------------- There are no swap.high memcg events recorded in any of the SSD/zswap experiments. However, I do see significant number of memcg_swap_fail events in some of the zswap runs, for all 3 compressors. This is not consistent, because there are some runs with 0 memcg_swap_fail for all compressors. There is a possible co-relation between memcg_swap_fail events (/sys/fs/cgroup/test/memory.swap.events) and the high # of memcg_high events. The root-cause appears to be that there are no available swap slots, memcg_swap_fail is incremented, add_to_swap() fails in shrink_folio_list(), followed by "activate_locked:" for the folio. The folio re-activation is recorded in cgroup memory.stat pgactivate events. The failure to swap out folios due to lack of swap slots could contribute towards memory.high breaches. swp_entry_t folio_alloc_swap(struct folio *folio) { ... get_swap_pages(1, &entry, 0); out: if (mem_cgroup_try_charge_swap(folio, entry)) { put_swap_folio(folio, entry); entry.val = 0; } return entry; } int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry) { ... if (!entry.val) { WARN_ONCE(1, "__mem_cgroup_try_charge_swap: MEMCG_SWAP_FAIL entry.val is 0"); memcg_memory_event(memcg, MEMCG_SWAP_FAIL); return 0; } ... } This is the call stack (v6.11-rc3 mainline) as reference for the above analysis: [ 109.130504] __mem_cgroup_try_charge_swap: MEMCG_SWAP_FAIL entry.val is 0 [ 109.130515] WARNING: CPU: 143 PID: 5200 at mm/memcontrol.c:5011 __mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) [ 109.130652] RIP: 0010:__mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) [ 109.130682] Call Trace: [ 109.130686] <TASK> [ 109.130689] ? __warn (kernel/panic.c:735) [ 109.130695] ? __mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) [ 109.130698] ? report_bug (lib/bug.c:201 lib/bug.c:219) [ 109.130705] ? prb_read_valid (kernel/printk/printk_ringbuffer.c:2183) [ 109.130710] ? handle_bug (arch/x86/kernel/traps.c:239) [ 109.130715] ? exc_invalid_op (arch/x86/kernel/traps.c:260 (discriminator 1)) [ 109.130718] ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621) [ 109.130722] ? __mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) [ 109.130725] ? __mem_cgroup_try_charge_swap (mm/memcontrol.c:5011 (discriminator 3)) [ 109.130728] folio_alloc_swap (mm/swap_slots.c:348) [ 109.130734] add_to_swap (mm/swap_state.c:189) [ 109.130737] shrink_folio_list (mm/vmscan.c:1235) [ 109.130744] ? __mod_zone_page_state (mm/vmstat.c:367) [ 109.130748] ? isolate_lru_folios (mm/vmscan.c:1598 mm/vmscan.c:1736) [ 109.130753] shrink_inactive_list (./include/linux/spinlock.h:376 mm/vmscan.c:1961) [ 109.130758] shrink_lruvec (mm/vmscan.c:2194 mm/vmscan.c:5706) [ 109.130763] shrink_node (mm/vmscan.c:5910 mm/vmscan.c:5948) [ 109.130768] do_try_to_free_pages (mm/vmscan.c:6134 mm/vmscan.c:6254) [ 109.130772] try_to_free_mem_cgroup_pages (./include/linux/sched/mm.h:355 ./include/linux/sched/mm.h:456 mm/vmscan.c:6588) [ 109.130778] reclaim_high (mm/memcontrol.c:1906) [ 109.130783] mem_cgroup_handle_over_high (./include/linux/memcontrol.h:556 mm/memcontrol.c:2001 mm/memcontrol.c:2108) [ 109.130787] irqentry_exit_to_user_mode (./include/linux/resume_user_mode.h:60 kernel/entry/common.c:114 ./include/linux/entry-common.h:328 kernel/entry/common.c:231) [ 109.130792] asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623) However, this is probably not the only cause for either the high # of memory.high breaches or the over-reclaim with zswap, as seen in the lz4 data where the memory.high is significant even in cases where there are no memcg_swap_fails. Some observations/questions based on the above 4K folios swapout data: 1) There are more memcg_high events as the swapout latency reduces (i.e. faster swap-write path). This is even without charging zswap utilization to the cgroup. 2) There appears to be a direct co-relation between higher # of memcg_swap_fail events, and an increase in memcg_high breaches and reduction in usemem throughput. This combined with the observation in (1) suggests that with a faster compressor, we need more swap slots, that increases the probability of running out of swap slots with the 4G SSD backing device. 3) Could the data shared earlier on reduction in memcg_high breaches with 64K mTHP swapout provide some more clues, if we agree with (1) and (2): "Interestingly, the # of memcg_high events reduces significantly with 64K mTHP as compared to the above 4K memcg_high events data, when tested with v4 and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-mTHP)." 4) In the case of each zswap compressor, there are some runs that go through with 0 memcg_swap_fail events. These runs generally have better fewer memcg_high breaches and better sys time/throughput. 5) For a given swap setup, there is some amount of variance in sys time for this workload. 6) All this suggests that the primary root cause is the concurrency setup, where there could be randomness between runs as to the # of processes that observe the memory.high breach due to other factors such as availability of swap slots for alloc. To summarize, I believe the root-cause is the 4G SSD swapfile resulting in running out of swap slots, and anomalous behavior with over-reclaim when 70 concurrent processes are working with the 60G memory limit while trying to allocate 1G each; with randomness in processes reacting to the breach. The cgroup zswap charging exacerbates this situation, but is not a problem in and of itself. Nhat, as you pointed out, this is somewhat of an unrealistic scenario that doesn't seem to indicate any specific problems to be solved, other than the temporary cgroup zswap double-charging. Would it be fair to evaluate this patch-series based on a more realistic swapfile configuration based on 176G ZRAM, for which I had shared the data in v2? There weren't any problems with swap slots availability or any anomalies that I can think of with this setup, other than the fact that the "Before" and "After" sys times could not be directly compared for 2 key reasons: - ZRAM compressed data is not charged to the cgroup, similar to SSD. - ZSWAP compressed data is charged to the cgroup. This disparity causes fewer swapouts, better sys time/throughput in the "Before" experiments. In the "After" experiments, this disparity causes more swapouts only with zswap-lz4 due to the poorer compression ratio combined with the cgroup charge; and hence a regression in sys time/throughput. However, the better compression ratio with deflate-iaa results in comparable # of swapouts as "Before", with better sys time/throughput. My main rationale for suggesting the v2 ZRAM swapfile data is that the disparities are the same as with the 4G SSD swapfile, but there are no anomalies, with reasonable explanations for the data. I would appreciate everyones' thoughts on this. If this sounds Ok, then I can submit a v5 with the changes suggested by Yosry. I am listing here the v2 data with 176G ZRAM swapfile again, just for reference. v2 data with cgroup zswap charging: ----------------------------------- 64KB mTHP: ========== ------------------------------------------------------------------ | | | | | |Kernel | mTHP SWAP-OUT | Throughput | Change| | | | KB/s | | |--------------------|-------------------|------------|------------| |v6.11-rc3 mainline | ZRAM lzo-rle | 118,928 | Baseline | |zswap-mTHP-Store | ZSWAP lz4 | 82,665 | -30% | |zswap-mTHP-Store | ZSWAP deflate-iaa | 176,210 | 48% | |------------------------------------------------------------------| | | | | | |Kernel | mTHP SWAP-OUT | Sys time | Change| | | | sec | | |--------------------|-------------------|------------|------------| |v6.11-rc3 mainline | ZRAM lzo-rle | 1,032.20 | Baseline | |zswap-mTHP=Store | ZSWAP lz4 | 1,854.51 | -80% | |zswap-mTHP-Store | ZSWAP deflate-iaa | 582.71 | 44% | ------------------------------------------------------------------ ----------------------------------------------------------------------- | VMSTATS, mTHP ZSWAP stats, | v6.11-rc3 | zswap-mTHP | zswap-mTHP | | mTHP ZRAM stats: | mainline | Store | Store | | | | lz4 | deflate-iaa | |-----------------------------------------------------------------------| | pswpin | 16 | 0 | 0 | | pswpout | 7,770,720 | 0 | 0 | | zswpin | 547 | 695 | 579 | | zswpout | 1,394 | 15,462,778 | 7,284,554 | |-----------------------------------------------------------------------| | thp_swpout | 0 | 0 | 0 | | thp_swpout_fallback | 0 | 0 | 0 | | pgmajfault | 3,786 | 3,541 | 3,367 | |-----------------------------------------------------------------------| | hugepages-64kB/stats/zswpout | | 966,328 | 455,196 | |-----------------------------------------------------------------------| | hugepages-64kB/stats/swpout | 485,670 | 0 | 0 | ----------------------------------------------------------------------- 2MB PMD-THP/2048K mTHP: ======================= ------------------------------------------------------------------ | | | | | |Kernel | mTHP SWAP-OUT | Throughput | Change| | | | KB/s | | |--------------------|-------------------|------------|------------| |v6.11-rc3 mainline | ZRAM lzo-rle | 177,340 | Baseline | |zswap-mTHP-Store | ZSWAP lz4 | 84,030 | -53% | |zswap-mTHP-Store | ZSWAP deflate-iaa | 185,691 | 5% | |------------------------------------------------------------------| | | | | | |Kernel | mTHP SWAP-OUT | Sys time | Change| | | | sec | | |--------------------|-------------------|------------|------------| |v6.11-rc3 mainline | ZRAM lzo-rle | 876.29 | Baseline | |zswap-mTHP-Store | ZSWAP lz4 | 1,740.55 | -99% | |zswap-mTHP-Store | ZSWAP deflate-iaa | 650.33 | 26% | ------------------------------------------------------------------ ------------------------------------------------------------------------- | VMSTATS, mTHP ZSWAP stats, | v6.11-rc3 | zswap-mTHP | zswap-mTHP | | mTHP ZRAM stats: | mainline | Store | Store | | | | lz4 | deflate-iaa | |-------------------------------------------------------------------------| | pswpin | 0 | 0 | 0 | | pswpout | 8,628,224 | 0 | 0 | | zswpin | 678 | 22,733 | 1,641 | | zswpout | 1,481 | 14,828,597 | 9,404,937 | |-------------------------------------------------------------------------| | thp_swpout | 16,852 | 0 | 0 | | thp_swpout_fallback | 0 | 0 | 0 | | pgmajfault | 3,467 | 25,550 | 4,800 | |-------------------------------------------------------------------------| | hugepages-2048kB/stats/zswpout | | 28,924 | 18,366 | |-------------------------------------------------------------------------| | hugepages-2048kB/stats/swpout | 16,852 | 0 | 0 | ------------------------------------------------------------------------- > > > -------------------------------------------------------------------- > > > > > > Summary from the debug: > > ----------------------- > > 1) Excess reclaim is exacerbated by zswap charge to cgroup. Without the > > charge, reclaim is on par with SSD for mTHP in the single process > > case. The multiple process excess reclaim seems to be most likely > > resulting from over-reclaim done by the cores, in their respective calls > > to mem_cgroup_handle_over_high(). > > Exarcebate, yes. I'm not 100% it's the sole or even the main cause. > > You still see a degree of overreclaiming without zswap cgroup charging in: > > 1. 70 processes, with mTHP > 2. 70 processes, with 4K folios. That's correct, although the over-reclaiming is not as bad with mTHP. > > > > > 2) The higher swapout activity with zswap as compared to SSD does not > > appear to be specific to mTHP. Higher reclaim activity and sys time > > regression with zswap (as compared to a setup where there is only SSD > > configured as swap) exists with 4K pages as far back as v6.7. > > Yeah I can believe that without mthp, the same-ish workload would > cause the same regression. This makes sense. > > > > > 3) The debug indicates the hypothesis (2) is worth more investigation: > > Does a faster reclaim path somehow cause less allocation stalls; thereby > > causing more breaches of memory.high, hence more reclaim -- and does > this > > cycle repeat, potentially leading to higher swapout activity with zswap? > > Any advise on this being a possibility, and suggestions/pointers to > > verify this, would be greatly appreciated. > > Add stalls along the zswap path? :) Yes, possibly! Hopefully, the swap slots availability learning from today's experiments makes things a little clearer. > > > > > 4) Interestingly, the # of memcg_high events reduces significantly with 64K > > mTHP as compared to the above 4K high events data, when tested with v4 > > and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-mTHP). This > > potentially indicates something to do with allocation efficiency > > countering the higher reclaim that seems to be caused by swapout > > efficiency. > > > > 5) Nhat, Yosry: would it be possible for you to run the 4K folios > > usemem -n 70 1g (with 60G memory.high) expmnt with 4G and some > higher > > value SSD configuration in your setup and say, v6.11-rc3. I would like > > to rule out the memory constrained 4G SSD in my setup somehow skewing > > the behavior of zswap vis-a-vis > > allocation/memcg_handle_over_high/reclaim. I realize your time is > > valuable, however I think an independent confirmation of what I have > > been observing, would be really helpful for us to figure out potential > > root-causes and solutions. > > It might take awhile for me to set up your benchmark, but yeah 4G > swapfile seems small on a 64G host - of course it depends on the > workload, but this has a lot memory usage. In fact the total memory > usage (70G?) is slightly above memory.high + 4G swapfile - note that > this is exarcebated by, once again, zswap's less-than-100% memory > saving ratio. I agree, this is somewhat of an unrealistic setup. Hopefully the data and my learnings shared from the experiments I ran today, should provide some insights into possible root-causes for the anomalous over-reclaim behavior. Thanks, Kanchana > > > > > 6) I tried a small change in memcontrol.c::mem_cgroup_handle_over_high() > to > > break out of the loop if we have reclaimed a total of at least > > "nr_pages": > > > > nr_reclaimed = reclaim_high(memcg, > > in_retry ? SWAP_CLUSTER_MAX : nr_pages, > > gfp_mask); > > > > + nr_reclaimed_total += nr_reclaimed; > > + > > + if (nr_reclaimed_total >= nr_pages) > > + goto out; > > > > > > This was only for debug purposes, and did seem to mitigate the higher > > reclaim behavior for 4K folios: > > > > ZSWAP lz4 lz4 lz4 > > ---------------------------------------------------------- > > zswpout 1,305,367 1,349,195 1,529,235 > > sys time (sec) 472.06 507.76 646.39 > > Throughput KB/s 55,144 21,811 88,310 > > memcg_high events 257,890 343,213 172,351 > > ---------------------------------------------------------- > > > > On average, this change results in 17% improvement in sys time, 2.35X > > improvement in throughput and 30% fewer memcg_high events. > > > > I look forward to further inputs on next steps. > > > > Thanks, > > Kanchana > > > > > > > > > > Thanks for this analysis. I will debug this some more, so we can better > > > understand these results. > > > > > > Thanks, > > > Kanchana
On Sun, Aug 18, 2024 at 7:16 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > Hi All, > > base-commit: 8c0b4f7b65fd1ca7af01267f491e815a40d77444 > -- > 2.27.0 > BTW, where does this commit come from? I assume this is post-mTHP swapout - does it have mTHP swapin? Chris Li's patch series to improve swap slot allocation? Can't seem to find it when I fetch mm-unstable for some reason hmmmmm.
On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> wrote: > > > Internally, we often see 1-3 or 1-4 saving ratio (or even more). > > Agree with this as well. In our experiments with other workloads, we > typically see much higher ratios. > > > > > Probably does not explain everything, but worth double checking - > > could you check with zstd to see if the ratio improves. > > Sure. I gathered ratio and compressed memory footprint data today with > 64K mTHP, the 4G SSD swapfile and different zswap compressors. > > This patch-series and no zswap charging, 64K mTHP: > --------------------------------------------------------------------------- > Total Total Average Average Comp > compressed compression compressed compression ratio > length latency length latency > bytes milliseconds bytes nanoseconds > --------------------------------------------------------------------------- > SSD (no zswap) 1,362,296,832 887,861 > lz4 2,610,657,430 55,984 2,055 44,065 1.99 > zstd 729,129,528 50,986 565 39,510 7.25 > deflate-iaa 1,286,533,438 44,785 1,415 49,252 2.89 > --------------------------------------------------------------------------- > > zstd does very well on ratio, as expected. Wait. So zstd is displaying 7-to-1 compression ratio? And has *lower* average latency? Why are we running benchmark on lz4 again? Sure there is no free lunch and no compressor that works well on all kind of data, but lz4's performance here is so bad that it's borderline justifiable to disable/bypass zswap with this kind of compresison ratio... Can I ask you to run benchmarking on zstd from now on? > > > > > > > > > > > > Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP: > > > ------------------------------------------------------------ > > > > > > I wanted to take a step back and understand how the mainline v6.11-rc3 > > > handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off) and > > when > > > swapped out to ZSWAP. Interestingly, higher swapout activity is observed > > > with 4K folios and v6.11-rc3 (with the debug change to not charge zswap to > > > cgroup). > > > > > > v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP: > > > > > > ------------------------------------------------------------- > > > SSD (CONFIG_ZSWAP is OFF) ZSWAP lz4 lzo-rle > > > ------------------------------------------------------------- > > > cgroup memory.events: cgroup memory.events: > > > > > > low 0 low 0 0 > > > high 5,068 high 321,923 375,116 > > > max 0 max 0 0 > > > oom 0 oom 0 0 > > > oom_kill 0 oom_kill 0 0 > > > oom_group_kill 0 oom_group_kill 0 0 > > > ------------------------------------------------------------- > > > > > > SSD (CONFIG_ZSWAP is OFF): > > > -------------------------- > > > pswpout 415,709 > > > sys time (sec) 301.02 > > > Throughput KB/s 155,970 > > > memcg_high events 5,068 > > > -------------------------- > > > > > > > > > ZSWAP lz4 lz4 lz4 lzo-rle > > > -------------------------------------------------------------- > > > zswpout 1,598,550 1,515,151 1,449,432 1,493,917 > > > sys time (sec) 889.36 481.21 581.22 635.75 > > > Throughput KB/s 35,176 14,765 20,253 21,407 > > > memcg_high events 321,923 412,733 369,976 375,116 > > > -------------------------------------------------------------- > > > > > > This shows that there is a performance regression of -60% to -195% with > > > zswap as compared to SSD with 4K folios. The higher swapout activity with > > > zswap is seen here too (i.e., this doesn't appear to be mTHP-specific). > > > > > > I verified this to be the case even with the v6.7 kernel, which also > > > showed a 2.3X throughput improvement when we don't charge zswap: > > > > > > ZSWAP lz4 v6.7 v6.7 with no cgroup zswap charge > > > -------------------------------------------------------------------- > > > zswpout 1,419,802 1,398,620 > > > sys time (sec) 535.4 613.41 > > > > systime increases without zswap cgroup charging? That's strange... > > Additional data gathered with v6.11-rc3 (listed below) based on your suggestion > to investigate potential swap.high breaches should hopefully provide some > explanation. > > > > > > Throughput KB/s 8,671 20,045 > > > memcg_high events 574,046 451,859 > > > > So, on 4k folio setup, even without cgroup charge, we are still seeing: > > > > 1. More zswpout (than observed in SSD) > > 2. 40-50% worse latency - in fact it is worse without zswap cgroup charging. > > 3. 100 times the amount of memcg_high events? This is perhaps the > > *strangest* to me. You're already removing zswap cgroup charging, then > > where does this comes from? How can we have memory.high violation when > > zswap does *not* contribute to memory usage? > > > > Is this due to swap limit charging? Do you have a cgroup swap limit? > > > > mem_high = page_counter_read(&memcg->memory) > > > READ_ONCE(memcg->memory.high); > > swap_high = page_counter_read(&memcg->swap) > > > READ_ONCE(memcg->swap.high); > > [...] > > > > if (mem_high || swap_high) { > > /* > > * The allocating tasks in this cgroup will need to do > > * reclaim or be throttled to prevent further growth > > * of the memory or swap footprints. > > * > > * Target some best-effort fairness between the tasks, > > * and distribute reclaim work and delay penalties > > * based on how much each task is actually allocating. > > */ > > current->memcg_nr_pages_over_high += batch; > > set_notify_resume(current); > > break; > > } > > > > I don't have a swap.high limit set on the cgroup; it is set to "max". > > I ran experiments with v6.11-rc3, no zswap charging, 4K folios and different > zswap compressors to verify if swap.high is breached with the 4G SSD swapfile. > > SSD (CONFIG_ZSWAP is OFF): > > SSD SSD SSD > ------------------------------------------------------------ > pswpout 415,709 1,032,170 636,582 > sys time (sec) 301.02 328.15 306.98 > Throughput KB/s 155,970 89,621 122,219 > memcg_high events 5,068 15,072 8,344 > memcg_swap_high events 0 0 0 > memcg_swap_fail events 0 0 0 > ------------------------------------------------------------ > > ZSWAP zstd zstd zstd > ---------------------------------------------------------------- > zswpout 1,391,524 1,382,965 1,417,307 > sys time (sec) 474.68 568.24 489.80 > Throughput KB/s 26,099 23,404 111,115 > memcg_high events 335,112 340,335 162,260 > memcg_swap_high events 0 0 0 > memcg_swap_fail events 1,226,899 5,742,153 > (mem_cgroup_try_charge_swap) > memcg_memory_stat_pgactivate 1,259,547 > (shrink_folio_list) > ---------------------------------------------------------------- > > ZSWAP lzo-rle lzo-rle lzo-rle > ----------------------------------------------------------- > zswpout 1,493,917 1,363,040 1,428,133 > sys time (sec) 635.75 498.63 484.65 > Throughput KB/s 21,407 23,827 20,237 > memcg_high events 375,116 352,814 373,667 > memcg_swap_high events 0 0 0 > memcg_swap_fail events 715,211 > ----------------------------------------------------------- > > ZSWAP lz4 lz4 lz4 lz4 > --------------------------------------------------------------------- > zswpout 1,378,781 1,598,550 1,515,151 1,449,432 > sys time (sec) 495.45 889.36 481.21 581.22 > Throughput KB/s 26,248 35,176 14,765 20,253 > memcg_high events 347,209 321,923 412,733 369,976 > memcg_swap_high events 0 0 0 0 > memcg_swap_fail events 580,103 0 > --------------------------------------------------------------------- > > ZSWAP deflate-iaa deflate-iaa deflate-iaa > ---------------------------------------------------------------- > zswpout 380,471 1,440,902 1,397,965 > sys time (sec) 329.06 570.77 467.41 > Throughput KB/s 283,867 28,403 190,600 > memcg_high events 5,551 422,831 28,154 > memcg_swap_high events 0 0 0 > memcg_swap_fail events 0 2,686,758 438,562 > ---------------------------------------------------------------- Why are there 3 columns for each of the compressors? Is this different runs of the same workload? And why do some columns have missing cells? > > There are no swap.high memcg events recorded in any of the SSD/zswap > experiments. However, I do see significant number of memcg_swap_fail > events in some of the zswap runs, for all 3 compressors. This is not > consistent, because there are some runs with 0 memcg_swap_fail for all > compressors. > > There is a possible co-relation between memcg_swap_fail events > (/sys/fs/cgroup/test/memory.swap.events) and the high # of memcg_high > events. The root-cause appears to be that there are no available swap > slots, memcg_swap_fail is incremented, add_to_swap() fails in > shrink_folio_list(), followed by "activate_locked:" for the folio. > The folio re-activation is recorded in cgroup memory.stat pgactivate > events. The failure to swap out folios due to lack of swap slots could > contribute towards memory.high breaches. Yeah FWIW, that was gonna be my first suggestion. This swapfile size is wayyyy too small... But that said, the link is not clear to me at all. The only thing I can think of is lz4's performance sucks so bad that it's not saving enough memory, leading to regression. And since it's still taking up swap slot, we cannot use swap either? > > However, this is probably not the only cause for either the high # of > memory.high breaches or the over-reclaim with zswap, as seen in the lz4 > data where the memory.high is significant even in cases where there are no > memcg_swap_fails. > > Some observations/questions based on the above 4K folios swapout data: > > 1) There are more memcg_high events as the swapout latency reduces > (i.e. faster swap-write path). This is even without charging zswap > utilization to the cgroup. This is still inexplicable to me. If we are not charging zswap usage, we shouldn't even be triggering the reclaim_high() path, no? I'm curious - can you use bpftrace to tracks where/when reclaim_high is being called? > > 2) There appears to be a direct co-relation between higher # of > memcg_swap_fail events, and an increase in memcg_high breaches and > reduction in usemem throughput. This combined with the observation in > (1) suggests that with a faster compressor, we need more swap slots, > that increases the probability of running out of swap slots with the 4G > SSD backing device. > > 3) Could the data shared earlier on reduction in memcg_high breaches with > 64K mTHP swapout provide some more clues, if we agree with (1) and (2): > > "Interestingly, the # of memcg_high events reduces significantly with 64K > mTHP as compared to the above 4K memcg_high events data, when tested > with v4 and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP-mTHP)." > > 4) In the case of each zswap compressor, there are some runs that go > through with 0 memcg_swap_fail events. These runs generally have better > fewer memcg_high breaches and better sys time/throughput. > > 5) For a given swap setup, there is some amount of variance in > sys time for this workload. > > 6) All this suggests that the primary root cause is the concurrency setup, > where there could be randomness between runs as to the # of processes > that observe the memory.high breach due to other factors such as > availability of swap slots for alloc. > > To summarize, I believe the root-cause is the 4G SSD swapfile resulting in > running out of swap slots, and anomalous behavior with over-reclaim when 70 > concurrent processes are working with the 60G memory limit while trying to > allocate 1G each; with randomness in processes reacting to the breach. > > The cgroup zswap charging exacerbates this situation, but is not a problem > in and of itself. > > Nhat, as you pointed out, this is somewhat of an unrealistic scenario that > doesn't seem to indicate any specific problems to be solved, other than the > temporary cgroup zswap double-charging. > > Would it be fair to evaluate this patch-series based on a more realistic > swapfile configuration based on 176G ZRAM, for which I had shared the data > in v2? There weren't any problems with swap slots availability or any > anomalies that I can think of with this setup, other than the fact that the > "Before" and "After" sys times could not be directly compared for 2 key > reasons: > > - ZRAM compressed data is not charged to the cgroup, similar to SSD. > - ZSWAP compressed data is charged to the cgroup. Yeah that's a bit unfair still. Wild idea, but what about we compare SSD without zswap (or SSD with zswap, but without this patch series so that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing swapfile on zram block device). It is stupid, I know. But let's take advantage of the fact that zram is not charged to cgroup, pretending that its memory foot print is empty? I don't know how zram works though, so my apologies if it's a stupid suggestion :)
On Tue, Aug 27, 2024 at 8:23 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> wrote: > Yeah that's a bit unfair still. Wild idea, but what about we compare > SSD without zswap (or SSD with zswap, but without this patch series so > that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing > swapfile on zram block device). > > It is stupid, I know. But let's take advantage of the fact that zram > is not charged to cgroup, pretending that its memory foot print is > empty? > > I don't know how zram works though, so my apologies if it's a stupid > suggestion :) Oh nvm, looks like that's what you're already doing. That said, the lz4 column is soooo bad still, whereas the deflate-iaa clearly shows improvement! This means it could be compressor-dependent. Can you try it with zstd?
Hi Nhat, > -----Original Message----- > From: Nhat Pham <nphamcs@gmail.com> > Sent: Tuesday, August 27, 2024 7:55 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > On Sun, Aug 18, 2024 at 7:16 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > Hi All, > > > > base-commit: 8c0b4f7b65fd1ca7af01267f491e815a40d77444 > > -- > > 2.27.0 > > > > BTW, where does this commit come from? I assume this is post-mTHP > swapout - does it have mTHP swapin? Chris Li's patch series to improve > swap slot allocation? > > Can't seem to find it when I fetch mm-unstable for some reason hmmmmm. This was the latest mm-unstable as of 8/18/2024: commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444 Author: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Date: Thu May 11 13:22:30 2023 +0800 mm: optimization on page allocation when CMA enabled Let me rebase to the latest mm-unstable and send out an updated patchset. mm-unstable as of 8/27/2024: - Has some of Chris Li's patches to improve swap slot allocation: https://patchwork.kernel.org/project/linux-mm/patch/20240730-swap-allocator-v5-3-cb9c148b9297@kernel.org/ https://patchwork.kernel.org/project/linux-mm/patch/20240730-swap-allocator-v5-2-cb9c148b9297@kernel.org/ https://patchwork.kernel.org/project/linux-mm/patch/20240730-swap-allocator-v5-1-cb9c148b9297@kernel.org/ - Does not yet have mTHP swapin as far as I can tell. Thanks, Kanchana
> -----Original Message----- > From: Nhat Pham <nphamcs@gmail.com> > Sent: Tuesday, August 27, 2024 8:24 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> wrote: > > > > > Internally, we often see 1-3 or 1-4 saving ratio (or even more). > > > > Agree with this as well. In our experiments with other workloads, we > > typically see much higher ratios. > > > > > > > > Probably does not explain everything, but worth double checking - > > > could you check with zstd to see if the ratio improves. > > > > Sure. I gathered ratio and compressed memory footprint data today with > > 64K mTHP, the 4G SSD swapfile and different zswap compressors. > > > > This patch-series and no zswap charging, 64K mTHP: > > --------------------------------------------------------------------------- > > Total Total Average Average Comp > > compressed compression compressed compression ratio > > length latency length latency > > bytes milliseconds bytes nanoseconds > > --------------------------------------------------------------------------- > > SSD (no zswap) 1,362,296,832 887,861 > > lz4 2,610,657,430 55,984 2,055 44,065 1.99 > > zstd 729,129,528 50,986 565 39,510 7.25 > > deflate-iaa 1,286,533,438 44,785 1,415 49,252 2.89 > > --------------------------------------------------------------------------- > > > > zstd does very well on ratio, as expected. > > Wait. So zstd is displaying 7-to-1 compression ratio? And has *lower* > average latency? > > Why are we running benchmark on lz4 again? Sure there is no free lunch > and no compressor that works well on all kind of data, but lz4's > performance here is so bad that it's borderline justifiable to > disable/bypass zswap with this kind of compresison ratio... > > Can I ask you to run benchmarking on zstd from now on? Sure, will do. > > > > > > > > > > > > > > > > > > Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP: > > > > ------------------------------------------------------------ > > > > > > > > I wanted to take a step back and understand how the mainline v6.11- > rc3 > > > > handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off) and > > > when > > > > swapped out to ZSWAP. Interestingly, higher swapout activity is > observed > > > > with 4K folios and v6.11-rc3 (with the debug change to not charge > zswap to > > > > cgroup). > > > > > > > > v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP: > > > > > > > > ------------------------------------------------------------- > > > > SSD (CONFIG_ZSWAP is OFF) ZSWAP lz4 lzo-rle > > > > ------------------------------------------------------------- > > > > cgroup memory.events: cgroup memory.events: > > > > > > > > low 0 low 0 0 > > > > high 5,068 high 321,923 375,116 > > > > max 0 max 0 0 > > > > oom 0 oom 0 0 > > > > oom_kill 0 oom_kill 0 0 > > > > oom_group_kill 0 oom_group_kill 0 0 > > > > ------------------------------------------------------------- > > > > > > > > SSD (CONFIG_ZSWAP is OFF): > > > > -------------------------- > > > > pswpout 415,709 > > > > sys time (sec) 301.02 > > > > Throughput KB/s 155,970 > > > > memcg_high events 5,068 > > > > -------------------------- > > > > > > > > > > > > ZSWAP lz4 lz4 lz4 lzo-rle > > > > -------------------------------------------------------------- > > > > zswpout 1,598,550 1,515,151 1,449,432 1,493,917 > > > > sys time (sec) 889.36 481.21 581.22 635.75 > > > > Throughput KB/s 35,176 14,765 20,253 21,407 > > > > memcg_high events 321,923 412,733 369,976 375,116 > > > > -------------------------------------------------------------- > > > > > > > > This shows that there is a performance regression of -60% to -195% > with > > > > zswap as compared to SSD with 4K folios. The higher swapout activity > with > > > > zswap is seen here too (i.e., this doesn't appear to be mTHP-specific). > > > > > > > > I verified this to be the case even with the v6.7 kernel, which also > > > > showed a 2.3X throughput improvement when we don't charge zswap: > > > > > > > > ZSWAP lz4 v6.7 v6.7 with no cgroup zswap charge > > > > -------------------------------------------------------------------- > > > > zswpout 1,419,802 1,398,620 > > > > sys time (sec) 535.4 613.41 > > > > > > systime increases without zswap cgroup charging? That's strange... > > > > Additional data gathered with v6.11-rc3 (listed below) based on your > suggestion > > to investigate potential swap.high breaches should hopefully provide some > > explanation. > > > > > > > > > Throughput KB/s 8,671 20,045 > > > > memcg_high events 574,046 451,859 > > > > > > So, on 4k folio setup, even without cgroup charge, we are still seeing: > > > > > > 1. More zswpout (than observed in SSD) > > > 2. 40-50% worse latency - in fact it is worse without zswap cgroup > charging. > > > 3. 100 times the amount of memcg_high events? This is perhaps the > > > *strangest* to me. You're already removing zswap cgroup charging, then > > > where does this comes from? How can we have memory.high violation > when > > > zswap does *not* contribute to memory usage? > > > > > > Is this due to swap limit charging? Do you have a cgroup swap limit? > > > > > > mem_high = page_counter_read(&memcg->memory) > > > > READ_ONCE(memcg->memory.high); > > > swap_high = page_counter_read(&memcg->swap) > > > > READ_ONCE(memcg->swap.high); > > > [...] > > > > > > if (mem_high || swap_high) { > > > /* > > > * The allocating tasks in this cgroup will need to do > > > * reclaim or be throttled to prevent further growth > > > * of the memory or swap footprints. > > > * > > > * Target some best-effort fairness between the tasks, > > > * and distribute reclaim work and delay penalties > > > * based on how much each task is actually allocating. > > > */ > > > current->memcg_nr_pages_over_high += batch; > > > set_notify_resume(current); > > > break; > > > } > > > > > > > I don't have a swap.high limit set on the cgroup; it is set to "max". > > > > I ran experiments with v6.11-rc3, no zswap charging, 4K folios and different > > zswap compressors to verify if swap.high is breached with the 4G SSD > swapfile. > > > > SSD (CONFIG_ZSWAP is OFF): > > > > SSD SSD SSD > > ------------------------------------------------------------ > > pswpout 415,709 1,032,170 636,582 > > sys time (sec) 301.02 328.15 306.98 > > Throughput KB/s 155,970 89,621 122,219 > > memcg_high events 5,068 15,072 8,344 > > memcg_swap_high events 0 0 0 > > memcg_swap_fail events 0 0 0 > > ------------------------------------------------------------ > > > > ZSWAP zstd zstd zstd > > ---------------------------------------------------------------- > > zswpout 1,391,524 1,382,965 1,417,307 > > sys time (sec) 474.68 568.24 489.80 > > Throughput KB/s 26,099 23,404 111,115 > > memcg_high events 335,112 340,335 162,260 > > memcg_swap_high events 0 0 0 > > memcg_swap_fail events 1,226,899 5,742,153 > > (mem_cgroup_try_charge_swap) > > memcg_memory_stat_pgactivate 1,259,547 > > (shrink_folio_list) > > ---------------------------------------------------------------- > > > > ZSWAP lzo-rle lzo-rle lzo-rle > > ----------------------------------------------------------- > > zswpout 1,493,917 1,363,040 1,428,133 > > sys time (sec) 635.75 498.63 484.65 > > Throughput KB/s 21,407 23,827 20,237 > > memcg_high events 375,116 352,814 373,667 > > memcg_swap_high events 0 0 0 > > memcg_swap_fail events 715,211 > > ----------------------------------------------------------- > > > > ZSWAP lz4 lz4 lz4 lz4 > > --------------------------------------------------------------------- > > zswpout 1,378,781 1,598,550 1,515,151 1,449,432 > > sys time (sec) 495.45 889.36 481.21 581.22 > > Throughput KB/s 26,248 35,176 14,765 20,253 > > memcg_high events 347,209 321,923 412,733 369,976 > > memcg_swap_high events 0 0 0 0 > > memcg_swap_fail events 580,103 0 > > --------------------------------------------------------------------- > > > > ZSWAP deflate-iaa deflate-iaa deflate-iaa > > ---------------------------------------------------------------- > > zswpout 380,471 1,440,902 1,397,965 > > sys time (sec) 329.06 570.77 467.41 > > Throughput KB/s 283,867 28,403 190,600 > > memcg_high events 5,551 422,831 28,154 > > memcg_swap_high events 0 0 0 > > memcg_swap_fail events 0 2,686,758 438,562 > > ---------------------------------------------------------------- > > Why are there 3 columns for each of the compressors? Is this different > runs of the same workload? > > And why do some columns have missing cells? Yes, these are different runs of the same workload. Since there is some amount of variance seen in the data, I figured it is best to publish the metrics from the individual runs rather than averaging. Some of these runs were gathered earlier with the same code base, however, I wasn't monitoring/logging the memcg_swap_high/memcg_swap_fail events at that time. For those runs, just these two counters have missing column entries; the rest of the data is still valid. > > > > > There are no swap.high memcg events recorded in any of the SSD/zswap > > experiments. However, I do see significant number of memcg_swap_fail > > events in some of the zswap runs, for all 3 compressors. This is not > > consistent, because there are some runs with 0 memcg_swap_fail for all > > compressors. > > > > There is a possible co-relation between memcg_swap_fail events > > (/sys/fs/cgroup/test/memory.swap.events) and the high # of memcg_high > > events. The root-cause appears to be that there are no available swap > > slots, memcg_swap_fail is incremented, add_to_swap() fails in > > shrink_folio_list(), followed by "activate_locked:" for the folio. > > The folio re-activation is recorded in cgroup memory.stat pgactivate > > events. The failure to swap out folios due to lack of swap slots could > > contribute towards memory.high breaches. > > Yeah FWIW, that was gonna be my first suggestion. This swapfile size > is wayyyy too small... > > But that said, the link is not clear to me at all. The only thing I > can think of is lz4's performance sucks so bad that it's not saving > enough memory, leading to regression. And since it's still taking up > swap slot, we cannot use swap either? The occurrence of memcg_swap_fail events establishes that swap slots are not available with 4G of swap space. This causes those 4K folios to remain in memory, which can worsen an existing problem with memory.high breaches. However, it is worth noting that this is not the only contributor to memcg_high events that still occur without zswap charging. The data shows 321,923 occurrences of memcg_high in Col 2 of the lz4 table, that also has 0 occurrences of memcg_swap_fail reported in the cgroup stats. > > > > > However, this is probably not the only cause for either the high # of > > memory.high breaches or the over-reclaim with zswap, as seen in the lz4 > > data where the memory.high is significant even in cases where there are no > > memcg_swap_fails. > > > > Some observations/questions based on the above 4K folios swapout data: > > > > 1) There are more memcg_high events as the swapout latency reduces > > (i.e. faster swap-write path). This is even without charging zswap > > utilization to the cgroup. > > This is still inexplicable to me. If we are not charging zswap usage, > we shouldn't even be triggering the reclaim_high() path, no? > > I'm curious - can you use bpftrace to tracks where/when reclaim_high > is being called? I had confirmed earlier with counters that all calls to reclaim_high() were from include/linux/resume_user_mode.h::resume_user_mode_work(). I will confirm this with zstd and bpftrace and share. Thanks, Kanchana > > > > > 2) There appears to be a direct co-relation between higher # of > > memcg_swap_fail events, and an increase in memcg_high breaches and > > reduction in usemem throughput. This combined with the observation in > > (1) suggests that with a faster compressor, we need more swap slots, > > that increases the probability of running out of swap slots with the 4G > > SSD backing device. > > > > 3) Could the data shared earlier on reduction in memcg_high breaches with > > 64K mTHP swapout provide some more clues, if we agree with (1) and (2): > > > > "Interestingly, the # of memcg_high events reduces significantly with 64K > > mTHP as compared to the above 4K memcg_high events data, when > tested > > with v4 and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP- > mTHP)." > > > > 4) In the case of each zswap compressor, there are some runs that go > > through with 0 memcg_swap_fail events. These runs generally have better > > fewer memcg_high breaches and better sys time/throughput. > > > > 5) For a given swap setup, there is some amount of variance in > > sys time for this workload. > > > > 6) All this suggests that the primary root cause is the concurrency setup, > > where there could be randomness between runs as to the # of processes > > that observe the memory.high breach due to other factors such as > > availability of swap slots for alloc. > > > > To summarize, I believe the root-cause is the 4G SSD swapfile resulting in > > running out of swap slots, and anomalous behavior with over-reclaim when > 70 > > concurrent processes are working with the 60G memory limit while trying to > > allocate 1G each; with randomness in processes reacting to the breach. > > > > The cgroup zswap charging exacerbates this situation, but is not a problem > > in and of itself. > > > > Nhat, as you pointed out, this is somewhat of an unrealistic scenario that > > doesn't seem to indicate any specific problems to be solved, other than the > > temporary cgroup zswap double-charging. > > > > Would it be fair to evaluate this patch-series based on a more realistic > > swapfile configuration based on 176G ZRAM, for which I had shared the > data > > in v2? There weren't any problems with swap slots availability or any > > anomalies that I can think of with this setup, other than the fact that the > > "Before" and "After" sys times could not be directly compared for 2 key > > reasons: > > > > - ZRAM compressed data is not charged to the cgroup, similar to SSD. > > - ZSWAP compressed data is charged to the cgroup. > > Yeah that's a bit unfair still. Wild idea, but what about we compare > SSD without zswap (or SSD with zswap, but without this patch series so > that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing > swapfile on zram block device). > > It is stupid, I know. But let's take advantage of the fact that zram > is not charged to cgroup, pretending that its memory foot print is > empty? > > I don't know how zram works though, so my apologies if it's a stupid > suggestion :)
> -----Original Message----- > From: Nhat Pham <nphamcs@gmail.com> > Sent: Tuesday, August 27, 2024 8:30 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > On Tue, Aug 27, 2024 at 8:23 AM Nhat Pham <nphamcs@gmail.com> wrote: > > > > On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P > > <kanchana.p.sridhar@intel.com> wrote: > > Yeah that's a bit unfair still. Wild idea, but what about we compare > > SSD without zswap (or SSD with zswap, but without this patch series so > > that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing > > swapfile on zram block device). > > > > It is stupid, I know. But let's take advantage of the fact that zram > > is not charged to cgroup, pretending that its memory foot print is > > empty? > > > > I don't know how zram works though, so my apologies if it's a stupid > > suggestion :) > > Oh nvm, looks like that's what you're already doing. > > That said, the lz4 column is soooo bad still, whereas the deflate-iaa > clearly shows improvement! This means it could be > compressor-dependent. > > Can you try it with zstd? Sure, I will gather data with zstd. Thanks, Kanchana
> -----Original Message----- > From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Sent: Tuesday, August 27, 2024 11:42 AM > To: Nhat Pham <nphamcs@gmail.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>; > Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Subject: RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > > > -----Original Message----- > > From: Nhat Pham <nphamcs@gmail.com> > > Sent: Tuesday, August 27, 2024 8:24 AM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com; > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > > > On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P > > <kanchana.p.sridhar@intel.com> wrote: > > > > > > > Internally, we often see 1-3 or 1-4 saving ratio (or even more). > > > > > > Agree with this as well. In our experiments with other workloads, we > > > typically see much higher ratios. > > > > > > > > > > > Probably does not explain everything, but worth double checking - > > > > could you check with zstd to see if the ratio improves. > > > > > > Sure. I gathered ratio and compressed memory footprint data today with > > > 64K mTHP, the 4G SSD swapfile and different zswap compressors. > > > > > > This patch-series and no zswap charging, 64K mTHP: > > > --------------------------------------------------------------------------- > > > Total Total Average Average Comp > > > compressed compression compressed compression ratio > > > length latency length latency > > > bytes milliseconds bytes nanoseconds > > > --------------------------------------------------------------------------- > > > SSD (no zswap) 1,362,296,832 887,861 > > > lz4 2,610,657,430 55,984 2,055 44,065 1.99 > > > zstd 729,129,528 50,986 565 39,510 7.25 > > > deflate-iaa 1,286,533,438 44,785 1,415 49,252 2.89 > > > --------------------------------------------------------------------------- > > > > > > zstd does very well on ratio, as expected. > > > > Wait. So zstd is displaying 7-to-1 compression ratio? And has *lower* > > average latency? > > > > Why are we running benchmark on lz4 again? Sure there is no free lunch > > and no compressor that works well on all kind of data, but lz4's > > performance here is so bad that it's borderline justifiable to > > disable/bypass zswap with this kind of compresison ratio... > > > > Can I ask you to run benchmarking on zstd from now on? > > Sure, will do. > > > > > > > > > > > > > > > > > > > > > > > > > Experiment 3 - 4K folios swap characteristics SSD vs. ZSWAP: > > > > > ------------------------------------------------------------ > > > > > > > > > > I wanted to take a step back and understand how the mainline v6.11- > > rc3 > > > > > handles 4K folios when swapped out to SSD (CONFIG_ZSWAP is off) > and > > > > when > > > > > swapped out to ZSWAP. Interestingly, higher swapout activity is > > observed > > > > > with 4K folios and v6.11-rc3 (with the debug change to not charge > > zswap to > > > > > cgroup). > > > > > > > > > > v6.11-rc3 with no zswap charge, only 4K folios, no (m)THP: > > > > > > > > > > ------------------------------------------------------------- > > > > > SSD (CONFIG_ZSWAP is OFF) ZSWAP lz4 lzo-rle > > > > > ------------------------------------------------------------- > > > > > cgroup memory.events: cgroup memory.events: > > > > > > > > > > low 0 low 0 0 > > > > > high 5,068 high 321,923 375,116 > > > > > max 0 max 0 0 > > > > > oom 0 oom 0 0 > > > > > oom_kill 0 oom_kill 0 0 > > > > > oom_group_kill 0 oom_group_kill 0 0 > > > > > ------------------------------------------------------------- > > > > > > > > > > SSD (CONFIG_ZSWAP is OFF): > > > > > -------------------------- > > > > > pswpout 415,709 > > > > > sys time (sec) 301.02 > > > > > Throughput KB/s 155,970 > > > > > memcg_high events 5,068 > > > > > -------------------------- > > > > > > > > > > > > > > > ZSWAP lz4 lz4 lz4 lzo-rle > > > > > -------------------------------------------------------------- > > > > > zswpout 1,598,550 1,515,151 1,449,432 1,493,917 > > > > > sys time (sec) 889.36 481.21 581.22 635.75 > > > > > Throughput KB/s 35,176 14,765 20,253 21,407 > > > > > memcg_high events 321,923 412,733 369,976 375,116 > > > > > -------------------------------------------------------------- > > > > > > > > > > This shows that there is a performance regression of -60% to -195% > > with > > > > > zswap as compared to SSD with 4K folios. The higher swapout activity > > with > > > > > zswap is seen here too (i.e., this doesn't appear to be mTHP-specific). > > > > > > > > > > I verified this to be the case even with the v6.7 kernel, which also > > > > > showed a 2.3X throughput improvement when we don't charge > zswap: > > > > > > > > > > ZSWAP lz4 v6.7 v6.7 with no cgroup zswap charge > > > > > -------------------------------------------------------------------- > > > > > zswpout 1,419,802 1,398,620 > > > > > sys time (sec) 535.4 613.41 > > > > > > > > systime increases without zswap cgroup charging? That's strange... > > > > > > Additional data gathered with v6.11-rc3 (listed below) based on your > > suggestion > > > to investigate potential swap.high breaches should hopefully provide > some > > > explanation. > > > > > > > > > > > > Throughput KB/s 8,671 20,045 > > > > > memcg_high events 574,046 451,859 > > > > > > > > So, on 4k folio setup, even without cgroup charge, we are still seeing: > > > > > > > > 1. More zswpout (than observed in SSD) > > > > 2. 40-50% worse latency - in fact it is worse without zswap cgroup > > charging. > > > > 3. 100 times the amount of memcg_high events? This is perhaps the > > > > *strangest* to me. You're already removing zswap cgroup charging, > then > > > > where does this comes from? How can we have memory.high violation > > when > > > > zswap does *not* contribute to memory usage? > > > > > > > > Is this due to swap limit charging? Do you have a cgroup swap limit? > > > > > > > > mem_high = page_counter_read(&memcg->memory) > > > > > READ_ONCE(memcg->memory.high); > > > > swap_high = page_counter_read(&memcg->swap) > > > > > READ_ONCE(memcg->swap.high); > > > > [...] > > > > > > > > if (mem_high || swap_high) { > > > > /* > > > > * The allocating tasks in this cgroup will need to do > > > > * reclaim or be throttled to prevent further growth > > > > * of the memory or swap footprints. > > > > * > > > > * Target some best-effort fairness between the tasks, > > > > * and distribute reclaim work and delay penalties > > > > * based on how much each task is actually allocating. > > > > */ > > > > current->memcg_nr_pages_over_high += batch; > > > > set_notify_resume(current); > > > > break; > > > > } > > > > > > > > > > I don't have a swap.high limit set on the cgroup; it is set to "max". > > > > > > I ran experiments with v6.11-rc3, no zswap charging, 4K folios and > different > > > zswap compressors to verify if swap.high is breached with the 4G SSD > > swapfile. > > > > > > SSD (CONFIG_ZSWAP is OFF): > > > > > > SSD SSD SSD > > > ------------------------------------------------------------ > > > pswpout 415,709 1,032,170 636,582 > > > sys time (sec) 301.02 328.15 306.98 > > > Throughput KB/s 155,970 89,621 122,219 > > > memcg_high events 5,068 15,072 8,344 > > > memcg_swap_high events 0 0 0 > > > memcg_swap_fail events 0 0 0 > > > ------------------------------------------------------------ > > > > > > ZSWAP zstd zstd zstd > > > ---------------------------------------------------------------- > > > zswpout 1,391,524 1,382,965 1,417,307 > > > sys time (sec) 474.68 568.24 489.80 > > > Throughput KB/s 26,099 23,404 111,115 > > > memcg_high events 335,112 340,335 162,260 > > > memcg_swap_high events 0 0 0 > > > memcg_swap_fail events 1,226,899 5,742,153 > > > (mem_cgroup_try_charge_swap) > > > memcg_memory_stat_pgactivate 1,259,547 > > > (shrink_folio_list) > > > ---------------------------------------------------------------- > > > > > > ZSWAP lzo-rle lzo-rle lzo-rle > > > ----------------------------------------------------------- > > > zswpout 1,493,917 1,363,040 1,428,133 > > > sys time (sec) 635.75 498.63 484.65 > > > Throughput KB/s 21,407 23,827 20,237 > > > memcg_high events 375,116 352,814 373,667 > > > memcg_swap_high events 0 0 0 > > > memcg_swap_fail events 715,211 > > > ----------------------------------------------------------- > > > > > > ZSWAP lz4 lz4 lz4 lz4 > > > --------------------------------------------------------------------- > > > zswpout 1,378,781 1,598,550 1,515,151 1,449,432 > > > sys time (sec) 495.45 889.36 481.21 581.22 > > > Throughput KB/s 26,248 35,176 14,765 20,253 > > > memcg_high events 347,209 321,923 412,733 369,976 > > > memcg_swap_high events 0 0 0 0 > > > memcg_swap_fail events 580,103 0 > > > --------------------------------------------------------------------- > > > > > > ZSWAP deflate-iaa deflate-iaa deflate-iaa > > > ---------------------------------------------------------------- > > > zswpout 380,471 1,440,902 1,397,965 > > > sys time (sec) 329.06 570.77 467.41 > > > Throughput KB/s 283,867 28,403 190,600 > > > memcg_high events 5,551 422,831 28,154 > > > memcg_swap_high events 0 0 0 > > > memcg_swap_fail events 0 2,686,758 438,562 > > > ---------------------------------------------------------------- > > > > Why are there 3 columns for each of the compressors? Is this different > > runs of the same workload? > > > > And why do some columns have missing cells? > > Yes, these are different runs of the same workload. Since there is some > amount of variance seen in the data, I figured it is best to publish the > metrics from the individual runs rather than averaging. > > Some of these runs were gathered earlier with the same code base, > however, I wasn't monitoring/logging the > memcg_swap_high/memcg_swap_fail > events at that time. For those runs, just these two counters have missing > column entries; the rest of the data is still valid. > > > > > > > > > There are no swap.high memcg events recorded in any of the SSD/zswap > > > experiments. However, I do see significant number of memcg_swap_fail > > > events in some of the zswap runs, for all 3 compressors. This is not > > > consistent, because there are some runs with 0 memcg_swap_fail for all > > > compressors. > > > > > > There is a possible co-relation between memcg_swap_fail events > > > (/sys/fs/cgroup/test/memory.swap.events) and the high # of > memcg_high > > > events. The root-cause appears to be that there are no available swap > > > slots, memcg_swap_fail is incremented, add_to_swap() fails in > > > shrink_folio_list(), followed by "activate_locked:" for the folio. > > > The folio re-activation is recorded in cgroup memory.stat pgactivate > > > events. The failure to swap out folios due to lack of swap slots could > > > contribute towards memory.high breaches. > > > > Yeah FWIW, that was gonna be my first suggestion. This swapfile size > > is wayyyy too small... > > > > But that said, the link is not clear to me at all. The only thing I > > can think of is lz4's performance sucks so bad that it's not saving > > enough memory, leading to regression. And since it's still taking up > > swap slot, we cannot use swap either? > > The occurrence of memcg_swap_fail events establishes that swap slots > are not available with 4G of swap space. This causes those 4K folios to > remain in memory, which can worsen an existing problem with memory.high > breaches. > > However, it is worth noting that this is not the only contributor to > memcg_high events that still occur without zswap charging. The data shows > 321,923 occurrences of memcg_high in Col 2 of the lz4 table, that also has > 0 occurrences of memcg_swap_fail reported in the cgroup stats. > > > > > > > > > However, this is probably not the only cause for either the high # of > > > memory.high breaches or the over-reclaim with zswap, as seen in the lz4 > > > data where the memory.high is significant even in cases where there are > no > > > memcg_swap_fails. > > > > > > Some observations/questions based on the above 4K folios swapout data: > > > > > > 1) There are more memcg_high events as the swapout latency reduces > > > (i.e. faster swap-write path). This is even without charging zswap > > > utilization to the cgroup. > > > > This is still inexplicable to me. If we are not charging zswap usage, > > we shouldn't even be triggering the reclaim_high() path, no? > > > > I'm curious - can you use bpftrace to tracks where/when reclaim_high > > is being called? Hi Nhat, Since reclaim_high() is called only in a handful of places, I figured I would just use debugfs u64 counters to record where it gets called from. These are the places where I increment the debugfs counters: include/linux/resume_user_mode.h: --------------------------------- diff --git a/include/linux/resume_user_mode.h b/include/linux/resume_user_mode.h index e0135e0adae0..382f5469e9a2 100644 --- a/include/linux/resume_user_mode.h +++ b/include/linux/resume_user_mode.h @@ -24,6 +24,7 @@ static inline void set_notify_resume(struct task_struct *task) kick_process(task); } +extern u64 hoh_userland; /** * resume_user_mode_work - Perform work before returning to user mode @@ -56,6 +57,7 @@ static inline void resume_user_mode_work(struct pt_regs *regs) } #endif + ++hoh_userland; mem_cgroup_handle_over_high(GFP_KERNEL); blkcg_maybe_throttle_current(); mm/memcontrol.c: ---------------- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f29157288b7d..6738bb670a78 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1910,9 +1910,12 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, return nr_reclaimed; } +extern u64 rec_high_hwf; + static void high_work_func(struct work_struct *work) { struct mem_cgroup *memcg; + ++rec_high_hwf; memcg = container_of(work, struct mem_cgroup, high_work); reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL); @@ -2055,6 +2058,8 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg, return penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH; } +extern u64 rec_high_hoh; + /* * Reclaims memory over the high limit. Called directly from * try_charge() (context permitting), as well as from the userland @@ -2097,6 +2102,7 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask) * memory.high is currently batched, whereas memory.max and the page * allocator run every time an allocation is made. */ + ++rec_high_hoh; nr_reclaimed = reclaim_high(memcg, in_retry ? SWAP_CLUSTER_MAX : nr_pages, gfp_mask); @@ -2153,6 +2159,8 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask) css_put(&memcg->css); } +extern u64 hoh_trycharge; + int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int nr_pages) { @@ -2344,8 +2352,10 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, */ if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH && !(current->flags & PF_MEMALLOC) && - gfpflags_allow_blocking(gfp_mask)) + gfpflags_allow_blocking(gfp_mask)) { + ++hoh_trycharge; mem_cgroup_handle_over_high(gfp_mask); + } return 0; } I reverted my debug changes for "zswap to not charge cgroup" when I ran these next set of experiments that record the # of times and locations where reclaim_high() is called. zstd is the compressor I have configured for both ZSWAP and ZRAM. 6.11-rc3 mainline, 176Gi ZRAM backing for ZSWAP, zstd, 64K mTHP: ---------------------------------------------------------------- /sys/fs/cgroup/iax/memory.events: high 112,910 hoh_userland 128,835 hoh_trycharge 0 rec_high_hoh 113,079 rec_high_hwf 0 6.11-rc3 mainline, 4G SSD backing for ZSWAP, zstd, 64K mTHP: ------------------------------------------------------------ /sys/fs/cgroup/iax/memory.events: high 4,693 hoh_userland 14,069 hoh_trycharge 0 rec_high_hoh 4,694 rec_high_hwf 0 ZSWAP-mTHP, 176Gi ZRAM backing for ZSWAP, zstd, 64K mTHP: --------------------------------------------------------- /sys/fs/cgroup/iax/memory.events: high 139,495 hoh_userland 156,628 hoh_trycharge 0 rec_high_hoh 140,039 rec_high_hwf 0 ZSWAP-mTHP, 4G SSD backing for ZSWAP, zstd, 64K mTHP: ----------------------------------------------------- /sys/fs/cgroup/iax/memory.events: high 20,427 /sys/fs/cgroup/iax/memory.swap.events: fail 20,856 hoh_userland 31,346 hoh_trycharge 0 rec_high_hoh 20,513 rec_high_hwf 0 This shows that in all cases, reclaim_high() is called only from the return path to user mode after handling a page-fault. Thanks, Kanchana > > I had confirmed earlier with counters that all calls to reclaim_high() > were from include/linux/resume_user_mode.h::resume_user_mode_work(). > I will confirm this with zstd and bpftrace and share. > > Thanks, > Kanchana > > > > > > > > > 2) There appears to be a direct co-relation between higher # of > > > memcg_swap_fail events, and an increase in memcg_high breaches and > > > reduction in usemem throughput. This combined with the observation in > > > (1) suggests that with a faster compressor, we need more swap slots, > > > that increases the probability of running out of swap slots with the 4G > > > SSD backing device. > > > > > > 3) Could the data shared earlier on reduction in memcg_high breaches > with > > > 64K mTHP swapout provide some more clues, if we agree with (1) and > (2): > > > > > > "Interestingly, the # of memcg_high events reduces significantly with > 64K > > > mTHP as compared to the above 4K memcg_high events data, when > > tested > > > with v4 and no zswap charge: 3,069 (SSD-mTHP) and 19,656 (ZSWAP- > > mTHP)." > > > > > > 4) In the case of each zswap compressor, there are some runs that go > > > through with 0 memcg_swap_fail events. These runs generally have > better > > > fewer memcg_high breaches and better sys time/throughput. > > > > > > 5) For a given swap setup, there is some amount of variance in > > > sys time for this workload. > > > > > > 6) All this suggests that the primary root cause is the concurrency setup, > > > where there could be randomness between runs as to the # of processes > > > that observe the memory.high breach due to other factors such as > > > availability of swap slots for alloc. > > > > > > To summarize, I believe the root-cause is the 4G SSD swapfile resulting in > > > running out of swap slots, and anomalous behavior with over-reclaim > when > > 70 > > > concurrent processes are working with the 60G memory limit while trying > to > > > allocate 1G each; with randomness in processes reacting to the breach. > > > > > > The cgroup zswap charging exacerbates this situation, but is not a problem > > > in and of itself. > > > > > > Nhat, as you pointed out, this is somewhat of an unrealistic scenario that > > > doesn't seem to indicate any specific problems to be solved, other than > the > > > temporary cgroup zswap double-charging. > > > > > > Would it be fair to evaluate this patch-series based on a more realistic > > > swapfile configuration based on 176G ZRAM, for which I had shared the > > data > > > in v2? There weren't any problems with swap slots availability or any > > > anomalies that I can think of with this setup, other than the fact that the > > > "Before" and "After" sys times could not be directly compared for 2 key > > > reasons: > > > > > > - ZRAM compressed data is not charged to the cgroup, similar to SSD. > > > - ZSWAP compressed data is charged to the cgroup. > > > > Yeah that's a bit unfair still. Wild idea, but what about we compare > > SSD without zswap (or SSD with zswap, but without this patch series so > > that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing > > swapfile on zram block device). > > > > It is stupid, I know. But let's take advantage of the fact that zram > > is not charged to cgroup, pretending that its memory foot print is > > empty? > > > > I don't know how zram works though, so my apologies if it's a stupid > > suggestion :)
> -----Original Message----- > From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Sent: Tuesday, August 27, 2024 11:43 AM > To: Nhat Pham <nphamcs@gmail.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>; > Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Subject: RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > > > -----Original Message----- > > From: Nhat Pham <nphamcs@gmail.com> > > Sent: Tuesday, August 27, 2024 8:30 AM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com; > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > > > On Tue, Aug 27, 2024 at 8:23 AM Nhat Pham <nphamcs@gmail.com> > wrote: > > > > > > On Mon, Aug 26, 2024 at 11:08 PM Sridhar, Kanchana P > > > <kanchana.p.sridhar@intel.com> wrote: > > > Yeah that's a bit unfair still. Wild idea, but what about we compare > > > SSD without zswap (or SSD with zswap, but without this patch series so > > > that mTHP are not zswapped) v.s zswap-on-zram (i.e with a backing > > > swapfile on zram block device). > > > > > > It is stupid, I know. But let's take advantage of the fact that zram > > > is not charged to cgroup, pretending that its memory foot print is > > > empty? > > > > > > I don't know how zram works though, so my apologies if it's a stupid > > > suggestion :) > > > > Oh nvm, looks like that's what you're already doing. > > > > That said, the lz4 column is soooo bad still, whereas the deflate-iaa > > clearly shows improvement! This means it could be > > compressor-dependent. > > > > Can you try it with zstd? > > Sure, I will gather data with zstd. I will be sending out a v5 shortly with data gathered with zstd. Thanks, Kanchana > > Thanks, > Kanchana
[..] > > This shows that in all cases, reclaim_high() is called only from the return > path to user mode after handling a page-fault. I am sorry I haven't been keeping up with this thread, I don't have a lot of capacity right now. If my understanding is correct, the summary of the problem we are observing here is that with high concurrency (70 processes), we observe worse system time, worse throughput, and higher memory_high events with zswap than SSD swap. This is true (with varying degrees) for 4K or mTHP, and with or without charging zswap compressed memory. Did I get that right? I saw you also mentioned that reclaim latency is directly correlated to higher memory_high events. Is it possible that with SSD swap, because we wait for IO during reclaim, this gives a chance for other processes to allocate and free the memory they need. While with zswap because everything is synchronous, all processes are trying to allocate their memory at the same time resulting in higher reclaim rates? IOW, maybe with zswap all the processes try to allocate their memory at the same time, so the total amount of memory needed at any given instance is much higher than memory.high, so we keep producing memory_high events and reclaiming. If 70 processes all require 1G at the same time, then we need 70G of memory at once, we will keep thrashing pages in/out of zswap. While with SSD swap, due to the waits imposed by IO, the allocations are more spread out and more serialized, and the amount of memory needed at any given instance is lower; resulting in less reclaim activity and ultimately faster overall execution? Could you please describe what the processes are doing? Are they allocating memory and holding on to it, or immediately freeing it? Do you have visibility into when each process allocates and frees memory?
Hi Yosry, > -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Wednesday, August 28, 2024 12:44 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux- > mm@kvack.org; hannes@cmpxchg.org; ryan.roberts@arm.com; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > [..] > > > > This shows that in all cases, reclaim_high() is called only from the return > > path to user mode after handling a page-fault. > > I am sorry I haven't been keeping up with this thread, I don't have a > lot of capacity right now. > > If my understanding is correct, the summary of the problem we are > observing here is that with high concurrency (70 processes), we > observe worse system time, worse throughput, and higher memory_high > events with zswap than SSD swap. This is true (with varying degrees) > for 4K or mTHP, and with or without charging zswap compressed memory. > > Did I get that right? Thanks for your review and comments! Yes, this is correct. > > I saw you also mentioned that reclaim latency is directly correlated > to higher memory_high events. That was my observation based on the swap-constrained experiments with 4G SSD. With a faster compressor, we allow allocations to proceed quickly, and if the pages are not being faulted in, we need more swap slots. This increases the probability of running out of swap slots with the 4G SSD backing device, which, as the data in v4 shows, causes memcg_swap_fail events, that drive folios to be resident in memory (triggering memcg_high breaches as allocations proceed even without zswap cgroup charging). Things change when the experiments are run in a situation where there is abundant swap space and when the default behavior of zswap compressed data being charged to the cgroup is enabled, as in the data with 176GiB ZRAM as ZSWAP's backing swapfile posted in v5. Now, the critical path to workload performance changes to concurrent reclaims in response to memcg_high events due to allocation and zswap usage. We see a lesser increase in swapout activity (as compared to the swap-constrained experiments in v4), and compress latency seems to become the bottleneck. Each individual process's throughput/sys time degrades mainly as a function of compress latency. Anyway, these were some of my learnings from these experiments. Please do let me know if there are other insights/analysis I could be missing. > > Is it possible that with SSD swap, because we wait for IO during > reclaim, this gives a chance for other processes to allocate and free > the memory they need. While with zswap because everything is > synchronous, all processes are trying to allocate their memory at the > same time resulting in higher reclaim rates? > > IOW, maybe with zswap all the processes try to allocate their memory > at the same time, so the total amount of memory needed at any given > instance is much higher than memory.high, so we keep producing > memory_high events and reclaiming. If 70 processes all require 1G at > the same time, then we need 70G of memory at once, we will keep > thrashing pages in/out of zswap. > > While with SSD swap, due to the waits imposed by IO, the allocations > are more spread out and more serialized, and the amount of memory > needed at any given instance is lower; resulting in less reclaim > activity and ultimately faster overall execution? This is a very interesting hypothesis, that is along the lines of the "slower compressor" essentially causing allocation stalls (and buffering us from the swap slots unavailability effect) observation I gathered from the 4G SSD experiments. I think this is a possibility. > > Could you please describe what the processes are doing? Are they > allocating memory and holding on to it, or immediately freeing it? I have been using the vm-scalability usemem workload for these experiments. Thanks Ying for suggesting I use this workload! I am running usemem with these config options: usemem --init-time -w -O -n 70 1g. This forks 70 processes, each of which does the following: 1) Allocates 1G mmap virtual memory with MAP_ANONYMOUS, read/write permissions. 2) Steps through and accesses each 8 bytes chunk of memory in the mmap-ed region, and: 2.a) Writes the index of that chunk to the (unsigned long *) memory at that index. 3) Generates statistics on throughput. There is an "munmap()" after step (2.a) that I have commented out because I wanted to see how much cold memory resides in the zswap zpool after the workload exits. Interestingly, this was 0 for 64K mTHP, but of the order of several hundreds of MB for 2M THP. > > Do you have visibility into when each process allocates and frees memory? Yes. Hopefully the above offers some clarifications. Thanks, Kanchana
On Wed, Aug 28, 2024 at 11:50 AM Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> wrote: > > Hi Yosry, > > > -----Original Message----- > > From: Yosry Ahmed <yosryahmed@google.com> > > Sent: Wednesday, August 28, 2024 12:44 AM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux- > > mm@kvack.org; hannes@cmpxchg.org; ryan.roberts@arm.com; Huang, Ying > > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > > > [..] > > > > > > This shows that in all cases, reclaim_high() is called only from the return > > > path to user mode after handling a page-fault. > > > > I am sorry I haven't been keeping up with this thread, I don't have a > > lot of capacity right now. > > > > If my understanding is correct, the summary of the problem we are > > observing here is that with high concurrency (70 processes), we > > observe worse system time, worse throughput, and higher memory_high > > events with zswap than SSD swap. This is true (with varying degrees) > > for 4K or mTHP, and with or without charging zswap compressed memory. > > > > Did I get that right? > > Thanks for your review and comments! Yes, this is correct. > > > > > I saw you also mentioned that reclaim latency is directly correlated > > to higher memory_high events. > > That was my observation based on the swap-constrained experiments with 4G SSD. > With a faster compressor, we allow allocations to proceed quickly, and if the pages > are not being faulted in, we need more swap slots. This increases the probability of > running out of swap slots with the 4G SSD backing device, which, as the data in v4 > shows, causes memcg_swap_fail events, that drive folios to be resident in memory > (triggering memcg_high breaches as allocations proceed even without zswap cgroup > charging). > > Things change when the experiments are run in a situation where there is abundant > swap space and when the default behavior of zswap compressed data being charged > to the cgroup is enabled, as in the data with 176GiB ZRAM as ZSWAP's backing > swapfile posted in v5. Now, the critical path to workload performance changes to > concurrent reclaims in response to memcg_high events due to allocation and zswap > usage. We see a lesser increase in swapout activity (as compared to the swap-constrained > experiments in v4), and compress latency seems to become the bottleneck. Each > individual process's throughput/sys time degrades mainly as a function of compress > latency. Anyway, these were some of my learnings from these experiments. Please > do let me know if there are other insights/analysis I could be missing. > > > > > Is it possible that with SSD swap, because we wait for IO during > > reclaim, this gives a chance for other processes to allocate and free > > the memory they need. While with zswap because everything is > > synchronous, all processes are trying to allocate their memory at the > > same time resulting in higher reclaim rates? > > > > IOW, maybe with zswap all the processes try to allocate their memory > > at the same time, so the total amount of memory needed at any given > > instance is much higher than memory.high, so we keep producing > > memory_high events and reclaiming. If 70 processes all require 1G at > > the same time, then we need 70G of memory at once, we will keep > > thrashing pages in/out of zswap. > > > > While with SSD swap, due to the waits imposed by IO, the allocations > > are more spread out and more serialized, and the amount of memory > > needed at any given instance is lower; resulting in less reclaim > > activity and ultimately faster overall execution? > > This is a very interesting hypothesis, that is along the lines of the > "slower compressor" essentially causing allocation stalls (and buffering us from > the swap slots unavailability effect) observation I gathered from the 4G SSD > experiments. I think this is a possibility. > > > > > Could you please describe what the processes are doing? Are they > > allocating memory and holding on to it, or immediately freeing it? > > I have been using the vm-scalability usemem workload for these experiments. > Thanks Ying for suggesting I use this workload! > > I am running usemem with these config options: usemem --init-time -w -O -n 70 1g. > This forks 70 processes, each of which does the following: > > 1) Allocates 1G mmap virtual memory with MAP_ANONYMOUS, read/write permissions. > 2) Steps through and accesses each 8 bytes chunk of memory in the mmap-ed region, and: > 2.a) Writes the index of that chunk to the (unsigned long *) memory at that index. > 3) Generates statistics on throughput. > > There is an "munmap()" after step (2.a) that I have commented out because I wanted to > see how much cold memory resides in the zswap zpool after the workload exits. Interestingly, > this was 0 for 64K mTHP, but of the order of several hundreds of MB for 2M THP. Does the process exit immediately after step (3)? The memory will be unmapped and freed once the process exits anyway, so removing an unmap that immediately precedes the process exiting should have no effect. I wonder how this changes if the processes sleep and keep the memory mapped for a while, to force the situation where all the memory is needed at the same time on SSD as well as zswap. This could make the playing field more even and force the same thrashing to happen on SSD for a more fair comparison. It's not a fix, if very fast reclaim with zswap ends up causing more problems perhaps we need to tweak the throttling of memory.high or something.
> -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Wednesday, August 28, 2024 3:34 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux- > mm@kvack.org; hannes@cmpxchg.org; ryan.roberts@arm.com; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > On Wed, Aug 28, 2024 at 11:50 AM Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> wrote: > > > > Hi Yosry, > > > > > -----Original Message----- > > > From: Yosry Ahmed <yosryahmed@google.com> > > > Sent: Wednesday, August 28, 2024 12:44 AM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; > linux- > > > mm@kvack.org; hannes@cmpxchg.org; ryan.roberts@arm.com; Huang, > Ying > > > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; > > > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > > > Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios > > > > > > [..] > > > > > > > > This shows that in all cases, reclaim_high() is called only from the return > > > > path to user mode after handling a page-fault. > > > > > > I am sorry I haven't been keeping up with this thread, I don't have a > > > lot of capacity right now. > > > > > > If my understanding is correct, the summary of the problem we are > > > observing here is that with high concurrency (70 processes), we > > > observe worse system time, worse throughput, and higher memory_high > > > events with zswap than SSD swap. This is true (with varying degrees) > > > for 4K or mTHP, and with or without charging zswap compressed memory. > > > > > > Did I get that right? > > > > Thanks for your review and comments! Yes, this is correct. > > > > > > > > I saw you also mentioned that reclaim latency is directly correlated > > > to higher memory_high events. > > > > That was my observation based on the swap-constrained experiments with > 4G SSD. > > With a faster compressor, we allow allocations to proceed quickly, and if the > pages > > are not being faulted in, we need more swap slots. This increases the > probability of > > running out of swap slots with the 4G SSD backing device, which, as the data > in v4 > > shows, causes memcg_swap_fail events, that drive folios to be resident in > memory > > (triggering memcg_high breaches as allocations proceed even without > zswap cgroup > > charging). > > > > Things change when the experiments are run in a situation where there is > abundant > > swap space and when the default behavior of zswap compressed data being > charged > > to the cgroup is enabled, as in the data with 176GiB ZRAM as ZSWAP's > backing > > swapfile posted in v5. Now, the critical path to workload performance > changes to > > concurrent reclaims in response to memcg_high events due to allocation > and zswap > > usage. We see a lesser increase in swapout activity (as compared to the > swap-constrained > > experiments in v4), and compress latency seems to become the bottleneck. > Each > > individual process's throughput/sys time degrades mainly as a function of > compress > > latency. Anyway, these were some of my learnings from these experiments. > Please > > do let me know if there are other insights/analysis I could be missing. > > > > > > > > Is it possible that with SSD swap, because we wait for IO during > > > reclaim, this gives a chance for other processes to allocate and free > > > the memory they need. While with zswap because everything is > > > synchronous, all processes are trying to allocate their memory at the > > > same time resulting in higher reclaim rates? > > > > > > IOW, maybe with zswap all the processes try to allocate their memory > > > at the same time, so the total amount of memory needed at any given > > > instance is much higher than memory.high, so we keep producing > > > memory_high events and reclaiming. If 70 processes all require 1G at > > > the same time, then we need 70G of memory at once, we will keep > > > thrashing pages in/out of zswap. > > > > > > While with SSD swap, due to the waits imposed by IO, the allocations > > > are more spread out and more serialized, and the amount of memory > > > needed at any given instance is lower; resulting in less reclaim > > > activity and ultimately faster overall execution? > > > > This is a very interesting hypothesis, that is along the lines of the > > "slower compressor" essentially causing allocation stalls (and buffering us > from > > the swap slots unavailability effect) observation I gathered from the 4G SSD > > experiments. I think this is a possibility. > > > > > > > > Could you please describe what the processes are doing? Are they > > > allocating memory and holding on to it, or immediately freeing it? > > > > I have been using the vm-scalability usemem workload for these > experiments. > > Thanks Ying for suggesting I use this workload! > > > > I am running usemem with these config options: usemem --init-time -w -O - > n 70 1g. > > This forks 70 processes, each of which does the following: > > > > 1) Allocates 1G mmap virtual memory with MAP_ANONYMOUS, read/write > permissions. > > 2) Steps through and accesses each 8 bytes chunk of memory in the mmap- > ed region, and: > > 2.a) Writes the index of that chunk to the (unsigned long *) memory at > that index. > > 3) Generates statistics on throughput. > > > > There is an "munmap()" after step (2.a) that I have commented out because > I wanted to > > see how much cold memory resides in the zswap zpool after the workload > exits. Interestingly, > > this was 0 for 64K mTHP, but of the order of several hundreds of MB for 2M > THP. > > Does the process exit immediately after step (3)? The memory will be > unmapped and freed once the process exits anyway, so removing an unmap > that immediately precedes the process exiting should have no effect. Yes, you're right. > > I wonder how this changes if the processes sleep and keep the memory > mapped for a while, to force the situation where all the memory is > needed at the same time on SSD as well as zswap. This could make the > playing field more even and force the same thrashing to happen on SSD > for a more fair comparison. Good point. I believe I saw an option in usemem that could facilitate this. I will investigate. > > It's not a fix, if very fast reclaim with zswap ends up causing more > problems perhaps we need to tweak the throttling of memory.high or > something. Sure, that is a possibility. Although, proactive reclaim might mitigate this, in which case very fast reclaim with zswap might help. Thanks, Kanchana