diff mbox series

[RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg

Message ID 20241027001444.3233-1-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series [RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg | expand

Commit Message

Barry Song Oct. 27, 2024, 12:14 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

In a memcg where mTHP is always utilized, even at full capacity, it
might not be the best option. Consider a system that uses only small
folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX
of buffer space before it can initiate the next reclamation. However,
large folios can quickly fill this space, rapidly bringing the memcg
back to full capacity, even though some portions of the large folios
may not be immediately needed and used by the process.

Usama and Kanchana identified a regression when building the kernel in
a memcg with memory.max set to a small value while enabling large
folio swap-in support on zswap[1].

The issue arises from an edge case where the memory cgroup remains
nearly full most of the time. Consequently, bringing in mTHP can
quickly cause a memcg overflow, triggering a swap-out. The subsequent
swap-in then recreates the overflow, resulting in a repetitive cycle.

We need a mechanism to stop the cup from overflowing continuously.
One potential solution is to slow the filling process when we identify
that the cup is nearly full.

Usama reported an improvement when we mitigate mTHP swap-in as the
memcg approaches full capacity[2]:

int mem_cgroup_swapin_charge_folio(...)
{
	...
	if (folio_test_large(folio) &&
	    mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)))
		ret = -ENOMEM;
	else
		ret = charge_memcg(folio, memcg, gfp);
	...
}

AMD 16K+32K THP=always
metric         mm-unstable      mm-unstable + large folio zswapin series    mm-unstable + large folio zswapin + no swap thrashing fix
real           1m23.038s        1m23.050s                                   1m22.704s
user           53m57.210s       53m53.437s                                  53m52.577s
sys            7m24.592s        7m48.843s                                   7m22.519s
zswpin         612070           999244                                      815934
zswpout        2226403          2347979                                     2054980
pgfault        20667366         20481728                                    20478690
pgmajfault     385887           269117                                      309702

AMD 16K+32K+64K THP=always
metric         mm-unstable      mm-unstable + large folio zswapin series   mm-unstable + large folio zswapin + no swap thrashing fix
real           1m22.975s        1m23.266s                                  1m22.549s
user           53m51.302s       53m51.069s                                 53m46.471s
sys            7m40.168s        7m57.104s                                  7m25.012s
zswpin         676492           1258573                                    1225703
zswpout        2449839          2714767                                    2899178
pgfault        17540746         17296555                                   17234663
pgmajfault     429629           307495                                     287859

I wonder if we can extend the mitigation to do_anonymous_page() as
well. Without hardware like AMD and ARM with hardware TLB coalescing
or CONT-PTE, I conducted a quick test on my Intel i9 workstation with
10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel
builds in a memcg with memory.max set to 1 GiB.

$ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
$ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
$ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
$ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled

$ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null

            disable-mTHP-swapin  mm-unstable  with-this-patch
Real:            6m54.595s      7m4.832s       6m45.811s
User:            66m42.795s     66m59.984s     67m21.150s
Sys:             12m7.092s      15m18.153s     12m52.644s
pswpin:          4262327        11723248       5918690
pswpout:         14883774       19574347       14026942
64k-swpout:      624447         889384         480039
32k-swpout:      115473         242288         73874
16k-swpout:      158203         294672         109142
64k-swpin:       0              495869         159061
32k-swpin:       0              219977         56158
16k-swpin:       0              223501         81445

I need Usama's assistance to identify a suitable patch, as I lack
access to hardware such as AMD machines and ARM servers with TLB
optimization.

[1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/
[2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/

Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/memcontrol.h |  9 ++++++++
 mm/memcontrol.c            | 45 ++++++++++++++++++++++++++++++++++++++
 mm/memory.c                | 17 ++++++++++++++
 3 files changed, 71 insertions(+)

Comments

Usama Arif Oct. 28, 2024, 12:07 p.m. UTC | #1
On 27/10/2024 01:14, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> In a memcg where mTHP is always utilized, even at full capacity, it
> might not be the best option. Consider a system that uses only small
> folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX
> of buffer space before it can initiate the next reclamation. However,
> large folios can quickly fill this space, rapidly bringing the memcg
> back to full capacity, even though some portions of the large folios
> may not be immediately needed and used by the process.
> 
> Usama and Kanchana identified a regression when building the kernel in
> a memcg with memory.max set to a small value while enabling large
> folio swap-in support on zswap[1].
> 
> The issue arises from an edge case where the memory cgroup remains
> nearly full most of the time. Consequently, bringing in mTHP can
> quickly cause a memcg overflow, triggering a swap-out. The subsequent
> swap-in then recreates the overflow, resulting in a repetitive cycle.
> 
> We need a mechanism to stop the cup from overflowing continuously.
> One potential solution is to slow the filling process when we identify
> that the cup is nearly full.
> 
> Usama reported an improvement when we mitigate mTHP swap-in as the
> memcg approaches full capacity[2]:
> 
> int mem_cgroup_swapin_charge_folio(...)
> {
> 	...
> 	if (folio_test_large(folio) &&
> 	    mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)))
> 		ret = -ENOMEM;
> 	else
> 		ret = charge_memcg(folio, memcg, gfp);
> 	...
> }
> 
> AMD 16K+32K THP=always
> metric         mm-unstable      mm-unstable + large folio zswapin series    mm-unstable + large folio zswapin + no swap thrashing fix
> real           1m23.038s        1m23.050s                                   1m22.704s
> user           53m57.210s       53m53.437s                                  53m52.577s
> sys            7m24.592s        7m48.843s                                   7m22.519s
> zswpin         612070           999244                                      815934
> zswpout        2226403          2347979                                     2054980
> pgfault        20667366         20481728                                    20478690
> pgmajfault     385887           269117                                      309702
> 
> AMD 16K+32K+64K THP=always
> metric         mm-unstable      mm-unstable + large folio zswapin series   mm-unstable + large folio zswapin + no swap thrashing fix
> real           1m22.975s        1m23.266s                                  1m22.549s
> user           53m51.302s       53m51.069s                                 53m46.471s
> sys            7m40.168s        7m57.104s                                  7m25.012s
> zswpin         676492           1258573                                    1225703
> zswpout        2449839          2714767                                    2899178
> pgfault        17540746         17296555                                   17234663
> pgmajfault     429629           307495                                     287859
> 
> I wonder if we can extend the mitigation to do_anonymous_page() as
> well. Without hardware like AMD and ARM with hardware TLB coalescing
> or CONT-PTE, I conducted a quick test on my Intel i9 workstation with
> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel
> builds in a memcg with memory.max set to 1 GiB.
> 
> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> 
> $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
> CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null
> 
>             disable-mTHP-swapin  mm-unstable  with-this-patch
> Real:            6m54.595s      7m4.832s       6m45.811s
> User:            66m42.795s     66m59.984s     67m21.150s
> Sys:             12m7.092s      15m18.153s     12m52.644s
> pswpin:          4262327        11723248       5918690
> pswpout:         14883774       19574347       14026942
> 64k-swpout:      624447         889384         480039
> 32k-swpout:      115473         242288         73874
> 16k-swpout:      158203         294672         109142
> 64k-swpin:       0              495869         159061
> 32k-swpin:       0              219977         56158
> 16k-swpin:       0              223501         81445
> 

hmm, both the user and sys time are worse with the patch compared to
disable-mTHP-swapin. I wonder if the real time is an anomaly and if you
repeat the experiment the real time might be worse as well? 

> I need Usama's assistance to identify a suitable patch, as I lack
> access to hardware such as AMD machines and ARM servers with TLB
> optimization.
> 
> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/
> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/
> 
> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Chris Li <chrisl@kernel.org>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> Cc: "Huang, Ying" <ying.huang@intel.com>
> Cc: Kairui Song <kasong@tencent.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: Muchun Song <muchun.song@linux.dev>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/memcontrol.h |  9 ++++++++
>  mm/memcontrol.c            | 45 ++++++++++++++++++++++++++++++++++++++
>  mm/memory.c                | 17 ++++++++++++++
>  3 files changed, 71 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 524006313b0d..8bcc8f4af39f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
>  int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
>  		long nr_pages);
>  
> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
> +				swp_entry_t *entry);
> +
>  int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
>  				  gfp_t gfp, swp_entry_t entry);
>  
> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
>  	return 0;
>  }
>  
> +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
> +		swp_entry_t *entry)
> +{
> +	return 0;
> +}
> +
>  static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
>  			struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 17af08367c68..f3d92b93ea6d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
>  	return 0;
>  }
>  
> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg)
> +{
> +	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
> +		if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR)

There might be 3 issues with the approach:

Its a very big margin, lets say you have ARM64_64K_PAGES, and you have
256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are
basically saying you need 512M free memory to swapin just 256K?

Its an uneven margin for different folio sizes.
For 16K folio swapin, you are checking if there is margin for 128 folios,
but for 1M folio swapin, you are checking there is margin for just 2 folios.

Maybe it might be better to make this dependent on some factor of folio_nr_pages?

As Johannes pointed out, the charging code already does the margin check.
So for 4K, the check just checks if there is 4K available, but for 16K it checks
if a lot more than 16K is available. Maybe there should be a similar policy for
all? I guess this is similar to my 2nd point, but just considers 4K folios as
well.

Thanks,
Usama


> +			return false;
> +	}
> +
> +	return true;
> +}
> +
> +/**
> + * mem_cgroup_swapin_precharge_large_folio: Precharge large folios.
> + *
> + * @mm: mm context of the victim
> + * @entry: swap entry for which the folio will be allocated
> + *
> + * If we are arriving the edge of an almost full memcg, return error so that
> + * swap-in and anon faults can quickly fall back to small folios to avoid swap
> + * thrashing.
> + *
> + * Returns 0 on success, an error code on failure.
> + */
> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, swp_entry_t *entry)
> +{
> +	struct mem_cgroup *memcg = NULL;
> +	unsigned short id;
> +	bool has_margin;
> +
> +	if (mem_cgroup_disabled())
> +		return 0;
> +
> +	rcu_read_lock();
> +	if (entry) {
> +		id = lookup_swap_cgroup_id(*entry);
> +		memcg = mem_cgroup_from_id(id);
> +	}
> +	if (!memcg || !css_tryget_online(&memcg->css))
> +		memcg = get_mem_cgroup_from_mm(mm);
> +	has_margin = mem_cgroup_has_margin(memcg);
> +	rcu_read_unlock();
> +
> +	css_put(&memcg->css);
> +	return has_margin ? 0 : -ENOMEM;
> +}
> +
>  /**
>   * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin.
>   * @folio: folio to charge.
> diff --git a/mm/memory.c b/mm/memory.c
> index 0f614523b9f4..96368ba0e8a6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4168,6 +4168,16 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  
>  	pte_unmap_unlock(pte, ptl);
>  
> +	if (!orders)
> +		goto fallback;
> +
> +	/*
> +	 * Avoid swapping in large folios when memcg is nearly full, as it
> +	 * may quickly trigger additional swap-out and swap-in cycles.
> +	 */
> +	if (mem_cgroup_precharge_large_folio(vma->vm_mm, &entry))
> +		goto fallback;
> +
>  	/* Try allocating the highest of the remaining orders. */
>  	gfp = vma_thp_gfp_mask(vma);
>  	while (orders) {
> @@ -4707,6 +4717,13 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>  	if (!orders)
>  		goto fallback;
>  
> +	/*
> +	 * When memcg is nearly full, large folios can rapidly fill
> +	 * the margin and trigger new reclamation
> +	 */
> +	if (mem_cgroup_precharge_large_folio(vma->vm_mm, NULL))
> +		goto fallback;
> +
>  	/* Try allocating the highest of the remaining orders. */
>  	gfp = vma_thp_gfp_mask(vma);
>  	while (orders) {
Barry Song Oct. 28, 2024, 10:03 p.m. UTC | #2
On Mon, Oct 28, 2024 at 8:07 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 27/10/2024 01:14, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > In a memcg where mTHP is always utilized, even at full capacity, it
> > might not be the best option. Consider a system that uses only small
> > folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX
> > of buffer space before it can initiate the next reclamation. However,
> > large folios can quickly fill this space, rapidly bringing the memcg
> > back to full capacity, even though some portions of the large folios
> > may not be immediately needed and used by the process.
> >
> > Usama and Kanchana identified a regression when building the kernel in
> > a memcg with memory.max set to a small value while enabling large
> > folio swap-in support on zswap[1].
> >
> > The issue arises from an edge case where the memory cgroup remains
> > nearly full most of the time. Consequently, bringing in mTHP can
> > quickly cause a memcg overflow, triggering a swap-out. The subsequent
> > swap-in then recreates the overflow, resulting in a repetitive cycle.
> >
> > We need a mechanism to stop the cup from overflowing continuously.
> > One potential solution is to slow the filling process when we identify
> > that the cup is nearly full.
> >
> > Usama reported an improvement when we mitigate mTHP swap-in as the
> > memcg approaches full capacity[2]:
> >
> > int mem_cgroup_swapin_charge_folio(...)
> > {
> >       ...
> >       if (folio_test_large(folio) &&
> >           mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)))
> >               ret = -ENOMEM;
> >       else
> >               ret = charge_memcg(folio, memcg, gfp);
> >       ...
> > }
> >
> > AMD 16K+32K THP=always
> > metric         mm-unstable      mm-unstable + large folio zswapin series    mm-unstable + large folio zswapin + no swap thrashing fix
> > real           1m23.038s        1m23.050s                                   1m22.704s
> > user           53m57.210s       53m53.437s                                  53m52.577s
> > sys            7m24.592s        7m48.843s                                   7m22.519s
> > zswpin         612070           999244                                      815934
> > zswpout        2226403          2347979                                     2054980
> > pgfault        20667366         20481728                                    20478690
> > pgmajfault     385887           269117                                      309702
> >
> > AMD 16K+32K+64K THP=always
> > metric         mm-unstable      mm-unstable + large folio zswapin series   mm-unstable + large folio zswapin + no swap thrashing fix
> > real           1m22.975s        1m23.266s                                  1m22.549s
> > user           53m51.302s       53m51.069s                                 53m46.471s
> > sys            7m40.168s        7m57.104s                                  7m25.012s
> > zswpin         676492           1258573                                    1225703
> > zswpout        2449839          2714767                                    2899178
> > pgfault        17540746         17296555                                   17234663
> > pgmajfault     429629           307495                                     287859
> >
> > I wonder if we can extend the mitigation to do_anonymous_page() as
> > well. Without hardware like AMD and ARM with hardware TLB coalescing
> > or CONT-PTE, I conducted a quick test on my Intel i9 workstation with
> > 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel
> > builds in a memcg with memory.max set to 1 GiB.
> >
> > $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> > $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
> > $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
> > $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> >
> > $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
> > CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null
> >
> >             disable-mTHP-swapin  mm-unstable  with-this-patch
> > Real:            6m54.595s      7m4.832s       6m45.811s
> > User:            66m42.795s     66m59.984s     67m21.150s
> > Sys:             12m7.092s      15m18.153s     12m52.644s
> > pswpin:          4262327        11723248       5918690
> > pswpout:         14883774       19574347       14026942
> > 64k-swpout:      624447         889384         480039
> > 32k-swpout:      115473         242288         73874
> > 16k-swpout:      158203         294672         109142
> > 64k-swpin:       0              495869         159061
> > 32k-swpin:       0              219977         56158
> > 16k-swpin:       0              223501         81445
> >
>

Hi Usama,

> hmm, both the user and sys time are worse with the patch compared to
> disable-mTHP-swapin. I wonder if the real time is an anomaly and if you
> repeat the experiment the real time might be worse as well?

Well, I've improved my script to include a loop:

echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled

for ((i=1; i<=100; i++))
do
  echo "Executing round $i"
  make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null
  echo 3 > /proc/sys/vm/drop_caches
  time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
        CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j15 1>/dev/null 2>/dev/null
  cat /proc/vmstat | grep pswp
  echo -n 64k-swpout: ; cat
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout
  echo -n 32k-swpout: ; cat
/sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout
  echo -n 16k-swpout: ; cat
/sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout
  echo -n 64k-swpin: ; cat
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin
  echo -n 32k-swpin: ; cat
/sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin
  echo -n 16k-swpin: ; cat
/sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin
done

I've noticed that the user/sys/real time on my i9 machine fluctuates
constantly, could be things
like:
real    6m52.087s
user    67m12.463s
sys     13m8.281s
...

real    7m42.937s
user    66m55.250s
sys     12m56.330s
...

real    6m49.374s
user    66m37.040s
sys     12m44.542s
...

real    6m54.205s
user    65m49.732s
sys     11m33.078s
...

likely due to unstable temperatures and I/O latency. As a result, my
data doesn’t seem
reference-worthy.

As a phone engineer, we never use phones to run kernel builds. I'm also
quite certain that phones won't provide stable and reliable data for this
type of workload. Without access to a Linux server to conduct the test,
I really need your help.

I used to work on optimizing the ARM server scheduler and memory
management, and I really miss that machine I had until three years ago :-)

>
> > I need Usama's assistance to identify a suitable patch, as I lack
> > access to hardware such as AMD machines and ARM servers with TLB
> > optimization.
> >
> > [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/
> > [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/
> >
> > Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > Cc: Usama Arif <usamaarif642@gmail.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Cc: Chris Li <chrisl@kernel.org>
> > Cc: Yosry Ahmed <yosryahmed@google.com>
> > Cc: "Huang, Ying" <ying.huang@intel.com>
> > Cc: Kairui Song <kasong@tencent.com>
> > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > Cc: Shakeel Butt <shakeel.butt@linux.dev>
> > Cc: Muchun Song <muchun.song@linux.dev>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  include/linux/memcontrol.h |  9 ++++++++
> >  mm/memcontrol.c            | 45 ++++++++++++++++++++++++++++++++++++++
> >  mm/memory.c                | 17 ++++++++++++++
> >  3 files changed, 71 insertions(+)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 524006313b0d..8bcc8f4af39f 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
> >  int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> >               long nr_pages);
> >
> > +int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
> > +                             swp_entry_t *entry);
> > +
> >  int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> >                                 gfp_t gfp, swp_entry_t entry);
> >
> > @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
> >       return 0;
> >  }
> >
> > +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
> > +             swp_entry_t *entry)
> > +{
> > +     return 0;
> > +}
> > +
> >  static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
> >                       struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
> >  {
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 17af08367c68..f3d92b93ea6d 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> >       return 0;
> >  }
> >
> > +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg)
> > +{
> > +     for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
> > +             if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR)
>
> There might be 3 issues with the approach:
>
> Its a very big margin, lets say you have ARM64_64K_PAGES, and you have
> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are
> basically saying you need 512M free memory to swapin just 256K?

Right, sorry for the noisy code. I was just thinking about 4KB pages
and wondering
if we could simplify the code.

>
> Its an uneven margin for different folio sizes.
> For 16K folio swapin, you are checking if there is margin for 128 folios,
> but for 1M folio swapin, you are checking there is margin for just 2 folios.
>
> Maybe it might be better to make this dependent on some factor of folio_nr_pages?

Agreed. This is similar to what we discussed regarding your zswap mTHP
swap-in series:

 int mem_cgroup_swapin_charge_folio(...)
 {
       ...
       if (folio_test_large(folio) &&
           mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH,
folio_nr_pages(folio)))
               ret = -ENOMEM;
       else
               ret = charge_memcg(folio, memcg, gfp);
       ...
 }

As someone focused on phones, my challenge is the absence of stable platforms to
benchmark this type of workload. If possible, Usama, I would greatly
appreciate it if
you could take the lead on the patch.

>
> As Johannes pointed out, the charging code already does the margin check.
> So for 4K, the check just checks if there is 4K available, but for 16K it checks
> if a lot more than 16K is available. Maybe there should be a similar policy for
> all? I guess this is similar to my 2nd point, but just considers 4K folios as
> well.

I don't think the charging code performs a margin check. It simply
tries to charge
the specified nr_pages (whether 1 or more). If nr_pages are available,
the charge
proceeds; otherwise, if GFP allows blocking, it triggers memory reclamation to
reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages.

If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages available, a
large folio with nr_pages == SWAP_CLUSTER_MAX will successfully charge,
immediately filling the memcg.

Shortly after, smaller folios—typically with blockable GFP—will quickly trigger
additional reclamation. While nr_pages - 1 subpages of the large folio may not
be immediately needed, they still occupy enough space to fill the memcg to
capacity.

My second point about the mitigation is as follows: For a system (or
memcg) under severe memory pressure, especially one without hardware TLB
optimization, is enabling mTHP always the right choice? Since mTHP operates at
a larger granularity, some internal fragmentation is unavoidable, regardless
of optimization. Could the mitigation code help in automatically tuning
this fragmentation?

>
> Thanks,
> Usama
>
>
> > +                     return false;
> > +     }
> > +
> > +     return true;
> > +}
> > +
> > +/**
> > + * mem_cgroup_swapin_precharge_large_folio: Precharge large folios.
> > + *
> > + * @mm: mm context of the victim
> > + * @entry: swap entry for which the folio will be allocated
> > + *
> > + * If we are arriving the edge of an almost full memcg, return error so that
> > + * swap-in and anon faults can quickly fall back to small folios to avoid swap
> > + * thrashing.
> > + *
> > + * Returns 0 on success, an error code on failure.
> > + */
> > +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, swp_entry_t *entry)
> > +{
> > +     struct mem_cgroup *memcg = NULL;
> > +     unsigned short id;
> > +     bool has_margin;
> > +
> > +     if (mem_cgroup_disabled())
> > +             return 0;
> > +
> > +     rcu_read_lock();
> > +     if (entry) {
> > +             id = lookup_swap_cgroup_id(*entry);
> > +             memcg = mem_cgroup_from_id(id);
> > +     }
> > +     if (!memcg || !css_tryget_online(&memcg->css))
> > +             memcg = get_mem_cgroup_from_mm(mm);
> > +     has_margin = mem_cgroup_has_margin(memcg);
> > +     rcu_read_unlock();
> > +
> > +     css_put(&memcg->css);
> > +     return has_margin ? 0 : -ENOMEM;
> > +}
> > +
> >  /**
> >   * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin.
> >   * @folio: folio to charge.
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 0f614523b9f4..96368ba0e8a6 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4168,6 +4168,16 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >
> >       pte_unmap_unlock(pte, ptl);
> >
> > +     if (!orders)
> > +             goto fallback;
> > +
> > +     /*
> > +      * Avoid swapping in large folios when memcg is nearly full, as it
> > +      * may quickly trigger additional swap-out and swap-in cycles.
> > +      */
> > +     if (mem_cgroup_precharge_large_folio(vma->vm_mm, &entry))
> > +             goto fallback;
> > +
> >       /* Try allocating the highest of the remaining orders. */
> >       gfp = vma_thp_gfp_mask(vma);
> >       while (orders) {
> > @@ -4707,6 +4717,13 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> >       if (!orders)
> >               goto fallback;
> >
> > +     /*
> > +      * When memcg is nearly full, large folios can rapidly fill
> > +      * the margin and trigger new reclamation
> > +      */
> > +     if (mem_cgroup_precharge_large_folio(vma->vm_mm, NULL))
> > +             goto fallback;
> > +
> >       /* Try allocating the highest of the remaining orders. */
> >       gfp = vma_thp_gfp_mask(vma);
> >       while (orders) {
>

Thanks
Barry
Usama Arif Oct. 30, 2024, 2:51 p.m. UTC | #3
On 28/10/2024 22:03, Barry Song wrote:
> On Mon, Oct 28, 2024 at 8:07 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 27/10/2024 01:14, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> In a memcg where mTHP is always utilized, even at full capacity, it
>>> might not be the best option. Consider a system that uses only small
>>> folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX
>>> of buffer space before it can initiate the next reclamation. However,
>>> large folios can quickly fill this space, rapidly bringing the memcg
>>> back to full capacity, even though some portions of the large folios
>>> may not be immediately needed and used by the process.
>>>
>>> Usama and Kanchana identified a regression when building the kernel in
>>> a memcg with memory.max set to a small value while enabling large
>>> folio swap-in support on zswap[1].
>>>
>>> The issue arises from an edge case where the memory cgroup remains
>>> nearly full most of the time. Consequently, bringing in mTHP can
>>> quickly cause a memcg overflow, triggering a swap-out. The subsequent
>>> swap-in then recreates the overflow, resulting in a repetitive cycle.
>>>
>>> We need a mechanism to stop the cup from overflowing continuously.
>>> One potential solution is to slow the filling process when we identify
>>> that the cup is nearly full.
>>>
>>> Usama reported an improvement when we mitigate mTHP swap-in as the
>>> memcg approaches full capacity[2]:
>>>
>>> int mem_cgroup_swapin_charge_folio(...)
>>> {
>>>       ...
>>>       if (folio_test_large(folio) &&
>>>           mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)))
>>>               ret = -ENOMEM;
>>>       else
>>>               ret = charge_memcg(folio, memcg, gfp);
>>>       ...
>>> }
>>>
>>> AMD 16K+32K THP=always
>>> metric         mm-unstable      mm-unstable + large folio zswapin series    mm-unstable + large folio zswapin + no swap thrashing fix
>>> real           1m23.038s        1m23.050s                                   1m22.704s
>>> user           53m57.210s       53m53.437s                                  53m52.577s
>>> sys            7m24.592s        7m48.843s                                   7m22.519s
>>> zswpin         612070           999244                                      815934
>>> zswpout        2226403          2347979                                     2054980
>>> pgfault        20667366         20481728                                    20478690
>>> pgmajfault     385887           269117                                      309702
>>>
>>> AMD 16K+32K+64K THP=always
>>> metric         mm-unstable      mm-unstable + large folio zswapin series   mm-unstable + large folio zswapin + no swap thrashing fix
>>> real           1m22.975s        1m23.266s                                  1m22.549s
>>> user           53m51.302s       53m51.069s                                 53m46.471s
>>> sys            7m40.168s        7m57.104s                                  7m25.012s
>>> zswpin         676492           1258573                                    1225703
>>> zswpout        2449839          2714767                                    2899178
>>> pgfault        17540746         17296555                                   17234663
>>> pgmajfault     429629           307495                                     287859
>>>
>>> I wonder if we can extend the mitigation to do_anonymous_page() as
>>> well. Without hardware like AMD and ARM with hardware TLB coalescing
>>> or CONT-PTE, I conducted a quick test on my Intel i9 workstation with
>>> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel
>>> builds in a memcg with memory.max set to 1 GiB.
>>>
>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
>>> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>>>
>>> $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
>>> CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null
>>>
>>>             disable-mTHP-swapin  mm-unstable  with-this-patch
>>> Real:            6m54.595s      7m4.832s       6m45.811s
>>> User:            66m42.795s     66m59.984s     67m21.150s
>>> Sys:             12m7.092s      15m18.153s     12m52.644s
>>> pswpin:          4262327        11723248       5918690
>>> pswpout:         14883774       19574347       14026942
>>> 64k-swpout:      624447         889384         480039
>>> 32k-swpout:      115473         242288         73874
>>> 16k-swpout:      158203         294672         109142
>>> 64k-swpin:       0              495869         159061
>>> 32k-swpin:       0              219977         56158
>>> 16k-swpin:       0              223501         81445
>>>
>>
> 
> Hi Usama,
> 
>> hmm, both the user and sys time are worse with the patch compared to
>> disable-mTHP-swapin. I wonder if the real time is an anomaly and if you
>> repeat the experiment the real time might be worse as well?
> 
> Well, I've improved my script to include a loop:
> 
> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> 
> for ((i=1; i<=100; i++))
> do
>   echo "Executing round $i"
>   make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null
>   echo 3 > /proc/sys/vm/drop_caches
>   time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
>         CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j15 1>/dev/null 2>/dev/null
>   cat /proc/vmstat | grep pswp
>   echo -n 64k-swpout: ; cat
> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout
>   echo -n 32k-swpout: ; cat
> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout
>   echo -n 16k-swpout: ; cat
> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout
>   echo -n 64k-swpin: ; cat
> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin
>   echo -n 32k-swpin: ; cat
> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin
>   echo -n 16k-swpin: ; cat
> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin
> done
> 
> I've noticed that the user/sys/real time on my i9 machine fluctuates
> constantly, could be things
> like:
> real    6m52.087s
> user    67m12.463s
> sys     13m8.281s
> ...
> 
> real    7m42.937s
> user    66m55.250s
> sys     12m56.330s
> ...
> 
> real    6m49.374s
> user    66m37.040s
> sys     12m44.542s
> ...
> 
> real    6m54.205s
> user    65m49.732s
> sys     11m33.078s
> ...
> 
> likely due to unstable temperatures and I/O latency. As a result, my
> data doesn’t seem
> reference-worthy.
> 

So I had suggested retrying the experiment to see how reproducible it is,
but had not done that myself!
Thanks for sharing this. I tried many times on the AMD server and I see
varying numbers as well.

AMD 16K THP always, cgroup = 4G, large folio zswapin patches
real    1m28.351s
user    54m14.476s
sys     8m46.596s
zswpin 811693
zswpout 2137310
pgfault 27344671
pgmajfault 290510
..
real    1m24.557s
user    53m56.815s
sys     8m10.200s
zswpin 571532
zswpout 1645063
pgfault 26989075
pgmajfault 205177
..
real    1m26.083s                                                                                                                                                                                                                                                                                                  
user    54m5.303s                                                                                                                                                                                                                                                                                                  
sys     9m55.247s                                                                                                                                                                                                                                                                                                  
zswpin 1176292                                                                                                                                                                                                                                                                                                     
zswpout 2910825                                                                                                                                                                                                                                                                                                    
pgfault 27286835                                                                                                                                                                                                                                                                                                   
pgmajfault 419746   
 

The sys time can especially vary by large numbers. I think you see the same.


> As a phone engineer, we never use phones to run kernel builds. I'm also
> quite certain that phones won't provide stable and reliable data for this
> type of workload. Without access to a Linux server to conduct the test,
> I really need your help.
> 
> I used to work on optimizing the ARM server scheduler and memory
> management, and I really miss that machine I had until three years ago :-)
> 
>>
>>> I need Usama's assistance to identify a suitable patch, as I lack
>>> access to hardware such as AMD machines and ARM servers with TLB
>>> optimization.
>>>
>>> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/
>>> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/
>>>
>>> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
>>> Cc: Usama Arif <usamaarif642@gmail.com>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: Chris Li <chrisl@kernel.org>
>>> Cc: Yosry Ahmed <yosryahmed@google.com>
>>> Cc: "Huang, Ying" <ying.huang@intel.com>
>>> Cc: Kairui Song <kasong@tencent.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>> Cc: Michal Hocko <mhocko@kernel.org>
>>> Cc: Roman Gushchin <roman.gushchin@linux.dev>
>>> Cc: Shakeel Butt <shakeel.butt@linux.dev>
>>> Cc: Muchun Song <muchun.song@linux.dev>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>  include/linux/memcontrol.h |  9 ++++++++
>>>  mm/memcontrol.c            | 45 ++++++++++++++++++++++++++++++++++++++
>>>  mm/memory.c                | 17 ++++++++++++++
>>>  3 files changed, 71 insertions(+)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index 524006313b0d..8bcc8f4af39f 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
>>>  int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
>>>               long nr_pages);
>>>
>>> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
>>> +                             swp_entry_t *entry);
>>> +
>>>  int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
>>>                                 gfp_t gfp, swp_entry_t entry);
>>>
>>> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
>>>       return 0;
>>>  }
>>>
>>> +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
>>> +             swp_entry_t *entry)
>>> +{
>>> +     return 0;
>>> +}
>>> +
>>>  static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
>>>                       struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
>>>  {
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 17af08367c68..f3d92b93ea6d 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
>>>       return 0;
>>>  }
>>>
>>> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg)
>>> +{
>>> +     for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
>>> +             if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR)
>>
>> There might be 3 issues with the approach:
>>
>> Its a very big margin, lets say you have ARM64_64K_PAGES, and you have
>> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are
>> basically saying you need 512M free memory to swapin just 256K?
> 
> Right, sorry for the noisy code. I was just thinking about 4KB pages
> and wondering
> if we could simplify the code.
> 
>>
>> Its an uneven margin for different folio sizes.
>> For 16K folio swapin, you are checking if there is margin for 128 folios,
>> but for 1M folio swapin, you are checking there is margin for just 2 folios.
>>
>> Maybe it might be better to make this dependent on some factor of folio_nr_pages?
> 
> Agreed. This is similar to what we discussed regarding your zswap mTHP
> swap-in series:
> 
>  int mem_cgroup_swapin_charge_folio(...)
>  {
>        ...
>        if (folio_test_large(folio) &&
>            mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH,
> folio_nr_pages(folio)))
>                ret = -ENOMEM;
>        else
>                ret = charge_memcg(folio, memcg, gfp);
>        ...
>  }
> 
> As someone focused on phones, my challenge is the absence of stable platforms to
> benchmark this type of workload. If possible, Usama, I would greatly
> appreciate it if
> you could take the lead on the patch.
> 
>>
>> As Johannes pointed out, the charging code already does the margin check.
>> So for 4K, the check just checks if there is 4K available, but for 16K it checks
>> if a lot more than 16K is available. Maybe there should be a similar policy for
>> all? I guess this is similar to my 2nd point, but just considers 4K folios as
>> well.
> 
> I don't think the charging code performs a margin check. It simply
> tries to charge
> the specified nr_pages (whether 1 or more). If nr_pages are available,
> the charge
> proceeds; otherwise, if GFP allows blocking, it triggers memory reclamation to
> reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages.
> 

So if you have defrag not set to always, it will not trigger reclamation.
I think that is a bigger usecase, i.e. defrag=madvise,defer,etc is probably
used much more then always.

In the current code in that case try_charge_memcg will return -ENOMEM all
the way to mem_cgroup_swapin_charge_folio and alloc_swap_folio will then
try the next order. So eventhough it might not be calling the mem_cgroup_margin
function, it is kind of is doing the same?

> If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages available, a
> large folio with nr_pages == SWAP_CLUSTER_MAX will successfully charge,
> immediately filling the memcg.
> 
> Shortly after, smaller folios—typically with blockable GFP—will quickly trigger
> additional reclamation. While nr_pages - 1 subpages of the large folio may not
> be immediately needed, they still occupy enough space to fill the memcg to
> capacity.
> 
> My second point about the mitigation is as follows: For a system (or
> memcg) under severe memory pressure, especially one without hardware TLB
> optimization, is enabling mTHP always the right choice? Since mTHP operates at
> a larger granularity, some internal fragmentation is unavoidable, regardless
> of optimization. Could the mitigation code help in automatically tuning
> this fragmentation?
> 

I agree with the point that enabling mTHP always is not the right thing to do
on all platforms. I also think it might be the case that enabling mTHP
might be a good thing for some workloads, but enabling mTHP swapin along with
it might not.

As you said when you have apps switching between foreground and background
in android, it probably makes sense to have large folio swapping, as you
want to bringin all the pages from background app as quickly as possible.
And also all the TLB optimizations and smaller lru overhead you get after
you have brought in all the pages.
Linux kernel build test doesnt really get to benefit from the TLB optimization
and smaller lru overhead, as probably the pages are very short lived. So I
think it doesnt show the benefit of large folio swapin properly and
large folio swapin should probably be disabled for this kind of workload,
eventhough mTHP should be enabled.

I am not sure that the approach we are trying in this patch is the right way:
- This patch makes it a memcg issue, but you could have memcg disabled and
then the mitigation being tried here wont apply.
- Instead of this being a large folio swapin issue, is it more of a readahead
issue? If we zswap (without the large folio swapin series) and change the window
to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
when cgroup memory is limited as readahead would probably cause swap thrashing as
well.
- Instead of looking at cgroup margin, maybe we should try and look at
the rate of change of workingset_restore_anon? This might be a lot more complicated
to do, but probably is the right metric to determine swap thrashing. It also means
that this could be used in both the synchronous swapcache skipping path and
swapin_readahead path.
(Thanks Johannes for suggesting this)

With the large folio swapin, I do see the large improvement when considering only
swapin performance and latency in the same way as you saw in zram.
Maybe the right short term approach is to have
/sys/kernel/mm/transparent_hugepage/swapin
and have that disabled by default to avoid regression.
If the workload owner sees a benefit, they can enable it.
I can add this when sending the next version of large folio zswapin if that makes
sense?
Longer term I can try and have a look at if we can do something with
workingset_restore_anon to improve things.

Thanks,
Usama
Yosry Ahmed Oct. 30, 2024, 7:51 p.m. UTC | #4
[..]
> > My second point about the mitigation is as follows: For a system (or
> > memcg) under severe memory pressure, especially one without hardware TLB
> > optimization, is enabling mTHP always the right choice? Since mTHP operates at
> > a larger granularity, some internal fragmentation is unavoidable, regardless
> > of optimization. Could the mitigation code help in automatically tuning
> > this fragmentation?
> >
>
> I agree with the point that enabling mTHP always is not the right thing to do
> on all platforms. I also think it might be the case that enabling mTHP
> might be a good thing for some workloads, but enabling mTHP swapin along with
> it might not.
>
> As you said when you have apps switching between foreground and background
> in android, it probably makes sense to have large folio swapping, as you
> want to bringin all the pages from background app as quickly as possible.
> And also all the TLB optimizations and smaller lru overhead you get after
> you have brought in all the pages.
> Linux kernel build test doesnt really get to benefit from the TLB optimization
> and smaller lru overhead, as probably the pages are very short lived. So I
> think it doesnt show the benefit of large folio swapin properly and
> large folio swapin should probably be disabled for this kind of workload,
> eventhough mTHP should be enabled.
>
> I am not sure that the approach we are trying in this patch is the right way:
> - This patch makes it a memcg issue, but you could have memcg disabled and
> then the mitigation being tried here wont apply.

Is the problem reproducible without memcg? I imagine only if the
entire system is under memory pressure. I guess we would want the same
"mitigation" either way.

> - Instead of this being a large folio swapin issue, is it more of a readahead
> issue? If we zswap (without the large folio swapin series) and change the window
> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> when cgroup memory is limited as readahead would probably cause swap thrashing as
> well.

I think large folio swapin would make the problem worse anyway. I am
also not sure if the readahead window adjusts on memory pressure or
not.

> - Instead of looking at cgroup margin, maybe we should try and look at
> the rate of change of workingset_restore_anon? This might be a lot more complicated
> to do, but probably is the right metric to determine swap thrashing. It also means
> that this could be used in both the synchronous swapcache skipping path and
> swapin_readahead path.
> (Thanks Johannes for suggesting this)
>
> With the large folio swapin, I do see the large improvement when considering only
> swapin performance and latency in the same way as you saw in zram.
> Maybe the right short term approach is to have
> /sys/kernel/mm/transparent_hugepage/swapin
> and have that disabled by default to avoid regression.
> If the workload owner sees a benefit, they can enable it.
> I can add this when sending the next version of large folio zswapin if that makes
> sense?

I would honestly prefer we avoid this if possible. It's always easy to
just put features behind knobs, and then users have the toil of
figuring out if/when they can use it, or just give up. We should find
a way to avoid the thrashing due to hitting the memcg limit (or being
under global memory pressure), it seems like something the kernel
should be able to do on its own.

> Longer term I can try and have a look at if we can do something with
> workingset_restore_anon to improve things.

I am not a big fan of this, mainly because reading a stat from the
kernel puts us in a situation where we have to choose between:
- Doing a memcg stats flush in the kernel, which is something we are
trying to move away from due to various problems we have been running
into.
- Using potentially stale stats (up to 2s), which may be fine but is
suboptimal at best. We may have blips of thrashing due to stale stats
not showing the refaults.
Usama Arif Oct. 30, 2024, 8:25 p.m. UTC | #5
On 30/10/2024 19:51, Yosry Ahmed wrote:
> [..]
>>> My second point about the mitigation is as follows: For a system (or
>>> memcg) under severe memory pressure, especially one without hardware TLB
>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at
>>> a larger granularity, some internal fragmentation is unavoidable, regardless
>>> of optimization. Could the mitigation code help in automatically tuning
>>> this fragmentation?
>>>
>>
>> I agree with the point that enabling mTHP always is not the right thing to do
>> on all platforms. I also think it might be the case that enabling mTHP
>> might be a good thing for some workloads, but enabling mTHP swapin along with
>> it might not.
>>
>> As you said when you have apps switching between foreground and background
>> in android, it probably makes sense to have large folio swapping, as you
>> want to bringin all the pages from background app as quickly as possible.
>> And also all the TLB optimizations and smaller lru overhead you get after
>> you have brought in all the pages.
>> Linux kernel build test doesnt really get to benefit from the TLB optimization
>> and smaller lru overhead, as probably the pages are very short lived. So I
>> think it doesnt show the benefit of large folio swapin properly and
>> large folio swapin should probably be disabled for this kind of workload,
>> eventhough mTHP should be enabled.
>>
>> I am not sure that the approach we are trying in this patch is the right way:
>> - This patch makes it a memcg issue, but you could have memcg disabled and
>> then the mitigation being tried here wont apply.
> 
> Is the problem reproducible without memcg? I imagine only if the
> entire system is under memory pressure. I guess we would want the same
> "mitigation" either way.
> 
What would be a good open source benchmark/workload to test without limiting memory
in memcg?
For the kernel build test, I can only get zswap activity to happen if I build
in cgroup and limit memory.max.

I can just run zswap large folio zswapin in production and see, but that will take me a few
days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing,
then maybe its not really an issue? I believe Barry doesnt see an issue in android
phones (but please correct me if I am wrong), and if there isnt an issue in Meta 
production as well, its a good data point for servers as well. And maybe
kernel build in 4G memcg is not a good test.

>> - Instead of this being a large folio swapin issue, is it more of a readahead
>> issue? If we zswap (without the large folio swapin series) and change the window
>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
>> when cgroup memory is limited as readahead would probably cause swap thrashing as
>> well.
> 
> I think large folio swapin would make the problem worse anyway. I am
> also not sure if the readahead window adjusts on memory pressure or
> not.
> 
readahead window doesnt look at memory pressure. So maybe the same thing is being
seen here as there would be in swapin_readahead? Maybe if we check kernel build test
performance in 4G memcg with below diff, it might get better?  

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4669f29cf555..9e196e1e6885 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -809,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
        pgoff_t ilx;
        bool page_allocated;
 
-       win = swap_vma_ra_win(vmf, &start, &end);
+       win = 1;
        if (win == 1)
                goto skip;

>> - Instead of looking at cgroup margin, maybe we should try and look at
>> the rate of change of workingset_restore_anon? This might be a lot more complicated
>> to do, but probably is the right metric to determine swap thrashing. It also means
>> that this could be used in both the synchronous swapcache skipping path and
>> swapin_readahead path.
>> (Thanks Johannes for suggesting this)
>>
>> With the large folio swapin, I do see the large improvement when considering only
>> swapin performance and latency in the same way as you saw in zram.
>> Maybe the right short term approach is to have
>> /sys/kernel/mm/transparent_hugepage/swapin
>> and have that disabled by default to avoid regression.
>> If the workload owner sees a benefit, they can enable it.
>> I can add this when sending the next version of large folio zswapin if that makes
>> sense?
> 
> I would honestly prefer we avoid this if possible. It's always easy to
> just put features behind knobs, and then users have the toil of
> figuring out if/when they can use it, or just give up. We should find
> a way to avoid the thrashing due to hitting the memcg limit (or being
> under global memory pressure), it seems like something the kernel
> should be able to do on its own.
> 
>> Longer term I can try and have a look at if we can do something with
>> workingset_restore_anon to improve things.
> 
> I am not a big fan of this, mainly because reading a stat from the
> kernel puts us in a situation where we have to choose between:
> - Doing a memcg stats flush in the kernel, which is something we are
> trying to move away from due to various problems we have been running
> into.
> - Using potentially stale stats (up to 2s), which may be fine but is
> suboptimal at best. We may have blips of thrashing due to stale stats
> not showing the refaults.
Barry Song Oct. 30, 2024, 8:27 p.m. UTC | #6
On Thu, Oct 31, 2024 at 3:51 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 28/10/2024 22:03, Barry Song wrote:
> > On Mon, Oct 28, 2024 at 8:07 PM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >>
> >>
> >> On 27/10/2024 01:14, Barry Song wrote:
> >>> From: Barry Song <v-songbaohua@oppo.com>
> >>>
> >>> In a memcg where mTHP is always utilized, even at full capacity, it
> >>> might not be the best option. Consider a system that uses only small
> >>> folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX
> >>> of buffer space before it can initiate the next reclamation. However,
> >>> large folios can quickly fill this space, rapidly bringing the memcg
> >>> back to full capacity, even though some portions of the large folios
> >>> may not be immediately needed and used by the process.
> >>>
> >>> Usama and Kanchana identified a regression when building the kernel in
> >>> a memcg with memory.max set to a small value while enabling large
> >>> folio swap-in support on zswap[1].
> >>>
> >>> The issue arises from an edge case where the memory cgroup remains
> >>> nearly full most of the time. Consequently, bringing in mTHP can
> >>> quickly cause a memcg overflow, triggering a swap-out. The subsequent
> >>> swap-in then recreates the overflow, resulting in a repetitive cycle.
> >>>
> >>> We need a mechanism to stop the cup from overflowing continuously.
> >>> One potential solution is to slow the filling process when we identify
> >>> that the cup is nearly full.
> >>>
> >>> Usama reported an improvement when we mitigate mTHP swap-in as the
> >>> memcg approaches full capacity[2]:
> >>>
> >>> int mem_cgroup_swapin_charge_folio(...)
> >>> {
> >>>       ...
> >>>       if (folio_test_large(folio) &&
> >>>           mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)))
> >>>               ret = -ENOMEM;
> >>>       else
> >>>               ret = charge_memcg(folio, memcg, gfp);
> >>>       ...
> >>> }
> >>>
> >>> AMD 16K+32K THP=always
> >>> metric         mm-unstable      mm-unstable + large folio zswapin series    mm-unstable + large folio zswapin + no swap thrashing fix
> >>> real           1m23.038s        1m23.050s                                   1m22.704s
> >>> user           53m57.210s       53m53.437s                                  53m52.577s
> >>> sys            7m24.592s        7m48.843s                                   7m22.519s
> >>> zswpin         612070           999244                                      815934
> >>> zswpout        2226403          2347979                                     2054980
> >>> pgfault        20667366         20481728                                    20478690
> >>> pgmajfault     385887           269117                                      309702
> >>>
> >>> AMD 16K+32K+64K THP=always
> >>> metric         mm-unstable      mm-unstable + large folio zswapin series   mm-unstable + large folio zswapin + no swap thrashing fix
> >>> real           1m22.975s        1m23.266s                                  1m22.549s
> >>> user           53m51.302s       53m51.069s                                 53m46.471s
> >>> sys            7m40.168s        7m57.104s                                  7m25.012s
> >>> zswpin         676492           1258573                                    1225703
> >>> zswpout        2449839          2714767                                    2899178
> >>> pgfault        17540746         17296555                                   17234663
> >>> pgmajfault     429629           307495                                     287859
> >>>
> >>> I wonder if we can extend the mitigation to do_anonymous_page() as
> >>> well. Without hardware like AMD and ARM with hardware TLB coalescing
> >>> or CONT-PTE, I conducted a quick test on my Intel i9 workstation with
> >>> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel
> >>> builds in a memcg with memory.max set to 1 GiB.
> >>>
> >>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> >>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
> >>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
> >>> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> >>>
> >>> $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
> >>> CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null
> >>>
> >>>             disable-mTHP-swapin  mm-unstable  with-this-patch
> >>> Real:            6m54.595s      7m4.832s       6m45.811s
> >>> User:            66m42.795s     66m59.984s     67m21.150s
> >>> Sys:             12m7.092s      15m18.153s     12m52.644s
> >>> pswpin:          4262327        11723248       5918690
> >>> pswpout:         14883774       19574347       14026942
> >>> 64k-swpout:      624447         889384         480039
> >>> 32k-swpout:      115473         242288         73874
> >>> 16k-swpout:      158203         294672         109142
> >>> 64k-swpin:       0              495869         159061
> >>> 32k-swpin:       0              219977         56158
> >>> 16k-swpin:       0              223501         81445
> >>>
> >>
> >
> > Hi Usama,
> >
> >> hmm, both the user and sys time are worse with the patch compared to
> >> disable-mTHP-swapin. I wonder if the real time is an anomaly and if you
> >> repeat the experiment the real time might be worse as well?
> >
> > Well, I've improved my script to include a loop:
> >
> > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
> > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
> > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> >
> > for ((i=1; i<=100; i++))
> > do
> >   echo "Executing round $i"
> >   make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null
> >   echo 3 > /proc/sys/vm/drop_caches
> >   time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
> >         CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j15 1>/dev/null 2>/dev/null
> >   cat /proc/vmstat | grep pswp
> >   echo -n 64k-swpout: ; cat
> > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout
> >   echo -n 32k-swpout: ; cat
> > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout
> >   echo -n 16k-swpout: ; cat
> > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout
> >   echo -n 64k-swpin: ; cat
> > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin
> >   echo -n 32k-swpin: ; cat
> > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin
> >   echo -n 16k-swpin: ; cat
> > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin
> > done
> >
> > I've noticed that the user/sys/real time on my i9 machine fluctuates
> > constantly, could be things
> > like:
> > real    6m52.087s
> > user    67m12.463s
> > sys     13m8.281s
> > ...
> >
> > real    7m42.937s
> > user    66m55.250s
> > sys     12m56.330s
> > ...
> >
> > real    6m49.374s
> > user    66m37.040s
> > sys     12m44.542s
> > ...
> >
> > real    6m54.205s
> > user    65m49.732s
> > sys     11m33.078s
> > ...
> >
> > likely due to unstable temperatures and I/O latency. As a result, my
> > data doesn’t seem
> > reference-worthy.
> >
>
> So I had suggested retrying the experiment to see how reproducible it is,
> but had not done that myself!
> Thanks for sharing this. I tried many times on the AMD server and I see
> varying numbers as well.
>
> AMD 16K THP always, cgroup = 4G, large folio zswapin patches
> real    1m28.351s
> user    54m14.476s
> sys     8m46.596s
> zswpin 811693
> zswpout 2137310
> pgfault 27344671
> pgmajfault 290510
> ..
> real    1m24.557s
> user    53m56.815s
> sys     8m10.200s
> zswpin 571532
> zswpout 1645063
> pgfault 26989075
> pgmajfault 205177
> ..
> real    1m26.083s
> user    54m5.303s
> sys     9m55.247s
> zswpin 1176292
> zswpout 2910825
> pgfault 27286835
> pgmajfault 419746
>
>
> The sys time can especially vary by large numbers. I think you see the same.
>
>
> > As a phone engineer, we never use phones to run kernel builds. I'm also
> > quite certain that phones won't provide stable and reliable data for this
> > type of workload. Without access to a Linux server to conduct the test,
> > I really need your help.
> >
> > I used to work on optimizing the ARM server scheduler and memory
> > management, and I really miss that machine I had until three years ago :-)
> >
> >>
> >>> I need Usama's assistance to identify a suitable patch, as I lack
> >>> access to hardware such as AMD machines and ARM servers with TLB
> >>> optimization.
> >>>
> >>> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/
> >>> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/
> >>>
> >>> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> >>> Cc: Usama Arif <usamaarif642@gmail.com>
> >>> Cc: David Hildenbrand <david@redhat.com>
> >>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> >>> Cc: Chris Li <chrisl@kernel.org>
> >>> Cc: Yosry Ahmed <yosryahmed@google.com>
> >>> Cc: "Huang, Ying" <ying.huang@intel.com>
> >>> Cc: Kairui Song <kasong@tencent.com>
> >>> Cc: Ryan Roberts <ryan.roberts@arm.com>
> >>> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >>> Cc: Michal Hocko <mhocko@kernel.org>
> >>> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> >>> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> >>> Cc: Muchun Song <muchun.song@linux.dev>
> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>> ---
> >>>  include/linux/memcontrol.h |  9 ++++++++
> >>>  mm/memcontrol.c            | 45 ++++++++++++++++++++++++++++++++++++++
> >>>  mm/memory.c                | 17 ++++++++++++++
> >>>  3 files changed, 71 insertions(+)
> >>>
> >>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> >>> index 524006313b0d..8bcc8f4af39f 100644
> >>> --- a/include/linux/memcontrol.h
> >>> +++ b/include/linux/memcontrol.h
> >>> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
> >>>  int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> >>>               long nr_pages);
> >>>
> >>> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
> >>> +                             swp_entry_t *entry);
> >>> +
> >>>  int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> >>>                                 gfp_t gfp, swp_entry_t entry);
> >>>
> >>> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
> >>>       return 0;
> >>>  }
> >>>
> >>> +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
> >>> +             swp_entry_t *entry)
> >>> +{
> >>> +     return 0;
> >>> +}
> >>> +
> >>>  static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
> >>>                       struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
> >>>  {
> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>> index 17af08367c68..f3d92b93ea6d 100644
> >>> --- a/mm/memcontrol.c
> >>> +++ b/mm/memcontrol.c
> >>> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> >>>       return 0;
> >>>  }
> >>>
> >>> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg)
> >>> +{
> >>> +     for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
> >>> +             if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR)
> >>
> >> There might be 3 issues with the approach:
> >>
> >> Its a very big margin, lets say you have ARM64_64K_PAGES, and you have
> >> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are
> >> basically saying you need 512M free memory to swapin just 256K?
> >
> > Right, sorry for the noisy code. I was just thinking about 4KB pages
> > and wondering
> > if we could simplify the code.
> >
> >>
> >> Its an uneven margin for different folio sizes.
> >> For 16K folio swapin, you are checking if there is margin for 128 folios,
> >> but for 1M folio swapin, you are checking there is margin for just 2 folios.
> >>
> >> Maybe it might be better to make this dependent on some factor of folio_nr_pages?
> >
> > Agreed. This is similar to what we discussed regarding your zswap mTHP
> > swap-in series:
> >
> >  int mem_cgroup_swapin_charge_folio(...)
> >  {
> >        ...
> >        if (folio_test_large(folio) &&
> >            mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH,
> > folio_nr_pages(folio)))
> >                ret = -ENOMEM;
> >        else
> >                ret = charge_memcg(folio, memcg, gfp);
> >        ...
> >  }
> >
> > As someone focused on phones, my challenge is the absence of stable platforms to
> > benchmark this type of workload. If possible, Usama, I would greatly
> > appreciate it if
> > you could take the lead on the patch.
> >
> >>
> >> As Johannes pointed out, the charging code already does the margin check.
> >> So for 4K, the check just checks if there is 4K available, but for 16K it checks
> >> if a lot more than 16K is available. Maybe there should be a similar policy for
> >> all? I guess this is similar to my 2nd point, but just considers 4K folios as
> >> well.
> >
> > I don't think the charging code performs a margin check. It simply
> > tries to charge
> > the specified nr_pages (whether 1 or more). If nr_pages are available,
> > the charge
> > proceeds; otherwise, if GFP allows blocking, it triggers memory reclamation to
> > reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages.
> >
>
> So if you have defrag not set to always, it will not trigger reclamation.
> I think that is a bigger usecase, i.e. defrag=madvise,defer,etc is probably
> used much more then always.
>
> In the current code in that case try_charge_memcg will return -ENOMEM all
> the way to mem_cgroup_swapin_charge_folio and alloc_swap_folio will then
> try the next order. So eventhough it might not be calling the mem_cgroup_margin
> function, it is kind of is doing the same?
>
> > If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages available, a
> > large folio with nr_pages == SWAP_CLUSTER_MAX will successfully charge,
> > immediately filling the memcg.
> >
> > Shortly after, smaller folios—typically with blockable GFP—will quickly trigger
> > additional reclamation. While nr_pages - 1 subpages of the large folio may not
> > be immediately needed, they still occupy enough space to fill the memcg to
> > capacity.
> >
> > My second point about the mitigation is as follows: For a system (or
> > memcg) under severe memory pressure, especially one without hardware TLB
> > optimization, is enabling mTHP always the right choice? Since mTHP operates at
> > a larger granularity, some internal fragmentation is unavoidable, regardless
> > of optimization. Could the mitigation code help in automatically tuning
> > this fragmentation?
> >
>
> I agree with the point that enabling mTHP always is not the right thing to do
> on all platforms. I also think it might be the case that enabling mTHP
> might be a good thing for some workloads, but enabling mTHP swapin along with
> it might not.
>
> As you said when you have apps switching between foreground and background
> in android, it probably makes sense to have large folio swapping, as you
> want to bringin all the pages from background app as quickly as possible.
> And also all the TLB optimizations and smaller lru overhead you get after
> you have brought in all the pages.
> Linux kernel build test doesnt really get to benefit from the TLB optimization
> and smaller lru overhead, as probably the pages are very short lived. So I
> think it doesnt show the benefit of large folio swapin properly and
> large folio swapin should probably be disabled for this kind of workload,
> eventhough mTHP should be enabled.

I'm not entirely sure if this applies to platforms without TLB
optimization, especially
in the absence of swap. In a memory-limited cgroup without swap, would
mTHP still
cause significant thrashing of file-backed folios? When a large swap
file is present,
the inability to swap in mTHP seems to act as a workaround for fragmentation,
allowing fragmented pages of the original mTHP from do_anonymous_page() to
remain in swap.

>
> I am not sure that the approach we are trying in this patch is the right way:
> - This patch makes it a memcg issue, but you could have memcg disabled and
> then the mitigation being tried here wont apply.
> - Instead of this being a large folio swapin issue, is it more of a readahead
> issue? If we zswap (without the large folio swapin series) and change the window
> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> when cgroup memory is limited as readahead would probably cause swap thrashing as
> well.
> - Instead of looking at cgroup margin, maybe we should try and look at
> the rate of change of workingset_restore_anon? This might be a lot more complicated
> to do, but probably is the right metric to determine swap thrashing. It also means
> that this could be used in both the synchronous swapcache skipping path and
> swapin_readahead path.
> (Thanks Johannes for suggesting this)
>
> With the large folio swapin, I do see the large improvement when considering only
> swapin performance and latency in the same way as you saw in zram.
> Maybe the right short term approach is to have
> /sys/kernel/mm/transparent_hugepage/swapin
> and have that disabled by default to avoid regression.

A crucial component is still missing—managing the compression and decompression
of multiple pages as a larger block. This could significantly reduce
system time and
potentially resolve the kernel build issue within a small memory
cgroup, even with
swap thrashing.

I’ll send an update ASAP so you can rebase for zswap.

> If the workload owner sees a benefit, they can enable it.
> I can add this when sending the next version of large folio zswapin if that makes
> sense?
> Longer term I can try and have a look at if we can do something with
> workingset_restore_anon to improve things.
>
> Thanks,
> Usama

Thanks
Barry
Usama Arif Oct. 30, 2024, 8:41 p.m. UTC | #7
On 30/10/2024 20:27, Barry Song wrote:
> On Thu, Oct 31, 2024 at 3:51 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 28/10/2024 22:03, Barry Song wrote:
>>> On Mon, Oct 28, 2024 at 8:07 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On 27/10/2024 01:14, Barry Song wrote:
>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>
>>>>> In a memcg where mTHP is always utilized, even at full capacity, it
>>>>> might not be the best option. Consider a system that uses only small
>>>>> folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX
>>>>> of buffer space before it can initiate the next reclamation. However,
>>>>> large folios can quickly fill this space, rapidly bringing the memcg
>>>>> back to full capacity, even though some portions of the large folios
>>>>> may not be immediately needed and used by the process.
>>>>>
>>>>> Usama and Kanchana identified a regression when building the kernel in
>>>>> a memcg with memory.max set to a small value while enabling large
>>>>> folio swap-in support on zswap[1].
>>>>>
>>>>> The issue arises from an edge case where the memory cgroup remains
>>>>> nearly full most of the time. Consequently, bringing in mTHP can
>>>>> quickly cause a memcg overflow, triggering a swap-out. The subsequent
>>>>> swap-in then recreates the overflow, resulting in a repetitive cycle.
>>>>>
>>>>> We need a mechanism to stop the cup from overflowing continuously.
>>>>> One potential solution is to slow the filling process when we identify
>>>>> that the cup is nearly full.
>>>>>
>>>>> Usama reported an improvement when we mitigate mTHP swap-in as the
>>>>> memcg approaches full capacity[2]:
>>>>>
>>>>> int mem_cgroup_swapin_charge_folio(...)
>>>>> {
>>>>>       ...
>>>>>       if (folio_test_large(folio) &&
>>>>>           mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)))
>>>>>               ret = -ENOMEM;
>>>>>       else
>>>>>               ret = charge_memcg(folio, memcg, gfp);
>>>>>       ...
>>>>> }
>>>>>
>>>>> AMD 16K+32K THP=always
>>>>> metric         mm-unstable      mm-unstable + large folio zswapin series    mm-unstable + large folio zswapin + no swap thrashing fix
>>>>> real           1m23.038s        1m23.050s                                   1m22.704s
>>>>> user           53m57.210s       53m53.437s                                  53m52.577s
>>>>> sys            7m24.592s        7m48.843s                                   7m22.519s
>>>>> zswpin         612070           999244                                      815934
>>>>> zswpout        2226403          2347979                                     2054980
>>>>> pgfault        20667366         20481728                                    20478690
>>>>> pgmajfault     385887           269117                                      309702
>>>>>
>>>>> AMD 16K+32K+64K THP=always
>>>>> metric         mm-unstable      mm-unstable + large folio zswapin series   mm-unstable + large folio zswapin + no swap thrashing fix
>>>>> real           1m22.975s        1m23.266s                                  1m22.549s
>>>>> user           53m51.302s       53m51.069s                                 53m46.471s
>>>>> sys            7m40.168s        7m57.104s                                  7m25.012s
>>>>> zswpin         676492           1258573                                    1225703
>>>>> zswpout        2449839          2714767                                    2899178
>>>>> pgfault        17540746         17296555                                   17234663
>>>>> pgmajfault     429629           307495                                     287859
>>>>>
>>>>> I wonder if we can extend the mitigation to do_anonymous_page() as
>>>>> well. Without hardware like AMD and ARM with hardware TLB coalescing
>>>>> or CONT-PTE, I conducted a quick test on my Intel i9 workstation with
>>>>> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel
>>>>> builds in a memcg with memory.max set to 1 GiB.
>>>>>
>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
>>>>> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>>>>>
>>>>> $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
>>>>> CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null
>>>>>
>>>>>             disable-mTHP-swapin  mm-unstable  with-this-patch
>>>>> Real:            6m54.595s      7m4.832s       6m45.811s
>>>>> User:            66m42.795s     66m59.984s     67m21.150s
>>>>> Sys:             12m7.092s      15m18.153s     12m52.644s
>>>>> pswpin:          4262327        11723248       5918690
>>>>> pswpout:         14883774       19574347       14026942
>>>>> 64k-swpout:      624447         889384         480039
>>>>> 32k-swpout:      115473         242288         73874
>>>>> 16k-swpout:      158203         294672         109142
>>>>> 64k-swpin:       0              495869         159061
>>>>> 32k-swpin:       0              219977         56158
>>>>> 16k-swpin:       0              223501         81445
>>>>>
>>>>
>>>
>>> Hi Usama,
>>>
>>>> hmm, both the user and sys time are worse with the patch compared to
>>>> disable-mTHP-swapin. I wonder if the real time is an anomaly and if you
>>>> repeat the experiment the real time might be worse as well?
>>>
>>> Well, I've improved my script to include a loop:
>>>
>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
>>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>>>
>>> for ((i=1; i<=100; i++))
>>> do
>>>   echo "Executing round $i"
>>>   make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null
>>>   echo 3 > /proc/sys/vm/drop_caches
>>>   time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
>>>         CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j15 1>/dev/null 2>/dev/null
>>>   cat /proc/vmstat | grep pswp
>>>   echo -n 64k-swpout: ; cat
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout
>>>   echo -n 32k-swpout: ; cat
>>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout
>>>   echo -n 16k-swpout: ; cat
>>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout
>>>   echo -n 64k-swpin: ; cat
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin
>>>   echo -n 32k-swpin: ; cat
>>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin
>>>   echo -n 16k-swpin: ; cat
>>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin
>>> done
>>>
>>> I've noticed that the user/sys/real time on my i9 machine fluctuates
>>> constantly, could be things
>>> like:
>>> real    6m52.087s
>>> user    67m12.463s
>>> sys     13m8.281s
>>> ...
>>>
>>> real    7m42.937s
>>> user    66m55.250s
>>> sys     12m56.330s
>>> ...
>>>
>>> real    6m49.374s
>>> user    66m37.040s
>>> sys     12m44.542s
>>> ...
>>>
>>> real    6m54.205s
>>> user    65m49.732s
>>> sys     11m33.078s
>>> ...
>>>
>>> likely due to unstable temperatures and I/O latency. As a result, my
>>> data doesn’t seem
>>> reference-worthy.
>>>
>>
>> So I had suggested retrying the experiment to see how reproducible it is,
>> but had not done that myself!
>> Thanks for sharing this. I tried many times on the AMD server and I see
>> varying numbers as well.
>>
>> AMD 16K THP always, cgroup = 4G, large folio zswapin patches
>> real    1m28.351s
>> user    54m14.476s
>> sys     8m46.596s
>> zswpin 811693
>> zswpout 2137310
>> pgfault 27344671
>> pgmajfault 290510
>> ..
>> real    1m24.557s
>> user    53m56.815s
>> sys     8m10.200s
>> zswpin 571532
>> zswpout 1645063
>> pgfault 26989075
>> pgmajfault 205177
>> ..
>> real    1m26.083s
>> user    54m5.303s
>> sys     9m55.247s
>> zswpin 1176292
>> zswpout 2910825
>> pgfault 27286835
>> pgmajfault 419746
>>
>>
>> The sys time can especially vary by large numbers. I think you see the same.
>>
>>
>>> As a phone engineer, we never use phones to run kernel builds. I'm also
>>> quite certain that phones won't provide stable and reliable data for this
>>> type of workload. Without access to a Linux server to conduct the test,
>>> I really need your help.
>>>
>>> I used to work on optimizing the ARM server scheduler and memory
>>> management, and I really miss that machine I had until three years ago :-)
>>>
>>>>
>>>>> I need Usama's assistance to identify a suitable patch, as I lack
>>>>> access to hardware such as AMD machines and ARM servers with TLB
>>>>> optimization.
>>>>>
>>>>> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/
>>>>> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/
>>>>>
>>>>> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
>>>>> Cc: Usama Arif <usamaarif642@gmail.com>
>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>> Cc: Chris Li <chrisl@kernel.org>
>>>>> Cc: Yosry Ahmed <yosryahmed@google.com>
>>>>> Cc: "Huang, Ying" <ying.huang@intel.com>
>>>>> Cc: Kairui Song <kasong@tencent.com>
>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>>>> Cc: Michal Hocko <mhocko@kernel.org>
>>>>> Cc: Roman Gushchin <roman.gushchin@linux.dev>
>>>>> Cc: Shakeel Butt <shakeel.butt@linux.dev>
>>>>> Cc: Muchun Song <muchun.song@linux.dev>
>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>> ---
>>>>>  include/linux/memcontrol.h |  9 ++++++++
>>>>>  mm/memcontrol.c            | 45 ++++++++++++++++++++++++++++++++++++++
>>>>>  mm/memory.c                | 17 ++++++++++++++
>>>>>  3 files changed, 71 insertions(+)
>>>>>
>>>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>>>> index 524006313b0d..8bcc8f4af39f 100644
>>>>> --- a/include/linux/memcontrol.h
>>>>> +++ b/include/linux/memcontrol.h
>>>>> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
>>>>>  int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
>>>>>               long nr_pages);
>>>>>
>>>>> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
>>>>> +                             swp_entry_t *entry);
>>>>> +
>>>>>  int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
>>>>>                                 gfp_t gfp, swp_entry_t entry);
>>>>>
>>>>> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
>>>>>       return 0;
>>>>>  }
>>>>>
>>>>> +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
>>>>> +             swp_entry_t *entry)
>>>>> +{
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>>  static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
>>>>>                       struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
>>>>>  {
>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>> index 17af08367c68..f3d92b93ea6d 100644
>>>>> --- a/mm/memcontrol.c
>>>>> +++ b/mm/memcontrol.c
>>>>> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
>>>>>       return 0;
>>>>>  }
>>>>>
>>>>> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg)
>>>>> +{
>>>>> +     for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
>>>>> +             if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR)
>>>>
>>>> There might be 3 issues with the approach:
>>>>
>>>> Its a very big margin, lets say you have ARM64_64K_PAGES, and you have
>>>> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are
>>>> basically saying you need 512M free memory to swapin just 256K?
>>>
>>> Right, sorry for the noisy code. I was just thinking about 4KB pages
>>> and wondering
>>> if we could simplify the code.
>>>
>>>>
>>>> Its an uneven margin for different folio sizes.
>>>> For 16K folio swapin, you are checking if there is margin for 128 folios,
>>>> but for 1M folio swapin, you are checking there is margin for just 2 folios.
>>>>
>>>> Maybe it might be better to make this dependent on some factor of folio_nr_pages?
>>>
>>> Agreed. This is similar to what we discussed regarding your zswap mTHP
>>> swap-in series:
>>>
>>>  int mem_cgroup_swapin_charge_folio(...)
>>>  {
>>>        ...
>>>        if (folio_test_large(folio) &&
>>>            mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH,
>>> folio_nr_pages(folio)))
>>>                ret = -ENOMEM;
>>>        else
>>>                ret = charge_memcg(folio, memcg, gfp);
>>>        ...
>>>  }
>>>
>>> As someone focused on phones, my challenge is the absence of stable platforms to
>>> benchmark this type of workload. If possible, Usama, I would greatly
>>> appreciate it if
>>> you could take the lead on the patch.
>>>
>>>>
>>>> As Johannes pointed out, the charging code already does the margin check.
>>>> So for 4K, the check just checks if there is 4K available, but for 16K it checks
>>>> if a lot more than 16K is available. Maybe there should be a similar policy for
>>>> all? I guess this is similar to my 2nd point, but just considers 4K folios as
>>>> well.
>>>
>>> I don't think the charging code performs a margin check. It simply
>>> tries to charge
>>> the specified nr_pages (whether 1 or more). If nr_pages are available,
>>> the charge
>>> proceeds; otherwise, if GFP allows blocking, it triggers memory reclamation to
>>> reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages.
>>>
>>
>> So if you have defrag not set to always, it will not trigger reclamation.
>> I think that is a bigger usecase, i.e. defrag=madvise,defer,etc is probably
>> used much more then always.
>>
>> In the current code in that case try_charge_memcg will return -ENOMEM all
>> the way to mem_cgroup_swapin_charge_folio and alloc_swap_folio will then
>> try the next order. So eventhough it might not be calling the mem_cgroup_margin
>> function, it is kind of is doing the same?
>>
>>> If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages available, a
>>> large folio with nr_pages == SWAP_CLUSTER_MAX will successfully charge,
>>> immediately filling the memcg.
>>>
>>> Shortly after, smaller folios—typically with blockable GFP—will quickly trigger
>>> additional reclamation. While nr_pages - 1 subpages of the large folio may not
>>> be immediately needed, they still occupy enough space to fill the memcg to
>>> capacity.
>>>
>>> My second point about the mitigation is as follows: For a system (or
>>> memcg) under severe memory pressure, especially one without hardware TLB
>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at
>>> a larger granularity, some internal fragmentation is unavoidable, regardless
>>> of optimization. Could the mitigation code help in automatically tuning
>>> this fragmentation?
>>>
>>
>> I agree with the point that enabling mTHP always is not the right thing to do
>> on all platforms. I also think it might be the case that enabling mTHP
>> might be a good thing for some workloads, but enabling mTHP swapin along with
>> it might not.
>>
>> As you said when you have apps switching between foreground and background
>> in android, it probably makes sense to have large folio swapping, as you
>> want to bringin all the pages from background app as quickly as possible.
>> And also all the TLB optimizations and smaller lru overhead you get after
>> you have brought in all the pages.
>> Linux kernel build test doesnt really get to benefit from the TLB optimization
>> and smaller lru overhead, as probably the pages are very short lived. So I
>> think it doesnt show the benefit of large folio swapin properly and
>> large folio swapin should probably be disabled for this kind of workload,
>> eventhough mTHP should be enabled.
> 
> I'm not entirely sure if this applies to platforms without TLB
> optimization, especially
> in the absence of swap. In a memory-limited cgroup without swap, would
> mTHP still
> cause significant thrashing of file-backed folios? When a large swap
> file is present,
> the inability to swap in mTHP seems to act as a workaround for fragmentation,
> allowing fragmented pages of the original mTHP from do_anonymous_page() to
> remain in swap.
> 
>>
>> I am not sure that the approach we are trying in this patch is the right way:
>> - This patch makes it a memcg issue, but you could have memcg disabled and
>> then the mitigation being tried here wont apply.
>> - Instead of this being a large folio swapin issue, is it more of a readahead
>> issue? If we zswap (without the large folio swapin series) and change the window
>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
>> when cgroup memory is limited as readahead would probably cause swap thrashing as
>> well.
>> - Instead of looking at cgroup margin, maybe we should try and look at
>> the rate of change of workingset_restore_anon? This might be a lot more complicated
>> to do, but probably is the right metric to determine swap thrashing. It also means
>> that this could be used in both the synchronous swapcache skipping path and
>> swapin_readahead path.
>> (Thanks Johannes for suggesting this)
>>
>> With the large folio swapin, I do see the large improvement when considering only
>> swapin performance and latency in the same way as you saw in zram.
>> Maybe the right short term approach is to have
>> /sys/kernel/mm/transparent_hugepage/swapin
>> and have that disabled by default to avoid regression.
> 
> A crucial component is still missing—managing the compression and decompression
> of multiple pages as a larger block. This could significantly reduce
> system time and
> potentially resolve the kernel build issue within a small memory
> cgroup, even with
> swap thrashing.
> 
> I’ll send an update ASAP so you can rebase for zswap.

Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/?
Thats wont benefit zswap, right?
I actually had a few questions about it. Mainly that the benefit comes if the
pagefault happens on page 0 of the large folio. But if the page fault happens
on any other page, lets say page 1 of a 64K folio. then it will decompress the
entire 64K chunk and just copy page 1? (memcpy in zram_bvec_read_multi_pages_partial).
Could that cause a regression as you have to decompress a large chunk for just
getting 1 4K page?
If we assume uniform distribution of page faults, maybe it could make things worse?

I probably should ask all of this in that thread.

> 
>> If the workload owner sees a benefit, they can enable it.
>> I can add this when sending the next version of large folio zswapin if that makes
>> sense?
>> Longer term I can try and have a look at if we can do something with
>> workingset_restore_anon to improve things.
>>
>> Thanks,
>> Usama
> 
> Thanks
> Barry
Barry Song Oct. 30, 2024, 8:48 p.m. UTC | #8
On Thu, Oct 31, 2024 at 9:41 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 30/10/2024 20:27, Barry Song wrote:
> > On Thu, Oct 31, 2024 at 3:51 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >>
> >>
> >> On 28/10/2024 22:03, Barry Song wrote:
> >>> On Mon, Oct 28, 2024 at 8:07 PM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 27/10/2024 01:14, Barry Song wrote:
> >>>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>>
> >>>>> In a memcg where mTHP is always utilized, even at full capacity, it
> >>>>> might not be the best option. Consider a system that uses only small
> >>>>> folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX
> >>>>> of buffer space before it can initiate the next reclamation. However,
> >>>>> large folios can quickly fill this space, rapidly bringing the memcg
> >>>>> back to full capacity, even though some portions of the large folios
> >>>>> may not be immediately needed and used by the process.
> >>>>>
> >>>>> Usama and Kanchana identified a regression when building the kernel in
> >>>>> a memcg with memory.max set to a small value while enabling large
> >>>>> folio swap-in support on zswap[1].
> >>>>>
> >>>>> The issue arises from an edge case where the memory cgroup remains
> >>>>> nearly full most of the time. Consequently, bringing in mTHP can
> >>>>> quickly cause a memcg overflow, triggering a swap-out. The subsequent
> >>>>> swap-in then recreates the overflow, resulting in a repetitive cycle.
> >>>>>
> >>>>> We need a mechanism to stop the cup from overflowing continuously.
> >>>>> One potential solution is to slow the filling process when we identify
> >>>>> that the cup is nearly full.
> >>>>>
> >>>>> Usama reported an improvement when we mitigate mTHP swap-in as the
> >>>>> memcg approaches full capacity[2]:
> >>>>>
> >>>>> int mem_cgroup_swapin_charge_folio(...)
> >>>>> {
> >>>>>       ...
> >>>>>       if (folio_test_large(folio) &&
> >>>>>           mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)))
> >>>>>               ret = -ENOMEM;
> >>>>>       else
> >>>>>               ret = charge_memcg(folio, memcg, gfp);
> >>>>>       ...
> >>>>> }
> >>>>>
> >>>>> AMD 16K+32K THP=always
> >>>>> metric         mm-unstable      mm-unstable + large folio zswapin series    mm-unstable + large folio zswapin + no swap thrashing fix
> >>>>> real           1m23.038s        1m23.050s                                   1m22.704s
> >>>>> user           53m57.210s       53m53.437s                                  53m52.577s
> >>>>> sys            7m24.592s        7m48.843s                                   7m22.519s
> >>>>> zswpin         612070           999244                                      815934
> >>>>> zswpout        2226403          2347979                                     2054980
> >>>>> pgfault        20667366         20481728                                    20478690
> >>>>> pgmajfault     385887           269117                                      309702
> >>>>>
> >>>>> AMD 16K+32K+64K THP=always
> >>>>> metric         mm-unstable      mm-unstable + large folio zswapin series   mm-unstable + large folio zswapin + no swap thrashing fix
> >>>>> real           1m22.975s        1m23.266s                                  1m22.549s
> >>>>> user           53m51.302s       53m51.069s                                 53m46.471s
> >>>>> sys            7m40.168s        7m57.104s                                  7m25.012s
> >>>>> zswpin         676492           1258573                                    1225703
> >>>>> zswpout        2449839          2714767                                    2899178
> >>>>> pgfault        17540746         17296555                                   17234663
> >>>>> pgmajfault     429629           307495                                     287859
> >>>>>
> >>>>> I wonder if we can extend the mitigation to do_anonymous_page() as
> >>>>> well. Without hardware like AMD and ARM with hardware TLB coalescing
> >>>>> or CONT-PTE, I conducted a quick test on my Intel i9 workstation with
> >>>>> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel
> >>>>> builds in a memcg with memory.max set to 1 GiB.
> >>>>>
> >>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> >>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
> >>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
> >>>>> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> >>>>>
> >>>>> $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
> >>>>> CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null
> >>>>>
> >>>>>             disable-mTHP-swapin  mm-unstable  with-this-patch
> >>>>> Real:            6m54.595s      7m4.832s       6m45.811s
> >>>>> User:            66m42.795s     66m59.984s     67m21.150s
> >>>>> Sys:             12m7.092s      15m18.153s     12m52.644s
> >>>>> pswpin:          4262327        11723248       5918690
> >>>>> pswpout:         14883774       19574347       14026942
> >>>>> 64k-swpout:      624447         889384         480039
> >>>>> 32k-swpout:      115473         242288         73874
> >>>>> 16k-swpout:      158203         294672         109142
> >>>>> 64k-swpin:       0              495869         159061
> >>>>> 32k-swpin:       0              219977         56158
> >>>>> 16k-swpin:       0              223501         81445
> >>>>>
> >>>>
> >>>
> >>> Hi Usama,
> >>>
> >>>> hmm, both the user and sys time are worse with the patch compared to
> >>>> disable-mTHP-swapin. I wonder if the real time is an anomaly and if you
> >>>> repeat the experiment the real time might be worse as well?
> >>>
> >>> Well, I've improved my script to include a loop:
> >>>
> >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
> >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
> >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> >>>
> >>> for ((i=1; i<=100; i++))
> >>> do
> >>>   echo "Executing round $i"
> >>>   make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null
> >>>   echo 3 > /proc/sys/vm/drop_caches
> >>>   time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
> >>>         CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j15 1>/dev/null 2>/dev/null
> >>>   cat /proc/vmstat | grep pswp
> >>>   echo -n 64k-swpout: ; cat
> >>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout
> >>>   echo -n 32k-swpout: ; cat
> >>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout
> >>>   echo -n 16k-swpout: ; cat
> >>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout
> >>>   echo -n 64k-swpin: ; cat
> >>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin
> >>>   echo -n 32k-swpin: ; cat
> >>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin
> >>>   echo -n 16k-swpin: ; cat
> >>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin
> >>> done
> >>>
> >>> I've noticed that the user/sys/real time on my i9 machine fluctuates
> >>> constantly, could be things
> >>> like:
> >>> real    6m52.087s
> >>> user    67m12.463s
> >>> sys     13m8.281s
> >>> ...
> >>>
> >>> real    7m42.937s
> >>> user    66m55.250s
> >>> sys     12m56.330s
> >>> ...
> >>>
> >>> real    6m49.374s
> >>> user    66m37.040s
> >>> sys     12m44.542s
> >>> ...
> >>>
> >>> real    6m54.205s
> >>> user    65m49.732s
> >>> sys     11m33.078s
> >>> ...
> >>>
> >>> likely due to unstable temperatures and I/O latency. As a result, my
> >>> data doesn’t seem
> >>> reference-worthy.
> >>>
> >>
> >> So I had suggested retrying the experiment to see how reproducible it is,
> >> but had not done that myself!
> >> Thanks for sharing this. I tried many times on the AMD server and I see
> >> varying numbers as well.
> >>
> >> AMD 16K THP always, cgroup = 4G, large folio zswapin patches
> >> real    1m28.351s
> >> user    54m14.476s
> >> sys     8m46.596s
> >> zswpin 811693
> >> zswpout 2137310
> >> pgfault 27344671
> >> pgmajfault 290510
> >> ..
> >> real    1m24.557s
> >> user    53m56.815s
> >> sys     8m10.200s
> >> zswpin 571532
> >> zswpout 1645063
> >> pgfault 26989075
> >> pgmajfault 205177
> >> ..
> >> real    1m26.083s
> >> user    54m5.303s
> >> sys     9m55.247s
> >> zswpin 1176292
> >> zswpout 2910825
> >> pgfault 27286835
> >> pgmajfault 419746
> >>
> >>
> >> The sys time can especially vary by large numbers. I think you see the same.
> >>
> >>
> >>> As a phone engineer, we never use phones to run kernel builds. I'm also
> >>> quite certain that phones won't provide stable and reliable data for this
> >>> type of workload. Without access to a Linux server to conduct the test,
> >>> I really need your help.
> >>>
> >>> I used to work on optimizing the ARM server scheduler and memory
> >>> management, and I really miss that machine I had until three years ago :-)
> >>>
> >>>>
> >>>>> I need Usama's assistance to identify a suitable patch, as I lack
> >>>>> access to hardware such as AMD machines and ARM servers with TLB
> >>>>> optimization.
> >>>>>
> >>>>> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/
> >>>>> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/
> >>>>>
> >>>>> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> >>>>> Cc: Usama Arif <usamaarif642@gmail.com>
> >>>>> Cc: David Hildenbrand <david@redhat.com>
> >>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> >>>>> Cc: Chris Li <chrisl@kernel.org>
> >>>>> Cc: Yosry Ahmed <yosryahmed@google.com>
> >>>>> Cc: "Huang, Ying" <ying.huang@intel.com>
> >>>>> Cc: Kairui Song <kasong@tencent.com>
> >>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
> >>>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >>>>> Cc: Michal Hocko <mhocko@kernel.org>
> >>>>> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> >>>>> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> >>>>> Cc: Muchun Song <muchun.song@linux.dev>
> >>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>>>> ---
> >>>>>  include/linux/memcontrol.h |  9 ++++++++
> >>>>>  mm/memcontrol.c            | 45 ++++++++++++++++++++++++++++++++++++++
> >>>>>  mm/memory.c                | 17 ++++++++++++++
> >>>>>  3 files changed, 71 insertions(+)
> >>>>>
> >>>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> >>>>> index 524006313b0d..8bcc8f4af39f 100644
> >>>>> --- a/include/linux/memcontrol.h
> >>>>> +++ b/include/linux/memcontrol.h
> >>>>> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
> >>>>>  int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> >>>>>               long nr_pages);
> >>>>>
> >>>>> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
> >>>>> +                             swp_entry_t *entry);
> >>>>> +
> >>>>>  int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> >>>>>                                 gfp_t gfp, swp_entry_t entry);
> >>>>>
> >>>>> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
> >>>>>       return 0;
> >>>>>  }
> >>>>>
> >>>>> +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
> >>>>> +             swp_entry_t *entry)
> >>>>> +{
> >>>>> +     return 0;
> >>>>> +}
> >>>>> +
> >>>>>  static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
> >>>>>                       struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
> >>>>>  {
> >>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>>>> index 17af08367c68..f3d92b93ea6d 100644
> >>>>> --- a/mm/memcontrol.c
> >>>>> +++ b/mm/memcontrol.c
> >>>>> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> >>>>>       return 0;
> >>>>>  }
> >>>>>
> >>>>> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg)
> >>>>> +{
> >>>>> +     for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
> >>>>> +             if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR)
> >>>>
> >>>> There might be 3 issues with the approach:
> >>>>
> >>>> Its a very big margin, lets say you have ARM64_64K_PAGES, and you have
> >>>> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are
> >>>> basically saying you need 512M free memory to swapin just 256K?
> >>>
> >>> Right, sorry for the noisy code. I was just thinking about 4KB pages
> >>> and wondering
> >>> if we could simplify the code.
> >>>
> >>>>
> >>>> Its an uneven margin for different folio sizes.
> >>>> For 16K folio swapin, you are checking if there is margin for 128 folios,
> >>>> but for 1M folio swapin, you are checking there is margin for just 2 folios.
> >>>>
> >>>> Maybe it might be better to make this dependent on some factor of folio_nr_pages?
> >>>
> >>> Agreed. This is similar to what we discussed regarding your zswap mTHP
> >>> swap-in series:
> >>>
> >>>  int mem_cgroup_swapin_charge_folio(...)
> >>>  {
> >>>        ...
> >>>        if (folio_test_large(folio) &&
> >>>            mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH,
> >>> folio_nr_pages(folio)))
> >>>                ret = -ENOMEM;
> >>>        else
> >>>                ret = charge_memcg(folio, memcg, gfp);
> >>>        ...
> >>>  }
> >>>
> >>> As someone focused on phones, my challenge is the absence of stable platforms to
> >>> benchmark this type of workload. If possible, Usama, I would greatly
> >>> appreciate it if
> >>> you could take the lead on the patch.
> >>>
> >>>>
> >>>> As Johannes pointed out, the charging code already does the margin check.
> >>>> So for 4K, the check just checks if there is 4K available, but for 16K it checks
> >>>> if a lot more than 16K is available. Maybe there should be a similar policy for
> >>>> all? I guess this is similar to my 2nd point, but just considers 4K folios as
> >>>> well.
> >>>
> >>> I don't think the charging code performs a margin check. It simply
> >>> tries to charge
> >>> the specified nr_pages (whether 1 or more). If nr_pages are available,
> >>> the charge
> >>> proceeds; otherwise, if GFP allows blocking, it triggers memory reclamation to
> >>> reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages.
> >>>
> >>
> >> So if you have defrag not set to always, it will not trigger reclamation.
> >> I think that is a bigger usecase, i.e. defrag=madvise,defer,etc is probably
> >> used much more then always.
> >>
> >> In the current code in that case try_charge_memcg will return -ENOMEM all
> >> the way to mem_cgroup_swapin_charge_folio and alloc_swap_folio will then
> >> try the next order. So eventhough it might not be calling the mem_cgroup_margin
> >> function, it is kind of is doing the same?
> >>
> >>> If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages available, a
> >>> large folio with nr_pages == SWAP_CLUSTER_MAX will successfully charge,
> >>> immediately filling the memcg.
> >>>
> >>> Shortly after, smaller folios—typically with blockable GFP—will quickly trigger
> >>> additional reclamation. While nr_pages - 1 subpages of the large folio may not
> >>> be immediately needed, they still occupy enough space to fill the memcg to
> >>> capacity.
> >>>
> >>> My second point about the mitigation is as follows: For a system (or
> >>> memcg) under severe memory pressure, especially one without hardware TLB
> >>> optimization, is enabling mTHP always the right choice? Since mTHP operates at
> >>> a larger granularity, some internal fragmentation is unavoidable, regardless
> >>> of optimization. Could the mitigation code help in automatically tuning
> >>> this fragmentation?
> >>>
> >>
> >> I agree with the point that enabling mTHP always is not the right thing to do
> >> on all platforms. I also think it might be the case that enabling mTHP
> >> might be a good thing for some workloads, but enabling mTHP swapin along with
> >> it might not.
> >>
> >> As you said when you have apps switching between foreground and background
> >> in android, it probably makes sense to have large folio swapping, as you
> >> want to bringin all the pages from background app as quickly as possible.
> >> And also all the TLB optimizations and smaller lru overhead you get after
> >> you have brought in all the pages.
> >> Linux kernel build test doesnt really get to benefit from the TLB optimization
> >> and smaller lru overhead, as probably the pages are very short lived. So I
> >> think it doesnt show the benefit of large folio swapin properly and
> >> large folio swapin should probably be disabled for this kind of workload,
> >> eventhough mTHP should be enabled.
> >
> > I'm not entirely sure if this applies to platforms without TLB
> > optimization, especially
> > in the absence of swap. In a memory-limited cgroup without swap, would
> > mTHP still
> > cause significant thrashing of file-backed folios? When a large swap
> > file is present,
> > the inability to swap in mTHP seems to act as a workaround for fragmentation,
> > allowing fragmented pages of the original mTHP from do_anonymous_page() to
> > remain in swap.
> >
> >>
> >> I am not sure that the approach we are trying in this patch is the right way:
> >> - This patch makes it a memcg issue, but you could have memcg disabled and
> >> then the mitigation being tried here wont apply.
> >> - Instead of this being a large folio swapin issue, is it more of a readahead
> >> issue? If we zswap (without the large folio swapin series) and change the window
> >> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> >> when cgroup memory is limited as readahead would probably cause swap thrashing as
> >> well.
> >> - Instead of looking at cgroup margin, maybe we should try and look at
> >> the rate of change of workingset_restore_anon? This might be a lot more complicated
> >> to do, but probably is the right metric to determine swap thrashing. It also means
> >> that this could be used in both the synchronous swapcache skipping path and
> >> swapin_readahead path.
> >> (Thanks Johannes for suggesting this)
> >>
> >> With the large folio swapin, I do see the large improvement when considering only
> >> swapin performance and latency in the same way as you saw in zram.
> >> Maybe the right short term approach is to have
> >> /sys/kernel/mm/transparent_hugepage/swapin
> >> and have that disabled by default to avoid regression.
> >
> > A crucial component is still missing—managing the compression and decompression
> > of multiple pages as a larger block. This could significantly reduce
> > system time and
> > potentially resolve the kernel build issue within a small memory
> > cgroup, even with
> > swap thrashing.
> >
> > I’ll send an update ASAP so you can rebase for zswap.
>
> Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/?
> Thats wont benefit zswap, right?

That's right. I assume we can also make it work with zswap?

> I actually had a few questions about it. Mainly that the benefit comes if the
> pagefault happens on page 0 of the large folio. But if the page fault happens
> on any other page, lets say page 1 of a 64K folio. then it will decompress the
> entire 64K chunk and just copy page 1? (memcpy in zram_bvec_read_multi_pages_partial).
> Could that cause a regression as you have to decompress a large chunk for just
> getting 1 4K page?
> If we assume uniform distribution of page faults, maybe it could make things worse?
>
> I probably should ask all of this in that thread.

With mTHP swap-in, a page fault on any page behaves the same as a fault on
page 0. Without mTHP swap-in, there’s also no difference between
faults on page 0
and other pages.

A fault on any page means that the entire block is decompressed. The
only difference
is that we don’t partially copy one page when mTHP swap-in is present.

>
> >
> >> If the workload owner sees a benefit, they can enable it.
> >> I can add this when sending the next version of large folio zswapin if that makes
> >> sense?
> >> Longer term I can try and have a look at if we can do something with
> >> workingset_restore_anon to improve things.
> >>
> >> Thanks,
> >> Usama

Thanks
Barry
Usama Arif Oct. 30, 2024, 9 p.m. UTC | #9
On 30/10/2024 20:48, Barry Song wrote:
> On Thu, Oct 31, 2024 at 9:41 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 30/10/2024 20:27, Barry Song wrote:
>>> On Thu, Oct 31, 2024 at 3:51 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On 28/10/2024 22:03, Barry Song wrote:
>>>>> On Mon, Oct 28, 2024 at 8:07 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 27/10/2024 01:14, Barry Song wrote:
>>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>>
>>>>>>> In a memcg where mTHP is always utilized, even at full capacity, it
>>>>>>> might not be the best option. Consider a system that uses only small
>>>>>>> folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX
>>>>>>> of buffer space before it can initiate the next reclamation. However,
>>>>>>> large folios can quickly fill this space, rapidly bringing the memcg
>>>>>>> back to full capacity, even though some portions of the large folios
>>>>>>> may not be immediately needed and used by the process.
>>>>>>>
>>>>>>> Usama and Kanchana identified a regression when building the kernel in
>>>>>>> a memcg with memory.max set to a small value while enabling large
>>>>>>> folio swap-in support on zswap[1].
>>>>>>>
>>>>>>> The issue arises from an edge case where the memory cgroup remains
>>>>>>> nearly full most of the time. Consequently, bringing in mTHP can
>>>>>>> quickly cause a memcg overflow, triggering a swap-out. The subsequent
>>>>>>> swap-in then recreates the overflow, resulting in a repetitive cycle.
>>>>>>>
>>>>>>> We need a mechanism to stop the cup from overflowing continuously.
>>>>>>> One potential solution is to slow the filling process when we identify
>>>>>>> that the cup is nearly full.
>>>>>>>
>>>>>>> Usama reported an improvement when we mitigate mTHP swap-in as the
>>>>>>> memcg approaches full capacity[2]:
>>>>>>>
>>>>>>> int mem_cgroup_swapin_charge_folio(...)
>>>>>>> {
>>>>>>>       ...
>>>>>>>       if (folio_test_large(folio) &&
>>>>>>>           mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)))
>>>>>>>               ret = -ENOMEM;
>>>>>>>       else
>>>>>>>               ret = charge_memcg(folio, memcg, gfp);
>>>>>>>       ...
>>>>>>> }
>>>>>>>
>>>>>>> AMD 16K+32K THP=always
>>>>>>> metric         mm-unstable      mm-unstable + large folio zswapin series    mm-unstable + large folio zswapin + no swap thrashing fix
>>>>>>> real           1m23.038s        1m23.050s                                   1m22.704s
>>>>>>> user           53m57.210s       53m53.437s                                  53m52.577s
>>>>>>> sys            7m24.592s        7m48.843s                                   7m22.519s
>>>>>>> zswpin         612070           999244                                      815934
>>>>>>> zswpout        2226403          2347979                                     2054980
>>>>>>> pgfault        20667366         20481728                                    20478690
>>>>>>> pgmajfault     385887           269117                                      309702
>>>>>>>
>>>>>>> AMD 16K+32K+64K THP=always
>>>>>>> metric         mm-unstable      mm-unstable + large folio zswapin series   mm-unstable + large folio zswapin + no swap thrashing fix
>>>>>>> real           1m22.975s        1m23.266s                                  1m22.549s
>>>>>>> user           53m51.302s       53m51.069s                                 53m46.471s
>>>>>>> sys            7m40.168s        7m57.104s                                  7m25.012s
>>>>>>> zswpin         676492           1258573                                    1225703
>>>>>>> zswpout        2449839          2714767                                    2899178
>>>>>>> pgfault        17540746         17296555                                   17234663
>>>>>>> pgmajfault     429629           307495                                     287859
>>>>>>>
>>>>>>> I wonder if we can extend the mitigation to do_anonymous_page() as
>>>>>>> well. Without hardware like AMD and ARM with hardware TLB coalescing
>>>>>>> or CONT-PTE, I conducted a quick test on my Intel i9 workstation with
>>>>>>> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel
>>>>>>> builds in a memcg with memory.max set to 1 GiB.
>>>>>>>
>>>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>>>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
>>>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
>>>>>>> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>>>>>>>
>>>>>>> $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
>>>>>>> CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null
>>>>>>>
>>>>>>>             disable-mTHP-swapin  mm-unstable  with-this-patch
>>>>>>> Real:            6m54.595s      7m4.832s       6m45.811s
>>>>>>> User:            66m42.795s     66m59.984s     67m21.150s
>>>>>>> Sys:             12m7.092s      15m18.153s     12m52.644s
>>>>>>> pswpin:          4262327        11723248       5918690
>>>>>>> pswpout:         14883774       19574347       14026942
>>>>>>> 64k-swpout:      624447         889384         480039
>>>>>>> 32k-swpout:      115473         242288         73874
>>>>>>> 16k-swpout:      158203         294672         109142
>>>>>>> 64k-swpin:       0              495869         159061
>>>>>>> 32k-swpin:       0              219977         56158
>>>>>>> 16k-swpin:       0              223501         81445
>>>>>>>
>>>>>>
>>>>>
>>>>> Hi Usama,
>>>>>
>>>>>> hmm, both the user and sys time are worse with the patch compared to
>>>>>> disable-mTHP-swapin. I wonder if the real time is an anomaly and if you
>>>>>> repeat the experiment the real time might be worse as well?
>>>>>
>>>>> Well, I've improved my script to include a loop:
>>>>>
>>>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>>>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
>>>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
>>>>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>>>>>
>>>>> for ((i=1; i<=100; i++))
>>>>> do
>>>>>   echo "Executing round $i"
>>>>>   make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null
>>>>>   echo 3 > /proc/sys/vm/drop_caches
>>>>>   time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
>>>>>         CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j15 1>/dev/null 2>/dev/null
>>>>>   cat /proc/vmstat | grep pswp
>>>>>   echo -n 64k-swpout: ; cat
>>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout
>>>>>   echo -n 32k-swpout: ; cat
>>>>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout
>>>>>   echo -n 16k-swpout: ; cat
>>>>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout
>>>>>   echo -n 64k-swpin: ; cat
>>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin
>>>>>   echo -n 32k-swpin: ; cat
>>>>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin
>>>>>   echo -n 16k-swpin: ; cat
>>>>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin
>>>>> done
>>>>>
>>>>> I've noticed that the user/sys/real time on my i9 machine fluctuates
>>>>> constantly, could be things
>>>>> like:
>>>>> real    6m52.087s
>>>>> user    67m12.463s
>>>>> sys     13m8.281s
>>>>> ...
>>>>>
>>>>> real    7m42.937s
>>>>> user    66m55.250s
>>>>> sys     12m56.330s
>>>>> ...
>>>>>
>>>>> real    6m49.374s
>>>>> user    66m37.040s
>>>>> sys     12m44.542s
>>>>> ...
>>>>>
>>>>> real    6m54.205s
>>>>> user    65m49.732s
>>>>> sys     11m33.078s
>>>>> ...
>>>>>
>>>>> likely due to unstable temperatures and I/O latency. As a result, my
>>>>> data doesn’t seem
>>>>> reference-worthy.
>>>>>
>>>>
>>>> So I had suggested retrying the experiment to see how reproducible it is,
>>>> but had not done that myself!
>>>> Thanks for sharing this. I tried many times on the AMD server and I see
>>>> varying numbers as well.
>>>>
>>>> AMD 16K THP always, cgroup = 4G, large folio zswapin patches
>>>> real    1m28.351s
>>>> user    54m14.476s
>>>> sys     8m46.596s
>>>> zswpin 811693
>>>> zswpout 2137310
>>>> pgfault 27344671
>>>> pgmajfault 290510
>>>> ..
>>>> real    1m24.557s
>>>> user    53m56.815s
>>>> sys     8m10.200s
>>>> zswpin 571532
>>>> zswpout 1645063
>>>> pgfault 26989075
>>>> pgmajfault 205177
>>>> ..
>>>> real    1m26.083s
>>>> user    54m5.303s
>>>> sys     9m55.247s
>>>> zswpin 1176292
>>>> zswpout 2910825
>>>> pgfault 27286835
>>>> pgmajfault 419746
>>>>
>>>>
>>>> The sys time can especially vary by large numbers. I think you see the same.
>>>>
>>>>
>>>>> As a phone engineer, we never use phones to run kernel builds. I'm also
>>>>> quite certain that phones won't provide stable and reliable data for this
>>>>> type of workload. Without access to a Linux server to conduct the test,
>>>>> I really need your help.
>>>>>
>>>>> I used to work on optimizing the ARM server scheduler and memory
>>>>> management, and I really miss that machine I had until three years ago :-)
>>>>>
>>>>>>
>>>>>>> I need Usama's assistance to identify a suitable patch, as I lack
>>>>>>> access to hardware such as AMD machines and ARM servers with TLB
>>>>>>> optimization.
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/
>>>>>>> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/
>>>>>>>
>>>>>>> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
>>>>>>> Cc: Usama Arif <usamaarif642@gmail.com>
>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>> Cc: Chris Li <chrisl@kernel.org>
>>>>>>> Cc: Yosry Ahmed <yosryahmed@google.com>
>>>>>>> Cc: "Huang, Ying" <ying.huang@intel.com>
>>>>>>> Cc: Kairui Song <kasong@tencent.com>
>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>>>>>> Cc: Michal Hocko <mhocko@kernel.org>
>>>>>>> Cc: Roman Gushchin <roman.gushchin@linux.dev>
>>>>>>> Cc: Shakeel Butt <shakeel.butt@linux.dev>
>>>>>>> Cc: Muchun Song <muchun.song@linux.dev>
>>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>>>> ---
>>>>>>>  include/linux/memcontrol.h |  9 ++++++++
>>>>>>>  mm/memcontrol.c            | 45 ++++++++++++++++++++++++++++++++++++++
>>>>>>>  mm/memory.c                | 17 ++++++++++++++
>>>>>>>  3 files changed, 71 insertions(+)
>>>>>>>
>>>>>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>>>>>> index 524006313b0d..8bcc8f4af39f 100644
>>>>>>> --- a/include/linux/memcontrol.h
>>>>>>> +++ b/include/linux/memcontrol.h
>>>>>>> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
>>>>>>>  int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
>>>>>>>               long nr_pages);
>>>>>>>
>>>>>>> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
>>>>>>> +                             swp_entry_t *entry);
>>>>>>> +
>>>>>>>  int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
>>>>>>>                                 gfp_t gfp, swp_entry_t entry);
>>>>>>>
>>>>>>> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
>>>>>>>       return 0;
>>>>>>>  }
>>>>>>>
>>>>>>> +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
>>>>>>> +             swp_entry_t *entry)
>>>>>>> +{
>>>>>>> +     return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>>  static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
>>>>>>>                       struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
>>>>>>>  {
>>>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>>>> index 17af08367c68..f3d92b93ea6d 100644
>>>>>>> --- a/mm/memcontrol.c
>>>>>>> +++ b/mm/memcontrol.c
>>>>>>> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
>>>>>>>       return 0;
>>>>>>>  }
>>>>>>>
>>>>>>> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg)
>>>>>>> +{
>>>>>>> +     for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
>>>>>>> +             if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR)
>>>>>>
>>>>>> There might be 3 issues with the approach:
>>>>>>
>>>>>> Its a very big margin, lets say you have ARM64_64K_PAGES, and you have
>>>>>> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are
>>>>>> basically saying you need 512M free memory to swapin just 256K?
>>>>>
>>>>> Right, sorry for the noisy code. I was just thinking about 4KB pages
>>>>> and wondering
>>>>> if we could simplify the code.
>>>>>
>>>>>>
>>>>>> Its an uneven margin for different folio sizes.
>>>>>> For 16K folio swapin, you are checking if there is margin for 128 folios,
>>>>>> but for 1M folio swapin, you are checking there is margin for just 2 folios.
>>>>>>
>>>>>> Maybe it might be better to make this dependent on some factor of folio_nr_pages?
>>>>>
>>>>> Agreed. This is similar to what we discussed regarding your zswap mTHP
>>>>> swap-in series:
>>>>>
>>>>>  int mem_cgroup_swapin_charge_folio(...)
>>>>>  {
>>>>>        ...
>>>>>        if (folio_test_large(folio) &&
>>>>>            mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH,
>>>>> folio_nr_pages(folio)))
>>>>>                ret = -ENOMEM;
>>>>>        else
>>>>>                ret = charge_memcg(folio, memcg, gfp);
>>>>>        ...
>>>>>  }
>>>>>
>>>>> As someone focused on phones, my challenge is the absence of stable platforms to
>>>>> benchmark this type of workload. If possible, Usama, I would greatly
>>>>> appreciate it if
>>>>> you could take the lead on the patch.
>>>>>
>>>>>>
>>>>>> As Johannes pointed out, the charging code already does the margin check.
>>>>>> So for 4K, the check just checks if there is 4K available, but for 16K it checks
>>>>>> if a lot more than 16K is available. Maybe there should be a similar policy for
>>>>>> all? I guess this is similar to my 2nd point, but just considers 4K folios as
>>>>>> well.
>>>>>
>>>>> I don't think the charging code performs a margin check. It simply
>>>>> tries to charge
>>>>> the specified nr_pages (whether 1 or more). If nr_pages are available,
>>>>> the charge
>>>>> proceeds; otherwise, if GFP allows blocking, it triggers memory reclamation to
>>>>> reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages.
>>>>>
>>>>
>>>> So if you have defrag not set to always, it will not trigger reclamation.
>>>> I think that is a bigger usecase, i.e. defrag=madvise,defer,etc is probably
>>>> used much more then always.
>>>>
>>>> In the current code in that case try_charge_memcg will return -ENOMEM all
>>>> the way to mem_cgroup_swapin_charge_folio and alloc_swap_folio will then
>>>> try the next order. So eventhough it might not be calling the mem_cgroup_margin
>>>> function, it is kind of is doing the same?
>>>>
>>>>> If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages available, a
>>>>> large folio with nr_pages == SWAP_CLUSTER_MAX will successfully charge,
>>>>> immediately filling the memcg.
>>>>>
>>>>> Shortly after, smaller folios—typically with blockable GFP—will quickly trigger
>>>>> additional reclamation. While nr_pages - 1 subpages of the large folio may not
>>>>> be immediately needed, they still occupy enough space to fill the memcg to
>>>>> capacity.
>>>>>
>>>>> My second point about the mitigation is as follows: For a system (or
>>>>> memcg) under severe memory pressure, especially one without hardware TLB
>>>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at
>>>>> a larger granularity, some internal fragmentation is unavoidable, regardless
>>>>> of optimization. Could the mitigation code help in automatically tuning
>>>>> this fragmentation?
>>>>>
>>>>
>>>> I agree with the point that enabling mTHP always is not the right thing to do
>>>> on all platforms. I also think it might be the case that enabling mTHP
>>>> might be a good thing for some workloads, but enabling mTHP swapin along with
>>>> it might not.
>>>>
>>>> As you said when you have apps switching between foreground and background
>>>> in android, it probably makes sense to have large folio swapping, as you
>>>> want to bringin all the pages from background app as quickly as possible.
>>>> And also all the TLB optimizations and smaller lru overhead you get after
>>>> you have brought in all the pages.
>>>> Linux kernel build test doesnt really get to benefit from the TLB optimization
>>>> and smaller lru overhead, as probably the pages are very short lived. So I
>>>> think it doesnt show the benefit of large folio swapin properly and
>>>> large folio swapin should probably be disabled for this kind of workload,
>>>> eventhough mTHP should be enabled.
>>>
>>> I'm not entirely sure if this applies to platforms without TLB
>>> optimization, especially
>>> in the absence of swap. In a memory-limited cgroup without swap, would
>>> mTHP still
>>> cause significant thrashing of file-backed folios? When a large swap
>>> file is present,
>>> the inability to swap in mTHP seems to act as a workaround for fragmentation,
>>> allowing fragmented pages of the original mTHP from do_anonymous_page() to
>>> remain in swap.
>>>
>>>>
>>>> I am not sure that the approach we are trying in this patch is the right way:
>>>> - This patch makes it a memcg issue, but you could have memcg disabled and
>>>> then the mitigation being tried here wont apply.
>>>> - Instead of this being a large folio swapin issue, is it more of a readahead
>>>> issue? If we zswap (without the large folio swapin series) and change the window
>>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
>>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
>>>> well.
>>>> - Instead of looking at cgroup margin, maybe we should try and look at
>>>> the rate of change of workingset_restore_anon? This might be a lot more complicated
>>>> to do, but probably is the right metric to determine swap thrashing. It also means
>>>> that this could be used in both the synchronous swapcache skipping path and
>>>> swapin_readahead path.
>>>> (Thanks Johannes for suggesting this)
>>>>
>>>> With the large folio swapin, I do see the large improvement when considering only
>>>> swapin performance and latency in the same way as you saw in zram.
>>>> Maybe the right short term approach is to have
>>>> /sys/kernel/mm/transparent_hugepage/swapin
>>>> and have that disabled by default to avoid regression.
>>>
>>> A crucial component is still missing—managing the compression and decompression
>>> of multiple pages as a larger block. This could significantly reduce
>>> system time and
>>> potentially resolve the kernel build issue within a small memory
>>> cgroup, even with
>>> swap thrashing.
>>>
>>> I’ll send an update ASAP so you can rebase for zswap.
>>
>> Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/?
>> Thats wont benefit zswap, right?
> 
> That's right. I assume we can also make it work with zswap?

Hopefully yes. Thats mainly why I was looking at that series, to try and find
a way to do something similar for zswap.
> 
>> I actually had a few questions about it. Mainly that the benefit comes if the
>> pagefault happens on page 0 of the large folio. But if the page fault happens
>> on any other page, lets say page 1 of a 64K folio. then it will decompress the
>> entire 64K chunk and just copy page 1? (memcpy in zram_bvec_read_multi_pages_partial).
>> Could that cause a regression as you have to decompress a large chunk for just
>> getting 1 4K page?
>> If we assume uniform distribution of page faults, maybe it could make things worse?
>>
>> I probably should ask all of this in that thread.
> 
> With mTHP swap-in, a page fault on any page behaves the same as a fault on
> page 0. Without mTHP swap-in, there’s also no difference between
> faults on page 0
> and other pages.

Ah ok, its because of the ALIGN_DOWN in
https://elixir.bootlin.com/linux/v6.12-rc5/source/mm/memory.c#L4158,
right?
> 
> A fault on any page means that the entire block is decompressed. The
> only difference
> is that we don’t partially copy one page when mTHP swap-in is present.
> 
Ah so zram_bvec_read_multi_pages_partial would be called only
if someone swaps out mTHP, disables it and then tries to do swapin?

Thanks 

>>
>>>
>>>> If the workload owner sees a benefit, they can enable it.
>>>> I can add this when sending the next version of large folio zswapin if that makes
>>>> sense?
>>>> Longer term I can try and have a look at if we can do something with
>>>> workingset_restore_anon to improve things.
>>>>
>>>> Thanks,
>>>> Usama
> 
> Thanks
> Barry
Yosry Ahmed Oct. 30, 2024, 9:01 p.m. UTC | #10
On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 30/10/2024 19:51, Yosry Ahmed wrote:
> > [..]
> >>> My second point about the mitigation is as follows: For a system (or
> >>> memcg) under severe memory pressure, especially one without hardware TLB
> >>> optimization, is enabling mTHP always the right choice? Since mTHP operates at
> >>> a larger granularity, some internal fragmentation is unavoidable, regardless
> >>> of optimization. Could the mitigation code help in automatically tuning
> >>> this fragmentation?
> >>>
> >>
> >> I agree with the point that enabling mTHP always is not the right thing to do
> >> on all platforms. I also think it might be the case that enabling mTHP
> >> might be a good thing for some workloads, but enabling mTHP swapin along with
> >> it might not.
> >>
> >> As you said when you have apps switching between foreground and background
> >> in android, it probably makes sense to have large folio swapping, as you
> >> want to bringin all the pages from background app as quickly as possible.
> >> And also all the TLB optimizations and smaller lru overhead you get after
> >> you have brought in all the pages.
> >> Linux kernel build test doesnt really get to benefit from the TLB optimization
> >> and smaller lru overhead, as probably the pages are very short lived. So I
> >> think it doesnt show the benefit of large folio swapin properly and
> >> large folio swapin should probably be disabled for this kind of workload,
> >> eventhough mTHP should be enabled.
> >>
> >> I am not sure that the approach we are trying in this patch is the right way:
> >> - This patch makes it a memcg issue, but you could have memcg disabled and
> >> then the mitigation being tried here wont apply.
> >
> > Is the problem reproducible without memcg? I imagine only if the
> > entire system is under memory pressure. I guess we would want the same
> > "mitigation" either way.
> >
> What would be a good open source benchmark/workload to test without limiting memory
> in memcg?
> For the kernel build test, I can only get zswap activity to happen if I build
> in cgroup and limit memory.max.

You mean a benchmark that puts the entire system under memory
pressure? I am not sure, it ultimately depends on the size of memory
you have, among other factors.

What if you run the kernel build test in a VM? Then you can limit is
size like a memcg, although you'd probably need to leave more room
because the entire guest OS will also subject to the same limit.

>
> I can just run zswap large folio zswapin in production and see, but that will take me a few
> days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing,
> then maybe its not really an issue? I believe Barry doesnt see an issue in android
> phones (but please correct me if I am wrong), and if there isnt an issue in Meta
> production as well, its a good data point for servers as well. And maybe
> kernel build in 4G memcg is not a good test.

If there is a regression in the kernel build, this means some
workloads may be affected, even if Meta's prod isn't. I understand
that the benchmark is not very representative of real world workloads,
but in this instance I think the thrashing problem surfaced by the
benchmark is real.

>
> >> - Instead of this being a large folio swapin issue, is it more of a readahead
> >> issue? If we zswap (without the large folio swapin series) and change the window
> >> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> >> when cgroup memory is limited as readahead would probably cause swap thrashing as
> >> well.
> >
> > I think large folio swapin would make the problem worse anyway. I am
> > also not sure if the readahead window adjusts on memory pressure or
> > not.
> >
> readahead window doesnt look at memory pressure. So maybe the same thing is being
> seen here as there would be in swapin_readahead?

Maybe readahead is not as aggressive in general as large folio
swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
of the window is the smaller of page_cluster (2 or 3) and
SWAP_RA_ORDER_CEILING (5).

Also readahead will swapin 4k folios AFAICT, so we don't need a
contiguous allocation like large folio swapin. So that could be
another factor why readahead may not reproduce the problem.

> Maybe if we check kernel build test
> performance in 4G memcg with below diff, it might get better?

I think you can use the page_cluster tunable to do this at runtime.

>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 4669f29cf555..9e196e1e6885 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -809,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>         pgoff_t ilx;
>         bool page_allocated;
>
> -       win = swap_vma_ra_win(vmf, &start, &end);
> +       win = 1;
>         if (win == 1)
>                 goto skip;
>
Barry Song Oct. 30, 2024, 9:08 p.m. UTC | #11
On Thu, Oct 31, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 30/10/2024 20:48, Barry Song wrote:
> > On Thu, Oct 31, 2024 at 9:41 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >>
> >>
> >> On 30/10/2024 20:27, Barry Song wrote:
> >>> On Thu, Oct 31, 2024 at 3:51 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 28/10/2024 22:03, Barry Song wrote:
> >>>>> On Mon, Oct 28, 2024 at 8:07 PM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 27/10/2024 01:14, Barry Song wrote:
> >>>>>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>>>>
> >>>>>>> In a memcg where mTHP is always utilized, even at full capacity, it
> >>>>>>> might not be the best option. Consider a system that uses only small
> >>>>>>> folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX
> >>>>>>> of buffer space before it can initiate the next reclamation. However,
> >>>>>>> large folios can quickly fill this space, rapidly bringing the memcg
> >>>>>>> back to full capacity, even though some portions of the large folios
> >>>>>>> may not be immediately needed and used by the process.
> >>>>>>>
> >>>>>>> Usama and Kanchana identified a regression when building the kernel in
> >>>>>>> a memcg with memory.max set to a small value while enabling large
> >>>>>>> folio swap-in support on zswap[1].
> >>>>>>>
> >>>>>>> The issue arises from an edge case where the memory cgroup remains
> >>>>>>> nearly full most of the time. Consequently, bringing in mTHP can
> >>>>>>> quickly cause a memcg overflow, triggering a swap-out. The subsequent
> >>>>>>> swap-in then recreates the overflow, resulting in a repetitive cycle.
> >>>>>>>
> >>>>>>> We need a mechanism to stop the cup from overflowing continuously.
> >>>>>>> One potential solution is to slow the filling process when we identify
> >>>>>>> that the cup is nearly full.
> >>>>>>>
> >>>>>>> Usama reported an improvement when we mitigate mTHP swap-in as the
> >>>>>>> memcg approaches full capacity[2]:
> >>>>>>>
> >>>>>>> int mem_cgroup_swapin_charge_folio(...)
> >>>>>>> {
> >>>>>>>       ...
> >>>>>>>       if (folio_test_large(folio) &&
> >>>>>>>           mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)))
> >>>>>>>               ret = -ENOMEM;
> >>>>>>>       else
> >>>>>>>               ret = charge_memcg(folio, memcg, gfp);
> >>>>>>>       ...
> >>>>>>> }
> >>>>>>>
> >>>>>>> AMD 16K+32K THP=always
> >>>>>>> metric         mm-unstable      mm-unstable + large folio zswapin series    mm-unstable + large folio zswapin + no swap thrashing fix
> >>>>>>> real           1m23.038s        1m23.050s                                   1m22.704s
> >>>>>>> user           53m57.210s       53m53.437s                                  53m52.577s
> >>>>>>> sys            7m24.592s        7m48.843s                                   7m22.519s
> >>>>>>> zswpin         612070           999244                                      815934
> >>>>>>> zswpout        2226403          2347979                                     2054980
> >>>>>>> pgfault        20667366         20481728                                    20478690
> >>>>>>> pgmajfault     385887           269117                                      309702
> >>>>>>>
> >>>>>>> AMD 16K+32K+64K THP=always
> >>>>>>> metric         mm-unstable      mm-unstable + large folio zswapin series   mm-unstable + large folio zswapin + no swap thrashing fix
> >>>>>>> real           1m22.975s        1m23.266s                                  1m22.549s
> >>>>>>> user           53m51.302s       53m51.069s                                 53m46.471s
> >>>>>>> sys            7m40.168s        7m57.104s                                  7m25.012s
> >>>>>>> zswpin         676492           1258573                                    1225703
> >>>>>>> zswpout        2449839          2714767                                    2899178
> >>>>>>> pgfault        17540746         17296555                                   17234663
> >>>>>>> pgmajfault     429629           307495                                     287859
> >>>>>>>
> >>>>>>> I wonder if we can extend the mitigation to do_anonymous_page() as
> >>>>>>> well. Without hardware like AMD and ARM with hardware TLB coalescing
> >>>>>>> or CONT-PTE, I conducted a quick test on my Intel i9 workstation with
> >>>>>>> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel
> >>>>>>> builds in a memcg with memory.max set to 1 GiB.
> >>>>>>>
> >>>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> >>>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
> >>>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
> >>>>>>> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> >>>>>>>
> >>>>>>> $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
> >>>>>>> CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null
> >>>>>>>
> >>>>>>>             disable-mTHP-swapin  mm-unstable  with-this-patch
> >>>>>>> Real:            6m54.595s      7m4.832s       6m45.811s
> >>>>>>> User:            66m42.795s     66m59.984s     67m21.150s
> >>>>>>> Sys:             12m7.092s      15m18.153s     12m52.644s
> >>>>>>> pswpin:          4262327        11723248       5918690
> >>>>>>> pswpout:         14883774       19574347       14026942
> >>>>>>> 64k-swpout:      624447         889384         480039
> >>>>>>> 32k-swpout:      115473         242288         73874
> >>>>>>> 16k-swpout:      158203         294672         109142
> >>>>>>> 64k-swpin:       0              495869         159061
> >>>>>>> 32k-swpin:       0              219977         56158
> >>>>>>> 16k-swpin:       0              223501         81445
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>> Hi Usama,
> >>>>>
> >>>>>> hmm, both the user and sys time are worse with the patch compared to
> >>>>>> disable-mTHP-swapin. I wonder if the real time is an anomaly and if you
> >>>>>> repeat the experiment the real time might be worse as well?
> >>>>>
> >>>>> Well, I've improved my script to include a loop:
> >>>>>
> >>>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> >>>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
> >>>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
> >>>>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> >>>>>
> >>>>> for ((i=1; i<=100; i++))
> >>>>> do
> >>>>>   echo "Executing round $i"
> >>>>>   make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null
> >>>>>   echo 3 > /proc/sys/vm/drop_caches
> >>>>>   time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
> >>>>>         CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j15 1>/dev/null 2>/dev/null
> >>>>>   cat /proc/vmstat | grep pswp
> >>>>>   echo -n 64k-swpout: ; cat
> >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout
> >>>>>   echo -n 32k-swpout: ; cat
> >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout
> >>>>>   echo -n 16k-swpout: ; cat
> >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout
> >>>>>   echo -n 64k-swpin: ; cat
> >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin
> >>>>>   echo -n 32k-swpin: ; cat
> >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin
> >>>>>   echo -n 16k-swpin: ; cat
> >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin
> >>>>> done
> >>>>>
> >>>>> I've noticed that the user/sys/real time on my i9 machine fluctuates
> >>>>> constantly, could be things
> >>>>> like:
> >>>>> real    6m52.087s
> >>>>> user    67m12.463s
> >>>>> sys     13m8.281s
> >>>>> ...
> >>>>>
> >>>>> real    7m42.937s
> >>>>> user    66m55.250s
> >>>>> sys     12m56.330s
> >>>>> ...
> >>>>>
> >>>>> real    6m49.374s
> >>>>> user    66m37.040s
> >>>>> sys     12m44.542s
> >>>>> ...
> >>>>>
> >>>>> real    6m54.205s
> >>>>> user    65m49.732s
> >>>>> sys     11m33.078s
> >>>>> ...
> >>>>>
> >>>>> likely due to unstable temperatures and I/O latency. As a result, my
> >>>>> data doesn’t seem
> >>>>> reference-worthy.
> >>>>>
> >>>>
> >>>> So I had suggested retrying the experiment to see how reproducible it is,
> >>>> but had not done that myself!
> >>>> Thanks for sharing this. I tried many times on the AMD server and I see
> >>>> varying numbers as well.
> >>>>
> >>>> AMD 16K THP always, cgroup = 4G, large folio zswapin patches
> >>>> real    1m28.351s
> >>>> user    54m14.476s
> >>>> sys     8m46.596s
> >>>> zswpin 811693
> >>>> zswpout 2137310
> >>>> pgfault 27344671
> >>>> pgmajfault 290510
> >>>> ..
> >>>> real    1m24.557s
> >>>> user    53m56.815s
> >>>> sys     8m10.200s
> >>>> zswpin 571532
> >>>> zswpout 1645063
> >>>> pgfault 26989075
> >>>> pgmajfault 205177
> >>>> ..
> >>>> real    1m26.083s
> >>>> user    54m5.303s
> >>>> sys     9m55.247s
> >>>> zswpin 1176292
> >>>> zswpout 2910825
> >>>> pgfault 27286835
> >>>> pgmajfault 419746
> >>>>
> >>>>
> >>>> The sys time can especially vary by large numbers. I think you see the same.
> >>>>
> >>>>
> >>>>> As a phone engineer, we never use phones to run kernel builds. I'm also
> >>>>> quite certain that phones won't provide stable and reliable data for this
> >>>>> type of workload. Without access to a Linux server to conduct the test,
> >>>>> I really need your help.
> >>>>>
> >>>>> I used to work on optimizing the ARM server scheduler and memory
> >>>>> management, and I really miss that machine I had until three years ago :-)
> >>>>>
> >>>>>>
> >>>>>>> I need Usama's assistance to identify a suitable patch, as I lack
> >>>>>>> access to hardware such as AMD machines and ARM servers with TLB
> >>>>>>> optimization.
> >>>>>>>
> >>>>>>> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/
> >>>>>>> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/
> >>>>>>>
> >>>>>>> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> >>>>>>> Cc: Usama Arif <usamaarif642@gmail.com>
> >>>>>>> Cc: David Hildenbrand <david@redhat.com>
> >>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> >>>>>>> Cc: Chris Li <chrisl@kernel.org>
> >>>>>>> Cc: Yosry Ahmed <yosryahmed@google.com>
> >>>>>>> Cc: "Huang, Ying" <ying.huang@intel.com>
> >>>>>>> Cc: Kairui Song <kasong@tencent.com>
> >>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
> >>>>>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >>>>>>> Cc: Michal Hocko <mhocko@kernel.org>
> >>>>>>> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> >>>>>>> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> >>>>>>> Cc: Muchun Song <muchun.song@linux.dev>
> >>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>>>>>> ---
> >>>>>>>  include/linux/memcontrol.h |  9 ++++++++
> >>>>>>>  mm/memcontrol.c            | 45 ++++++++++++++++++++++++++++++++++++++
> >>>>>>>  mm/memory.c                | 17 ++++++++++++++
> >>>>>>>  3 files changed, 71 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> >>>>>>> index 524006313b0d..8bcc8f4af39f 100644
> >>>>>>> --- a/include/linux/memcontrol.h
> >>>>>>> +++ b/include/linux/memcontrol.h
> >>>>>>> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
> >>>>>>>  int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> >>>>>>>               long nr_pages);
> >>>>>>>
> >>>>>>> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
> >>>>>>> +                             swp_entry_t *entry);
> >>>>>>> +
> >>>>>>>  int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> >>>>>>>                                 gfp_t gfp, swp_entry_t entry);
> >>>>>>>
> >>>>>>> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
> >>>>>>>       return 0;
> >>>>>>>  }
> >>>>>>>
> >>>>>>> +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
> >>>>>>> +             swp_entry_t *entry)
> >>>>>>> +{
> >>>>>>> +     return 0;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>>  static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
> >>>>>>>                       struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
> >>>>>>>  {
> >>>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>>>>>> index 17af08367c68..f3d92b93ea6d 100644
> >>>>>>> --- a/mm/memcontrol.c
> >>>>>>> +++ b/mm/memcontrol.c
> >>>>>>> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> >>>>>>>       return 0;
> >>>>>>>  }
> >>>>>>>
> >>>>>>> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg)
> >>>>>>> +{
> >>>>>>> +     for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
> >>>>>>> +             if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR)
> >>>>>>
> >>>>>> There might be 3 issues with the approach:
> >>>>>>
> >>>>>> Its a very big margin, lets say you have ARM64_64K_PAGES, and you have
> >>>>>> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are
> >>>>>> basically saying you need 512M free memory to swapin just 256K?
> >>>>>
> >>>>> Right, sorry for the noisy code. I was just thinking about 4KB pages
> >>>>> and wondering
> >>>>> if we could simplify the code.
> >>>>>
> >>>>>>
> >>>>>> Its an uneven margin for different folio sizes.
> >>>>>> For 16K folio swapin, you are checking if there is margin for 128 folios,
> >>>>>> but for 1M folio swapin, you are checking there is margin for just 2 folios.
> >>>>>>
> >>>>>> Maybe it might be better to make this dependent on some factor of folio_nr_pages?
> >>>>>
> >>>>> Agreed. This is similar to what we discussed regarding your zswap mTHP
> >>>>> swap-in series:
> >>>>>
> >>>>>  int mem_cgroup_swapin_charge_folio(...)
> >>>>>  {
> >>>>>        ...
> >>>>>        if (folio_test_large(folio) &&
> >>>>>            mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH,
> >>>>> folio_nr_pages(folio)))
> >>>>>                ret = -ENOMEM;
> >>>>>        else
> >>>>>                ret = charge_memcg(folio, memcg, gfp);
> >>>>>        ...
> >>>>>  }
> >>>>>
> >>>>> As someone focused on phones, my challenge is the absence of stable platforms to
> >>>>> benchmark this type of workload. If possible, Usama, I would greatly
> >>>>> appreciate it if
> >>>>> you could take the lead on the patch.
> >>>>>
> >>>>>>
> >>>>>> As Johannes pointed out, the charging code already does the margin check.
> >>>>>> So for 4K, the check just checks if there is 4K available, but for 16K it checks
> >>>>>> if a lot more than 16K is available. Maybe there should be a similar policy for
> >>>>>> all? I guess this is similar to my 2nd point, but just considers 4K folios as
> >>>>>> well.
> >>>>>
> >>>>> I don't think the charging code performs a margin check. It simply
> >>>>> tries to charge
> >>>>> the specified nr_pages (whether 1 or more). If nr_pages are available,
> >>>>> the charge
> >>>>> proceeds; otherwise, if GFP allows blocking, it triggers memory reclamation to
> >>>>> reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages.
> >>>>>
> >>>>
> >>>> So if you have defrag not set to always, it will not trigger reclamation.
> >>>> I think that is a bigger usecase, i.e. defrag=madvise,defer,etc is probably
> >>>> used much more then always.
> >>>>
> >>>> In the current code in that case try_charge_memcg will return -ENOMEM all
> >>>> the way to mem_cgroup_swapin_charge_folio and alloc_swap_folio will then
> >>>> try the next order. So eventhough it might not be calling the mem_cgroup_margin
> >>>> function, it is kind of is doing the same?
> >>>>
> >>>>> If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages available, a
> >>>>> large folio with nr_pages == SWAP_CLUSTER_MAX will successfully charge,
> >>>>> immediately filling the memcg.
> >>>>>
> >>>>> Shortly after, smaller folios—typically with blockable GFP—will quickly trigger
> >>>>> additional reclamation. While nr_pages - 1 subpages of the large folio may not
> >>>>> be immediately needed, they still occupy enough space to fill the memcg to
> >>>>> capacity.
> >>>>>
> >>>>> My second point about the mitigation is as follows: For a system (or
> >>>>> memcg) under severe memory pressure, especially one without hardware TLB
> >>>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at
> >>>>> a larger granularity, some internal fragmentation is unavoidable, regardless
> >>>>> of optimization. Could the mitigation code help in automatically tuning
> >>>>> this fragmentation?
> >>>>>
> >>>>
> >>>> I agree with the point that enabling mTHP always is not the right thing to do
> >>>> on all platforms. I also think it might be the case that enabling mTHP
> >>>> might be a good thing for some workloads, but enabling mTHP swapin along with
> >>>> it might not.
> >>>>
> >>>> As you said when you have apps switching between foreground and background
> >>>> in android, it probably makes sense to have large folio swapping, as you
> >>>> want to bringin all the pages from background app as quickly as possible.
> >>>> And also all the TLB optimizations and smaller lru overhead you get after
> >>>> you have brought in all the pages.
> >>>> Linux kernel build test doesnt really get to benefit from the TLB optimization
> >>>> and smaller lru overhead, as probably the pages are very short lived. So I
> >>>> think it doesnt show the benefit of large folio swapin properly and
> >>>> large folio swapin should probably be disabled for this kind of workload,
> >>>> eventhough mTHP should be enabled.
> >>>
> >>> I'm not entirely sure if this applies to platforms without TLB
> >>> optimization, especially
> >>> in the absence of swap. In a memory-limited cgroup without swap, would
> >>> mTHP still
> >>> cause significant thrashing of file-backed folios? When a large swap
> >>> file is present,
> >>> the inability to swap in mTHP seems to act as a workaround for fragmentation,
> >>> allowing fragmented pages of the original mTHP from do_anonymous_page() to
> >>> remain in swap.
> >>>
> >>>>
> >>>> I am not sure that the approach we are trying in this patch is the right way:
> >>>> - This patch makes it a memcg issue, but you could have memcg disabled and
> >>>> then the mitigation being tried here wont apply.
> >>>> - Instead of this being a large folio swapin issue, is it more of a readahead
> >>>> issue? If we zswap (without the large folio swapin series) and change the window
> >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
> >>>> well.
> >>>> - Instead of looking at cgroup margin, maybe we should try and look at
> >>>> the rate of change of workingset_restore_anon? This might be a lot more complicated
> >>>> to do, but probably is the right metric to determine swap thrashing. It also means
> >>>> that this could be used in both the synchronous swapcache skipping path and
> >>>> swapin_readahead path.
> >>>> (Thanks Johannes for suggesting this)
> >>>>
> >>>> With the large folio swapin, I do see the large improvement when considering only
> >>>> swapin performance and latency in the same way as you saw in zram.
> >>>> Maybe the right short term approach is to have
> >>>> /sys/kernel/mm/transparent_hugepage/swapin
> >>>> and have that disabled by default to avoid regression.
> >>>
> >>> A crucial component is still missing—managing the compression and decompression
> >>> of multiple pages as a larger block. This could significantly reduce
> >>> system time and
> >>> potentially resolve the kernel build issue within a small memory
> >>> cgroup, even with
> >>> swap thrashing.
> >>>
> >>> I’ll send an update ASAP so you can rebase for zswap.
> >>
> >> Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/?
> >> Thats wont benefit zswap, right?
> >
> > That's right. I assume we can also make it work with zswap?
>
> Hopefully yes. Thats mainly why I was looking at that series, to try and find
> a way to do something similar for zswap.
> >
> >> I actually had a few questions about it. Mainly that the benefit comes if the
> >> pagefault happens on page 0 of the large folio. But if the page fault happens
> >> on any other page, lets say page 1 of a 64K folio. then it will decompress the
> >> entire 64K chunk and just copy page 1? (memcpy in zram_bvec_read_multi_pages_partial).
> >> Could that cause a regression as you have to decompress a large chunk for just
> >> getting 1 4K page?
> >> If we assume uniform distribution of page faults, maybe it could make things worse?
> >>
> >> I probably should ask all of this in that thread.
> >
> > With mTHP swap-in, a page fault on any page behaves the same as a fault on
> > page 0. Without mTHP swap-in, there’s also no difference between
> > faults on page 0
> > and other pages.
>
> Ah ok, its because of the ALIGN_DOWN in
> https://elixir.bootlin.com/linux/v6.12-rc5/source/mm/memory.c#L4158,
> right?

right.

> >
> > A fault on any page means that the entire block is decompressed. The
> > only difference
> > is that we don’t partially copy one page when mTHP swap-in is present.
> >
> Ah so zram_bvec_read_multi_pages_partial would be called only
> if someone swaps out mTHP, disables it and then tries to do swapin?
>

For example, if the block contains 16KB of original data but only 4KB
is swapped in without mTHP swap-in, this means we decompress the
entire 16KB while only copying a portion of it to do_swap_page(). So
likely compression/ decompression of large blocks without mTHP
swapin can make things worse though it brings higher compression
ratio.

> Thanks
>
> >>
> >>>
> >>>> If the workload owner sees a benefit, they can enable it.
> >>>> I can add this when sending the next version of large folio zswapin if that makes
> >>>> sense?
> >>>> Longer term I can try and have a look at if we can do something with
> >>>> workingset_restore_anon to improve things.
> >>>>
> >>>> Thanks,
> >>>> Usama
> >

Thanks
Barry
Yosry Ahmed Oct. 30, 2024, 9:10 p.m. UTC | #12
[..]
> >>> A crucial component is still missing—managing the compression and decompression
> >>> of multiple pages as a larger block. This could significantly reduce
> >>> system time and
> >>> potentially resolve the kernel build issue within a small memory
> >>> cgroup, even with
> >>> swap thrashing.
> >>>
> >>> I’ll send an update ASAP so you can rebase for zswap.
> >>
> >> Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/?
> >> Thats wont benefit zswap, right?
> >
> > That's right. I assume we can also make it work with zswap?
>
> Hopefully yes. Thats mainly why I was looking at that series, to try and find
> a way to do something similar for zswap.

I would prefer for these things to be done separately. We still need
to evaluate the compression/decompression of large blocks. I am mainly
concerned about having to decompress a large chunk to fault in one
page.

The obvious problems are fault latency, and wasted work having to
consistently decompress the large chunk to take one page from it. We
also need to decide if we'd rather split it after decompression and
compress the parts that we didn't swap in separately.

This can cause problems beyond the fault latency. Imagine the case
where the system is under memory pressure, so we fallback to order-0
swapin to avoid reclaim. Now we want to decompress a chunk that used
to be 64K.

We need to allocate 64K of contiguous memory for a temporary
allocation to be able to fault a 4K page. Now we either need to:
- Go into reclaim, which we were trying to avoid to begin with.
- Dip into reserves to allocate the 64K as it's a temporary
allocation. This is probably risky because under memory pressure, many
CPUs may be doing this concurrently.
Usama Arif Oct. 30, 2024, 9:13 p.m. UTC | #13
On 30/10/2024 21:01, Yosry Ahmed wrote:
> On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 30/10/2024 19:51, Yosry Ahmed wrote:
>>> [..]
>>>>> My second point about the mitigation is as follows: For a system (or
>>>>> memcg) under severe memory pressure, especially one without hardware TLB
>>>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at
>>>>> a larger granularity, some internal fragmentation is unavoidable, regardless
>>>>> of optimization. Could the mitigation code help in automatically tuning
>>>>> this fragmentation?
>>>>>
>>>>
>>>> I agree with the point that enabling mTHP always is not the right thing to do
>>>> on all platforms. I also think it might be the case that enabling mTHP
>>>> might be a good thing for some workloads, but enabling mTHP swapin along with
>>>> it might not.
>>>>
>>>> As you said when you have apps switching between foreground and background
>>>> in android, it probably makes sense to have large folio swapping, as you
>>>> want to bringin all the pages from background app as quickly as possible.
>>>> And also all the TLB optimizations and smaller lru overhead you get after
>>>> you have brought in all the pages.
>>>> Linux kernel build test doesnt really get to benefit from the TLB optimization
>>>> and smaller lru overhead, as probably the pages are very short lived. So I
>>>> think it doesnt show the benefit of large folio swapin properly and
>>>> large folio swapin should probably be disabled for this kind of workload,
>>>> eventhough mTHP should be enabled.
>>>>
>>>> I am not sure that the approach we are trying in this patch is the right way:
>>>> - This patch makes it a memcg issue, but you could have memcg disabled and
>>>> then the mitigation being tried here wont apply.
>>>
>>> Is the problem reproducible without memcg? I imagine only if the
>>> entire system is under memory pressure. I guess we would want the same
>>> "mitigation" either way.
>>>
>> What would be a good open source benchmark/workload to test without limiting memory
>> in memcg?
>> For the kernel build test, I can only get zswap activity to happen if I build
>> in cgroup and limit memory.max.
> 
> You mean a benchmark that puts the entire system under memory
> pressure? I am not sure, it ultimately depends on the size of memory
> you have, among other factors.
> 
> What if you run the kernel build test in a VM? Then you can limit is
> size like a memcg, although you'd probably need to leave more room
> because the entire guest OS will also subject to the same limit.
> 

I had tried this, but the variance in time/zswap numbers was very high.
Much higher than the AMD numbers I posted in reply to Barry. So found
it very difficult to make comparison.

>>
>> I can just run zswap large folio zswapin in production and see, but that will take me a few
>> days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing,
>> then maybe its not really an issue? I believe Barry doesnt see an issue in android
>> phones (but please correct me if I am wrong), and if there isnt an issue in Meta
>> production as well, its a good data point for servers as well. And maybe
>> kernel build in 4G memcg is not a good test.
> 
> If there is a regression in the kernel build, this means some
> workloads may be affected, even if Meta's prod isn't. I understand
> that the benchmark is not very representative of real world workloads,
> but in this instance I think the thrashing problem surfaced by the
> benchmark is real.
> 
>>
>>>> - Instead of this being a large folio swapin issue, is it more of a readahead
>>>> issue? If we zswap (without the large folio swapin series) and change the window
>>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
>>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
>>>> well.
>>>
>>> I think large folio swapin would make the problem worse anyway. I am
>>> also not sure if the readahead window adjusts on memory pressure or
>>> not.
>>>
>> readahead window doesnt look at memory pressure. So maybe the same thing is being
>> seen here as there would be in swapin_readahead?
> 
> Maybe readahead is not as aggressive in general as large folio
> swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
> of the window is the smaller of page_cluster (2 or 3) and
> SWAP_RA_ORDER_CEILING (5).
Yes, I was seeing 8 pages swapin (order 3) when testing. So might
be similar to enabling 32K mTHP?

> 
> Also readahead will swapin 4k folios AFAICT, so we don't need a
> contiguous allocation like large folio swapin. So that could be
> another factor why readahead may not reproduce the problem.
> 
>> Maybe if we check kernel build test
>> performance in 4G memcg with below diff, it might get better?
> 
> I think you can use the page_cluster tunable to do this at runtime.
> 
>>
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 4669f29cf555..9e196e1e6885 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -809,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>>         pgoff_t ilx;
>>         bool page_allocated;
>>
>> -       win = swap_vma_ra_win(vmf, &start, &end);
>> +       win = 1;
>>         if (win == 1)
>>                 goto skip;
>>
Yosry Ahmed Oct. 30, 2024, 9:18 p.m. UTC | #14
On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 30/10/2024 21:01, Yosry Ahmed wrote:
> > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >>
> >>
> >> On 30/10/2024 19:51, Yosry Ahmed wrote:
> >>> [..]
> >>>>> My second point about the mitigation is as follows: For a system (or
> >>>>> memcg) under severe memory pressure, especially one without hardware TLB
> >>>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at
> >>>>> a larger granularity, some internal fragmentation is unavoidable, regardless
> >>>>> of optimization. Could the mitigation code help in automatically tuning
> >>>>> this fragmentation?
> >>>>>
> >>>>
> >>>> I agree with the point that enabling mTHP always is not the right thing to do
> >>>> on all platforms. I also think it might be the case that enabling mTHP
> >>>> might be a good thing for some workloads, but enabling mTHP swapin along with
> >>>> it might not.
> >>>>
> >>>> As you said when you have apps switching between foreground and background
> >>>> in android, it probably makes sense to have large folio swapping, as you
> >>>> want to bringin all the pages from background app as quickly as possible.
> >>>> And also all the TLB optimizations and smaller lru overhead you get after
> >>>> you have brought in all the pages.
> >>>> Linux kernel build test doesnt really get to benefit from the TLB optimization
> >>>> and smaller lru overhead, as probably the pages are very short lived. So I
> >>>> think it doesnt show the benefit of large folio swapin properly and
> >>>> large folio swapin should probably be disabled for this kind of workload,
> >>>> eventhough mTHP should be enabled.
> >>>>
> >>>> I am not sure that the approach we are trying in this patch is the right way:
> >>>> - This patch makes it a memcg issue, but you could have memcg disabled and
> >>>> then the mitigation being tried here wont apply.
> >>>
> >>> Is the problem reproducible without memcg? I imagine only if the
> >>> entire system is under memory pressure. I guess we would want the same
> >>> "mitigation" either way.
> >>>
> >> What would be a good open source benchmark/workload to test without limiting memory
> >> in memcg?
> >> For the kernel build test, I can only get zswap activity to happen if I build
> >> in cgroup and limit memory.max.
> >
> > You mean a benchmark that puts the entire system under memory
> > pressure? I am not sure, it ultimately depends on the size of memory
> > you have, among other factors.
> >
> > What if you run the kernel build test in a VM? Then you can limit is
> > size like a memcg, although you'd probably need to leave more room
> > because the entire guest OS will also subject to the same limit.
> >
>
> I had tried this, but the variance in time/zswap numbers was very high.
> Much higher than the AMD numbers I posted in reply to Barry. So found
> it very difficult to make comparison.

Hmm yeah maybe more factors come into play with global memory
pressure. I am honestly not sure how to test this scenario, and I
suspect variance will be high anyway.

We can just try to use whatever technique we use for the memcg limit
though, if possible, right?

>
> >>
> >> I can just run zswap large folio zswapin in production and see, but that will take me a few
> >> days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing,
> >> then maybe its not really an issue? I believe Barry doesnt see an issue in android
> >> phones (but please correct me if I am wrong), and if there isnt an issue in Meta
> >> production as well, its a good data point for servers as well. And maybe
> >> kernel build in 4G memcg is not a good test.
> >
> > If there is a regression in the kernel build, this means some
> > workloads may be affected, even if Meta's prod isn't. I understand
> > that the benchmark is not very representative of real world workloads,
> > but in this instance I think the thrashing problem surfaced by the
> > benchmark is real.
> >
> >>
> >>>> - Instead of this being a large folio swapin issue, is it more of a readahead
> >>>> issue? If we zswap (without the large folio swapin series) and change the window
> >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
> >>>> well.
> >>>
> >>> I think large folio swapin would make the problem worse anyway. I am
> >>> also not sure if the readahead window adjusts on memory pressure or
> >>> not.
> >>>
> >> readahead window doesnt look at memory pressure. So maybe the same thing is being
> >> seen here as there would be in swapin_readahead?
> >
> > Maybe readahead is not as aggressive in general as large folio
> > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
> > of the window is the smaller of page_cluster (2 or 3) and
> > SWAP_RA_ORDER_CEILING (5).
> Yes, I was seeing 8 pages swapin (order 3) when testing. So might
> be similar to enabling 32K mTHP?

Not quite.

>
> >
> > Also readahead will swapin 4k folios AFAICT, so we don't need a
> > contiguous allocation like large folio swapin. So that could be
> > another factor why readahead may not reproduce the problem.

Because of this ^.
Barry Song Oct. 30, 2024, 9:21 p.m. UTC | #15
On Thu, Oct 31, 2024 at 10:10 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> [..]
> > >>> A crucial component is still missing—managing the compression and decompression
> > >>> of multiple pages as a larger block. This could significantly reduce
> > >>> system time and
> > >>> potentially resolve the kernel build issue within a small memory
> > >>> cgroup, even with
> > >>> swap thrashing.
> > >>>
> > >>> I’ll send an update ASAP so you can rebase for zswap.
> > >>
> > >> Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/?
> > >> Thats wont benefit zswap, right?
> > >
> > > That's right. I assume we can also make it work with zswap?
> >
> > Hopefully yes. Thats mainly why I was looking at that series, to try and find
> > a way to do something similar for zswap.
>
> I would prefer for these things to be done separately. We still need
> to evaluate the compression/decompression of large blocks. I am mainly
> concerned about having to decompress a large chunk to fault in one
> page.
>
> The obvious problems are fault latency, and wasted work having to
> consistently decompress the large chunk to take one page from it. We
> also need to decide if we'd rather split it after decompression and
> compress the parts that we didn't swap in separately.
>
> This can cause problems beyond the fault latency. Imagine the case
> where the system is under memory pressure, so we fallback to order-0
> swapin to avoid reclaim. Now we want to decompress a chunk that used
> to be 64K.

Yes, this could be an issue.

We had actually tried to utilize several buffers for those partial
swap-in cases,
where the decompressed data was held in anticipation of the upcoming
swap-in. This approach could address the majority of partial swap-ins for
fallback scenarios.

>
> We need to allocate 64K of contiguous memory for a temporary
> allocation to be able to fault a 4K page. Now we either need to:
> - Go into reclaim, which we were trying to avoid to begin with.
> - Dip into reserves to allocate the 64K as it's a temporary
> allocation. This is probably risky because under memory pressure, many
> CPUs may be doing this concurrently.

This has been addressed by using contiguous memory that is prepared on
a per-CPU basis., search the below:
"alloc_pages() might fail, so we don't depend on allocation:"
https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/

Thanks
Barry
Yosry Ahmed Oct. 30, 2024, 9:31 p.m. UTC | #16
On Wed, Oct 30, 2024 at 2:21 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Oct 31, 2024 at 10:10 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > [..]
> > > >>> A crucial component is still missing—managing the compression and decompression
> > > >>> of multiple pages as a larger block. This could significantly reduce
> > > >>> system time and
> > > >>> potentially resolve the kernel build issue within a small memory
> > > >>> cgroup, even with
> > > >>> swap thrashing.
> > > >>>
> > > >>> I’ll send an update ASAP so you can rebase for zswap.
> > > >>
> > > >> Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/?
> > > >> Thats wont benefit zswap, right?
> > > >
> > > > That's right. I assume we can also make it work with zswap?
> > >
> > > Hopefully yes. Thats mainly why I was looking at that series, to try and find
> > > a way to do something similar for zswap.
> >
> > I would prefer for these things to be done separately. We still need
> > to evaluate the compression/decompression of large blocks. I am mainly
> > concerned about having to decompress a large chunk to fault in one
> > page.
> >
> > The obvious problems are fault latency, and wasted work having to
> > consistently decompress the large chunk to take one page from it. We
> > also need to decide if we'd rather split it after decompression and
> > compress the parts that we didn't swap in separately.
> >
> > This can cause problems beyond the fault latency. Imagine the case
> > where the system is under memory pressure, so we fallback to order-0
> > swapin to avoid reclaim. Now we want to decompress a chunk that used
> > to be 64K.
>
> Yes, this could be an issue.
>
> We had actually tried to utilize several buffers for those partial
> swap-in cases,
> where the decompressed data was held in anticipation of the upcoming
> swap-in. This approach could address the majority of partial swap-ins for
> fallback scenarios.
>
> >
> > We need to allocate 64K of contiguous memory for a temporary
> > allocation to be able to fault a 4K page. Now we either need to:
> > - Go into reclaim, which we were trying to avoid to begin with.
> > - Dip into reserves to allocate the 64K as it's a temporary
> > allocation. This is probably risky because under memory pressure, many
> > CPUs may be doing this concurrently.
>
> This has been addressed by using contiguous memory that is prepared on
> a per-CPU basis., search the below:
> "alloc_pages() might fail, so we don't depend on allocation:"
> https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/

Thanks. I think this is reasonable but it makes it difficult to
increase the size of the chunk.

I would still prefer for both series to remain separate. If we want to
wait for the large folio zswap loads until your series goes in to
offset the thrashing that's fine, but I really think we should try to
address the thrashing on its own.
Johannes Weiner Oct. 31, 2024, 3:38 p.m. UTC | #17
On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote:
> > On 30/10/2024 21:01, Yosry Ahmed wrote:
> > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote:
> > >>>> I am not sure that the approach we are trying in this patch is the right way:
> > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and
> > >>>> then the mitigation being tried here wont apply.
> > >>>
> > >>> Is the problem reproducible without memcg? I imagine only if the
> > >>> entire system is under memory pressure. I guess we would want the same
> > >>> "mitigation" either way.
> > >>>
> > >> What would be a good open source benchmark/workload to test without limiting memory
> > >> in memcg?
> > >> For the kernel build test, I can only get zswap activity to happen if I build
> > >> in cgroup and limit memory.max.
> > >
> > > You mean a benchmark that puts the entire system under memory
> > > pressure? I am not sure, it ultimately depends on the size of memory
> > > you have, among other factors.
> > >
> > > What if you run the kernel build test in a VM? Then you can limit is
> > > size like a memcg, although you'd probably need to leave more room
> > > because the entire guest OS will also subject to the same limit.
> > >
> >
> > I had tried this, but the variance in time/zswap numbers was very high.
> > Much higher than the AMD numbers I posted in reply to Barry. So found
> > it very difficult to make comparison.
> 
> Hmm yeah maybe more factors come into play with global memory
> pressure. I am honestly not sure how to test this scenario, and I
> suspect variance will be high anyway.
> 
> We can just try to use whatever technique we use for the memcg limit
> though, if possible, right?

You can boot a physical machine with mem=1G on the commandline, which
restricts the physical range of memory that will be initialized.
Double check /proc/meminfo after boot, because part of that physical
range might not be usable RAM.

I do this quite often to test physical memory pressure with workloads
that don't scale up easily, like kernel builds.

> > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead
> > >>>> issue? If we zswap (without the large folio swapin series) and change the window
> > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
> > >>>> well.

+1

I also think there is too much focus on cgroup alone. The bigger issue
seems to be how much optimistic volume we swap in when we're under
pressure already. This applies to large folios and readahead; global
memory availability and cgroup limits.

It happens to manifest with THP in cgroups because that's what you
guys are testing. But IMO, any solution to this problem should
consider the wider scope.

> > >>> I think large folio swapin would make the problem worse anyway. I am
> > >>> also not sure if the readahead window adjusts on memory pressure or
> > >>> not.
> > >>>
> > >> readahead window doesnt look at memory pressure. So maybe the same thing is being
> > >> seen here as there would be in swapin_readahead?
> > >
> > > Maybe readahead is not as aggressive in general as large folio
> > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
> > > of the window is the smaller of page_cluster (2 or 3) and
> > > SWAP_RA_ORDER_CEILING (5).
> > Yes, I was seeing 8 pages swapin (order 3) when testing. So might
> > be similar to enabling 32K mTHP?
> 
> Not quite.

Actually, I would expect it to be...

> > > Also readahead will swapin 4k folios AFAICT, so we don't need a
> > > contiguous allocation like large folio swapin. So that could be
> > > another factor why readahead may not reproduce the problem.
> 
> Because of this ^.

...this matters for the physical allocation, which might require more
reclaim and compaction to produce the 32k. But an earlier version of
Barry's patch did the cgroup margin fallback after the THP was already
physically allocated, and it still helped.

So the issue in this test scenario seems to be mostly about cgroup
volume. And then 8 4k charges should be equivalent to a singular 32k
charge when it comes to cgroup pressure.
Yosry Ahmed Oct. 31, 2024, 3:59 p.m. UTC | #18
On Thu, Oct 31, 2024 at 8:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
> > On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote:
> > > On 30/10/2024 21:01, Yosry Ahmed wrote:
> > > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote:
> > > >>>> I am not sure that the approach we are trying in this patch is the right way:
> > > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and
> > > >>>> then the mitigation being tried here wont apply.
> > > >>>
> > > >>> Is the problem reproducible without memcg? I imagine only if the
> > > >>> entire system is under memory pressure. I guess we would want the same
> > > >>> "mitigation" either way.
> > > >>>
> > > >> What would be a good open source benchmark/workload to test without limiting memory
> > > >> in memcg?
> > > >> For the kernel build test, I can only get zswap activity to happen if I build
> > > >> in cgroup and limit memory.max.
> > > >
> > > > You mean a benchmark that puts the entire system under memory
> > > > pressure? I am not sure, it ultimately depends on the size of memory
> > > > you have, among other factors.
> > > >
> > > > What if you run the kernel build test in a VM? Then you can limit is
> > > > size like a memcg, although you'd probably need to leave more room
> > > > because the entire guest OS will also subject to the same limit.
> > > >
> > >
> > > I had tried this, but the variance in time/zswap numbers was very high.
> > > Much higher than the AMD numbers I posted in reply to Barry. So found
> > > it very difficult to make comparison.
> >
> > Hmm yeah maybe more factors come into play with global memory
> > pressure. I am honestly not sure how to test this scenario, and I
> > suspect variance will be high anyway.
> >
> > We can just try to use whatever technique we use for the memcg limit
> > though, if possible, right?
>
> You can boot a physical machine with mem=1G on the commandline, which
> restricts the physical range of memory that will be initialized.
> Double check /proc/meminfo after boot, because part of that physical
> range might not be usable RAM.
>
> I do this quite often to test physical memory pressure with workloads
> that don't scale up easily, like kernel builds.
>
> > > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead
> > > >>>> issue? If we zswap (without the large folio swapin series) and change the window
> > > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> > > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
> > > >>>> well.
>
> +1
>
> I also think there is too much focus on cgroup alone. The bigger issue
> seems to be how much optimistic volume we swap in when we're under
> pressure already. This applies to large folios and readahead; global
> memory availability and cgroup limits.

Agreed, although the characteristics of large folios and readahead are
different. But yeah, different flavors of the same problem.

>
> It happens to manifest with THP in cgroups because that's what you
> guys are testing. But IMO, any solution to this problem should
> consider the wider scope.

+1, and I really think this should be addressed separately, not just
rely on large block compression/decompression to offset the cost. It's
probably not just a zswap/zram problem anyway, it just happens to be
what we support large folio swapin for.

>
> > > >>> I think large folio swapin would make the problem worse anyway. I am
> > > >>> also not sure if the readahead window adjusts on memory pressure or
> > > >>> not.
> > > >>>
> > > >> readahead window doesnt look at memory pressure. So maybe the same thing is being
> > > >> seen here as there would be in swapin_readahead?
> > > >
> > > > Maybe readahead is not as aggressive in general as large folio
> > > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
> > > > of the window is the smaller of page_cluster (2 or 3) and
> > > > SWAP_RA_ORDER_CEILING (5).
> > > Yes, I was seeing 8 pages swapin (order 3) when testing. So might
> > > be similar to enabling 32K mTHP?
> >
> > Not quite.
>
> Actually, I would expect it to be...
>
> > > > Also readahead will swapin 4k folios AFAICT, so we don't need a
> > > > contiguous allocation like large folio swapin. So that could be
> > > > another factor why readahead may not reproduce the problem.
> >
> > Because of this ^.
>
> ...this matters for the physical allocation, which might require more
> reclaim and compaction to produce the 32k. But an earlier version of
> Barry's patch did the cgroup margin fallback after the THP was already
> physically allocated, and it still helped.
>
> So the issue in this test scenario seems to be mostly about cgroup
> volume. And then 8 4k charges should be equivalent to a singular 32k
> charge when it comes to cgroup pressure.

In this test scenario, yes, because it's only exercising cgroup
pressure. But if we want a general solution that also addresses global
pressure, I expect large folios to be worse because of the contiguity
and the size (compared to default readahead window sizes). So I think
we shouldn't only test with readahead, as it won't cover some of the
large folio cases.
Barry Song Oct. 31, 2024, 8:59 p.m. UTC | #19
On Fri, Nov 1, 2024 at 5:00 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Oct 31, 2024 at 8:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
> > > On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote:
> > > > On 30/10/2024 21:01, Yosry Ahmed wrote:
> > > > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote:
> > > > >>>> I am not sure that the approach we are trying in this patch is the right way:
> > > > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and
> > > > >>>> then the mitigation being tried here wont apply.
> > > > >>>
> > > > >>> Is the problem reproducible without memcg? I imagine only if the
> > > > >>> entire system is under memory pressure. I guess we would want the same
> > > > >>> "mitigation" either way.
> > > > >>>
> > > > >> What would be a good open source benchmark/workload to test without limiting memory
> > > > >> in memcg?
> > > > >> For the kernel build test, I can only get zswap activity to happen if I build
> > > > >> in cgroup and limit memory.max.
> > > > >
> > > > > You mean a benchmark that puts the entire system under memory
> > > > > pressure? I am not sure, it ultimately depends on the size of memory
> > > > > you have, among other factors.
> > > > >
> > > > > What if you run the kernel build test in a VM? Then you can limit is
> > > > > size like a memcg, although you'd probably need to leave more room
> > > > > because the entire guest OS will also subject to the same limit.
> > > > >
> > > >
> > > > I had tried this, but the variance in time/zswap numbers was very high.
> > > > Much higher than the AMD numbers I posted in reply to Barry. So found
> > > > it very difficult to make comparison.
> > >
> > > Hmm yeah maybe more factors come into play with global memory
> > > pressure. I am honestly not sure how to test this scenario, and I
> > > suspect variance will be high anyway.
> > >
> > > We can just try to use whatever technique we use for the memcg limit
> > > though, if possible, right?
> >
> > You can boot a physical machine with mem=1G on the commandline, which
> > restricts the physical range of memory that will be initialized.
> > Double check /proc/meminfo after boot, because part of that physical
> > range might not be usable RAM.
> >
> > I do this quite often to test physical memory pressure with workloads
> > that don't scale up easily, like kernel builds.
> >
> > > > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead
> > > > >>>> issue? If we zswap (without the large folio swapin series) and change the window
> > > > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> > > > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
> > > > >>>> well.
> >
> > +1
> >
> > I also think there is too much focus on cgroup alone. The bigger issue
> > seems to be how much optimistic volume we swap in when we're under
> > pressure already. This applies to large folios and readahead; global
> > memory availability and cgroup limits.
>
> Agreed, although the characteristics of large folios and readahead are
> different. But yeah, different flavors of the same problem.
>
> >
> > It happens to manifest with THP in cgroups because that's what you
> > guys are testing. But IMO, any solution to this problem should
> > consider the wider scope.
>
> +1, and I really think this should be addressed separately, not just
> rely on large block compression/decompression to offset the cost. It's
> probably not just a zswap/zram problem anyway, it just happens to be
> what we support large folio swapin for.

Agreed these are two separate issues and should be both investigated
though 2 can offset the cost of 1.
1. swap thrashing
2. large block compression/decompression

For point 1, we likely want to investigate the following:

1. if we can see the same thrashing if we always perform readahead
(rapidly filling
the memcg to full again after reclamation).

2. Whether there are any issues with balancing file and anon memory
reclamation.

The 'refault feedback loop' in mglru compares refault rates between anon and
file pages to decide which type should be prioritized for reclamation.

type = get_type_to_scan(lruvec, swappiness, &tier);

static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int
*tier_idx)
{
        ...
        read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp);
        read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv);
        type = positive_ctrl_err(&sp, &pv);

        read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp);
        for (tier = 1; tier < MAX_NR_TIERS; tier++) {
                read_ctrl_pos(lruvec, type, tier, gain[type], &pv);
                if (!positive_ctrl_err(&sp, &pv))
                        break;
        }

        *tier_idx = tier - 1;
        return type;
}

In this case, we may want to investigate whether reclamation is primarily
targeting anonymous memory due to potential errors in the statistics path
after mTHP is involved.

3. Determine if this is a memcg-specific issue by setting mem=1GB and
running the same test on the global system.

Yosry, Johannes, Usama,
Is there anything else that might interest us?

I'll get back to you after completing the investigation mentioned above.

>
> >
> > > > >>> I think large folio swapin would make the problem worse anyway. I am
> > > > >>> also not sure if the readahead window adjusts on memory pressure or
> > > > >>> not.
> > > > >>>
> > > > >> readahead window doesnt look at memory pressure. So maybe the same thing is being
> > > > >> seen here as there would be in swapin_readahead?
> > > > >
> > > > > Maybe readahead is not as aggressive in general as large folio
> > > > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
> > > > > of the window is the smaller of page_cluster (2 or 3) and
> > > > > SWAP_RA_ORDER_CEILING (5).
> > > > Yes, I was seeing 8 pages swapin (order 3) when testing. So might
> > > > be similar to enabling 32K mTHP?
> > >
> > > Not quite.
> >
> > Actually, I would expect it to be...
> >
> > > > > Also readahead will swapin 4k folios AFAICT, so we don't need a
> > > > > contiguous allocation like large folio swapin. So that could be
> > > > > another factor why readahead may not reproduce the problem.
> > >
> > > Because of this ^.
> >
> > ...this matters for the physical allocation, which might require more
> > reclaim and compaction to produce the 32k. But an earlier version of
> > Barry's patch did the cgroup margin fallback after the THP was already
> > physically allocated, and it still helped.
> >
> > So the issue in this test scenario seems to be mostly about cgroup
> > volume. And then 8 4k charges should be equivalent to a singular 32k
> > charge when it comes to cgroup pressure.
>
> In this test scenario, yes, because it's only exercising cgroup
> pressure. But if we want a general solution that also addresses global
> pressure, I expect large folios to be worse because of the contiguity
> and the size (compared to default readahead window sizes). So I think
> we shouldn't only test with readahead, as it won't cover some of the
> large folio cases.

Thanks
barry
Yosry Ahmed Nov. 1, 2024, 4:19 p.m. UTC | #20
On Thu, Oct 31, 2024 at 2:00 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, Nov 1, 2024 at 5:00 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Oct 31, 2024 at 8:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
> > > > On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote:
> > > > > On 30/10/2024 21:01, Yosry Ahmed wrote:
> > > > > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote:
> > > > > >>>> I am not sure that the approach we are trying in this patch is the right way:
> > > > > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and
> > > > > >>>> then the mitigation being tried here wont apply.
> > > > > >>>
> > > > > >>> Is the problem reproducible without memcg? I imagine only if the
> > > > > >>> entire system is under memory pressure. I guess we would want the same
> > > > > >>> "mitigation" either way.
> > > > > >>>
> > > > > >> What would be a good open source benchmark/workload to test without limiting memory
> > > > > >> in memcg?
> > > > > >> For the kernel build test, I can only get zswap activity to happen if I build
> > > > > >> in cgroup and limit memory.max.
> > > > > >
> > > > > > You mean a benchmark that puts the entire system under memory
> > > > > > pressure? I am not sure, it ultimately depends on the size of memory
> > > > > > you have, among other factors.
> > > > > >
> > > > > > What if you run the kernel build test in a VM? Then you can limit is
> > > > > > size like a memcg, although you'd probably need to leave more room
> > > > > > because the entire guest OS will also subject to the same limit.
> > > > > >
> > > > >
> > > > > I had tried this, but the variance in time/zswap numbers was very high.
> > > > > Much higher than the AMD numbers I posted in reply to Barry. So found
> > > > > it very difficult to make comparison.
> > > >
> > > > Hmm yeah maybe more factors come into play with global memory
> > > > pressure. I am honestly not sure how to test this scenario, and I
> > > > suspect variance will be high anyway.
> > > >
> > > > We can just try to use whatever technique we use for the memcg limit
> > > > though, if possible, right?
> > >
> > > You can boot a physical machine with mem=1G on the commandline, which
> > > restricts the physical range of memory that will be initialized.
> > > Double check /proc/meminfo after boot, because part of that physical
> > > range might not be usable RAM.
> > >
> > > I do this quite often to test physical memory pressure with workloads
> > > that don't scale up easily, like kernel builds.
> > >
> > > > > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead
> > > > > >>>> issue? If we zswap (without the large folio swapin series) and change the window
> > > > > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> > > > > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
> > > > > >>>> well.
> > >
> > > +1
> > >
> > > I also think there is too much focus on cgroup alone. The bigger issue
> > > seems to be how much optimistic volume we swap in when we're under
> > > pressure already. This applies to large folios and readahead; global
> > > memory availability and cgroup limits.
> >
> > Agreed, although the characteristics of large folios and readahead are
> > different. But yeah, different flavors of the same problem.
> >
> > >
> > > It happens to manifest with THP in cgroups because that's what you
> > > guys are testing. But IMO, any solution to this problem should
> > > consider the wider scope.
> >
> > +1, and I really think this should be addressed separately, not just
> > rely on large block compression/decompression to offset the cost. It's
> > probably not just a zswap/zram problem anyway, it just happens to be
> > what we support large folio swapin for.
>
> Agreed these are two separate issues and should be both investigated
> though 2 can offset the cost of 1.
> 1. swap thrashing
> 2. large block compression/decompression
>
> For point 1, we likely want to investigate the following:
>
> 1. if we can see the same thrashing if we always perform readahead
> (rapidly filling
> the memcg to full again after reclamation).
>
> 2. Whether there are any issues with balancing file and anon memory
> reclamation.
>
> The 'refault feedback loop' in mglru compares refault rates between anon and
> file pages to decide which type should be prioritized for reclamation.
>
> type = get_type_to_scan(lruvec, swappiness, &tier);
>
> static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int
> *tier_idx)
> {
>         ...
>         read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp);
>         read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv);
>         type = positive_ctrl_err(&sp, &pv);
>
>         read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp);
>         for (tier = 1; tier < MAX_NR_TIERS; tier++) {
>                 read_ctrl_pos(lruvec, type, tier, gain[type], &pv);
>                 if (!positive_ctrl_err(&sp, &pv))
>                         break;
>         }
>
>         *tier_idx = tier - 1;
>         return type;
> }
>
> In this case, we may want to investigate whether reclamation is primarily
> targeting anonymous memory due to potential errors in the statistics path
> after mTHP is involved.
>
> 3. Determine if this is a memcg-specific issue by setting mem=1GB and
> running the same test on the global system.
>
> Yosry, Johannes, Usama,
> Is there anything else that might interest us?
>
> I'll get back to you after completing the investigation mentioned above.

Thanks for looking into this.

Perhaps a naive question, but is this only related to swap faults? Can
the same scenario happen with other types of faults allocating large
folios (e.g. faulting in a file page, or a new anon allocation)?

Do swap faults use a different policy for determining the folio order,
or is it just the swap faults are naturally more correlated to memory
pressure, so that's how the issue was surfaced?
Huang, Ying Nov. 4, 2024, 6:42 a.m. UTC | #21
Johannes Weiner <hannes@cmpxchg.org> writes:

> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
>> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote:
>> > On 30/10/2024 21:01, Yosry Ahmed wrote:
>> > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote:
>> > >>>> I am not sure that the approach we are trying in this patch is the right way:
>> > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and
>> > >>>> then the mitigation being tried here wont apply.
>> > >>>
>> > >>> Is the problem reproducible without memcg? I imagine only if the
>> > >>> entire system is under memory pressure. I guess we would want the same
>> > >>> "mitigation" either way.
>> > >>>
>> > >> What would be a good open source benchmark/workload to test without limiting memory
>> > >> in memcg?
>> > >> For the kernel build test, I can only get zswap activity to happen if I build
>> > >> in cgroup and limit memory.max.
>> > >
>> > > You mean a benchmark that puts the entire system under memory
>> > > pressure? I am not sure, it ultimately depends on the size of memory
>> > > you have, among other factors.
>> > >
>> > > What if you run the kernel build test in a VM? Then you can limit is
>> > > size like a memcg, although you'd probably need to leave more room
>> > > because the entire guest OS will also subject to the same limit.
>> > >
>> >
>> > I had tried this, but the variance in time/zswap numbers was very high.
>> > Much higher than the AMD numbers I posted in reply to Barry. So found
>> > it very difficult to make comparison.
>> 
>> Hmm yeah maybe more factors come into play with global memory
>> pressure. I am honestly not sure how to test this scenario, and I
>> suspect variance will be high anyway.
>> 
>> We can just try to use whatever technique we use for the memcg limit
>> though, if possible, right?
>
> You can boot a physical machine with mem=1G on the commandline, which
> restricts the physical range of memory that will be initialized.
> Double check /proc/meminfo after boot, because part of that physical
> range might not be usable RAM.
>
> I do this quite often to test physical memory pressure with workloads
> that don't scale up easily, like kernel builds.
>
>> > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead
>> > >>>> issue? If we zswap (without the large folio swapin series) and change the window
>> > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
>> > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
>> > >>>> well.
>
> +1
>
> I also think there is too much focus on cgroup alone. The bigger issue
> seems to be how much optimistic volume we swap in when we're under
> pressure already. This applies to large folios and readahead; global
> memory availability and cgroup limits.

The current swap readahead logic is something like,

1. try readahead some pages for sequential access pattern, mark them as
   readahead

2. if these readahead pages get accessed before swapped out again,
   increase 'hits' counter

3. for next swap in, try readahead 'hits' pages and clear 'hits'.

So, if there's heavy memory pressure, the readaheaded pages will not be
accessed before being swapped out again (in 2 above), the readahead
pages will be minimal.

IMHO, mTHP swap-in is kind of swap readahead in effect.  That is, in
addition to the pages accessed are swapped in, the adjacent pages are
swapped in (swap readahead) too.  If these readahead pages are not
accessed before swapped out again, system runs into more severe
thrashing.  This is because we lack the swap readahead window scaling
mechanism as above.  And, this is why I suggested to combine the swap
readahead mechanism and mTHP swap-in by default before.  That is, when
kernel swaps in a page, it checks current swap readahead window, and
decides mTHP order according to window size.  So, if there are heavy
memory pressure, so that the nearby pages will not be accessed before
being swapped out again, the mTHP swap-in order can be adjusted
automatically.

> It happens to manifest with THP in cgroups because that's what you
> guys are testing. But IMO, any solution to this problem should
> consider the wider scope.
>
>> > >>> I think large folio swapin would make the problem worse anyway. I am
>> > >>> also not sure if the readahead window adjusts on memory pressure or
>> > >>> not.
>> > >>>
>> > >> readahead window doesnt look at memory pressure. So maybe the same thing is being
>> > >> seen here as there would be in swapin_readahead?
>> > >
>> > > Maybe readahead is not as aggressive in general as large folio
>> > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
>> > > of the window is the smaller of page_cluster (2 or 3) and
>> > > SWAP_RA_ORDER_CEILING (5).
>> > Yes, I was seeing 8 pages swapin (order 3) when testing. So might
>> > be similar to enabling 32K mTHP?
>> 
>> Not quite.
>
> Actually, I would expect it to be...

Me too.

>> > > Also readahead will swapin 4k folios AFAICT, so we don't need a
>> > > contiguous allocation like large folio swapin. So that could be
>> > > another factor why readahead may not reproduce the problem.
>> 
>> Because of this ^.
>
> ...this matters for the physical allocation, which might require more
> reclaim and compaction to produce the 32k. But an earlier version of
> Barry's patch did the cgroup margin fallback after the THP was already
> physically allocated, and it still helped.
>
> So the issue in this test scenario seems to be mostly about cgroup
> volume. And then 8 4k charges should be equivalent to a singular 32k
> charge when it comes to cgroup pressure.

--
Best Regards,
Huang, Ying
Barry Song Nov. 4, 2024, 8:06 a.m. UTC | #22
On Mon, Nov 4, 2024 at 7:46 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Johannes Weiner <hannes@cmpxchg.org> writes:
>
> > On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
> >> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote:
> >> > On 30/10/2024 21:01, Yosry Ahmed wrote:
> >> > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote:
> >> > >>>> I am not sure that the approach we are trying in this patch is the right way:
> >> > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and
> >> > >>>> then the mitigation being tried here wont apply.
> >> > >>>
> >> > >>> Is the problem reproducible without memcg? I imagine only if the
> >> > >>> entire system is under memory pressure. I guess we would want the same
> >> > >>> "mitigation" either way.
> >> > >>>
> >> > >> What would be a good open source benchmark/workload to test without limiting memory
> >> > >> in memcg?
> >> > >> For the kernel build test, I can only get zswap activity to happen if I build
> >> > >> in cgroup and limit memory.max.
> >> > >
> >> > > You mean a benchmark that puts the entire system under memory
> >> > > pressure? I am not sure, it ultimately depends on the size of memory
> >> > > you have, among other factors.
> >> > >
> >> > > What if you run the kernel build test in a VM? Then you can limit is
> >> > > size like a memcg, although you'd probably need to leave more room
> >> > > because the entire guest OS will also subject to the same limit.
> >> > >
> >> >
> >> > I had tried this, but the variance in time/zswap numbers was very high.
> >> > Much higher than the AMD numbers I posted in reply to Barry. So found
> >> > it very difficult to make comparison.
> >>
> >> Hmm yeah maybe more factors come into play with global memory
> >> pressure. I am honestly not sure how to test this scenario, and I
> >> suspect variance will be high anyway.
> >>
> >> We can just try to use whatever technique we use for the memcg limit
> >> though, if possible, right?
> >
> > You can boot a physical machine with mem=1G on the commandline, which
> > restricts the physical range of memory that will be initialized.
> > Double check /proc/meminfo after boot, because part of that physical
> > range might not be usable RAM.
> >
> > I do this quite often to test physical memory pressure with workloads
> > that don't scale up easily, like kernel builds.
> >
> >> > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead
> >> > >>>> issue? If we zswap (without the large folio swapin series) and change the window
> >> > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> >> > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
> >> > >>>> well.
> >
> > +1
> >
> > I also think there is too much focus on cgroup alone. The bigger issue
> > seems to be how much optimistic volume we swap in when we're under
> > pressure already. This applies to large folios and readahead; global
> > memory availability and cgroup limits.
>
> The current swap readahead logic is something like,
>
> 1. try readahead some pages for sequential access pattern, mark them as
>    readahead
>
> 2. if these readahead pages get accessed before swapped out again,
>    increase 'hits' counter
>
> 3. for next swap in, try readahead 'hits' pages and clear 'hits'.
>
> So, if there's heavy memory pressure, the readaheaded pages will not be
> accessed before being swapped out again (in 2 above), the readahead
> pages will be minimal.
>
> IMHO, mTHP swap-in is kind of swap readahead in effect.  That is, in
> addition to the pages accessed are swapped in, the adjacent pages are
> swapped in (swap readahead) too.  If these readahead pages are not
> accessed before swapped out again, system runs into more severe
> thrashing.  This is because we lack the swap readahead window scaling
> mechanism as above.  And, this is why I suggested to combine the swap
> readahead mechanism and mTHP swap-in by default before.  That is, when
> kernel swaps in a page, it checks current swap readahead window, and
> decides mTHP order according to window size.  So, if there are heavy
> memory pressure, so that the nearby pages will not be accessed before
> being swapped out again, the mTHP swap-in order can be adjusted
> automatically.

This might help reduce memory reclamation thrashing for kernel build
workload running in a memory limited memcg which might not benefit
from mTHP that much. But this mechanism has clear disadvantages:

1. Loss of large folios: For example, if you're using app A and then switch
to app B, a significant portion (around 60%) of A's memory might be swapped
out as it moves to the background. When you switch back to app A, a large
portion of the memory originally in mTHP could be lost while swapping
in small folios.

Essentially, systems with a user interface operate quite differently from kernel
build workloads running under a memory-limited memcg, as they switch
applications between the foreground and background.

2. Fragmentation of swap slots: This fragmentation increases the likelihood of
mTHP swapout failures, as it makes it harder to maintain contiguous memory
blocks in swap.

3. Prevent the implementation of large block compression and decompression
to achieve higher compression ratios and significantly lower CPU consumption,
as small folio swap-ins may still remain the predominant approach.

Memory-limited systems often face challenges with larger page sizes. Even on
systems that support various base page sizes, such as 4KB, 16KB, and 64KB
on ARM64, using 16KB or 64KB as the base page size is not always the best
choice.  With mTHP, we have already enabled per-size settings. For this kernel
build workload operating within a limited memcg, enabling only 16kB is likely
the best option for optimizing performance and minimizing thrashing.
/sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled

We could focus on mTHP and seek strategies to minimize thrashing when free
memory is severely limited :-)

>
> > It happens to manifest with THP in cgroups because that's what you
> > guys are testing. But IMO, any solution to this problem should
> > consider the wider scope.
> >
> >> > >>> I think large folio swapin would make the problem worse anyway. I am
> >> > >>> also not sure if the readahead window adjusts on memory pressure or
> >> > >>> not.
> >> > >>>
> >> > >> readahead window doesnt look at memory pressure. So maybe the same thing is being
> >> > >> seen here as there would be in swapin_readahead?
> >> > >
> >> > > Maybe readahead is not as aggressive in general as large folio
> >> > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
> >> > > of the window is the smaller of page_cluster (2 or 3) and
> >> > > SWAP_RA_ORDER_CEILING (5).
> >> > Yes, I was seeing 8 pages swapin (order 3) when testing. So might
> >> > be similar to enabling 32K mTHP?
> >>
> >> Not quite.
> >
> > Actually, I would expect it to be...
>
> Me too.
>
> >> > > Also readahead will swapin 4k folios AFAICT, so we don't need a
> >> > > contiguous allocation like large folio swapin. So that could be
> >> > > another factor why readahead may not reproduce the problem.
> >>
> >> Because of this ^.
> >
> > ...this matters for the physical allocation, which might require more
> > reclaim and compaction to produce the 32k. But an earlier version of
> > Barry's patch did the cgroup margin fallback after the THP was already
> > physically allocated, and it still helped.
> >
> > So the issue in this test scenario seems to be mostly about cgroup
> > volume. And then 8 4k charges should be equivalent to a singular 32k
> > charge when it comes to cgroup pressure.
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry
Usama Arif Nov. 4, 2024, 12:13 p.m. UTC | #23
On 04/11/2024 06:42, Huang, Ying wrote:
> Johannes Weiner <hannes@cmpxchg.org> writes:
> 
>> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
>>> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>>> On 30/10/2024 21:01, Yosry Ahmed wrote:
>>>>> On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>>>> I am not sure that the approach we are trying in this patch is the right way:
>>>>>>>> - This patch makes it a memcg issue, but you could have memcg disabled and
>>>>>>>> then the mitigation being tried here wont apply.
>>>>>>>
>>>>>>> Is the problem reproducible without memcg? I imagine only if the
>>>>>>> entire system is under memory pressure. I guess we would want the same
>>>>>>> "mitigation" either way.
>>>>>>>
>>>>>> What would be a good open source benchmark/workload to test without limiting memory
>>>>>> in memcg?
>>>>>> For the kernel build test, I can only get zswap activity to happen if I build
>>>>>> in cgroup and limit memory.max.
>>>>>
>>>>> You mean a benchmark that puts the entire system under memory
>>>>> pressure? I am not sure, it ultimately depends on the size of memory
>>>>> you have, among other factors.
>>>>>
>>>>> What if you run the kernel build test in a VM? Then you can limit is
>>>>> size like a memcg, although you'd probably need to leave more room
>>>>> because the entire guest OS will also subject to the same limit.
>>>>>
>>>>
>>>> I had tried this, but the variance in time/zswap numbers was very high.
>>>> Much higher than the AMD numbers I posted in reply to Barry. So found
>>>> it very difficult to make comparison.
>>>
>>> Hmm yeah maybe more factors come into play with global memory
>>> pressure. I am honestly not sure how to test this scenario, and I
>>> suspect variance will be high anyway.
>>>
>>> We can just try to use whatever technique we use for the memcg limit
>>> though, if possible, right?
>>
>> You can boot a physical machine with mem=1G on the commandline, which
>> restricts the physical range of memory that will be initialized.
>> Double check /proc/meminfo after boot, because part of that physical
>> range might not be usable RAM.
>>
>> I do this quite often to test physical memory pressure with workloads
>> that don't scale up easily, like kernel builds.
>>
>>>>>>>> - Instead of this being a large folio swapin issue, is it more of a readahead
>>>>>>>> issue? If we zswap (without the large folio swapin series) and change the window
>>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
>>>>>>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
>>>>>>>> well.
>>
>> +1
>>
>> I also think there is too much focus on cgroup alone. The bigger issue
>> seems to be how much optimistic volume we swap in when we're under
>> pressure already. This applies to large folios and readahead; global
>> memory availability and cgroup limits.
> 
> The current swap readahead logic is something like,
> 
> 1. try readahead some pages for sequential access pattern, mark them as
>    readahead
> 
> 2. if these readahead pages get accessed before swapped out again,
>    increase 'hits' counter
> 
> 3. for next swap in, try readahead 'hits' pages and clear 'hits'.
> 
> So, if there's heavy memory pressure, the readaheaded pages will not be
> accessed before being swapped out again (in 2 above), the readahead
> pages will be minimal.
> 
> IMHO, mTHP swap-in is kind of swap readahead in effect.  That is, in
> addition to the pages accessed are swapped in, the adjacent pages are
> swapped in (swap readahead) too.  If these readahead pages are not
> accessed before swapped out again, system runs into more severe
> thrashing.  This is because we lack the swap readahead window scaling
> mechanism as above.  And, this is why I suggested to combine the swap
> readahead mechanism and mTHP swap-in by default before.  That is, when
> kernel swaps in a page, it checks current swap readahead window, and
> decides mTHP order according to window size.  So, if there are heavy
> memory pressure, so that the nearby pages will not be accessed before
> being swapped out again, the mTHP swap-in order can be adjusted
> automatically.

This is a good idea to do, but I think the issue is that readahead
is a folio flag and not a page flag, so only works when folio size is 1.

In the swapin_readahead swapcache path, the current implementation decides
the ra_window based on hits, which is incremented in swap_cache_get_folio
if it has not been gotten from swapcache before.
The problem would be that we need information on how many distinct pages in
a large folio that has been swapped in have been accessed to decide the
hits/window size, which I don't think is possible. As once the entire large
folio has been swapped in, we won't get a fault.


> 
>> It happens to manifest with THP in cgroups because that's what you
>> guys are testing. But IMO, any solution to this problem should
>> consider the wider scope.
>>
>>>>>>> I think large folio swapin would make the problem worse anyway. I am
>>>>>>> also not sure if the readahead window adjusts on memory pressure or
>>>>>>> not.
>>>>>>>
>>>>>> readahead window doesnt look at memory pressure. So maybe the same thing is being
>>>>>> seen here as there would be in swapin_readahead?
>>>>>
>>>>> Maybe readahead is not as aggressive in general as large folio
>>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
>>>>> of the window is the smaller of page_cluster (2 or 3) and
>>>>> SWAP_RA_ORDER_CEILING (5).
>>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might
>>>> be similar to enabling 32K mTHP?
>>>
>>> Not quite.
>>
>> Actually, I would expect it to be...
> 
> Me too.
> 
>>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a
>>>>> contiguous allocation like large folio swapin. So that could be
>>>>> another factor why readahead may not reproduce the problem.
>>>
>>> Because of this ^.
>>
>> ...this matters for the physical allocation, which might require more
>> reclaim and compaction to produce the 32k. But an earlier version of
>> Barry's patch did the cgroup margin fallback after the THP was already
>> physically allocated, and it still helped.
>>
>> So the issue in this test scenario seems to be mostly about cgroup
>> volume. And then 8 4k charges should be equivalent to a singular 32k
>> charge when it comes to cgroup pressure.
> 
> --
> Best Regards,
> Huang, Ying
Huang, Ying Nov. 4, 2024, 1:09 p.m. UTC | #24
Barry Song <21cnbao@gmail.com> writes:

> On Mon, Nov 4, 2024 at 7:46 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Johannes Weiner <hannes@cmpxchg.org> writes:
>>
>> > On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
>> >> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote:
>> >> > On 30/10/2024 21:01, Yosry Ahmed wrote:
>> >> > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote:
>> >> > >>>> I am not sure that the approach we are trying in this patch is the right way:
>> >> > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and
>> >> > >>>> then the mitigation being tried here wont apply.
>> >> > >>>
>> >> > >>> Is the problem reproducible without memcg? I imagine only if the
>> >> > >>> entire system is under memory pressure. I guess we would want the same
>> >> > >>> "mitigation" either way.
>> >> > >>>
>> >> > >> What would be a good open source benchmark/workload to test without limiting memory
>> >> > >> in memcg?
>> >> > >> For the kernel build test, I can only get zswap activity to happen if I build
>> >> > >> in cgroup and limit memory.max.
>> >> > >
>> >> > > You mean a benchmark that puts the entire system under memory
>> >> > > pressure? I am not sure, it ultimately depends on the size of memory
>> >> > > you have, among other factors.
>> >> > >
>> >> > > What if you run the kernel build test in a VM? Then you can limit is
>> >> > > size like a memcg, although you'd probably need to leave more room
>> >> > > because the entire guest OS will also subject to the same limit.
>> >> > >
>> >> >
>> >> > I had tried this, but the variance in time/zswap numbers was very high.
>> >> > Much higher than the AMD numbers I posted in reply to Barry. So found
>> >> > it very difficult to make comparison.
>> >>
>> >> Hmm yeah maybe more factors come into play with global memory
>> >> pressure. I am honestly not sure how to test this scenario, and I
>> >> suspect variance will be high anyway.
>> >>
>> >> We can just try to use whatever technique we use for the memcg limit
>> >> though, if possible, right?
>> >
>> > You can boot a physical machine with mem=1G on the commandline, which
>> > restricts the physical range of memory that will be initialized.
>> > Double check /proc/meminfo after boot, because part of that physical
>> > range might not be usable RAM.
>> >
>> > I do this quite often to test physical memory pressure with workloads
>> > that don't scale up easily, like kernel builds.
>> >
>> >> > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead
>> >> > >>>> issue? If we zswap (without the large folio swapin series) and change the window
>> >> > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
>> >> > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
>> >> > >>>> well.
>> >
>> > +1
>> >
>> > I also think there is too much focus on cgroup alone. The bigger issue
>> > seems to be how much optimistic volume we swap in when we're under
>> > pressure already. This applies to large folios and readahead; global
>> > memory availability and cgroup limits.
>>
>> The current swap readahead logic is something like,
>>
>> 1. try readahead some pages for sequential access pattern, mark them as
>>    readahead
>>
>> 2. if these readahead pages get accessed before swapped out again,
>>    increase 'hits' counter
>>
>> 3. for next swap in, try readahead 'hits' pages and clear 'hits'.
>>
>> So, if there's heavy memory pressure, the readaheaded pages will not be
>> accessed before being swapped out again (in 2 above), the readahead
>> pages will be minimal.
>>
>> IMHO, mTHP swap-in is kind of swap readahead in effect.  That is, in
>> addition to the pages accessed are swapped in, the adjacent pages are
>> swapped in (swap readahead) too.  If these readahead pages are not
>> accessed before swapped out again, system runs into more severe
>> thrashing.  This is because we lack the swap readahead window scaling
>> mechanism as above.  And, this is why I suggested to combine the swap
>> readahead mechanism and mTHP swap-in by default before.  That is, when
>> kernel swaps in a page, it checks current swap readahead window, and
>> decides mTHP order according to window size.  So, if there are heavy
>> memory pressure, so that the nearby pages will not be accessed before
>> being swapped out again, the mTHP swap-in order can be adjusted
>> automatically.
>
> This might help reduce memory reclamation thrashing for kernel build
> workload running in a memory limited memcg which might not benefit
> from mTHP that much. But this mechanism has clear disadvantages:
>
> 1. Loss of large folios: For example, if you're using app A and then switch
> to app B, a significant portion (around 60%) of A's memory might be swapped
> out as it moves to the background. When you switch back to app A, a large
> portion of the memory originally in mTHP could be lost while swapping
> in small folios.

Why?  Most app A pages could be swapped in as mTHP if all readahead
pages are accessed before being swapped out again.

> Essentially, systems with a user interface operate quite differently from kernel
> build workloads running under a memory-limited memcg, as they switch
> applications between the foreground and background.
>
> 2. Fragmentation of swap slots: This fragmentation increases the likelihood of
> mTHP swapout failures, as it makes it harder to maintain contiguous memory
> blocks in swap.
>
> 3. Prevent the implementation of large block compression and decompression
> to achieve higher compression ratios and significantly lower CPU consumption,
> as small folio swap-ins may still remain the predominant approach.
>
> Memory-limited systems often face challenges with larger page sizes. Even on
> systems that support various base page sizes, such as 4KB, 16KB, and 64KB
> on ARM64, using 16KB or 64KB as the base page size is not always the best
> choice.  With mTHP, we have already enabled per-size settings. For this kernel
> build workload operating within a limited memcg, enabling only 16kB is likely
> the best option for optimizing performance and minimizing thrashing.
> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
>
> We could focus on mTHP and seek strategies to minimize thrashing when free
> memory is severely limited :-)

IIUC, mTHP can improve performance but may waste memory too.  The best
mTHP order may be different for different workloads and system states
(e.g. high memory pressure).  So, we need a way to identify the best
mTHP order automatically.

Swap readahead window size auto-scaling can be used to design a method
to identify the best mTHP order for swap-in.  I'm open to other possible
methods too.

I admit a fixed mTHP order can be best for some specific workloads or
system states.  However, it may be hard to find one mTHP order that
works well for all workloads and system states.

>> > It happens to manifest with THP in cgroups because that's what you
>> > guys are testing. But IMO, any solution to this problem should
>> > consider the wider scope.
>> >
>> >> > >>> I think large folio swapin would make the problem worse anyway. I am
>> >> > >>> also not sure if the readahead window adjusts on memory pressure or
>> >> > >>> not.
>> >> > >>>
>> >> > >> readahead window doesnt look at memory pressure. So maybe the same thing is being
>> >> > >> seen here as there would be in swapin_readahead?
>> >> > >
>> >> > > Maybe readahead is not as aggressive in general as large folio
>> >> > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
>> >> > > of the window is the smaller of page_cluster (2 or 3) and
>> >> > > SWAP_RA_ORDER_CEILING (5).
>> >> > Yes, I was seeing 8 pages swapin (order 3) when testing. So might
>> >> > be similar to enabling 32K mTHP?
>> >>
>> >> Not quite.
>> >
>> > Actually, I would expect it to be...
>>
>> Me too.
>>
>> >> > > Also readahead will swapin 4k folios AFAICT, so we don't need a
>> >> > > contiguous allocation like large folio swapin. So that could be
>> >> > > another factor why readahead may not reproduce the problem.
>> >>
>> >> Because of this ^.
>> >
>> > ...this matters for the physical allocation, which might require more
>> > reclaim and compaction to produce the 32k. But an earlier version of
>> > Barry's patch did the cgroup margin fallback after the THP was already
>> > physically allocated, and it still helped.
>> >
>> > So the issue in this test scenario seems to be mostly about cgroup
>> > volume. And then 8 4k charges should be equivalent to a singular 32k
>> > charge when it comes to cgroup pressure.
>>

--
Best Regards,
Huang, Ying
Huang, Ying Nov. 5, 2024, 12:57 a.m. UTC | #25
Usama Arif <usamaarif642@gmail.com> writes:

> On 04/11/2024 06:42, Huang, Ying wrote:
>> Johannes Weiner <hannes@cmpxchg.org> writes:
>> 
>>> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
>>>> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>> On 30/10/2024 21:01, Yosry Ahmed wrote:
>>>>>> On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>>>>> I am not sure that the approach we are trying in this patch is the right way:
>>>>>>>>> - This patch makes it a memcg issue, but you could have memcg disabled and
>>>>>>>>> then the mitigation being tried here wont apply.
>>>>>>>>
>>>>>>>> Is the problem reproducible without memcg? I imagine only if the
>>>>>>>> entire system is under memory pressure. I guess we would want the same
>>>>>>>> "mitigation" either way.
>>>>>>>>
>>>>>>> What would be a good open source benchmark/workload to test without limiting memory
>>>>>>> in memcg?
>>>>>>> For the kernel build test, I can only get zswap activity to happen if I build
>>>>>>> in cgroup and limit memory.max.
>>>>>>
>>>>>> You mean a benchmark that puts the entire system under memory
>>>>>> pressure? I am not sure, it ultimately depends on the size of memory
>>>>>> you have, among other factors.
>>>>>>
>>>>>> What if you run the kernel build test in a VM? Then you can limit is
>>>>>> size like a memcg, although you'd probably need to leave more room
>>>>>> because the entire guest OS will also subject to the same limit.
>>>>>>
>>>>>
>>>>> I had tried this, but the variance in time/zswap numbers was very high.
>>>>> Much higher than the AMD numbers I posted in reply to Barry. So found
>>>>> it very difficult to make comparison.
>>>>
>>>> Hmm yeah maybe more factors come into play with global memory
>>>> pressure. I am honestly not sure how to test this scenario, and I
>>>> suspect variance will be high anyway.
>>>>
>>>> We can just try to use whatever technique we use for the memcg limit
>>>> though, if possible, right?
>>>
>>> You can boot a physical machine with mem=1G on the commandline, which
>>> restricts the physical range of memory that will be initialized.
>>> Double check /proc/meminfo after boot, because part of that physical
>>> range might not be usable RAM.
>>>
>>> I do this quite often to test physical memory pressure with workloads
>>> that don't scale up easily, like kernel builds.
>>>
>>>>>>>>> - Instead of this being a large folio swapin issue, is it more of a readahead
>>>>>>>>> issue? If we zswap (without the large folio swapin series) and change the window
>>>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
>>>>>>>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
>>>>>>>>> well.
>>>
>>> +1
>>>
>>> I also think there is too much focus on cgroup alone. The bigger issue
>>> seems to be how much optimistic volume we swap in when we're under
>>> pressure already. This applies to large folios and readahead; global
>>> memory availability and cgroup limits.
>> 
>> The current swap readahead logic is something like,
>> 
>> 1. try readahead some pages for sequential access pattern, mark them as
>>    readahead
>> 
>> 2. if these readahead pages get accessed before swapped out again,
>>    increase 'hits' counter
>> 
>> 3. for next swap in, try readahead 'hits' pages and clear 'hits'.
>> 
>> So, if there's heavy memory pressure, the readaheaded pages will not be
>> accessed before being swapped out again (in 2 above), the readahead
>> pages will be minimal.
>> 
>> IMHO, mTHP swap-in is kind of swap readahead in effect.  That is, in
>> addition to the pages accessed are swapped in, the adjacent pages are
>> swapped in (swap readahead) too.  If these readahead pages are not
>> accessed before swapped out again, system runs into more severe
>> thrashing.  This is because we lack the swap readahead window scaling
>> mechanism as above.  And, this is why I suggested to combine the swap
>> readahead mechanism and mTHP swap-in by default before.  That is, when
>> kernel swaps in a page, it checks current swap readahead window, and
>> decides mTHP order according to window size.  So, if there are heavy
>> memory pressure, so that the nearby pages will not be accessed before
>> being swapped out again, the mTHP swap-in order can be adjusted
>> automatically.
>
> This is a good idea to do, but I think the issue is that readahead
> is a folio flag and not a page flag, so only works when folio size is 1.
>
> In the swapin_readahead swapcache path, the current implementation decides
> the ra_window based on hits, which is incremented in swap_cache_get_folio
> if it has not been gotten from swapcache before.
> The problem would be that we need information on how many distinct pages in
> a large folio that has been swapped in have been accessed to decide the
> hits/window size, which I don't think is possible. As once the entire large
> folio has been swapped in, we won't get a fault.
>

To do that, we need to move readahead flag to per-page from per-folio.
And we need to map only the accessed page of the folio in page fault
handler.  This may impact performance.  So, we may only do that for
sampled folios only, for example, every 100 folios.

>> 
>>> It happens to manifest with THP in cgroups because that's what you
>>> guys are testing. But IMO, any solution to this problem should
>>> consider the wider scope.
>>>
>>>>>>>> I think large folio swapin would make the problem worse anyway. I am
>>>>>>>> also not sure if the readahead window adjusts on memory pressure or
>>>>>>>> not.
>>>>>>>>
>>>>>>> readahead window doesnt look at memory pressure. So maybe the same thing is being
>>>>>>> seen here as there would be in swapin_readahead?
>>>>>>
>>>>>> Maybe readahead is not as aggressive in general as large folio
>>>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
>>>>>> of the window is the smaller of page_cluster (2 or 3) and
>>>>>> SWAP_RA_ORDER_CEILING (5).
>>>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might
>>>>> be similar to enabling 32K mTHP?
>>>>
>>>> Not quite.
>>>
>>> Actually, I would expect it to be...
>> 
>> Me too.
>> 
>>>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a
>>>>>> contiguous allocation like large folio swapin. So that could be
>>>>>> another factor why readahead may not reproduce the problem.
>>>>
>>>> Because of this ^.
>>>
>>> ...this matters for the physical allocation, which might require more
>>> reclaim and compaction to produce the 32k. But an earlier version of
>>> Barry's patch did the cgroup margin fallback after the THP was already
>>> physically allocated, and it still helped.
>>>
>>> So the issue in this test scenario seems to be mostly about cgroup
>>> volume. And then 8 4k charges should be equivalent to a singular 32k
>>> charge when it comes to cgroup pressure.

--
Best Regards,
Huang, Ying
Barry Song Nov. 5, 2024, 1:13 a.m. UTC | #26
On Tue, Nov 5, 2024 at 2:01 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Usama Arif <usamaarif642@gmail.com> writes:
>
> > On 04/11/2024 06:42, Huang, Ying wrote:
> >> Johannes Weiner <hannes@cmpxchg.org> writes:
> >>
> >>> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
> >>>> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>> On 30/10/2024 21:01, Yosry Ahmed wrote:
> >>>>>> On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>>>>>> I am not sure that the approach we are trying in this patch is the right way:
> >>>>>>>>> - This patch makes it a memcg issue, but you could have memcg disabled and
> >>>>>>>>> then the mitigation being tried here wont apply.
> >>>>>>>>
> >>>>>>>> Is the problem reproducible without memcg? I imagine only if the
> >>>>>>>> entire system is under memory pressure. I guess we would want the same
> >>>>>>>> "mitigation" either way.
> >>>>>>>>
> >>>>>>> What would be a good open source benchmark/workload to test without limiting memory
> >>>>>>> in memcg?
> >>>>>>> For the kernel build test, I can only get zswap activity to happen if I build
> >>>>>>> in cgroup and limit memory.max.
> >>>>>>
> >>>>>> You mean a benchmark that puts the entire system under memory
> >>>>>> pressure? I am not sure, it ultimately depends on the size of memory
> >>>>>> you have, among other factors.
> >>>>>>
> >>>>>> What if you run the kernel build test in a VM? Then you can limit is
> >>>>>> size like a memcg, although you'd probably need to leave more room
> >>>>>> because the entire guest OS will also subject to the same limit.
> >>>>>>
> >>>>>
> >>>>> I had tried this, but the variance in time/zswap numbers was very high.
> >>>>> Much higher than the AMD numbers I posted in reply to Barry. So found
> >>>>> it very difficult to make comparison.
> >>>>
> >>>> Hmm yeah maybe more factors come into play with global memory
> >>>> pressure. I am honestly not sure how to test this scenario, and I
> >>>> suspect variance will be high anyway.
> >>>>
> >>>> We can just try to use whatever technique we use for the memcg limit
> >>>> though, if possible, right?
> >>>
> >>> You can boot a physical machine with mem=1G on the commandline, which
> >>> restricts the physical range of memory that will be initialized.
> >>> Double check /proc/meminfo after boot, because part of that physical
> >>> range might not be usable RAM.
> >>>
> >>> I do this quite often to test physical memory pressure with workloads
> >>> that don't scale up easily, like kernel builds.
> >>>
> >>>>>>>>> - Instead of this being a large folio swapin issue, is it more of a readahead
> >>>>>>>>> issue? If we zswap (without the large folio swapin series) and change the window
> >>>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
> >>>>>>>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
> >>>>>>>>> well.
> >>>
> >>> +1
> >>>
> >>> I also think there is too much focus on cgroup alone. The bigger issue
> >>> seems to be how much optimistic volume we swap in when we're under
> >>> pressure already. This applies to large folios and readahead; global
> >>> memory availability and cgroup limits.
> >>
> >> The current swap readahead logic is something like,
> >>
> >> 1. try readahead some pages for sequential access pattern, mark them as
> >>    readahead
> >>
> >> 2. if these readahead pages get accessed before swapped out again,
> >>    increase 'hits' counter
> >>
> >> 3. for next swap in, try readahead 'hits' pages and clear 'hits'.
> >>
> >> So, if there's heavy memory pressure, the readaheaded pages will not be
> >> accessed before being swapped out again (in 2 above), the readahead
> >> pages will be minimal.
> >>
> >> IMHO, mTHP swap-in is kind of swap readahead in effect.  That is, in
> >> addition to the pages accessed are swapped in, the adjacent pages are
> >> swapped in (swap readahead) too.  If these readahead pages are not
> >> accessed before swapped out again, system runs into more severe
> >> thrashing.  This is because we lack the swap readahead window scaling
> >> mechanism as above.  And, this is why I suggested to combine the swap
> >> readahead mechanism and mTHP swap-in by default before.  That is, when
> >> kernel swaps in a page, it checks current swap readahead window, and
> >> decides mTHP order according to window size.  So, if there are heavy
> >> memory pressure, so that the nearby pages will not be accessed before
> >> being swapped out again, the mTHP swap-in order can be adjusted
> >> automatically.
> >
> > This is a good idea to do, but I think the issue is that readahead
> > is a folio flag and not a page flag, so only works when folio size is 1.
> >
> > In the swapin_readahead swapcache path, the current implementation decides
> > the ra_window based on hits, which is incremented in swap_cache_get_folio
> > if it has not been gotten from swapcache before.
> > The problem would be that we need information on how many distinct pages in
> > a large folio that has been swapped in have been accessed to decide the
> > hits/window size, which I don't think is possible. As once the entire large
> > folio has been swapped in, we won't get a fault.
> >
>
> To do that, we need to move readahead flag to per-page from per-folio.
> And we need to map only the accessed page of the folio in page fault
> handler.  This may impact performance.  So, we may only do that for
> sampled folios only, for example, every 100 folios.

I'm not entirely sure there's a chance to gain traction on this, as the current
trend clearly leans toward moving flags from page to folio, not from folio to
page :-)

>
> >>
> >>> It happens to manifest with THP in cgroups because that's what you
> >>> guys are testing. But IMO, any solution to this problem should
> >>> consider the wider scope.
> >>>
> >>>>>>>> I think large folio swapin would make the problem worse anyway. I am
> >>>>>>>> also not sure if the readahead window adjusts on memory pressure or
> >>>>>>>> not.
> >>>>>>>>
> >>>>>>> readahead window doesnt look at memory pressure. So maybe the same thing is being
> >>>>>>> seen here as there would be in swapin_readahead?
> >>>>>>
> >>>>>> Maybe readahead is not as aggressive in general as large folio
> >>>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
> >>>>>> of the window is the smaller of page_cluster (2 or 3) and
> >>>>>> SWAP_RA_ORDER_CEILING (5).
> >>>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might
> >>>>> be similar to enabling 32K mTHP?
> >>>>
> >>>> Not quite.
> >>>
> >>> Actually, I would expect it to be...
> >>
> >> Me too.
> >>
> >>>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a
> >>>>>> contiguous allocation like large folio swapin. So that could be
> >>>>>> another factor why readahead may not reproduce the problem.
> >>>>
> >>>> Because of this ^.
> >>>
> >>> ...this matters for the physical allocation, which might require more
> >>> reclaim and compaction to produce the 32k. But an earlier version of
> >>> Barry's patch did the cgroup margin fallback after the THP was already
> >>> physically allocated, and it still helped.
> >>>
> >>> So the issue in this test scenario seems to be mostly about cgroup
> >>> volume. And then 8 4k charges should be equivalent to a singular 32k
> >>> charge when it comes to cgroup pressure.
>
> --
> Best Regards,
> Huang, Ying
Huang, Ying Nov. 5, 2024, 1:41 a.m. UTC | #27
Barry Song <21cnbao@gmail.com> writes:

> On Tue, Nov 5, 2024 at 2:01 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Usama Arif <usamaarif642@gmail.com> writes:
>>
>> > On 04/11/2024 06:42, Huang, Ying wrote:
>> >> Johannes Weiner <hannes@cmpxchg.org> writes:
>> >>
>> >>> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
>> >>>> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote:
>> >>>>> On 30/10/2024 21:01, Yosry Ahmed wrote:
>> >>>>>> On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote:
>> >>>>>>>>> I am not sure that the approach we are trying in this patch is the right way:
>> >>>>>>>>> - This patch makes it a memcg issue, but you could have memcg disabled and
>> >>>>>>>>> then the mitigation being tried here wont apply.
>> >>>>>>>>
>> >>>>>>>> Is the problem reproducible without memcg? I imagine only if the
>> >>>>>>>> entire system is under memory pressure. I guess we would want the same
>> >>>>>>>> "mitigation" either way.
>> >>>>>>>>
>> >>>>>>> What would be a good open source benchmark/workload to test without limiting memory
>> >>>>>>> in memcg?
>> >>>>>>> For the kernel build test, I can only get zswap activity to happen if I build
>> >>>>>>> in cgroup and limit memory.max.
>> >>>>>>
>> >>>>>> You mean a benchmark that puts the entire system under memory
>> >>>>>> pressure? I am not sure, it ultimately depends on the size of memory
>> >>>>>> you have, among other factors.
>> >>>>>>
>> >>>>>> What if you run the kernel build test in a VM? Then you can limit is
>> >>>>>> size like a memcg, although you'd probably need to leave more room
>> >>>>>> because the entire guest OS will also subject to the same limit.
>> >>>>>>
>> >>>>>
>> >>>>> I had tried this, but the variance in time/zswap numbers was very high.
>> >>>>> Much higher than the AMD numbers I posted in reply to Barry. So found
>> >>>>> it very difficult to make comparison.
>> >>>>
>> >>>> Hmm yeah maybe more factors come into play with global memory
>> >>>> pressure. I am honestly not sure how to test this scenario, and I
>> >>>> suspect variance will be high anyway.
>> >>>>
>> >>>> We can just try to use whatever technique we use for the memcg limit
>> >>>> though, if possible, right?
>> >>>
>> >>> You can boot a physical machine with mem=1G on the commandline, which
>> >>> restricts the physical range of memory that will be initialized.
>> >>> Double check /proc/meminfo after boot, because part of that physical
>> >>> range might not be usable RAM.
>> >>>
>> >>> I do this quite often to test physical memory pressure with workloads
>> >>> that don't scale up easily, like kernel builds.
>> >>>
>> >>>>>>>>> - Instead of this being a large folio swapin issue, is it more of a readahead
>> >>>>>>>>> issue? If we zswap (without the large folio swapin series) and change the window
>> >>>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time
>> >>>>>>>>> when cgroup memory is limited as readahead would probably cause swap thrashing as
>> >>>>>>>>> well.
>> >>>
>> >>> +1
>> >>>
>> >>> I also think there is too much focus on cgroup alone. The bigger issue
>> >>> seems to be how much optimistic volume we swap in when we're under
>> >>> pressure already. This applies to large folios and readahead; global
>> >>> memory availability and cgroup limits.
>> >>
>> >> The current swap readahead logic is something like,
>> >>
>> >> 1. try readahead some pages for sequential access pattern, mark them as
>> >>    readahead
>> >>
>> >> 2. if these readahead pages get accessed before swapped out again,
>> >>    increase 'hits' counter
>> >>
>> >> 3. for next swap in, try readahead 'hits' pages and clear 'hits'.
>> >>
>> >> So, if there's heavy memory pressure, the readaheaded pages will not be
>> >> accessed before being swapped out again (in 2 above), the readahead
>> >> pages will be minimal.
>> >>
>> >> IMHO, mTHP swap-in is kind of swap readahead in effect.  That is, in
>> >> addition to the pages accessed are swapped in, the adjacent pages are
>> >> swapped in (swap readahead) too.  If these readahead pages are not
>> >> accessed before swapped out again, system runs into more severe
>> >> thrashing.  This is because we lack the swap readahead window scaling
>> >> mechanism as above.  And, this is why I suggested to combine the swap
>> >> readahead mechanism and mTHP swap-in by default before.  That is, when
>> >> kernel swaps in a page, it checks current swap readahead window, and
>> >> decides mTHP order according to window size.  So, if there are heavy
>> >> memory pressure, so that the nearby pages will not be accessed before
>> >> being swapped out again, the mTHP swap-in order can be adjusted
>> >> automatically.
>> >
>> > This is a good idea to do, but I think the issue is that readahead
>> > is a folio flag and not a page flag, so only works when folio size is 1.
>> >
>> > In the swapin_readahead swapcache path, the current implementation decides
>> > the ra_window based on hits, which is incremented in swap_cache_get_folio
>> > if it has not been gotten from swapcache before.
>> > The problem would be that we need information on how many distinct pages in
>> > a large folio that has been swapped in have been accessed to decide the
>> > hits/window size, which I don't think is possible. As once the entire large
>> > folio has been swapped in, we won't get a fault.
>> >
>>
>> To do that, we need to move readahead flag to per-page from per-folio.
>> And we need to map only the accessed page of the folio in page fault
>> handler.  This may impact performance.  So, we may only do that for
>> sampled folios only, for example, every 100 folios.
>
> I'm not entirely sure there's a chance to gain traction on this, as the current
> trend clearly leans toward moving flags from page to folio, not from folio to
> page :-)

This may be a problem.  However, I think we can try to find a solution
for this.  Anyway, we need some way to track per-page status in a folio,
regardless how to implement it.

>>
>> >>
>> >>> It happens to manifest with THP in cgroups because that's what you
>> >>> guys are testing. But IMO, any solution to this problem should
>> >>> consider the wider scope.
>> >>>
>> >>>>>>>> I think large folio swapin would make the problem worse anyway. I am
>> >>>>>>>> also not sure if the readahead window adjusts on memory pressure or
>> >>>>>>>> not.
>> >>>>>>>>
>> >>>>>>> readahead window doesnt look at memory pressure. So maybe the same thing is being
>> >>>>>>> seen here as there would be in swapin_readahead?
>> >>>>>>
>> >>>>>> Maybe readahead is not as aggressive in general as large folio
>> >>>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
>> >>>>>> of the window is the smaller of page_cluster (2 or 3) and
>> >>>>>> SWAP_RA_ORDER_CEILING (5).
>> >>>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might
>> >>>>> be similar to enabling 32K mTHP?
>> >>>>
>> >>>> Not quite.
>> >>>
>> >>> Actually, I would expect it to be...
>> >>
>> >> Me too.
>> >>
>> >>>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a
>> >>>>>> contiguous allocation like large folio swapin. So that could be
>> >>>>>> another factor why readahead may not reproduce the problem.
>> >>>>
>> >>>> Because of this ^.
>> >>>
>> >>> ...this matters for the physical allocation, which might require more
>> >>> reclaim and compaction to produce the 32k. But an earlier version of
>> >>> Barry's patch did the cgroup margin fallback after the THP was already
>> >>> physically allocated, and it still helped.
>> >>>
>> >>> So the issue in this test scenario seems to be mostly about cgroup
>> >>> volume. And then 8 4k charges should be equivalent to a singular 32k
>> >>> charge when it comes to cgroup pressure.

--
Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 524006313b0d..8bcc8f4af39f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -697,6 +697,9 @@  static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
 int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
 		long nr_pages);
 
+int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
+				swp_entry_t *entry);
+
 int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 				  gfp_t gfp, swp_entry_t entry);
 
@@ -1201,6 +1204,12 @@  static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
 	return 0;
 }
 
+static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
+		swp_entry_t *entry)
+{
+	return 0;
+}
+
 static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
 			struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 17af08367c68..f3d92b93ea6d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4530,6 +4530,51 @@  int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
 	return 0;
 }
 
+static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg)
+{
+	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
+		if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR)
+			return false;
+	}
+
+	return true;
+}
+
+/**
+ * mem_cgroup_swapin_precharge_large_folio: Precharge large folios.
+ *
+ * @mm: mm context of the victim
+ * @entry: swap entry for which the folio will be allocated
+ *
+ * If we are arriving the edge of an almost full memcg, return error so that
+ * swap-in and anon faults can quickly fall back to small folios to avoid swap
+ * thrashing.
+ *
+ * Returns 0 on success, an error code on failure.
+ */
+int mem_cgroup_precharge_large_folio(struct mm_struct *mm, swp_entry_t *entry)
+{
+	struct mem_cgroup *memcg = NULL;
+	unsigned short id;
+	bool has_margin;
+
+	if (mem_cgroup_disabled())
+		return 0;
+
+	rcu_read_lock();
+	if (entry) {
+		id = lookup_swap_cgroup_id(*entry);
+		memcg = mem_cgroup_from_id(id);
+	}
+	if (!memcg || !css_tryget_online(&memcg->css))
+		memcg = get_mem_cgroup_from_mm(mm);
+	has_margin = mem_cgroup_has_margin(memcg);
+	rcu_read_unlock();
+
+	css_put(&memcg->css);
+	return has_margin ? 0 : -ENOMEM;
+}
+
 /**
  * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin.
  * @folio: folio to charge.
diff --git a/mm/memory.c b/mm/memory.c
index 0f614523b9f4..96368ba0e8a6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4168,6 +4168,16 @@  static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 
 	pte_unmap_unlock(pte, ptl);
 
+	if (!orders)
+		goto fallback;
+
+	/*
+	 * Avoid swapping in large folios when memcg is nearly full, as it
+	 * may quickly trigger additional swap-out and swap-in cycles.
+	 */
+	if (mem_cgroup_precharge_large_folio(vma->vm_mm, &entry))
+		goto fallback;
+
 	/* Try allocating the highest of the remaining orders. */
 	gfp = vma_thp_gfp_mask(vma);
 	while (orders) {
@@ -4707,6 +4717,13 @@  static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 	if (!orders)
 		goto fallback;
 
+	/*
+	 * When memcg is nearly full, large folios can rapidly fill
+	 * the margin and trigger new reclamation
+	 */
+	if (mem_cgroup_precharge_large_folio(vma->vm_mm, NULL))
+		goto fallback;
+
 	/* Try allocating the highest of the remaining orders. */
 	gfp = vma_thp_gfp_mask(vma);
 	while (orders) {