Message ID | 20241027001444.3233-1-21cnbao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg | expand |
On 27/10/2024 01:14, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > In a memcg where mTHP is always utilized, even at full capacity, it > might not be the best option. Consider a system that uses only small > folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX > of buffer space before it can initiate the next reclamation. However, > large folios can quickly fill this space, rapidly bringing the memcg > back to full capacity, even though some portions of the large folios > may not be immediately needed and used by the process. > > Usama and Kanchana identified a regression when building the kernel in > a memcg with memory.max set to a small value while enabling large > folio swap-in support on zswap[1]. > > The issue arises from an edge case where the memory cgroup remains > nearly full most of the time. Consequently, bringing in mTHP can > quickly cause a memcg overflow, triggering a swap-out. The subsequent > swap-in then recreates the overflow, resulting in a repetitive cycle. > > We need a mechanism to stop the cup from overflowing continuously. > One potential solution is to slow the filling process when we identify > that the cup is nearly full. > > Usama reported an improvement when we mitigate mTHP swap-in as the > memcg approaches full capacity[2]: > > int mem_cgroup_swapin_charge_folio(...) > { > ... > if (folio_test_large(folio) && > mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) > ret = -ENOMEM; > else > ret = charge_memcg(folio, memcg, gfp); > ... > } > > AMD 16K+32K THP=always > metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > real 1m23.038s 1m23.050s 1m22.704s > user 53m57.210s 53m53.437s 53m52.577s > sys 7m24.592s 7m48.843s 7m22.519s > zswpin 612070 999244 815934 > zswpout 2226403 2347979 2054980 > pgfault 20667366 20481728 20478690 > pgmajfault 385887 269117 309702 > > AMD 16K+32K+64K THP=always > metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > real 1m22.975s 1m23.266s 1m22.549s > user 53m51.302s 53m51.069s 53m46.471s > sys 7m40.168s 7m57.104s 7m25.012s > zswpin 676492 1258573 1225703 > zswpout 2449839 2714767 2899178 > pgfault 17540746 17296555 17234663 > pgmajfault 429629 307495 287859 > > I wonder if we can extend the mitigation to do_anonymous_page() as > well. Without hardware like AMD and ARM with hardware TLB coalescing > or CONT-PTE, I conducted a quick test on my Intel i9 workstation with > 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel > builds in a memcg with memory.max set to 1 GiB. > > $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > > $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ > CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null > > disable-mTHP-swapin mm-unstable with-this-patch > Real: 6m54.595s 7m4.832s 6m45.811s > User: 66m42.795s 66m59.984s 67m21.150s > Sys: 12m7.092s 15m18.153s 12m52.644s > pswpin: 4262327 11723248 5918690 > pswpout: 14883774 19574347 14026942 > 64k-swpout: 624447 889384 480039 > 32k-swpout: 115473 242288 73874 > 16k-swpout: 158203 294672 109142 > 64k-swpin: 0 495869 159061 > 32k-swpin: 0 219977 56158 > 16k-swpin: 0 223501 81445 > hmm, both the user and sys time are worse with the patch compared to disable-mTHP-swapin. I wonder if the real time is an anomaly and if you repeat the experiment the real time might be worse as well? > I need Usama's assistance to identify a suitable patch, as I lack > access to hardware such as AMD machines and ARM servers with TLB > optimization. > > [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/ > [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/ > > Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > Cc: Usama Arif <usamaarif642@gmail.com> > Cc: David Hildenbrand <david@redhat.com> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > Cc: Chris Li <chrisl@kernel.org> > Cc: Yosry Ahmed <yosryahmed@google.com> > Cc: "Huang, Ying" <ying.huang@intel.com> > Cc: Kairui Song <kasong@tencent.com> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Roman Gushchin <roman.gushchin@linux.dev> > Cc: Shakeel Butt <shakeel.butt@linux.dev> > Cc: Muchun Song <muchun.song@linux.dev> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > include/linux/memcontrol.h | 9 ++++++++ > mm/memcontrol.c | 45 ++++++++++++++++++++++++++++++++++++++ > mm/memory.c | 17 ++++++++++++++ > 3 files changed, 71 insertions(+) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 524006313b0d..8bcc8f4af39f 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, > int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, > long nr_pages); > > +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, > + swp_entry_t *entry); > + > int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > gfp_t gfp, swp_entry_t entry); > > @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, > return 0; > } > > +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm, > + swp_entry_t *entry) > +{ > + return 0; > +} > + > static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, > struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) > { > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 17af08367c68..f3d92b93ea6d 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, > return 0; > } > > +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg) > +{ > + for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { > + if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR) There might be 3 issues with the approach: Its a very big margin, lets say you have ARM64_64K_PAGES, and you have 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are basically saying you need 512M free memory to swapin just 256K? Its an uneven margin for different folio sizes. For 16K folio swapin, you are checking if there is margin for 128 folios, but for 1M folio swapin, you are checking there is margin for just 2 folios. Maybe it might be better to make this dependent on some factor of folio_nr_pages? As Johannes pointed out, the charging code already does the margin check. So for 4K, the check just checks if there is 4K available, but for 16K it checks if a lot more than 16K is available. Maybe there should be a similar policy for all? I guess this is similar to my 2nd point, but just considers 4K folios as well. Thanks, Usama > + return false; > + } > + > + return true; > +} > + > +/** > + * mem_cgroup_swapin_precharge_large_folio: Precharge large folios. > + * > + * @mm: mm context of the victim > + * @entry: swap entry for which the folio will be allocated > + * > + * If we are arriving the edge of an almost full memcg, return error so that > + * swap-in and anon faults can quickly fall back to small folios to avoid swap > + * thrashing. > + * > + * Returns 0 on success, an error code on failure. > + */ > +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, swp_entry_t *entry) > +{ > + struct mem_cgroup *memcg = NULL; > + unsigned short id; > + bool has_margin; > + > + if (mem_cgroup_disabled()) > + return 0; > + > + rcu_read_lock(); > + if (entry) { > + id = lookup_swap_cgroup_id(*entry); > + memcg = mem_cgroup_from_id(id); > + } > + if (!memcg || !css_tryget_online(&memcg->css)) > + memcg = get_mem_cgroup_from_mm(mm); > + has_margin = mem_cgroup_has_margin(memcg); > + rcu_read_unlock(); > + > + css_put(&memcg->css); > + return has_margin ? 0 : -ENOMEM; > +} > + > /** > * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin. > * @folio: folio to charge. > diff --git a/mm/memory.c b/mm/memory.c > index 0f614523b9f4..96368ba0e8a6 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4168,6 +4168,16 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > pte_unmap_unlock(pte, ptl); > > + if (!orders) > + goto fallback; > + > + /* > + * Avoid swapping in large folios when memcg is nearly full, as it > + * may quickly trigger additional swap-out and swap-in cycles. > + */ > + if (mem_cgroup_precharge_large_folio(vma->vm_mm, &entry)) > + goto fallback; > + > /* Try allocating the highest of the remaining orders. */ > gfp = vma_thp_gfp_mask(vma); > while (orders) { > @@ -4707,6 +4717,13 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) > if (!orders) > goto fallback; > > + /* > + * When memcg is nearly full, large folios can rapidly fill > + * the margin and trigger new reclamation > + */ > + if (mem_cgroup_precharge_large_folio(vma->vm_mm, NULL)) > + goto fallback; > + > /* Try allocating the highest of the remaining orders. */ > gfp = vma_thp_gfp_mask(vma); > while (orders) {
On Mon, Oct 28, 2024 at 8:07 PM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 27/10/2024 01:14, Barry Song wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > > > In a memcg where mTHP is always utilized, even at full capacity, it > > might not be the best option. Consider a system that uses only small > > folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX > > of buffer space before it can initiate the next reclamation. However, > > large folios can quickly fill this space, rapidly bringing the memcg > > back to full capacity, even though some portions of the large folios > > may not be immediately needed and used by the process. > > > > Usama and Kanchana identified a regression when building the kernel in > > a memcg with memory.max set to a small value while enabling large > > folio swap-in support on zswap[1]. > > > > The issue arises from an edge case where the memory cgroup remains > > nearly full most of the time. Consequently, bringing in mTHP can > > quickly cause a memcg overflow, triggering a swap-out. The subsequent > > swap-in then recreates the overflow, resulting in a repetitive cycle. > > > > We need a mechanism to stop the cup from overflowing continuously. > > One potential solution is to slow the filling process when we identify > > that the cup is nearly full. > > > > Usama reported an improvement when we mitigate mTHP swap-in as the > > memcg approaches full capacity[2]: > > > > int mem_cgroup_swapin_charge_folio(...) > > { > > ... > > if (folio_test_large(folio) && > > mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) > > ret = -ENOMEM; > > else > > ret = charge_memcg(folio, memcg, gfp); > > ... > > } > > > > AMD 16K+32K THP=always > > metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > > real 1m23.038s 1m23.050s 1m22.704s > > user 53m57.210s 53m53.437s 53m52.577s > > sys 7m24.592s 7m48.843s 7m22.519s > > zswpin 612070 999244 815934 > > zswpout 2226403 2347979 2054980 > > pgfault 20667366 20481728 20478690 > > pgmajfault 385887 269117 309702 > > > > AMD 16K+32K+64K THP=always > > metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > > real 1m22.975s 1m23.266s 1m22.549s > > user 53m51.302s 53m51.069s 53m46.471s > > sys 7m40.168s 7m57.104s 7m25.012s > > zswpin 676492 1258573 1225703 > > zswpout 2449839 2714767 2899178 > > pgfault 17540746 17296555 17234663 > > pgmajfault 429629 307495 287859 > > > > I wonder if we can extend the mitigation to do_anonymous_page() as > > well. Without hardware like AMD and ARM with hardware TLB coalescing > > or CONT-PTE, I conducted a quick test on my Intel i9 workstation with > > 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel > > builds in a memcg with memory.max set to 1 GiB. > > > > $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > > $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > > $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > > $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > > > > $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ > > CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null > > > > disable-mTHP-swapin mm-unstable with-this-patch > > Real: 6m54.595s 7m4.832s 6m45.811s > > User: 66m42.795s 66m59.984s 67m21.150s > > Sys: 12m7.092s 15m18.153s 12m52.644s > > pswpin: 4262327 11723248 5918690 > > pswpout: 14883774 19574347 14026942 > > 64k-swpout: 624447 889384 480039 > > 32k-swpout: 115473 242288 73874 > > 16k-swpout: 158203 294672 109142 > > 64k-swpin: 0 495869 159061 > > 32k-swpin: 0 219977 56158 > > 16k-swpin: 0 223501 81445 > > > Hi Usama, > hmm, both the user and sys time are worse with the patch compared to > disable-mTHP-swapin. I wonder if the real time is an anomaly and if you > repeat the experiment the real time might be worse as well? Well, I've improved my script to include a loop: echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled for ((i=1; i<=100; i++)) do echo "Executing round $i" make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null echo 3 > /proc/sys/vm/drop_caches time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j15 1>/dev/null 2>/dev/null cat /proc/vmstat | grep pswp echo -n 64k-swpout: ; cat /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout echo -n 32k-swpout: ; cat /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout echo -n 16k-swpout: ; cat /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout echo -n 64k-swpin: ; cat /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin echo -n 32k-swpin: ; cat /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin echo -n 16k-swpin: ; cat /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin done I've noticed that the user/sys/real time on my i9 machine fluctuates constantly, could be things like: real 6m52.087s user 67m12.463s sys 13m8.281s ... real 7m42.937s user 66m55.250s sys 12m56.330s ... real 6m49.374s user 66m37.040s sys 12m44.542s ... real 6m54.205s user 65m49.732s sys 11m33.078s ... likely due to unstable temperatures and I/O latency. As a result, my data doesn’t seem reference-worthy. As a phone engineer, we never use phones to run kernel builds. I'm also quite certain that phones won't provide stable and reliable data for this type of workload. Without access to a Linux server to conduct the test, I really need your help. I used to work on optimizing the ARM server scheduler and memory management, and I really miss that machine I had until three years ago :-) > > > I need Usama's assistance to identify a suitable patch, as I lack > > access to hardware such as AMD machines and ARM servers with TLB > > optimization. > > > > [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/ > > [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/ > > > > Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > Cc: Usama Arif <usamaarif642@gmail.com> > > Cc: David Hildenbrand <david@redhat.com> > > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > > Cc: Chris Li <chrisl@kernel.org> > > Cc: Yosry Ahmed <yosryahmed@google.com> > > Cc: "Huang, Ying" <ying.huang@intel.com> > > Cc: Kairui Song <kasong@tencent.com> > > Cc: Ryan Roberts <ryan.roberts@arm.com> > > Cc: Johannes Weiner <hannes@cmpxchg.org> > > Cc: Michal Hocko <mhocko@kernel.org> > > Cc: Roman Gushchin <roman.gushchin@linux.dev> > > Cc: Shakeel Butt <shakeel.butt@linux.dev> > > Cc: Muchun Song <muchun.song@linux.dev> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > --- > > include/linux/memcontrol.h | 9 ++++++++ > > mm/memcontrol.c | 45 ++++++++++++++++++++++++++++++++++++++ > > mm/memory.c | 17 ++++++++++++++ > > 3 files changed, 71 insertions(+) > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 524006313b0d..8bcc8f4af39f 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, > > int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, > > long nr_pages); > > > > +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, > > + swp_entry_t *entry); > > + > > int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > > gfp_t gfp, swp_entry_t entry); > > > > @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, > > return 0; > > } > > > > +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm, > > + swp_entry_t *entry) > > +{ > > + return 0; > > +} > > + > > static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, > > struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) > > { > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 17af08367c68..f3d92b93ea6d 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, > > return 0; > > } > > > > +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg) > > +{ > > + for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { > > + if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR) > > There might be 3 issues with the approach: > > Its a very big margin, lets say you have ARM64_64K_PAGES, and you have > 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are > basically saying you need 512M free memory to swapin just 256K? Right, sorry for the noisy code. I was just thinking about 4KB pages and wondering if we could simplify the code. > > Its an uneven margin for different folio sizes. > For 16K folio swapin, you are checking if there is margin for 128 folios, > but for 1M folio swapin, you are checking there is margin for just 2 folios. > > Maybe it might be better to make this dependent on some factor of folio_nr_pages? Agreed. This is similar to what we discussed regarding your zswap mTHP swap-in series: int mem_cgroup_swapin_charge_folio(...) { ... if (folio_test_large(folio) && mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) ret = -ENOMEM; else ret = charge_memcg(folio, memcg, gfp); ... } As someone focused on phones, my challenge is the absence of stable platforms to benchmark this type of workload. If possible, Usama, I would greatly appreciate it if you could take the lead on the patch. > > As Johannes pointed out, the charging code already does the margin check. > So for 4K, the check just checks if there is 4K available, but for 16K it checks > if a lot more than 16K is available. Maybe there should be a similar policy for > all? I guess this is similar to my 2nd point, but just considers 4K folios as > well. I don't think the charging code performs a margin check. It simply tries to charge the specified nr_pages (whether 1 or more). If nr_pages are available, the charge proceeds; otherwise, if GFP allows blocking, it triggers memory reclamation to reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages. If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages available, a large folio with nr_pages == SWAP_CLUSTER_MAX will successfully charge, immediately filling the memcg. Shortly after, smaller folios—typically with blockable GFP—will quickly trigger additional reclamation. While nr_pages - 1 subpages of the large folio may not be immediately needed, they still occupy enough space to fill the memcg to capacity. My second point about the mitigation is as follows: For a system (or memcg) under severe memory pressure, especially one without hardware TLB optimization, is enabling mTHP always the right choice? Since mTHP operates at a larger granularity, some internal fragmentation is unavoidable, regardless of optimization. Could the mitigation code help in automatically tuning this fragmentation? > > Thanks, > Usama > > > > + return false; > > + } > > + > > + return true; > > +} > > + > > +/** > > + * mem_cgroup_swapin_precharge_large_folio: Precharge large folios. > > + * > > + * @mm: mm context of the victim > > + * @entry: swap entry for which the folio will be allocated > > + * > > + * If we are arriving the edge of an almost full memcg, return error so that > > + * swap-in and anon faults can quickly fall back to small folios to avoid swap > > + * thrashing. > > + * > > + * Returns 0 on success, an error code on failure. > > + */ > > +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, swp_entry_t *entry) > > +{ > > + struct mem_cgroup *memcg = NULL; > > + unsigned short id; > > + bool has_margin; > > + > > + if (mem_cgroup_disabled()) > > + return 0; > > + > > + rcu_read_lock(); > > + if (entry) { > > + id = lookup_swap_cgroup_id(*entry); > > + memcg = mem_cgroup_from_id(id); > > + } > > + if (!memcg || !css_tryget_online(&memcg->css)) > > + memcg = get_mem_cgroup_from_mm(mm); > > + has_margin = mem_cgroup_has_margin(memcg); > > + rcu_read_unlock(); > > + > > + css_put(&memcg->css); > > + return has_margin ? 0 : -ENOMEM; > > +} > > + > > /** > > * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin. > > * @folio: folio to charge. > > diff --git a/mm/memory.c b/mm/memory.c > > index 0f614523b9f4..96368ba0e8a6 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -4168,6 +4168,16 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > > > pte_unmap_unlock(pte, ptl); > > > > + if (!orders) > > + goto fallback; > > + > > + /* > > + * Avoid swapping in large folios when memcg is nearly full, as it > > + * may quickly trigger additional swap-out and swap-in cycles. > > + */ > > + if (mem_cgroup_precharge_large_folio(vma->vm_mm, &entry)) > > + goto fallback; > > + > > /* Try allocating the highest of the remaining orders. */ > > gfp = vma_thp_gfp_mask(vma); > > while (orders) { > > @@ -4707,6 +4717,13 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) > > if (!orders) > > goto fallback; > > > > + /* > > + * When memcg is nearly full, large folios can rapidly fill > > + * the margin and trigger new reclamation > > + */ > > + if (mem_cgroup_precharge_large_folio(vma->vm_mm, NULL)) > > + goto fallback; > > + > > /* Try allocating the highest of the remaining orders. */ > > gfp = vma_thp_gfp_mask(vma); > > while (orders) { > Thanks Barry
On 28/10/2024 22:03, Barry Song wrote: > On Mon, Oct 28, 2024 at 8:07 PM Usama Arif <usamaarif642@gmail.com> wrote: >> >> >> >> On 27/10/2024 01:14, Barry Song wrote: >>> From: Barry Song <v-songbaohua@oppo.com> >>> >>> In a memcg where mTHP is always utilized, even at full capacity, it >>> might not be the best option. Consider a system that uses only small >>> folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX >>> of buffer space before it can initiate the next reclamation. However, >>> large folios can quickly fill this space, rapidly bringing the memcg >>> back to full capacity, even though some portions of the large folios >>> may not be immediately needed and used by the process. >>> >>> Usama and Kanchana identified a regression when building the kernel in >>> a memcg with memory.max set to a small value while enabling large >>> folio swap-in support on zswap[1]. >>> >>> The issue arises from an edge case where the memory cgroup remains >>> nearly full most of the time. Consequently, bringing in mTHP can >>> quickly cause a memcg overflow, triggering a swap-out. The subsequent >>> swap-in then recreates the overflow, resulting in a repetitive cycle. >>> >>> We need a mechanism to stop the cup from overflowing continuously. >>> One potential solution is to slow the filling process when we identify >>> that the cup is nearly full. >>> >>> Usama reported an improvement when we mitigate mTHP swap-in as the >>> memcg approaches full capacity[2]: >>> >>> int mem_cgroup_swapin_charge_folio(...) >>> { >>> ... >>> if (folio_test_large(folio) && >>> mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) >>> ret = -ENOMEM; >>> else >>> ret = charge_memcg(folio, memcg, gfp); >>> ... >>> } >>> >>> AMD 16K+32K THP=always >>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix >>> real 1m23.038s 1m23.050s 1m22.704s >>> user 53m57.210s 53m53.437s 53m52.577s >>> sys 7m24.592s 7m48.843s 7m22.519s >>> zswpin 612070 999244 815934 >>> zswpout 2226403 2347979 2054980 >>> pgfault 20667366 20481728 20478690 >>> pgmajfault 385887 269117 309702 >>> >>> AMD 16K+32K+64K THP=always >>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix >>> real 1m22.975s 1m23.266s 1m22.549s >>> user 53m51.302s 53m51.069s 53m46.471s >>> sys 7m40.168s 7m57.104s 7m25.012s >>> zswpin 676492 1258573 1225703 >>> zswpout 2449839 2714767 2899178 >>> pgfault 17540746 17296555 17234663 >>> pgmajfault 429629 307495 287859 >>> >>> I wonder if we can extend the mitigation to do_anonymous_page() as >>> well. Without hardware like AMD and ARM with hardware TLB coalescing >>> or CONT-PTE, I conducted a quick test on my Intel i9 workstation with >>> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel >>> builds in a memcg with memory.max set to 1 GiB. >>> >>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled >>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled >>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled >>> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled >>> >>> $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ >>> CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null >>> >>> disable-mTHP-swapin mm-unstable with-this-patch >>> Real: 6m54.595s 7m4.832s 6m45.811s >>> User: 66m42.795s 66m59.984s 67m21.150s >>> Sys: 12m7.092s 15m18.153s 12m52.644s >>> pswpin: 4262327 11723248 5918690 >>> pswpout: 14883774 19574347 14026942 >>> 64k-swpout: 624447 889384 480039 >>> 32k-swpout: 115473 242288 73874 >>> 16k-swpout: 158203 294672 109142 >>> 64k-swpin: 0 495869 159061 >>> 32k-swpin: 0 219977 56158 >>> 16k-swpin: 0 223501 81445 >>> >> > > Hi Usama, > >> hmm, both the user and sys time are worse with the patch compared to >> disable-mTHP-swapin. I wonder if the real time is an anomaly and if you >> repeat the experiment the real time might be worse as well? > > Well, I've improved my script to include a loop: > > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > > for ((i=1; i<=100; i++)) > do > echo "Executing round $i" > make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null > echo 3 > /proc/sys/vm/drop_caches > time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ > CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j15 1>/dev/null 2>/dev/null > cat /proc/vmstat | grep pswp > echo -n 64k-swpout: ; cat > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout > echo -n 32k-swpout: ; cat > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout > echo -n 16k-swpout: ; cat > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout > echo -n 64k-swpin: ; cat > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin > echo -n 32k-swpin: ; cat > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin > echo -n 16k-swpin: ; cat > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin > done > > I've noticed that the user/sys/real time on my i9 machine fluctuates > constantly, could be things > like: > real 6m52.087s > user 67m12.463s > sys 13m8.281s > ... > > real 7m42.937s > user 66m55.250s > sys 12m56.330s > ... > > real 6m49.374s > user 66m37.040s > sys 12m44.542s > ... > > real 6m54.205s > user 65m49.732s > sys 11m33.078s > ... > > likely due to unstable temperatures and I/O latency. As a result, my > data doesn’t seem > reference-worthy. > So I had suggested retrying the experiment to see how reproducible it is, but had not done that myself! Thanks for sharing this. I tried many times on the AMD server and I see varying numbers as well. AMD 16K THP always, cgroup = 4G, large folio zswapin patches real 1m28.351s user 54m14.476s sys 8m46.596s zswpin 811693 zswpout 2137310 pgfault 27344671 pgmajfault 290510 .. real 1m24.557s user 53m56.815s sys 8m10.200s zswpin 571532 zswpout 1645063 pgfault 26989075 pgmajfault 205177 .. real 1m26.083s user 54m5.303s sys 9m55.247s zswpin 1176292 zswpout 2910825 pgfault 27286835 pgmajfault 419746 The sys time can especially vary by large numbers. I think you see the same. > As a phone engineer, we never use phones to run kernel builds. I'm also > quite certain that phones won't provide stable and reliable data for this > type of workload. Without access to a Linux server to conduct the test, > I really need your help. > > I used to work on optimizing the ARM server scheduler and memory > management, and I really miss that machine I had until three years ago :-) > >> >>> I need Usama's assistance to identify a suitable patch, as I lack >>> access to hardware such as AMD machines and ARM servers with TLB >>> optimization. >>> >>> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/ >>> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/ >>> >>> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> >>> Cc: Usama Arif <usamaarif642@gmail.com> >>> Cc: David Hildenbrand <david@redhat.com> >>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >>> Cc: Chris Li <chrisl@kernel.org> >>> Cc: Yosry Ahmed <yosryahmed@google.com> >>> Cc: "Huang, Ying" <ying.huang@intel.com> >>> Cc: Kairui Song <kasong@tencent.com> >>> Cc: Ryan Roberts <ryan.roberts@arm.com> >>> Cc: Johannes Weiner <hannes@cmpxchg.org> >>> Cc: Michal Hocko <mhocko@kernel.org> >>> Cc: Roman Gushchin <roman.gushchin@linux.dev> >>> Cc: Shakeel Butt <shakeel.butt@linux.dev> >>> Cc: Muchun Song <muchun.song@linux.dev> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >>> --- >>> include/linux/memcontrol.h | 9 ++++++++ >>> mm/memcontrol.c | 45 ++++++++++++++++++++++++++++++++++++++ >>> mm/memory.c | 17 ++++++++++++++ >>> 3 files changed, 71 insertions(+) >>> >>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h >>> index 524006313b0d..8bcc8f4af39f 100644 >>> --- a/include/linux/memcontrol.h >>> +++ b/include/linux/memcontrol.h >>> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, >>> int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, >>> long nr_pages); >>> >>> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, >>> + swp_entry_t *entry); >>> + >>> int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, >>> gfp_t gfp, swp_entry_t entry); >>> >>> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, >>> return 0; >>> } >>> >>> +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm, >>> + swp_entry_t *entry) >>> +{ >>> + return 0; >>> +} >>> + >>> static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, >>> struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) >>> { >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>> index 17af08367c68..f3d92b93ea6d 100644 >>> --- a/mm/memcontrol.c >>> +++ b/mm/memcontrol.c >>> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, >>> return 0; >>> } >>> >>> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg) >>> +{ >>> + for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { >>> + if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR) >> >> There might be 3 issues with the approach: >> >> Its a very big margin, lets say you have ARM64_64K_PAGES, and you have >> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are >> basically saying you need 512M free memory to swapin just 256K? > > Right, sorry for the noisy code. I was just thinking about 4KB pages > and wondering > if we could simplify the code. > >> >> Its an uneven margin for different folio sizes. >> For 16K folio swapin, you are checking if there is margin for 128 folios, >> but for 1M folio swapin, you are checking there is margin for just 2 folios. >> >> Maybe it might be better to make this dependent on some factor of folio_nr_pages? > > Agreed. This is similar to what we discussed regarding your zswap mTHP > swap-in series: > > int mem_cgroup_swapin_charge_folio(...) > { > ... > if (folio_test_large(folio) && > mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, > folio_nr_pages(folio))) > ret = -ENOMEM; > else > ret = charge_memcg(folio, memcg, gfp); > ... > } > > As someone focused on phones, my challenge is the absence of stable platforms to > benchmark this type of workload. If possible, Usama, I would greatly > appreciate it if > you could take the lead on the patch. > >> >> As Johannes pointed out, the charging code already does the margin check. >> So for 4K, the check just checks if there is 4K available, but for 16K it checks >> if a lot more than 16K is available. Maybe there should be a similar policy for >> all? I guess this is similar to my 2nd point, but just considers 4K folios as >> well. > > I don't think the charging code performs a margin check. It simply > tries to charge > the specified nr_pages (whether 1 or more). If nr_pages are available, > the charge > proceeds; otherwise, if GFP allows blocking, it triggers memory reclamation to > reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages. > So if you have defrag not set to always, it will not trigger reclamation. I think that is a bigger usecase, i.e. defrag=madvise,defer,etc is probably used much more then always. In the current code in that case try_charge_memcg will return -ENOMEM all the way to mem_cgroup_swapin_charge_folio and alloc_swap_folio will then try the next order. So eventhough it might not be calling the mem_cgroup_margin function, it is kind of is doing the same? > If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages available, a > large folio with nr_pages == SWAP_CLUSTER_MAX will successfully charge, > immediately filling the memcg. > > Shortly after, smaller folios—typically with blockable GFP—will quickly trigger > additional reclamation. While nr_pages - 1 subpages of the large folio may not > be immediately needed, they still occupy enough space to fill the memcg to > capacity. > > My second point about the mitigation is as follows: For a system (or > memcg) under severe memory pressure, especially one without hardware TLB > optimization, is enabling mTHP always the right choice? Since mTHP operates at > a larger granularity, some internal fragmentation is unavoidable, regardless > of optimization. Could the mitigation code help in automatically tuning > this fragmentation? > I agree with the point that enabling mTHP always is not the right thing to do on all platforms. I also think it might be the case that enabling mTHP might be a good thing for some workloads, but enabling mTHP swapin along with it might not. As you said when you have apps switching between foreground and background in android, it probably makes sense to have large folio swapping, as you want to bringin all the pages from background app as quickly as possible. And also all the TLB optimizations and smaller lru overhead you get after you have brought in all the pages. Linux kernel build test doesnt really get to benefit from the TLB optimization and smaller lru overhead, as probably the pages are very short lived. So I think it doesnt show the benefit of large folio swapin properly and large folio swapin should probably be disabled for this kind of workload, eventhough mTHP should be enabled. I am not sure that the approach we are trying in this patch is the right way: - This patch makes it a memcg issue, but you could have memcg disabled and then the mitigation being tried here wont apply. - Instead of this being a large folio swapin issue, is it more of a readahead issue? If we zswap (without the large folio swapin series) and change the window to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time when cgroup memory is limited as readahead would probably cause swap thrashing as well. - Instead of looking at cgroup margin, maybe we should try and look at the rate of change of workingset_restore_anon? This might be a lot more complicated to do, but probably is the right metric to determine swap thrashing. It also means that this could be used in both the synchronous swapcache skipping path and swapin_readahead path. (Thanks Johannes for suggesting this) With the large folio swapin, I do see the large improvement when considering only swapin performance and latency in the same way as you saw in zram. Maybe the right short term approach is to have /sys/kernel/mm/transparent_hugepage/swapin and have that disabled by default to avoid regression. If the workload owner sees a benefit, they can enable it. I can add this when sending the next version of large folio zswapin if that makes sense? Longer term I can try and have a look at if we can do something with workingset_restore_anon to improve things. Thanks, Usama
[..] > > My second point about the mitigation is as follows: For a system (or > > memcg) under severe memory pressure, especially one without hardware TLB > > optimization, is enabling mTHP always the right choice? Since mTHP operates at > > a larger granularity, some internal fragmentation is unavoidable, regardless > > of optimization. Could the mitigation code help in automatically tuning > > this fragmentation? > > > > I agree with the point that enabling mTHP always is not the right thing to do > on all platforms. I also think it might be the case that enabling mTHP > might be a good thing for some workloads, but enabling mTHP swapin along with > it might not. > > As you said when you have apps switching between foreground and background > in android, it probably makes sense to have large folio swapping, as you > want to bringin all the pages from background app as quickly as possible. > And also all the TLB optimizations and smaller lru overhead you get after > you have brought in all the pages. > Linux kernel build test doesnt really get to benefit from the TLB optimization > and smaller lru overhead, as probably the pages are very short lived. So I > think it doesnt show the benefit of large folio swapin properly and > large folio swapin should probably be disabled for this kind of workload, > eventhough mTHP should be enabled. > > I am not sure that the approach we are trying in this patch is the right way: > - This patch makes it a memcg issue, but you could have memcg disabled and > then the mitigation being tried here wont apply. Is the problem reproducible without memcg? I imagine only if the entire system is under memory pressure. I guess we would want the same "mitigation" either way. > - Instead of this being a large folio swapin issue, is it more of a readahead > issue? If we zswap (without the large folio swapin series) and change the window > to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > when cgroup memory is limited as readahead would probably cause swap thrashing as > well. I think large folio swapin would make the problem worse anyway. I am also not sure if the readahead window adjusts on memory pressure or not. > - Instead of looking at cgroup margin, maybe we should try and look at > the rate of change of workingset_restore_anon? This might be a lot more complicated > to do, but probably is the right metric to determine swap thrashing. It also means > that this could be used in both the synchronous swapcache skipping path and > swapin_readahead path. > (Thanks Johannes for suggesting this) > > With the large folio swapin, I do see the large improvement when considering only > swapin performance and latency in the same way as you saw in zram. > Maybe the right short term approach is to have > /sys/kernel/mm/transparent_hugepage/swapin > and have that disabled by default to avoid regression. > If the workload owner sees a benefit, they can enable it. > I can add this when sending the next version of large folio zswapin if that makes > sense? I would honestly prefer we avoid this if possible. It's always easy to just put features behind knobs, and then users have the toil of figuring out if/when they can use it, or just give up. We should find a way to avoid the thrashing due to hitting the memcg limit (or being under global memory pressure), it seems like something the kernel should be able to do on its own. > Longer term I can try and have a look at if we can do something with > workingset_restore_anon to improve things. I am not a big fan of this, mainly because reading a stat from the kernel puts us in a situation where we have to choose between: - Doing a memcg stats flush in the kernel, which is something we are trying to move away from due to various problems we have been running into. - Using potentially stale stats (up to 2s), which may be fine but is suboptimal at best. We may have blips of thrashing due to stale stats not showing the refaults.
On 30/10/2024 19:51, Yosry Ahmed wrote: > [..] >>> My second point about the mitigation is as follows: For a system (or >>> memcg) under severe memory pressure, especially one without hardware TLB >>> optimization, is enabling mTHP always the right choice? Since mTHP operates at >>> a larger granularity, some internal fragmentation is unavoidable, regardless >>> of optimization. Could the mitigation code help in automatically tuning >>> this fragmentation? >>> >> >> I agree with the point that enabling mTHP always is not the right thing to do >> on all platforms. I also think it might be the case that enabling mTHP >> might be a good thing for some workloads, but enabling mTHP swapin along with >> it might not. >> >> As you said when you have apps switching between foreground and background >> in android, it probably makes sense to have large folio swapping, as you >> want to bringin all the pages from background app as quickly as possible. >> And also all the TLB optimizations and smaller lru overhead you get after >> you have brought in all the pages. >> Linux kernel build test doesnt really get to benefit from the TLB optimization >> and smaller lru overhead, as probably the pages are very short lived. So I >> think it doesnt show the benefit of large folio swapin properly and >> large folio swapin should probably be disabled for this kind of workload, >> eventhough mTHP should be enabled. >> >> I am not sure that the approach we are trying in this patch is the right way: >> - This patch makes it a memcg issue, but you could have memcg disabled and >> then the mitigation being tried here wont apply. > > Is the problem reproducible without memcg? I imagine only if the > entire system is under memory pressure. I guess we would want the same > "mitigation" either way. > What would be a good open source benchmark/workload to test without limiting memory in memcg? For the kernel build test, I can only get zswap activity to happen if I build in cgroup and limit memory.max. I can just run zswap large folio zswapin in production and see, but that will take me a few days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing, then maybe its not really an issue? I believe Barry doesnt see an issue in android phones (but please correct me if I am wrong), and if there isnt an issue in Meta production as well, its a good data point for servers as well. And maybe kernel build in 4G memcg is not a good test. >> - Instead of this being a large folio swapin issue, is it more of a readahead >> issue? If we zswap (without the large folio swapin series) and change the window >> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time >> when cgroup memory is limited as readahead would probably cause swap thrashing as >> well. > > I think large folio swapin would make the problem worse anyway. I am > also not sure if the readahead window adjusts on memory pressure or > not. > readahead window doesnt look at memory pressure. So maybe the same thing is being seen here as there would be in swapin_readahead? Maybe if we check kernel build test performance in 4G memcg with below diff, it might get better? diff --git a/mm/swap_state.c b/mm/swap_state.c index 4669f29cf555..9e196e1e6885 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -809,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask, pgoff_t ilx; bool page_allocated; - win = swap_vma_ra_win(vmf, &start, &end); + win = 1; if (win == 1) goto skip; >> - Instead of looking at cgroup margin, maybe we should try and look at >> the rate of change of workingset_restore_anon? This might be a lot more complicated >> to do, but probably is the right metric to determine swap thrashing. It also means >> that this could be used in both the synchronous swapcache skipping path and >> swapin_readahead path. >> (Thanks Johannes for suggesting this) >> >> With the large folio swapin, I do see the large improvement when considering only >> swapin performance and latency in the same way as you saw in zram. >> Maybe the right short term approach is to have >> /sys/kernel/mm/transparent_hugepage/swapin >> and have that disabled by default to avoid regression. >> If the workload owner sees a benefit, they can enable it. >> I can add this when sending the next version of large folio zswapin if that makes >> sense? > > I would honestly prefer we avoid this if possible. It's always easy to > just put features behind knobs, and then users have the toil of > figuring out if/when they can use it, or just give up. We should find > a way to avoid the thrashing due to hitting the memcg limit (or being > under global memory pressure), it seems like something the kernel > should be able to do on its own. > >> Longer term I can try and have a look at if we can do something with >> workingset_restore_anon to improve things. > > I am not a big fan of this, mainly because reading a stat from the > kernel puts us in a situation where we have to choose between: > - Doing a memcg stats flush in the kernel, which is something we are > trying to move away from due to various problems we have been running > into. > - Using potentially stale stats (up to 2s), which may be fine but is > suboptimal at best. We may have blips of thrashing due to stale stats > not showing the refaults.
On Thu, Oct 31, 2024 at 3:51 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 28/10/2024 22:03, Barry Song wrote: > > On Mon, Oct 28, 2024 at 8:07 PM Usama Arif <usamaarif642@gmail.com> wrote: > >> > >> > >> > >> On 27/10/2024 01:14, Barry Song wrote: > >>> From: Barry Song <v-songbaohua@oppo.com> > >>> > >>> In a memcg where mTHP is always utilized, even at full capacity, it > >>> might not be the best option. Consider a system that uses only small > >>> folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX > >>> of buffer space before it can initiate the next reclamation. However, > >>> large folios can quickly fill this space, rapidly bringing the memcg > >>> back to full capacity, even though some portions of the large folios > >>> may not be immediately needed and used by the process. > >>> > >>> Usama and Kanchana identified a regression when building the kernel in > >>> a memcg with memory.max set to a small value while enabling large > >>> folio swap-in support on zswap[1]. > >>> > >>> The issue arises from an edge case where the memory cgroup remains > >>> nearly full most of the time. Consequently, bringing in mTHP can > >>> quickly cause a memcg overflow, triggering a swap-out. The subsequent > >>> swap-in then recreates the overflow, resulting in a repetitive cycle. > >>> > >>> We need a mechanism to stop the cup from overflowing continuously. > >>> One potential solution is to slow the filling process when we identify > >>> that the cup is nearly full. > >>> > >>> Usama reported an improvement when we mitigate mTHP swap-in as the > >>> memcg approaches full capacity[2]: > >>> > >>> int mem_cgroup_swapin_charge_folio(...) > >>> { > >>> ... > >>> if (folio_test_large(folio) && > >>> mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) > >>> ret = -ENOMEM; > >>> else > >>> ret = charge_memcg(folio, memcg, gfp); > >>> ... > >>> } > >>> > >>> AMD 16K+32K THP=always > >>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > >>> real 1m23.038s 1m23.050s 1m22.704s > >>> user 53m57.210s 53m53.437s 53m52.577s > >>> sys 7m24.592s 7m48.843s 7m22.519s > >>> zswpin 612070 999244 815934 > >>> zswpout 2226403 2347979 2054980 > >>> pgfault 20667366 20481728 20478690 > >>> pgmajfault 385887 269117 309702 > >>> > >>> AMD 16K+32K+64K THP=always > >>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > >>> real 1m22.975s 1m23.266s 1m22.549s > >>> user 53m51.302s 53m51.069s 53m46.471s > >>> sys 7m40.168s 7m57.104s 7m25.012s > >>> zswpin 676492 1258573 1225703 > >>> zswpout 2449839 2714767 2899178 > >>> pgfault 17540746 17296555 17234663 > >>> pgmajfault 429629 307495 287859 > >>> > >>> I wonder if we can extend the mitigation to do_anonymous_page() as > >>> well. Without hardware like AMD and ARM with hardware TLB coalescing > >>> or CONT-PTE, I conducted a quick test on my Intel i9 workstation with > >>> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel > >>> builds in a memcg with memory.max set to 1 GiB. > >>> > >>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > >>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > >>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > >>> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > >>> > >>> $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ > >>> CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null > >>> > >>> disable-mTHP-swapin mm-unstable with-this-patch > >>> Real: 6m54.595s 7m4.832s 6m45.811s > >>> User: 66m42.795s 66m59.984s 67m21.150s > >>> Sys: 12m7.092s 15m18.153s 12m52.644s > >>> pswpin: 4262327 11723248 5918690 > >>> pswpout: 14883774 19574347 14026942 > >>> 64k-swpout: 624447 889384 480039 > >>> 32k-swpout: 115473 242288 73874 > >>> 16k-swpout: 158203 294672 109142 > >>> 64k-swpin: 0 495869 159061 > >>> 32k-swpin: 0 219977 56158 > >>> 16k-swpin: 0 223501 81445 > >>> > >> > > > > Hi Usama, > > > >> hmm, both the user and sys time are worse with the patch compared to > >> disable-mTHP-swapin. I wonder if the real time is an anomaly and if you > >> repeat the experiment the real time might be worse as well? > > > > Well, I've improved my script to include a loop: > > > > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > > > > for ((i=1; i<=100; i++)) > > do > > echo "Executing round $i" > > make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null > > echo 3 > /proc/sys/vm/drop_caches > > time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ > > CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j15 1>/dev/null 2>/dev/null > > cat /proc/vmstat | grep pswp > > echo -n 64k-swpout: ; cat > > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout > > echo -n 32k-swpout: ; cat > > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout > > echo -n 16k-swpout: ; cat > > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout > > echo -n 64k-swpin: ; cat > > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin > > echo -n 32k-swpin: ; cat > > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin > > echo -n 16k-swpin: ; cat > > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin > > done > > > > I've noticed that the user/sys/real time on my i9 machine fluctuates > > constantly, could be things > > like: > > real 6m52.087s > > user 67m12.463s > > sys 13m8.281s > > ... > > > > real 7m42.937s > > user 66m55.250s > > sys 12m56.330s > > ... > > > > real 6m49.374s > > user 66m37.040s > > sys 12m44.542s > > ... > > > > real 6m54.205s > > user 65m49.732s > > sys 11m33.078s > > ... > > > > likely due to unstable temperatures and I/O latency. As a result, my > > data doesn’t seem > > reference-worthy. > > > > So I had suggested retrying the experiment to see how reproducible it is, > but had not done that myself! > Thanks for sharing this. I tried many times on the AMD server and I see > varying numbers as well. > > AMD 16K THP always, cgroup = 4G, large folio zswapin patches > real 1m28.351s > user 54m14.476s > sys 8m46.596s > zswpin 811693 > zswpout 2137310 > pgfault 27344671 > pgmajfault 290510 > .. > real 1m24.557s > user 53m56.815s > sys 8m10.200s > zswpin 571532 > zswpout 1645063 > pgfault 26989075 > pgmajfault 205177 > .. > real 1m26.083s > user 54m5.303s > sys 9m55.247s > zswpin 1176292 > zswpout 2910825 > pgfault 27286835 > pgmajfault 419746 > > > The sys time can especially vary by large numbers. I think you see the same. > > > > As a phone engineer, we never use phones to run kernel builds. I'm also > > quite certain that phones won't provide stable and reliable data for this > > type of workload. Without access to a Linux server to conduct the test, > > I really need your help. > > > > I used to work on optimizing the ARM server scheduler and memory > > management, and I really miss that machine I had until three years ago :-) > > > >> > >>> I need Usama's assistance to identify a suitable patch, as I lack > >>> access to hardware such as AMD machines and ARM servers with TLB > >>> optimization. > >>> > >>> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/ > >>> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/ > >>> > >>> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > >>> Cc: Usama Arif <usamaarif642@gmail.com> > >>> Cc: David Hildenbrand <david@redhat.com> > >>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > >>> Cc: Chris Li <chrisl@kernel.org> > >>> Cc: Yosry Ahmed <yosryahmed@google.com> > >>> Cc: "Huang, Ying" <ying.huang@intel.com> > >>> Cc: Kairui Song <kasong@tencent.com> > >>> Cc: Ryan Roberts <ryan.roberts@arm.com> > >>> Cc: Johannes Weiner <hannes@cmpxchg.org> > >>> Cc: Michal Hocko <mhocko@kernel.org> > >>> Cc: Roman Gushchin <roman.gushchin@linux.dev> > >>> Cc: Shakeel Butt <shakeel.butt@linux.dev> > >>> Cc: Muchun Song <muchun.song@linux.dev> > >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> > >>> --- > >>> include/linux/memcontrol.h | 9 ++++++++ > >>> mm/memcontrol.c | 45 ++++++++++++++++++++++++++++++++++++++ > >>> mm/memory.c | 17 ++++++++++++++ > >>> 3 files changed, 71 insertions(+) > >>> > >>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > >>> index 524006313b0d..8bcc8f4af39f 100644 > >>> --- a/include/linux/memcontrol.h > >>> +++ b/include/linux/memcontrol.h > >>> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, > >>> int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, > >>> long nr_pages); > >>> > >>> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, > >>> + swp_entry_t *entry); > >>> + > >>> int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > >>> gfp_t gfp, swp_entry_t entry); > >>> > >>> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, > >>> return 0; > >>> } > >>> > >>> +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm, > >>> + swp_entry_t *entry) > >>> +{ > >>> + return 0; > >>> +} > >>> + > >>> static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, > >>> struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) > >>> { > >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >>> index 17af08367c68..f3d92b93ea6d 100644 > >>> --- a/mm/memcontrol.c > >>> +++ b/mm/memcontrol.c > >>> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, > >>> return 0; > >>> } > >>> > >>> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg) > >>> +{ > >>> + for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { > >>> + if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR) > >> > >> There might be 3 issues with the approach: > >> > >> Its a very big margin, lets say you have ARM64_64K_PAGES, and you have > >> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are > >> basically saying you need 512M free memory to swapin just 256K? > > > > Right, sorry for the noisy code. I was just thinking about 4KB pages > > and wondering > > if we could simplify the code. > > > >> > >> Its an uneven margin for different folio sizes. > >> For 16K folio swapin, you are checking if there is margin for 128 folios, > >> but for 1M folio swapin, you are checking there is margin for just 2 folios. > >> > >> Maybe it might be better to make this dependent on some factor of folio_nr_pages? > > > > Agreed. This is similar to what we discussed regarding your zswap mTHP > > swap-in series: > > > > int mem_cgroup_swapin_charge_folio(...) > > { > > ... > > if (folio_test_large(folio) && > > mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, > > folio_nr_pages(folio))) > > ret = -ENOMEM; > > else > > ret = charge_memcg(folio, memcg, gfp); > > ... > > } > > > > As someone focused on phones, my challenge is the absence of stable platforms to > > benchmark this type of workload. If possible, Usama, I would greatly > > appreciate it if > > you could take the lead on the patch. > > > >> > >> As Johannes pointed out, the charging code already does the margin check. > >> So for 4K, the check just checks if there is 4K available, but for 16K it checks > >> if a lot more than 16K is available. Maybe there should be a similar policy for > >> all? I guess this is similar to my 2nd point, but just considers 4K folios as > >> well. > > > > I don't think the charging code performs a margin check. It simply > > tries to charge > > the specified nr_pages (whether 1 or more). If nr_pages are available, > > the charge > > proceeds; otherwise, if GFP allows blocking, it triggers memory reclamation to > > reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages. > > > > So if you have defrag not set to always, it will not trigger reclamation. > I think that is a bigger usecase, i.e. defrag=madvise,defer,etc is probably > used much more then always. > > In the current code in that case try_charge_memcg will return -ENOMEM all > the way to mem_cgroup_swapin_charge_folio and alloc_swap_folio will then > try the next order. So eventhough it might not be calling the mem_cgroup_margin > function, it is kind of is doing the same? > > > If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages available, a > > large folio with nr_pages == SWAP_CLUSTER_MAX will successfully charge, > > immediately filling the memcg. > > > > Shortly after, smaller folios—typically with blockable GFP—will quickly trigger > > additional reclamation. While nr_pages - 1 subpages of the large folio may not > > be immediately needed, they still occupy enough space to fill the memcg to > > capacity. > > > > My second point about the mitigation is as follows: For a system (or > > memcg) under severe memory pressure, especially one without hardware TLB > > optimization, is enabling mTHP always the right choice? Since mTHP operates at > > a larger granularity, some internal fragmentation is unavoidable, regardless > > of optimization. Could the mitigation code help in automatically tuning > > this fragmentation? > > > > I agree with the point that enabling mTHP always is not the right thing to do > on all platforms. I also think it might be the case that enabling mTHP > might be a good thing for some workloads, but enabling mTHP swapin along with > it might not. > > As you said when you have apps switching between foreground and background > in android, it probably makes sense to have large folio swapping, as you > want to bringin all the pages from background app as quickly as possible. > And also all the TLB optimizations and smaller lru overhead you get after > you have brought in all the pages. > Linux kernel build test doesnt really get to benefit from the TLB optimization > and smaller lru overhead, as probably the pages are very short lived. So I > think it doesnt show the benefit of large folio swapin properly and > large folio swapin should probably be disabled for this kind of workload, > eventhough mTHP should be enabled. I'm not entirely sure if this applies to platforms without TLB optimization, especially in the absence of swap. In a memory-limited cgroup without swap, would mTHP still cause significant thrashing of file-backed folios? When a large swap file is present, the inability to swap in mTHP seems to act as a workaround for fragmentation, allowing fragmented pages of the original mTHP from do_anonymous_page() to remain in swap. > > I am not sure that the approach we are trying in this patch is the right way: > - This patch makes it a memcg issue, but you could have memcg disabled and > then the mitigation being tried here wont apply. > - Instead of this being a large folio swapin issue, is it more of a readahead > issue? If we zswap (without the large folio swapin series) and change the window > to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > when cgroup memory is limited as readahead would probably cause swap thrashing as > well. > - Instead of looking at cgroup margin, maybe we should try and look at > the rate of change of workingset_restore_anon? This might be a lot more complicated > to do, but probably is the right metric to determine swap thrashing. It also means > that this could be used in both the synchronous swapcache skipping path and > swapin_readahead path. > (Thanks Johannes for suggesting this) > > With the large folio swapin, I do see the large improvement when considering only > swapin performance and latency in the same way as you saw in zram. > Maybe the right short term approach is to have > /sys/kernel/mm/transparent_hugepage/swapin > and have that disabled by default to avoid regression. A crucial component is still missing—managing the compression and decompression of multiple pages as a larger block. This could significantly reduce system time and potentially resolve the kernel build issue within a small memory cgroup, even with swap thrashing. I’ll send an update ASAP so you can rebase for zswap. > If the workload owner sees a benefit, they can enable it. > I can add this when sending the next version of large folio zswapin if that makes > sense? > Longer term I can try and have a look at if we can do something with > workingset_restore_anon to improve things. > > Thanks, > Usama Thanks Barry
On 30/10/2024 20:27, Barry Song wrote: > On Thu, Oct 31, 2024 at 3:51 AM Usama Arif <usamaarif642@gmail.com> wrote: >> >> >> >> On 28/10/2024 22:03, Barry Song wrote: >>> On Mon, Oct 28, 2024 at 8:07 PM Usama Arif <usamaarif642@gmail.com> wrote: >>>> >>>> >>>> >>>> On 27/10/2024 01:14, Barry Song wrote: >>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>> >>>>> In a memcg where mTHP is always utilized, even at full capacity, it >>>>> might not be the best option. Consider a system that uses only small >>>>> folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX >>>>> of buffer space before it can initiate the next reclamation. However, >>>>> large folios can quickly fill this space, rapidly bringing the memcg >>>>> back to full capacity, even though some portions of the large folios >>>>> may not be immediately needed and used by the process. >>>>> >>>>> Usama and Kanchana identified a regression when building the kernel in >>>>> a memcg with memory.max set to a small value while enabling large >>>>> folio swap-in support on zswap[1]. >>>>> >>>>> The issue arises from an edge case where the memory cgroup remains >>>>> nearly full most of the time. Consequently, bringing in mTHP can >>>>> quickly cause a memcg overflow, triggering a swap-out. The subsequent >>>>> swap-in then recreates the overflow, resulting in a repetitive cycle. >>>>> >>>>> We need a mechanism to stop the cup from overflowing continuously. >>>>> One potential solution is to slow the filling process when we identify >>>>> that the cup is nearly full. >>>>> >>>>> Usama reported an improvement when we mitigate mTHP swap-in as the >>>>> memcg approaches full capacity[2]: >>>>> >>>>> int mem_cgroup_swapin_charge_folio(...) >>>>> { >>>>> ... >>>>> if (folio_test_large(folio) && >>>>> mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) >>>>> ret = -ENOMEM; >>>>> else >>>>> ret = charge_memcg(folio, memcg, gfp); >>>>> ... >>>>> } >>>>> >>>>> AMD 16K+32K THP=always >>>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix >>>>> real 1m23.038s 1m23.050s 1m22.704s >>>>> user 53m57.210s 53m53.437s 53m52.577s >>>>> sys 7m24.592s 7m48.843s 7m22.519s >>>>> zswpin 612070 999244 815934 >>>>> zswpout 2226403 2347979 2054980 >>>>> pgfault 20667366 20481728 20478690 >>>>> pgmajfault 385887 269117 309702 >>>>> >>>>> AMD 16K+32K+64K THP=always >>>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix >>>>> real 1m22.975s 1m23.266s 1m22.549s >>>>> user 53m51.302s 53m51.069s 53m46.471s >>>>> sys 7m40.168s 7m57.104s 7m25.012s >>>>> zswpin 676492 1258573 1225703 >>>>> zswpout 2449839 2714767 2899178 >>>>> pgfault 17540746 17296555 17234663 >>>>> pgmajfault 429629 307495 287859 >>>>> >>>>> I wonder if we can extend the mitigation to do_anonymous_page() as >>>>> well. Without hardware like AMD and ARM with hardware TLB coalescing >>>>> or CONT-PTE, I conducted a quick test on my Intel i9 workstation with >>>>> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel >>>>> builds in a memcg with memory.max set to 1 GiB. >>>>> >>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled >>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled >>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled >>>>> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled >>>>> >>>>> $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ >>>>> CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null >>>>> >>>>> disable-mTHP-swapin mm-unstable with-this-patch >>>>> Real: 6m54.595s 7m4.832s 6m45.811s >>>>> User: 66m42.795s 66m59.984s 67m21.150s >>>>> Sys: 12m7.092s 15m18.153s 12m52.644s >>>>> pswpin: 4262327 11723248 5918690 >>>>> pswpout: 14883774 19574347 14026942 >>>>> 64k-swpout: 624447 889384 480039 >>>>> 32k-swpout: 115473 242288 73874 >>>>> 16k-swpout: 158203 294672 109142 >>>>> 64k-swpin: 0 495869 159061 >>>>> 32k-swpin: 0 219977 56158 >>>>> 16k-swpin: 0 223501 81445 >>>>> >>>> >>> >>> Hi Usama, >>> >>>> hmm, both the user and sys time are worse with the patch compared to >>>> disable-mTHP-swapin. I wonder if the real time is an anomaly and if you >>>> repeat the experiment the real time might be worse as well? >>> >>> Well, I've improved my script to include a loop: >>> >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled >>> >>> for ((i=1; i<=100; i++)) >>> do >>> echo "Executing round $i" >>> make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null >>> echo 3 > /proc/sys/vm/drop_caches >>> time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ >>> CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j15 1>/dev/null 2>/dev/null >>> cat /proc/vmstat | grep pswp >>> echo -n 64k-swpout: ; cat >>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout >>> echo -n 32k-swpout: ; cat >>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout >>> echo -n 16k-swpout: ; cat >>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout >>> echo -n 64k-swpin: ; cat >>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin >>> echo -n 32k-swpin: ; cat >>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin >>> echo -n 16k-swpin: ; cat >>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin >>> done >>> >>> I've noticed that the user/sys/real time on my i9 machine fluctuates >>> constantly, could be things >>> like: >>> real 6m52.087s >>> user 67m12.463s >>> sys 13m8.281s >>> ... >>> >>> real 7m42.937s >>> user 66m55.250s >>> sys 12m56.330s >>> ... >>> >>> real 6m49.374s >>> user 66m37.040s >>> sys 12m44.542s >>> ... >>> >>> real 6m54.205s >>> user 65m49.732s >>> sys 11m33.078s >>> ... >>> >>> likely due to unstable temperatures and I/O latency. As a result, my >>> data doesn’t seem >>> reference-worthy. >>> >> >> So I had suggested retrying the experiment to see how reproducible it is, >> but had not done that myself! >> Thanks for sharing this. I tried many times on the AMD server and I see >> varying numbers as well. >> >> AMD 16K THP always, cgroup = 4G, large folio zswapin patches >> real 1m28.351s >> user 54m14.476s >> sys 8m46.596s >> zswpin 811693 >> zswpout 2137310 >> pgfault 27344671 >> pgmajfault 290510 >> .. >> real 1m24.557s >> user 53m56.815s >> sys 8m10.200s >> zswpin 571532 >> zswpout 1645063 >> pgfault 26989075 >> pgmajfault 205177 >> .. >> real 1m26.083s >> user 54m5.303s >> sys 9m55.247s >> zswpin 1176292 >> zswpout 2910825 >> pgfault 27286835 >> pgmajfault 419746 >> >> >> The sys time can especially vary by large numbers. I think you see the same. >> >> >>> As a phone engineer, we never use phones to run kernel builds. I'm also >>> quite certain that phones won't provide stable and reliable data for this >>> type of workload. Without access to a Linux server to conduct the test, >>> I really need your help. >>> >>> I used to work on optimizing the ARM server scheduler and memory >>> management, and I really miss that machine I had until three years ago :-) >>> >>>> >>>>> I need Usama's assistance to identify a suitable patch, as I lack >>>>> access to hardware such as AMD machines and ARM servers with TLB >>>>> optimization. >>>>> >>>>> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/ >>>>> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/ >>>>> >>>>> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> >>>>> Cc: Usama Arif <usamaarif642@gmail.com> >>>>> Cc: David Hildenbrand <david@redhat.com> >>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >>>>> Cc: Chris Li <chrisl@kernel.org> >>>>> Cc: Yosry Ahmed <yosryahmed@google.com> >>>>> Cc: "Huang, Ying" <ying.huang@intel.com> >>>>> Cc: Kairui Song <kasong@tencent.com> >>>>> Cc: Ryan Roberts <ryan.roberts@arm.com> >>>>> Cc: Johannes Weiner <hannes@cmpxchg.org> >>>>> Cc: Michal Hocko <mhocko@kernel.org> >>>>> Cc: Roman Gushchin <roman.gushchin@linux.dev> >>>>> Cc: Shakeel Butt <shakeel.butt@linux.dev> >>>>> Cc: Muchun Song <muchun.song@linux.dev> >>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >>>>> --- >>>>> include/linux/memcontrol.h | 9 ++++++++ >>>>> mm/memcontrol.c | 45 ++++++++++++++++++++++++++++++++++++++ >>>>> mm/memory.c | 17 ++++++++++++++ >>>>> 3 files changed, 71 insertions(+) >>>>> >>>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h >>>>> index 524006313b0d..8bcc8f4af39f 100644 >>>>> --- a/include/linux/memcontrol.h >>>>> +++ b/include/linux/memcontrol.h >>>>> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, >>>>> int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, >>>>> long nr_pages); >>>>> >>>>> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, >>>>> + swp_entry_t *entry); >>>>> + >>>>> int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, >>>>> gfp_t gfp, swp_entry_t entry); >>>>> >>>>> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, >>>>> return 0; >>>>> } >>>>> >>>>> +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm, >>>>> + swp_entry_t *entry) >>>>> +{ >>>>> + return 0; >>>>> +} >>>>> + >>>>> static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, >>>>> struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) >>>>> { >>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>>>> index 17af08367c68..f3d92b93ea6d 100644 >>>>> --- a/mm/memcontrol.c >>>>> +++ b/mm/memcontrol.c >>>>> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, >>>>> return 0; >>>>> } >>>>> >>>>> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg) >>>>> +{ >>>>> + for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { >>>>> + if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR) >>>> >>>> There might be 3 issues with the approach: >>>> >>>> Its a very big margin, lets say you have ARM64_64K_PAGES, and you have >>>> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are >>>> basically saying you need 512M free memory to swapin just 256K? >>> >>> Right, sorry for the noisy code. I was just thinking about 4KB pages >>> and wondering >>> if we could simplify the code. >>> >>>> >>>> Its an uneven margin for different folio sizes. >>>> For 16K folio swapin, you are checking if there is margin for 128 folios, >>>> but for 1M folio swapin, you are checking there is margin for just 2 folios. >>>> >>>> Maybe it might be better to make this dependent on some factor of folio_nr_pages? >>> >>> Agreed. This is similar to what we discussed regarding your zswap mTHP >>> swap-in series: >>> >>> int mem_cgroup_swapin_charge_folio(...) >>> { >>> ... >>> if (folio_test_large(folio) && >>> mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, >>> folio_nr_pages(folio))) >>> ret = -ENOMEM; >>> else >>> ret = charge_memcg(folio, memcg, gfp); >>> ... >>> } >>> >>> As someone focused on phones, my challenge is the absence of stable platforms to >>> benchmark this type of workload. If possible, Usama, I would greatly >>> appreciate it if >>> you could take the lead on the patch. >>> >>>> >>>> As Johannes pointed out, the charging code already does the margin check. >>>> So for 4K, the check just checks if there is 4K available, but for 16K it checks >>>> if a lot more than 16K is available. Maybe there should be a similar policy for >>>> all? I guess this is similar to my 2nd point, but just considers 4K folios as >>>> well. >>> >>> I don't think the charging code performs a margin check. It simply >>> tries to charge >>> the specified nr_pages (whether 1 or more). If nr_pages are available, >>> the charge >>> proceeds; otherwise, if GFP allows blocking, it triggers memory reclamation to >>> reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages. >>> >> >> So if you have defrag not set to always, it will not trigger reclamation. >> I think that is a bigger usecase, i.e. defrag=madvise,defer,etc is probably >> used much more then always. >> >> In the current code in that case try_charge_memcg will return -ENOMEM all >> the way to mem_cgroup_swapin_charge_folio and alloc_swap_folio will then >> try the next order. So eventhough it might not be calling the mem_cgroup_margin >> function, it is kind of is doing the same? >> >>> If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages available, a >>> large folio with nr_pages == SWAP_CLUSTER_MAX will successfully charge, >>> immediately filling the memcg. >>> >>> Shortly after, smaller folios—typically with blockable GFP—will quickly trigger >>> additional reclamation. While nr_pages - 1 subpages of the large folio may not >>> be immediately needed, they still occupy enough space to fill the memcg to >>> capacity. >>> >>> My second point about the mitigation is as follows: For a system (or >>> memcg) under severe memory pressure, especially one without hardware TLB >>> optimization, is enabling mTHP always the right choice? Since mTHP operates at >>> a larger granularity, some internal fragmentation is unavoidable, regardless >>> of optimization. Could the mitigation code help in automatically tuning >>> this fragmentation? >>> >> >> I agree with the point that enabling mTHP always is not the right thing to do >> on all platforms. I also think it might be the case that enabling mTHP >> might be a good thing for some workloads, but enabling mTHP swapin along with >> it might not. >> >> As you said when you have apps switching between foreground and background >> in android, it probably makes sense to have large folio swapping, as you >> want to bringin all the pages from background app as quickly as possible. >> And also all the TLB optimizations and smaller lru overhead you get after >> you have brought in all the pages. >> Linux kernel build test doesnt really get to benefit from the TLB optimization >> and smaller lru overhead, as probably the pages are very short lived. So I >> think it doesnt show the benefit of large folio swapin properly and >> large folio swapin should probably be disabled for this kind of workload, >> eventhough mTHP should be enabled. > > I'm not entirely sure if this applies to platforms without TLB > optimization, especially > in the absence of swap. In a memory-limited cgroup without swap, would > mTHP still > cause significant thrashing of file-backed folios? When a large swap > file is present, > the inability to swap in mTHP seems to act as a workaround for fragmentation, > allowing fragmented pages of the original mTHP from do_anonymous_page() to > remain in swap. > >> >> I am not sure that the approach we are trying in this patch is the right way: >> - This patch makes it a memcg issue, but you could have memcg disabled and >> then the mitigation being tried here wont apply. >> - Instead of this being a large folio swapin issue, is it more of a readahead >> issue? If we zswap (without the large folio swapin series) and change the window >> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time >> when cgroup memory is limited as readahead would probably cause swap thrashing as >> well. >> - Instead of looking at cgroup margin, maybe we should try and look at >> the rate of change of workingset_restore_anon? This might be a lot more complicated >> to do, but probably is the right metric to determine swap thrashing. It also means >> that this could be used in both the synchronous swapcache skipping path and >> swapin_readahead path. >> (Thanks Johannes for suggesting this) >> >> With the large folio swapin, I do see the large improvement when considering only >> swapin performance and latency in the same way as you saw in zram. >> Maybe the right short term approach is to have >> /sys/kernel/mm/transparent_hugepage/swapin >> and have that disabled by default to avoid regression. > > A crucial component is still missing—managing the compression and decompression > of multiple pages as a larger block. This could significantly reduce > system time and > potentially resolve the kernel build issue within a small memory > cgroup, even with > swap thrashing. > > I’ll send an update ASAP so you can rebase for zswap. Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/? Thats wont benefit zswap, right? I actually had a few questions about it. Mainly that the benefit comes if the pagefault happens on page 0 of the large folio. But if the page fault happens on any other page, lets say page 1 of a 64K folio. then it will decompress the entire 64K chunk and just copy page 1? (memcpy in zram_bvec_read_multi_pages_partial). Could that cause a regression as you have to decompress a large chunk for just getting 1 4K page? If we assume uniform distribution of page faults, maybe it could make things worse? I probably should ask all of this in that thread. > >> If the workload owner sees a benefit, they can enable it. >> I can add this when sending the next version of large folio zswapin if that makes >> sense? >> Longer term I can try and have a look at if we can do something with >> workingset_restore_anon to improve things. >> >> Thanks, >> Usama > > Thanks > Barry
On Thu, Oct 31, 2024 at 9:41 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 30/10/2024 20:27, Barry Song wrote: > > On Thu, Oct 31, 2024 at 3:51 AM Usama Arif <usamaarif642@gmail.com> wrote: > >> > >> > >> > >> On 28/10/2024 22:03, Barry Song wrote: > >>> On Mon, Oct 28, 2024 at 8:07 PM Usama Arif <usamaarif642@gmail.com> wrote: > >>>> > >>>> > >>>> > >>>> On 27/10/2024 01:14, Barry Song wrote: > >>>>> From: Barry Song <v-songbaohua@oppo.com> > >>>>> > >>>>> In a memcg where mTHP is always utilized, even at full capacity, it > >>>>> might not be the best option. Consider a system that uses only small > >>>>> folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX > >>>>> of buffer space before it can initiate the next reclamation. However, > >>>>> large folios can quickly fill this space, rapidly bringing the memcg > >>>>> back to full capacity, even though some portions of the large folios > >>>>> may not be immediately needed and used by the process. > >>>>> > >>>>> Usama and Kanchana identified a regression when building the kernel in > >>>>> a memcg with memory.max set to a small value while enabling large > >>>>> folio swap-in support on zswap[1]. > >>>>> > >>>>> The issue arises from an edge case where the memory cgroup remains > >>>>> nearly full most of the time. Consequently, bringing in mTHP can > >>>>> quickly cause a memcg overflow, triggering a swap-out. The subsequent > >>>>> swap-in then recreates the overflow, resulting in a repetitive cycle. > >>>>> > >>>>> We need a mechanism to stop the cup from overflowing continuously. > >>>>> One potential solution is to slow the filling process when we identify > >>>>> that the cup is nearly full. > >>>>> > >>>>> Usama reported an improvement when we mitigate mTHP swap-in as the > >>>>> memcg approaches full capacity[2]: > >>>>> > >>>>> int mem_cgroup_swapin_charge_folio(...) > >>>>> { > >>>>> ... > >>>>> if (folio_test_large(folio) && > >>>>> mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) > >>>>> ret = -ENOMEM; > >>>>> else > >>>>> ret = charge_memcg(folio, memcg, gfp); > >>>>> ... > >>>>> } > >>>>> > >>>>> AMD 16K+32K THP=always > >>>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > >>>>> real 1m23.038s 1m23.050s 1m22.704s > >>>>> user 53m57.210s 53m53.437s 53m52.577s > >>>>> sys 7m24.592s 7m48.843s 7m22.519s > >>>>> zswpin 612070 999244 815934 > >>>>> zswpout 2226403 2347979 2054980 > >>>>> pgfault 20667366 20481728 20478690 > >>>>> pgmajfault 385887 269117 309702 > >>>>> > >>>>> AMD 16K+32K+64K THP=always > >>>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > >>>>> real 1m22.975s 1m23.266s 1m22.549s > >>>>> user 53m51.302s 53m51.069s 53m46.471s > >>>>> sys 7m40.168s 7m57.104s 7m25.012s > >>>>> zswpin 676492 1258573 1225703 > >>>>> zswpout 2449839 2714767 2899178 > >>>>> pgfault 17540746 17296555 17234663 > >>>>> pgmajfault 429629 307495 287859 > >>>>> > >>>>> I wonder if we can extend the mitigation to do_anonymous_page() as > >>>>> well. Without hardware like AMD and ARM with hardware TLB coalescing > >>>>> or CONT-PTE, I conducted a quick test on my Intel i9 workstation with > >>>>> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel > >>>>> builds in a memcg with memory.max set to 1 GiB. > >>>>> > >>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > >>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > >>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > >>>>> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > >>>>> > >>>>> $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ > >>>>> CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null > >>>>> > >>>>> disable-mTHP-swapin mm-unstable with-this-patch > >>>>> Real: 6m54.595s 7m4.832s 6m45.811s > >>>>> User: 66m42.795s 66m59.984s 67m21.150s > >>>>> Sys: 12m7.092s 15m18.153s 12m52.644s > >>>>> pswpin: 4262327 11723248 5918690 > >>>>> pswpout: 14883774 19574347 14026942 > >>>>> 64k-swpout: 624447 889384 480039 > >>>>> 32k-swpout: 115473 242288 73874 > >>>>> 16k-swpout: 158203 294672 109142 > >>>>> 64k-swpin: 0 495869 159061 > >>>>> 32k-swpin: 0 219977 56158 > >>>>> 16k-swpin: 0 223501 81445 > >>>>> > >>>> > >>> > >>> Hi Usama, > >>> > >>>> hmm, both the user and sys time are worse with the patch compared to > >>>> disable-mTHP-swapin. I wonder if the real time is an anomaly and if you > >>>> repeat the experiment the real time might be worse as well? > >>> > >>> Well, I've improved my script to include a loop: > >>> > >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > >>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > >>> > >>> for ((i=1; i<=100; i++)) > >>> do > >>> echo "Executing round $i" > >>> make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null > >>> echo 3 > /proc/sys/vm/drop_caches > >>> time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ > >>> CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j15 1>/dev/null 2>/dev/null > >>> cat /proc/vmstat | grep pswp > >>> echo -n 64k-swpout: ; cat > >>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout > >>> echo -n 32k-swpout: ; cat > >>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout > >>> echo -n 16k-swpout: ; cat > >>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout > >>> echo -n 64k-swpin: ; cat > >>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin > >>> echo -n 32k-swpin: ; cat > >>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin > >>> echo -n 16k-swpin: ; cat > >>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin > >>> done > >>> > >>> I've noticed that the user/sys/real time on my i9 machine fluctuates > >>> constantly, could be things > >>> like: > >>> real 6m52.087s > >>> user 67m12.463s > >>> sys 13m8.281s > >>> ... > >>> > >>> real 7m42.937s > >>> user 66m55.250s > >>> sys 12m56.330s > >>> ... > >>> > >>> real 6m49.374s > >>> user 66m37.040s > >>> sys 12m44.542s > >>> ... > >>> > >>> real 6m54.205s > >>> user 65m49.732s > >>> sys 11m33.078s > >>> ... > >>> > >>> likely due to unstable temperatures and I/O latency. As a result, my > >>> data doesn’t seem > >>> reference-worthy. > >>> > >> > >> So I had suggested retrying the experiment to see how reproducible it is, > >> but had not done that myself! > >> Thanks for sharing this. I tried many times on the AMD server and I see > >> varying numbers as well. > >> > >> AMD 16K THP always, cgroup = 4G, large folio zswapin patches > >> real 1m28.351s > >> user 54m14.476s > >> sys 8m46.596s > >> zswpin 811693 > >> zswpout 2137310 > >> pgfault 27344671 > >> pgmajfault 290510 > >> .. > >> real 1m24.557s > >> user 53m56.815s > >> sys 8m10.200s > >> zswpin 571532 > >> zswpout 1645063 > >> pgfault 26989075 > >> pgmajfault 205177 > >> .. > >> real 1m26.083s > >> user 54m5.303s > >> sys 9m55.247s > >> zswpin 1176292 > >> zswpout 2910825 > >> pgfault 27286835 > >> pgmajfault 419746 > >> > >> > >> The sys time can especially vary by large numbers. I think you see the same. > >> > >> > >>> As a phone engineer, we never use phones to run kernel builds. I'm also > >>> quite certain that phones won't provide stable and reliable data for this > >>> type of workload. Without access to a Linux server to conduct the test, > >>> I really need your help. > >>> > >>> I used to work on optimizing the ARM server scheduler and memory > >>> management, and I really miss that machine I had until three years ago :-) > >>> > >>>> > >>>>> I need Usama's assistance to identify a suitable patch, as I lack > >>>>> access to hardware such as AMD machines and ARM servers with TLB > >>>>> optimization. > >>>>> > >>>>> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/ > >>>>> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/ > >>>>> > >>>>> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > >>>>> Cc: Usama Arif <usamaarif642@gmail.com> > >>>>> Cc: David Hildenbrand <david@redhat.com> > >>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > >>>>> Cc: Chris Li <chrisl@kernel.org> > >>>>> Cc: Yosry Ahmed <yosryahmed@google.com> > >>>>> Cc: "Huang, Ying" <ying.huang@intel.com> > >>>>> Cc: Kairui Song <kasong@tencent.com> > >>>>> Cc: Ryan Roberts <ryan.roberts@arm.com> > >>>>> Cc: Johannes Weiner <hannes@cmpxchg.org> > >>>>> Cc: Michal Hocko <mhocko@kernel.org> > >>>>> Cc: Roman Gushchin <roman.gushchin@linux.dev> > >>>>> Cc: Shakeel Butt <shakeel.butt@linux.dev> > >>>>> Cc: Muchun Song <muchun.song@linux.dev> > >>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> > >>>>> --- > >>>>> include/linux/memcontrol.h | 9 ++++++++ > >>>>> mm/memcontrol.c | 45 ++++++++++++++++++++++++++++++++++++++ > >>>>> mm/memory.c | 17 ++++++++++++++ > >>>>> 3 files changed, 71 insertions(+) > >>>>> > >>>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > >>>>> index 524006313b0d..8bcc8f4af39f 100644 > >>>>> --- a/include/linux/memcontrol.h > >>>>> +++ b/include/linux/memcontrol.h > >>>>> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, > >>>>> int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, > >>>>> long nr_pages); > >>>>> > >>>>> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, > >>>>> + swp_entry_t *entry); > >>>>> + > >>>>> int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > >>>>> gfp_t gfp, swp_entry_t entry); > >>>>> > >>>>> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, > >>>>> return 0; > >>>>> } > >>>>> > >>>>> +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm, > >>>>> + swp_entry_t *entry) > >>>>> +{ > >>>>> + return 0; > >>>>> +} > >>>>> + > >>>>> static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, > >>>>> struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) > >>>>> { > >>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >>>>> index 17af08367c68..f3d92b93ea6d 100644 > >>>>> --- a/mm/memcontrol.c > >>>>> +++ b/mm/memcontrol.c > >>>>> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, > >>>>> return 0; > >>>>> } > >>>>> > >>>>> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg) > >>>>> +{ > >>>>> + for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { > >>>>> + if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR) > >>>> > >>>> There might be 3 issues with the approach: > >>>> > >>>> Its a very big margin, lets say you have ARM64_64K_PAGES, and you have > >>>> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are > >>>> basically saying you need 512M free memory to swapin just 256K? > >>> > >>> Right, sorry for the noisy code. I was just thinking about 4KB pages > >>> and wondering > >>> if we could simplify the code. > >>> > >>>> > >>>> Its an uneven margin for different folio sizes. > >>>> For 16K folio swapin, you are checking if there is margin for 128 folios, > >>>> but for 1M folio swapin, you are checking there is margin for just 2 folios. > >>>> > >>>> Maybe it might be better to make this dependent on some factor of folio_nr_pages? > >>> > >>> Agreed. This is similar to what we discussed regarding your zswap mTHP > >>> swap-in series: > >>> > >>> int mem_cgroup_swapin_charge_folio(...) > >>> { > >>> ... > >>> if (folio_test_large(folio) && > >>> mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, > >>> folio_nr_pages(folio))) > >>> ret = -ENOMEM; > >>> else > >>> ret = charge_memcg(folio, memcg, gfp); > >>> ... > >>> } > >>> > >>> As someone focused on phones, my challenge is the absence of stable platforms to > >>> benchmark this type of workload. If possible, Usama, I would greatly > >>> appreciate it if > >>> you could take the lead on the patch. > >>> > >>>> > >>>> As Johannes pointed out, the charging code already does the margin check. > >>>> So for 4K, the check just checks if there is 4K available, but for 16K it checks > >>>> if a lot more than 16K is available. Maybe there should be a similar policy for > >>>> all? I guess this is similar to my 2nd point, but just considers 4K folios as > >>>> well. > >>> > >>> I don't think the charging code performs a margin check. It simply > >>> tries to charge > >>> the specified nr_pages (whether 1 or more). If nr_pages are available, > >>> the charge > >>> proceeds; otherwise, if GFP allows blocking, it triggers memory reclamation to > >>> reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages. > >>> > >> > >> So if you have defrag not set to always, it will not trigger reclamation. > >> I think that is a bigger usecase, i.e. defrag=madvise,defer,etc is probably > >> used much more then always. > >> > >> In the current code in that case try_charge_memcg will return -ENOMEM all > >> the way to mem_cgroup_swapin_charge_folio and alloc_swap_folio will then > >> try the next order. So eventhough it might not be calling the mem_cgroup_margin > >> function, it is kind of is doing the same? > >> > >>> If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages available, a > >>> large folio with nr_pages == SWAP_CLUSTER_MAX will successfully charge, > >>> immediately filling the memcg. > >>> > >>> Shortly after, smaller folios—typically with blockable GFP—will quickly trigger > >>> additional reclamation. While nr_pages - 1 subpages of the large folio may not > >>> be immediately needed, they still occupy enough space to fill the memcg to > >>> capacity. > >>> > >>> My second point about the mitigation is as follows: For a system (or > >>> memcg) under severe memory pressure, especially one without hardware TLB > >>> optimization, is enabling mTHP always the right choice? Since mTHP operates at > >>> a larger granularity, some internal fragmentation is unavoidable, regardless > >>> of optimization. Could the mitigation code help in automatically tuning > >>> this fragmentation? > >>> > >> > >> I agree with the point that enabling mTHP always is not the right thing to do > >> on all platforms. I also think it might be the case that enabling mTHP > >> might be a good thing for some workloads, but enabling mTHP swapin along with > >> it might not. > >> > >> As you said when you have apps switching between foreground and background > >> in android, it probably makes sense to have large folio swapping, as you > >> want to bringin all the pages from background app as quickly as possible. > >> And also all the TLB optimizations and smaller lru overhead you get after > >> you have brought in all the pages. > >> Linux kernel build test doesnt really get to benefit from the TLB optimization > >> and smaller lru overhead, as probably the pages are very short lived. So I > >> think it doesnt show the benefit of large folio swapin properly and > >> large folio swapin should probably be disabled for this kind of workload, > >> eventhough mTHP should be enabled. > > > > I'm not entirely sure if this applies to platforms without TLB > > optimization, especially > > in the absence of swap. In a memory-limited cgroup without swap, would > > mTHP still > > cause significant thrashing of file-backed folios? When a large swap > > file is present, > > the inability to swap in mTHP seems to act as a workaround for fragmentation, > > allowing fragmented pages of the original mTHP from do_anonymous_page() to > > remain in swap. > > > >> > >> I am not sure that the approach we are trying in this patch is the right way: > >> - This patch makes it a memcg issue, but you could have memcg disabled and > >> then the mitigation being tried here wont apply. > >> - Instead of this being a large folio swapin issue, is it more of a readahead > >> issue? If we zswap (without the large folio swapin series) and change the window > >> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > >> when cgroup memory is limited as readahead would probably cause swap thrashing as > >> well. > >> - Instead of looking at cgroup margin, maybe we should try and look at > >> the rate of change of workingset_restore_anon? This might be a lot more complicated > >> to do, but probably is the right metric to determine swap thrashing. It also means > >> that this could be used in both the synchronous swapcache skipping path and > >> swapin_readahead path. > >> (Thanks Johannes for suggesting this) > >> > >> With the large folio swapin, I do see the large improvement when considering only > >> swapin performance and latency in the same way as you saw in zram. > >> Maybe the right short term approach is to have > >> /sys/kernel/mm/transparent_hugepage/swapin > >> and have that disabled by default to avoid regression. > > > > A crucial component is still missing—managing the compression and decompression > > of multiple pages as a larger block. This could significantly reduce > > system time and > > potentially resolve the kernel build issue within a small memory > > cgroup, even with > > swap thrashing. > > > > I’ll send an update ASAP so you can rebase for zswap. > > Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/? > Thats wont benefit zswap, right? That's right. I assume we can also make it work with zswap? > I actually had a few questions about it. Mainly that the benefit comes if the > pagefault happens on page 0 of the large folio. But if the page fault happens > on any other page, lets say page 1 of a 64K folio. then it will decompress the > entire 64K chunk and just copy page 1? (memcpy in zram_bvec_read_multi_pages_partial). > Could that cause a regression as you have to decompress a large chunk for just > getting 1 4K page? > If we assume uniform distribution of page faults, maybe it could make things worse? > > I probably should ask all of this in that thread. With mTHP swap-in, a page fault on any page behaves the same as a fault on page 0. Without mTHP swap-in, there’s also no difference between faults on page 0 and other pages. A fault on any page means that the entire block is decompressed. The only difference is that we don’t partially copy one page when mTHP swap-in is present. > > > > >> If the workload owner sees a benefit, they can enable it. > >> I can add this when sending the next version of large folio zswapin if that makes > >> sense? > >> Longer term I can try and have a look at if we can do something with > >> workingset_restore_anon to improve things. > >> > >> Thanks, > >> Usama Thanks Barry
On 30/10/2024 20:48, Barry Song wrote: > On Thu, Oct 31, 2024 at 9:41 AM Usama Arif <usamaarif642@gmail.com> wrote: >> >> >> >> On 30/10/2024 20:27, Barry Song wrote: >>> On Thu, Oct 31, 2024 at 3:51 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>> >>>> >>>> >>>> On 28/10/2024 22:03, Barry Song wrote: >>>>> On Mon, Oct 28, 2024 at 8:07 PM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 27/10/2024 01:14, Barry Song wrote: >>>>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>>>> >>>>>>> In a memcg where mTHP is always utilized, even at full capacity, it >>>>>>> might not be the best option. Consider a system that uses only small >>>>>>> folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX >>>>>>> of buffer space before it can initiate the next reclamation. However, >>>>>>> large folios can quickly fill this space, rapidly bringing the memcg >>>>>>> back to full capacity, even though some portions of the large folios >>>>>>> may not be immediately needed and used by the process. >>>>>>> >>>>>>> Usama and Kanchana identified a regression when building the kernel in >>>>>>> a memcg with memory.max set to a small value while enabling large >>>>>>> folio swap-in support on zswap[1]. >>>>>>> >>>>>>> The issue arises from an edge case where the memory cgroup remains >>>>>>> nearly full most of the time. Consequently, bringing in mTHP can >>>>>>> quickly cause a memcg overflow, triggering a swap-out. The subsequent >>>>>>> swap-in then recreates the overflow, resulting in a repetitive cycle. >>>>>>> >>>>>>> We need a mechanism to stop the cup from overflowing continuously. >>>>>>> One potential solution is to slow the filling process when we identify >>>>>>> that the cup is nearly full. >>>>>>> >>>>>>> Usama reported an improvement when we mitigate mTHP swap-in as the >>>>>>> memcg approaches full capacity[2]: >>>>>>> >>>>>>> int mem_cgroup_swapin_charge_folio(...) >>>>>>> { >>>>>>> ... >>>>>>> if (folio_test_large(folio) && >>>>>>> mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) >>>>>>> ret = -ENOMEM; >>>>>>> else >>>>>>> ret = charge_memcg(folio, memcg, gfp); >>>>>>> ... >>>>>>> } >>>>>>> >>>>>>> AMD 16K+32K THP=always >>>>>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix >>>>>>> real 1m23.038s 1m23.050s 1m22.704s >>>>>>> user 53m57.210s 53m53.437s 53m52.577s >>>>>>> sys 7m24.592s 7m48.843s 7m22.519s >>>>>>> zswpin 612070 999244 815934 >>>>>>> zswpout 2226403 2347979 2054980 >>>>>>> pgfault 20667366 20481728 20478690 >>>>>>> pgmajfault 385887 269117 309702 >>>>>>> >>>>>>> AMD 16K+32K+64K THP=always >>>>>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix >>>>>>> real 1m22.975s 1m23.266s 1m22.549s >>>>>>> user 53m51.302s 53m51.069s 53m46.471s >>>>>>> sys 7m40.168s 7m57.104s 7m25.012s >>>>>>> zswpin 676492 1258573 1225703 >>>>>>> zswpout 2449839 2714767 2899178 >>>>>>> pgfault 17540746 17296555 17234663 >>>>>>> pgmajfault 429629 307495 287859 >>>>>>> >>>>>>> I wonder if we can extend the mitigation to do_anonymous_page() as >>>>>>> well. Without hardware like AMD and ARM with hardware TLB coalescing >>>>>>> or CONT-PTE, I conducted a quick test on my Intel i9 workstation with >>>>>>> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel >>>>>>> builds in a memcg with memory.max set to 1 GiB. >>>>>>> >>>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled >>>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled >>>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled >>>>>>> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled >>>>>>> >>>>>>> $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ >>>>>>> CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null >>>>>>> >>>>>>> disable-mTHP-swapin mm-unstable with-this-patch >>>>>>> Real: 6m54.595s 7m4.832s 6m45.811s >>>>>>> User: 66m42.795s 66m59.984s 67m21.150s >>>>>>> Sys: 12m7.092s 15m18.153s 12m52.644s >>>>>>> pswpin: 4262327 11723248 5918690 >>>>>>> pswpout: 14883774 19574347 14026942 >>>>>>> 64k-swpout: 624447 889384 480039 >>>>>>> 32k-swpout: 115473 242288 73874 >>>>>>> 16k-swpout: 158203 294672 109142 >>>>>>> 64k-swpin: 0 495869 159061 >>>>>>> 32k-swpin: 0 219977 56158 >>>>>>> 16k-swpin: 0 223501 81445 >>>>>>> >>>>>> >>>>> >>>>> Hi Usama, >>>>> >>>>>> hmm, both the user and sys time are worse with the patch compared to >>>>>> disable-mTHP-swapin. I wonder if the real time is an anomaly and if you >>>>>> repeat the experiment the real time might be worse as well? >>>>> >>>>> Well, I've improved my script to include a loop: >>>>> >>>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled >>>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled >>>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled >>>>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled >>>>> >>>>> for ((i=1; i<=100; i++)) >>>>> do >>>>> echo "Executing round $i" >>>>> make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null >>>>> echo 3 > /proc/sys/vm/drop_caches >>>>> time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ >>>>> CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j15 1>/dev/null 2>/dev/null >>>>> cat /proc/vmstat | grep pswp >>>>> echo -n 64k-swpout: ; cat >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout >>>>> echo -n 32k-swpout: ; cat >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout >>>>> echo -n 16k-swpout: ; cat >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout >>>>> echo -n 64k-swpin: ; cat >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin >>>>> echo -n 32k-swpin: ; cat >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin >>>>> echo -n 16k-swpin: ; cat >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin >>>>> done >>>>> >>>>> I've noticed that the user/sys/real time on my i9 machine fluctuates >>>>> constantly, could be things >>>>> like: >>>>> real 6m52.087s >>>>> user 67m12.463s >>>>> sys 13m8.281s >>>>> ... >>>>> >>>>> real 7m42.937s >>>>> user 66m55.250s >>>>> sys 12m56.330s >>>>> ... >>>>> >>>>> real 6m49.374s >>>>> user 66m37.040s >>>>> sys 12m44.542s >>>>> ... >>>>> >>>>> real 6m54.205s >>>>> user 65m49.732s >>>>> sys 11m33.078s >>>>> ... >>>>> >>>>> likely due to unstable temperatures and I/O latency. As a result, my >>>>> data doesn’t seem >>>>> reference-worthy. >>>>> >>>> >>>> So I had suggested retrying the experiment to see how reproducible it is, >>>> but had not done that myself! >>>> Thanks for sharing this. I tried many times on the AMD server and I see >>>> varying numbers as well. >>>> >>>> AMD 16K THP always, cgroup = 4G, large folio zswapin patches >>>> real 1m28.351s >>>> user 54m14.476s >>>> sys 8m46.596s >>>> zswpin 811693 >>>> zswpout 2137310 >>>> pgfault 27344671 >>>> pgmajfault 290510 >>>> .. >>>> real 1m24.557s >>>> user 53m56.815s >>>> sys 8m10.200s >>>> zswpin 571532 >>>> zswpout 1645063 >>>> pgfault 26989075 >>>> pgmajfault 205177 >>>> .. >>>> real 1m26.083s >>>> user 54m5.303s >>>> sys 9m55.247s >>>> zswpin 1176292 >>>> zswpout 2910825 >>>> pgfault 27286835 >>>> pgmajfault 419746 >>>> >>>> >>>> The sys time can especially vary by large numbers. I think you see the same. >>>> >>>> >>>>> As a phone engineer, we never use phones to run kernel builds. I'm also >>>>> quite certain that phones won't provide stable and reliable data for this >>>>> type of workload. Without access to a Linux server to conduct the test, >>>>> I really need your help. >>>>> >>>>> I used to work on optimizing the ARM server scheduler and memory >>>>> management, and I really miss that machine I had until three years ago :-) >>>>> >>>>>> >>>>>>> I need Usama's assistance to identify a suitable patch, as I lack >>>>>>> access to hardware such as AMD machines and ARM servers with TLB >>>>>>> optimization. >>>>>>> >>>>>>> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/ >>>>>>> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/ >>>>>>> >>>>>>> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> >>>>>>> Cc: Usama Arif <usamaarif642@gmail.com> >>>>>>> Cc: David Hildenbrand <david@redhat.com> >>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >>>>>>> Cc: Chris Li <chrisl@kernel.org> >>>>>>> Cc: Yosry Ahmed <yosryahmed@google.com> >>>>>>> Cc: "Huang, Ying" <ying.huang@intel.com> >>>>>>> Cc: Kairui Song <kasong@tencent.com> >>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com> >>>>>>> Cc: Johannes Weiner <hannes@cmpxchg.org> >>>>>>> Cc: Michal Hocko <mhocko@kernel.org> >>>>>>> Cc: Roman Gushchin <roman.gushchin@linux.dev> >>>>>>> Cc: Shakeel Butt <shakeel.butt@linux.dev> >>>>>>> Cc: Muchun Song <muchun.song@linux.dev> >>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >>>>>>> --- >>>>>>> include/linux/memcontrol.h | 9 ++++++++ >>>>>>> mm/memcontrol.c | 45 ++++++++++++++++++++++++++++++++++++++ >>>>>>> mm/memory.c | 17 ++++++++++++++ >>>>>>> 3 files changed, 71 insertions(+) >>>>>>> >>>>>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h >>>>>>> index 524006313b0d..8bcc8f4af39f 100644 >>>>>>> --- a/include/linux/memcontrol.h >>>>>>> +++ b/include/linux/memcontrol.h >>>>>>> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, >>>>>>> int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, >>>>>>> long nr_pages); >>>>>>> >>>>>>> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, >>>>>>> + swp_entry_t *entry); >>>>>>> + >>>>>>> int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, >>>>>>> gfp_t gfp, swp_entry_t entry); >>>>>>> >>>>>>> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, >>>>>>> return 0; >>>>>>> } >>>>>>> >>>>>>> +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm, >>>>>>> + swp_entry_t *entry) >>>>>>> +{ >>>>>>> + return 0; >>>>>>> +} >>>>>>> + >>>>>>> static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, >>>>>>> struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) >>>>>>> { >>>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>>>>>> index 17af08367c68..f3d92b93ea6d 100644 >>>>>>> --- a/mm/memcontrol.c >>>>>>> +++ b/mm/memcontrol.c >>>>>>> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, >>>>>>> return 0; >>>>>>> } >>>>>>> >>>>>>> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg) >>>>>>> +{ >>>>>>> + for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { >>>>>>> + if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR) >>>>>> >>>>>> There might be 3 issues with the approach: >>>>>> >>>>>> Its a very big margin, lets say you have ARM64_64K_PAGES, and you have >>>>>> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are >>>>>> basically saying you need 512M free memory to swapin just 256K? >>>>> >>>>> Right, sorry for the noisy code. I was just thinking about 4KB pages >>>>> and wondering >>>>> if we could simplify the code. >>>>> >>>>>> >>>>>> Its an uneven margin for different folio sizes. >>>>>> For 16K folio swapin, you are checking if there is margin for 128 folios, >>>>>> but for 1M folio swapin, you are checking there is margin for just 2 folios. >>>>>> >>>>>> Maybe it might be better to make this dependent on some factor of folio_nr_pages? >>>>> >>>>> Agreed. This is similar to what we discussed regarding your zswap mTHP >>>>> swap-in series: >>>>> >>>>> int mem_cgroup_swapin_charge_folio(...) >>>>> { >>>>> ... >>>>> if (folio_test_large(folio) && >>>>> mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, >>>>> folio_nr_pages(folio))) >>>>> ret = -ENOMEM; >>>>> else >>>>> ret = charge_memcg(folio, memcg, gfp); >>>>> ... >>>>> } >>>>> >>>>> As someone focused on phones, my challenge is the absence of stable platforms to >>>>> benchmark this type of workload. If possible, Usama, I would greatly >>>>> appreciate it if >>>>> you could take the lead on the patch. >>>>> >>>>>> >>>>>> As Johannes pointed out, the charging code already does the margin check. >>>>>> So for 4K, the check just checks if there is 4K available, but for 16K it checks >>>>>> if a lot more than 16K is available. Maybe there should be a similar policy for >>>>>> all? I guess this is similar to my 2nd point, but just considers 4K folios as >>>>>> well. >>>>> >>>>> I don't think the charging code performs a margin check. It simply >>>>> tries to charge >>>>> the specified nr_pages (whether 1 or more). If nr_pages are available, >>>>> the charge >>>>> proceeds; otherwise, if GFP allows blocking, it triggers memory reclamation to >>>>> reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages. >>>>> >>>> >>>> So if you have defrag not set to always, it will not trigger reclamation. >>>> I think that is a bigger usecase, i.e. defrag=madvise,defer,etc is probably >>>> used much more then always. >>>> >>>> In the current code in that case try_charge_memcg will return -ENOMEM all >>>> the way to mem_cgroup_swapin_charge_folio and alloc_swap_folio will then >>>> try the next order. So eventhough it might not be calling the mem_cgroup_margin >>>> function, it is kind of is doing the same? >>>> >>>>> If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages available, a >>>>> large folio with nr_pages == SWAP_CLUSTER_MAX will successfully charge, >>>>> immediately filling the memcg. >>>>> >>>>> Shortly after, smaller folios—typically with blockable GFP—will quickly trigger >>>>> additional reclamation. While nr_pages - 1 subpages of the large folio may not >>>>> be immediately needed, they still occupy enough space to fill the memcg to >>>>> capacity. >>>>> >>>>> My second point about the mitigation is as follows: For a system (or >>>>> memcg) under severe memory pressure, especially one without hardware TLB >>>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at >>>>> a larger granularity, some internal fragmentation is unavoidable, regardless >>>>> of optimization. Could the mitigation code help in automatically tuning >>>>> this fragmentation? >>>>> >>>> >>>> I agree with the point that enabling mTHP always is not the right thing to do >>>> on all platforms. I also think it might be the case that enabling mTHP >>>> might be a good thing for some workloads, but enabling mTHP swapin along with >>>> it might not. >>>> >>>> As you said when you have apps switching between foreground and background >>>> in android, it probably makes sense to have large folio swapping, as you >>>> want to bringin all the pages from background app as quickly as possible. >>>> And also all the TLB optimizations and smaller lru overhead you get after >>>> you have brought in all the pages. >>>> Linux kernel build test doesnt really get to benefit from the TLB optimization >>>> and smaller lru overhead, as probably the pages are very short lived. So I >>>> think it doesnt show the benefit of large folio swapin properly and >>>> large folio swapin should probably be disabled for this kind of workload, >>>> eventhough mTHP should be enabled. >>> >>> I'm not entirely sure if this applies to platforms without TLB >>> optimization, especially >>> in the absence of swap. In a memory-limited cgroup without swap, would >>> mTHP still >>> cause significant thrashing of file-backed folios? When a large swap >>> file is present, >>> the inability to swap in mTHP seems to act as a workaround for fragmentation, >>> allowing fragmented pages of the original mTHP from do_anonymous_page() to >>> remain in swap. >>> >>>> >>>> I am not sure that the approach we are trying in this patch is the right way: >>>> - This patch makes it a memcg issue, but you could have memcg disabled and >>>> then the mitigation being tried here wont apply. >>>> - Instead of this being a large folio swapin issue, is it more of a readahead >>>> issue? If we zswap (without the large folio swapin series) and change the window >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as >>>> well. >>>> - Instead of looking at cgroup margin, maybe we should try and look at >>>> the rate of change of workingset_restore_anon? This might be a lot more complicated >>>> to do, but probably is the right metric to determine swap thrashing. It also means >>>> that this could be used in both the synchronous swapcache skipping path and >>>> swapin_readahead path. >>>> (Thanks Johannes for suggesting this) >>>> >>>> With the large folio swapin, I do see the large improvement when considering only >>>> swapin performance and latency in the same way as you saw in zram. >>>> Maybe the right short term approach is to have >>>> /sys/kernel/mm/transparent_hugepage/swapin >>>> and have that disabled by default to avoid regression. >>> >>> A crucial component is still missing—managing the compression and decompression >>> of multiple pages as a larger block. This could significantly reduce >>> system time and >>> potentially resolve the kernel build issue within a small memory >>> cgroup, even with >>> swap thrashing. >>> >>> I’ll send an update ASAP so you can rebase for zswap. >> >> Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/? >> Thats wont benefit zswap, right? > > That's right. I assume we can also make it work with zswap? Hopefully yes. Thats mainly why I was looking at that series, to try and find a way to do something similar for zswap. > >> I actually had a few questions about it. Mainly that the benefit comes if the >> pagefault happens on page 0 of the large folio. But if the page fault happens >> on any other page, lets say page 1 of a 64K folio. then it will decompress the >> entire 64K chunk and just copy page 1? (memcpy in zram_bvec_read_multi_pages_partial). >> Could that cause a regression as you have to decompress a large chunk for just >> getting 1 4K page? >> If we assume uniform distribution of page faults, maybe it could make things worse? >> >> I probably should ask all of this in that thread. > > With mTHP swap-in, a page fault on any page behaves the same as a fault on > page 0. Without mTHP swap-in, there’s also no difference between > faults on page 0 > and other pages. Ah ok, its because of the ALIGN_DOWN in https://elixir.bootlin.com/linux/v6.12-rc5/source/mm/memory.c#L4158, right? > > A fault on any page means that the entire block is decompressed. The > only difference > is that we don’t partially copy one page when mTHP swap-in is present. > Ah so zram_bvec_read_multi_pages_partial would be called only if someone swaps out mTHP, disables it and then tries to do swapin? Thanks >> >>> >>>> If the workload owner sees a benefit, they can enable it. >>>> I can add this when sending the next version of large folio zswapin if that makes >>>> sense? >>>> Longer term I can try and have a look at if we can do something with >>>> workingset_restore_anon to improve things. >>>> >>>> Thanks, >>>> Usama > > Thanks > Barry
On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 30/10/2024 19:51, Yosry Ahmed wrote: > > [..] > >>> My second point about the mitigation is as follows: For a system (or > >>> memcg) under severe memory pressure, especially one without hardware TLB > >>> optimization, is enabling mTHP always the right choice? Since mTHP operates at > >>> a larger granularity, some internal fragmentation is unavoidable, regardless > >>> of optimization. Could the mitigation code help in automatically tuning > >>> this fragmentation? > >>> > >> > >> I agree with the point that enabling mTHP always is not the right thing to do > >> on all platforms. I also think it might be the case that enabling mTHP > >> might be a good thing for some workloads, but enabling mTHP swapin along with > >> it might not. > >> > >> As you said when you have apps switching between foreground and background > >> in android, it probably makes sense to have large folio swapping, as you > >> want to bringin all the pages from background app as quickly as possible. > >> And also all the TLB optimizations and smaller lru overhead you get after > >> you have brought in all the pages. > >> Linux kernel build test doesnt really get to benefit from the TLB optimization > >> and smaller lru overhead, as probably the pages are very short lived. So I > >> think it doesnt show the benefit of large folio swapin properly and > >> large folio swapin should probably be disabled for this kind of workload, > >> eventhough mTHP should be enabled. > >> > >> I am not sure that the approach we are trying in this patch is the right way: > >> - This patch makes it a memcg issue, but you could have memcg disabled and > >> then the mitigation being tried here wont apply. > > > > Is the problem reproducible without memcg? I imagine only if the > > entire system is under memory pressure. I guess we would want the same > > "mitigation" either way. > > > What would be a good open source benchmark/workload to test without limiting memory > in memcg? > For the kernel build test, I can only get zswap activity to happen if I build > in cgroup and limit memory.max. You mean a benchmark that puts the entire system under memory pressure? I am not sure, it ultimately depends on the size of memory you have, among other factors. What if you run the kernel build test in a VM? Then you can limit is size like a memcg, although you'd probably need to leave more room because the entire guest OS will also subject to the same limit. > > I can just run zswap large folio zswapin in production and see, but that will take me a few > days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing, > then maybe its not really an issue? I believe Barry doesnt see an issue in android > phones (but please correct me if I am wrong), and if there isnt an issue in Meta > production as well, its a good data point for servers as well. And maybe > kernel build in 4G memcg is not a good test. If there is a regression in the kernel build, this means some workloads may be affected, even if Meta's prod isn't. I understand that the benchmark is not very representative of real world workloads, but in this instance I think the thrashing problem surfaced by the benchmark is real. > > >> - Instead of this being a large folio swapin issue, is it more of a readahead > >> issue? If we zswap (without the large folio swapin series) and change the window > >> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > >> when cgroup memory is limited as readahead would probably cause swap thrashing as > >> well. > > > > I think large folio swapin would make the problem worse anyway. I am > > also not sure if the readahead window adjusts on memory pressure or > > not. > > > readahead window doesnt look at memory pressure. So maybe the same thing is being > seen here as there would be in swapin_readahead? Maybe readahead is not as aggressive in general as large folio swapins? Looking at swap_vma_ra_win(), it seems like the maximum order of the window is the smaller of page_cluster (2 or 3) and SWAP_RA_ORDER_CEILING (5). Also readahead will swapin 4k folios AFAICT, so we don't need a contiguous allocation like large folio swapin. So that could be another factor why readahead may not reproduce the problem. > Maybe if we check kernel build test > performance in 4G memcg with below diff, it might get better? I think you can use the page_cluster tunable to do this at runtime. > > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 4669f29cf555..9e196e1e6885 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -809,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask, > pgoff_t ilx; > bool page_allocated; > > - win = swap_vma_ra_win(vmf, &start, &end); > + win = 1; > if (win == 1) > goto skip; >
On Thu, Oct 31, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 30/10/2024 20:48, Barry Song wrote: > > On Thu, Oct 31, 2024 at 9:41 AM Usama Arif <usamaarif642@gmail.com> wrote: > >> > >> > >> > >> On 30/10/2024 20:27, Barry Song wrote: > >>> On Thu, Oct 31, 2024 at 3:51 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>> > >>>> > >>>> > >>>> On 28/10/2024 22:03, Barry Song wrote: > >>>>> On Mon, Oct 28, 2024 at 8:07 PM Usama Arif <usamaarif642@gmail.com> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 27/10/2024 01:14, Barry Song wrote: > >>>>>>> From: Barry Song <v-songbaohua@oppo.com> > >>>>>>> > >>>>>>> In a memcg where mTHP is always utilized, even at full capacity, it > >>>>>>> might not be the best option. Consider a system that uses only small > >>>>>>> folios: after each reclamation, a process has at least SWAP_CLUSTER_MAX > >>>>>>> of buffer space before it can initiate the next reclamation. However, > >>>>>>> large folios can quickly fill this space, rapidly bringing the memcg > >>>>>>> back to full capacity, even though some portions of the large folios > >>>>>>> may not be immediately needed and used by the process. > >>>>>>> > >>>>>>> Usama and Kanchana identified a regression when building the kernel in > >>>>>>> a memcg with memory.max set to a small value while enabling large > >>>>>>> folio swap-in support on zswap[1]. > >>>>>>> > >>>>>>> The issue arises from an edge case where the memory cgroup remains > >>>>>>> nearly full most of the time. Consequently, bringing in mTHP can > >>>>>>> quickly cause a memcg overflow, triggering a swap-out. The subsequent > >>>>>>> swap-in then recreates the overflow, resulting in a repetitive cycle. > >>>>>>> > >>>>>>> We need a mechanism to stop the cup from overflowing continuously. > >>>>>>> One potential solution is to slow the filling process when we identify > >>>>>>> that the cup is nearly full. > >>>>>>> > >>>>>>> Usama reported an improvement when we mitigate mTHP swap-in as the > >>>>>>> memcg approaches full capacity[2]: > >>>>>>> > >>>>>>> int mem_cgroup_swapin_charge_folio(...) > >>>>>>> { > >>>>>>> ... > >>>>>>> if (folio_test_large(folio) && > >>>>>>> mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) > >>>>>>> ret = -ENOMEM; > >>>>>>> else > >>>>>>> ret = charge_memcg(folio, memcg, gfp); > >>>>>>> ... > >>>>>>> } > >>>>>>> > >>>>>>> AMD 16K+32K THP=always > >>>>>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > >>>>>>> real 1m23.038s 1m23.050s 1m22.704s > >>>>>>> user 53m57.210s 53m53.437s 53m52.577s > >>>>>>> sys 7m24.592s 7m48.843s 7m22.519s > >>>>>>> zswpin 612070 999244 815934 > >>>>>>> zswpout 2226403 2347979 2054980 > >>>>>>> pgfault 20667366 20481728 20478690 > >>>>>>> pgmajfault 385887 269117 309702 > >>>>>>> > >>>>>>> AMD 16K+32K+64K THP=always > >>>>>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > >>>>>>> real 1m22.975s 1m23.266s 1m22.549s > >>>>>>> user 53m51.302s 53m51.069s 53m46.471s > >>>>>>> sys 7m40.168s 7m57.104s 7m25.012s > >>>>>>> zswpin 676492 1258573 1225703 > >>>>>>> zswpout 2449839 2714767 2899178 > >>>>>>> pgfault 17540746 17296555 17234663 > >>>>>>> pgmajfault 429629 307495 287859 > >>>>>>> > >>>>>>> I wonder if we can extend the mitigation to do_anonymous_page() as > >>>>>>> well. Without hardware like AMD and ARM with hardware TLB coalescing > >>>>>>> or CONT-PTE, I conducted a quick test on my Intel i9 workstation with > >>>>>>> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kernel > >>>>>>> builds in a memcg with memory.max set to 1 GiB. > >>>>>>> > >>>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > >>>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > >>>>>>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > >>>>>>> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > >>>>>>> > >>>>>>> $ time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ > >>>>>>> CROSS_COMPILE=aarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null > >>>>>>> > >>>>>>> disable-mTHP-swapin mm-unstable with-this-patch > >>>>>>> Real: 6m54.595s 7m4.832s 6m45.811s > >>>>>>> User: 66m42.795s 66m59.984s 67m21.150s > >>>>>>> Sys: 12m7.092s 15m18.153s 12m52.644s > >>>>>>> pswpin: 4262327 11723248 5918690 > >>>>>>> pswpout: 14883774 19574347 14026942 > >>>>>>> 64k-swpout: 624447 889384 480039 > >>>>>>> 32k-swpout: 115473 242288 73874 > >>>>>>> 16k-swpout: 158203 294672 109142 > >>>>>>> 64k-swpin: 0 495869 159061 > >>>>>>> 32k-swpin: 0 219977 56158 > >>>>>>> 16k-swpin: 0 223501 81445 > >>>>>>> > >>>>>> > >>>>> > >>>>> Hi Usama, > >>>>> > >>>>>> hmm, both the user and sys time are worse with the patch compared to > >>>>>> disable-mTHP-swapin. I wonder if the real time is an anomaly and if you > >>>>>> repeat the experiment the real time might be worse as well? > >>>>> > >>>>> Well, I've improved my script to include a loop: > >>>>> > >>>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > >>>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > >>>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > >>>>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled > >>>>> > >>>>> for ((i=1; i<=100; i++)) > >>>>> do > >>>>> echo "Executing round $i" > >>>>> make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null > >>>>> echo 3 > /proc/sys/vm/drop_caches > >>>>> time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \ > >>>>> CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j15 1>/dev/null 2>/dev/null > >>>>> cat /proc/vmstat | grep pswp > >>>>> echo -n 64k-swpout: ; cat > >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout > >>>>> echo -n 32k-swpout: ; cat > >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout > >>>>> echo -n 16k-swpout: ; cat > >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout > >>>>> echo -n 64k-swpin: ; cat > >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin > >>>>> echo -n 32k-swpin: ; cat > >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin > >>>>> echo -n 16k-swpin: ; cat > >>>>> /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin > >>>>> done > >>>>> > >>>>> I've noticed that the user/sys/real time on my i9 machine fluctuates > >>>>> constantly, could be things > >>>>> like: > >>>>> real 6m52.087s > >>>>> user 67m12.463s > >>>>> sys 13m8.281s > >>>>> ... > >>>>> > >>>>> real 7m42.937s > >>>>> user 66m55.250s > >>>>> sys 12m56.330s > >>>>> ... > >>>>> > >>>>> real 6m49.374s > >>>>> user 66m37.040s > >>>>> sys 12m44.542s > >>>>> ... > >>>>> > >>>>> real 6m54.205s > >>>>> user 65m49.732s > >>>>> sys 11m33.078s > >>>>> ... > >>>>> > >>>>> likely due to unstable temperatures and I/O latency. As a result, my > >>>>> data doesn’t seem > >>>>> reference-worthy. > >>>>> > >>>> > >>>> So I had suggested retrying the experiment to see how reproducible it is, > >>>> but had not done that myself! > >>>> Thanks for sharing this. I tried many times on the AMD server and I see > >>>> varying numbers as well. > >>>> > >>>> AMD 16K THP always, cgroup = 4G, large folio zswapin patches > >>>> real 1m28.351s > >>>> user 54m14.476s > >>>> sys 8m46.596s > >>>> zswpin 811693 > >>>> zswpout 2137310 > >>>> pgfault 27344671 > >>>> pgmajfault 290510 > >>>> .. > >>>> real 1m24.557s > >>>> user 53m56.815s > >>>> sys 8m10.200s > >>>> zswpin 571532 > >>>> zswpout 1645063 > >>>> pgfault 26989075 > >>>> pgmajfault 205177 > >>>> .. > >>>> real 1m26.083s > >>>> user 54m5.303s > >>>> sys 9m55.247s > >>>> zswpin 1176292 > >>>> zswpout 2910825 > >>>> pgfault 27286835 > >>>> pgmajfault 419746 > >>>> > >>>> > >>>> The sys time can especially vary by large numbers. I think you see the same. > >>>> > >>>> > >>>>> As a phone engineer, we never use phones to run kernel builds. I'm also > >>>>> quite certain that phones won't provide stable and reliable data for this > >>>>> type of workload. Without access to a Linux server to conduct the test, > >>>>> I really need your help. > >>>>> > >>>>> I used to work on optimizing the ARM server scheduler and memory > >>>>> management, and I really miss that machine I had until three years ago :-) > >>>>> > >>>>>> > >>>>>>> I need Usama's assistance to identify a suitable patch, as I lack > >>>>>>> access to hardware such as AMD machines and ARM servers with TLB > >>>>>>> optimization. > >>>>>>> > >>>>>>> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@gmail.com/ > >>>>>>> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@gmail.com/ > >>>>>>> > >>>>>>> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > >>>>>>> Cc: Usama Arif <usamaarif642@gmail.com> > >>>>>>> Cc: David Hildenbrand <david@redhat.com> > >>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > >>>>>>> Cc: Chris Li <chrisl@kernel.org> > >>>>>>> Cc: Yosry Ahmed <yosryahmed@google.com> > >>>>>>> Cc: "Huang, Ying" <ying.huang@intel.com> > >>>>>>> Cc: Kairui Song <kasong@tencent.com> > >>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com> > >>>>>>> Cc: Johannes Weiner <hannes@cmpxchg.org> > >>>>>>> Cc: Michal Hocko <mhocko@kernel.org> > >>>>>>> Cc: Roman Gushchin <roman.gushchin@linux.dev> > >>>>>>> Cc: Shakeel Butt <shakeel.butt@linux.dev> > >>>>>>> Cc: Muchun Song <muchun.song@linux.dev> > >>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> > >>>>>>> --- > >>>>>>> include/linux/memcontrol.h | 9 ++++++++ > >>>>>>> mm/memcontrol.c | 45 ++++++++++++++++++++++++++++++++++++++ > >>>>>>> mm/memory.c | 17 ++++++++++++++ > >>>>>>> 3 files changed, 71 insertions(+) > >>>>>>> > >>>>>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > >>>>>>> index 524006313b0d..8bcc8f4af39f 100644 > >>>>>>> --- a/include/linux/memcontrol.h > >>>>>>> +++ b/include/linux/memcontrol.h > >>>>>>> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, > >>>>>>> int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, > >>>>>>> long nr_pages); > >>>>>>> > >>>>>>> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, > >>>>>>> + swp_entry_t *entry); > >>>>>>> + > >>>>>>> int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > >>>>>>> gfp_t gfp, swp_entry_t entry); > >>>>>>> > >>>>>>> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, > >>>>>>> return 0; > >>>>>>> } > >>>>>>> > >>>>>>> +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm, > >>>>>>> + swp_entry_t *entry) > >>>>>>> +{ > >>>>>>> + return 0; > >>>>>>> +} > >>>>>>> + > >>>>>>> static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, > >>>>>>> struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) > >>>>>>> { > >>>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >>>>>>> index 17af08367c68..f3d92b93ea6d 100644 > >>>>>>> --- a/mm/memcontrol.c > >>>>>>> +++ b/mm/memcontrol.c > >>>>>>> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, > >>>>>>> return 0; > >>>>>>> } > >>>>>>> > >>>>>>> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg) > >>>>>>> +{ > >>>>>>> + for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { > >>>>>>> + if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR) > >>>>>> > >>>>>> There might be 3 issues with the approach: > >>>>>> > >>>>>> Its a very big margin, lets say you have ARM64_64K_PAGES, and you have > >>>>>> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are > >>>>>> basically saying you need 512M free memory to swapin just 256K? > >>>>> > >>>>> Right, sorry for the noisy code. I was just thinking about 4KB pages > >>>>> and wondering > >>>>> if we could simplify the code. > >>>>> > >>>>>> > >>>>>> Its an uneven margin for different folio sizes. > >>>>>> For 16K folio swapin, you are checking if there is margin for 128 folios, > >>>>>> but for 1M folio swapin, you are checking there is margin for just 2 folios. > >>>>>> > >>>>>> Maybe it might be better to make this dependent on some factor of folio_nr_pages? > >>>>> > >>>>> Agreed. This is similar to what we discussed regarding your zswap mTHP > >>>>> swap-in series: > >>>>> > >>>>> int mem_cgroup_swapin_charge_folio(...) > >>>>> { > >>>>> ... > >>>>> if (folio_test_large(folio) && > >>>>> mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, > >>>>> folio_nr_pages(folio))) > >>>>> ret = -ENOMEM; > >>>>> else > >>>>> ret = charge_memcg(folio, memcg, gfp); > >>>>> ... > >>>>> } > >>>>> > >>>>> As someone focused on phones, my challenge is the absence of stable platforms to > >>>>> benchmark this type of workload. If possible, Usama, I would greatly > >>>>> appreciate it if > >>>>> you could take the lead on the patch. > >>>>> > >>>>>> > >>>>>> As Johannes pointed out, the charging code already does the margin check. > >>>>>> So for 4K, the check just checks if there is 4K available, but for 16K it checks > >>>>>> if a lot more than 16K is available. Maybe there should be a similar policy for > >>>>>> all? I guess this is similar to my 2nd point, but just considers 4K folios as > >>>>>> well. > >>>>> > >>>>> I don't think the charging code performs a margin check. It simply > >>>>> tries to charge > >>>>> the specified nr_pages (whether 1 or more). If nr_pages are available, > >>>>> the charge > >>>>> proceeds; otherwise, if GFP allows blocking, it triggers memory reclamation to > >>>>> reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages. > >>>>> > >>>> > >>>> So if you have defrag not set to always, it will not trigger reclamation. > >>>> I think that is a bigger usecase, i.e. defrag=madvise,defer,etc is probably > >>>> used much more then always. > >>>> > >>>> In the current code in that case try_charge_memcg will return -ENOMEM all > >>>> the way to mem_cgroup_swapin_charge_folio and alloc_swap_folio will then > >>>> try the next order. So eventhough it might not be calling the mem_cgroup_margin > >>>> function, it is kind of is doing the same? > >>>> > >>>>> If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages available, a > >>>>> large folio with nr_pages == SWAP_CLUSTER_MAX will successfully charge, > >>>>> immediately filling the memcg. > >>>>> > >>>>> Shortly after, smaller folios—typically with blockable GFP—will quickly trigger > >>>>> additional reclamation. While nr_pages - 1 subpages of the large folio may not > >>>>> be immediately needed, they still occupy enough space to fill the memcg to > >>>>> capacity. > >>>>> > >>>>> My second point about the mitigation is as follows: For a system (or > >>>>> memcg) under severe memory pressure, especially one without hardware TLB > >>>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at > >>>>> a larger granularity, some internal fragmentation is unavoidable, regardless > >>>>> of optimization. Could the mitigation code help in automatically tuning > >>>>> this fragmentation? > >>>>> > >>>> > >>>> I agree with the point that enabling mTHP always is not the right thing to do > >>>> on all platforms. I also think it might be the case that enabling mTHP > >>>> might be a good thing for some workloads, but enabling mTHP swapin along with > >>>> it might not. > >>>> > >>>> As you said when you have apps switching between foreground and background > >>>> in android, it probably makes sense to have large folio swapping, as you > >>>> want to bringin all the pages from background app as quickly as possible. > >>>> And also all the TLB optimizations and smaller lru overhead you get after > >>>> you have brought in all the pages. > >>>> Linux kernel build test doesnt really get to benefit from the TLB optimization > >>>> and smaller lru overhead, as probably the pages are very short lived. So I > >>>> think it doesnt show the benefit of large folio swapin properly and > >>>> large folio swapin should probably be disabled for this kind of workload, > >>>> eventhough mTHP should be enabled. > >>> > >>> I'm not entirely sure if this applies to platforms without TLB > >>> optimization, especially > >>> in the absence of swap. In a memory-limited cgroup without swap, would > >>> mTHP still > >>> cause significant thrashing of file-backed folios? When a large swap > >>> file is present, > >>> the inability to swap in mTHP seems to act as a workaround for fragmentation, > >>> allowing fragmented pages of the original mTHP from do_anonymous_page() to > >>> remain in swap. > >>> > >>>> > >>>> I am not sure that the approach we are trying in this patch is the right way: > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and > >>>> then the mitigation being tried here wont apply. > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead > >>>> issue? If we zswap (without the large folio swapin series) and change the window > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as > >>>> well. > >>>> - Instead of looking at cgroup margin, maybe we should try and look at > >>>> the rate of change of workingset_restore_anon? This might be a lot more complicated > >>>> to do, but probably is the right metric to determine swap thrashing. It also means > >>>> that this could be used in both the synchronous swapcache skipping path and > >>>> swapin_readahead path. > >>>> (Thanks Johannes for suggesting this) > >>>> > >>>> With the large folio swapin, I do see the large improvement when considering only > >>>> swapin performance and latency in the same way as you saw in zram. > >>>> Maybe the right short term approach is to have > >>>> /sys/kernel/mm/transparent_hugepage/swapin > >>>> and have that disabled by default to avoid regression. > >>> > >>> A crucial component is still missing—managing the compression and decompression > >>> of multiple pages as a larger block. This could significantly reduce > >>> system time and > >>> potentially resolve the kernel build issue within a small memory > >>> cgroup, even with > >>> swap thrashing. > >>> > >>> I’ll send an update ASAP so you can rebase for zswap. > >> > >> Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/? > >> Thats wont benefit zswap, right? > > > > That's right. I assume we can also make it work with zswap? > > Hopefully yes. Thats mainly why I was looking at that series, to try and find > a way to do something similar for zswap. > > > >> I actually had a few questions about it. Mainly that the benefit comes if the > >> pagefault happens on page 0 of the large folio. But if the page fault happens > >> on any other page, lets say page 1 of a 64K folio. then it will decompress the > >> entire 64K chunk and just copy page 1? (memcpy in zram_bvec_read_multi_pages_partial). > >> Could that cause a regression as you have to decompress a large chunk for just > >> getting 1 4K page? > >> If we assume uniform distribution of page faults, maybe it could make things worse? > >> > >> I probably should ask all of this in that thread. > > > > With mTHP swap-in, a page fault on any page behaves the same as a fault on > > page 0. Without mTHP swap-in, there’s also no difference between > > faults on page 0 > > and other pages. > > Ah ok, its because of the ALIGN_DOWN in > https://elixir.bootlin.com/linux/v6.12-rc5/source/mm/memory.c#L4158, > right? right. > > > > A fault on any page means that the entire block is decompressed. The > > only difference > > is that we don’t partially copy one page when mTHP swap-in is present. > > > Ah so zram_bvec_read_multi_pages_partial would be called only > if someone swaps out mTHP, disables it and then tries to do swapin? > For example, if the block contains 16KB of original data but only 4KB is swapped in without mTHP swap-in, this means we decompress the entire 16KB while only copying a portion of it to do_swap_page(). So likely compression/ decompression of large blocks without mTHP swapin can make things worse though it brings higher compression ratio. > Thanks > > >> > >>> > >>>> If the workload owner sees a benefit, they can enable it. > >>>> I can add this when sending the next version of large folio zswapin if that makes > >>>> sense? > >>>> Longer term I can try and have a look at if we can do something with > >>>> workingset_restore_anon to improve things. > >>>> > >>>> Thanks, > >>>> Usama > > Thanks Barry
[..] > >>> A crucial component is still missing—managing the compression and decompression > >>> of multiple pages as a larger block. This could significantly reduce > >>> system time and > >>> potentially resolve the kernel build issue within a small memory > >>> cgroup, even with > >>> swap thrashing. > >>> > >>> I’ll send an update ASAP so you can rebase for zswap. > >> > >> Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/? > >> Thats wont benefit zswap, right? > > > > That's right. I assume we can also make it work with zswap? > > Hopefully yes. Thats mainly why I was looking at that series, to try and find > a way to do something similar for zswap. I would prefer for these things to be done separately. We still need to evaluate the compression/decompression of large blocks. I am mainly concerned about having to decompress a large chunk to fault in one page. The obvious problems are fault latency, and wasted work having to consistently decompress the large chunk to take one page from it. We also need to decide if we'd rather split it after decompression and compress the parts that we didn't swap in separately. This can cause problems beyond the fault latency. Imagine the case where the system is under memory pressure, so we fallback to order-0 swapin to avoid reclaim. Now we want to decompress a chunk that used to be 64K. We need to allocate 64K of contiguous memory for a temporary allocation to be able to fault a 4K page. Now we either need to: - Go into reclaim, which we were trying to avoid to begin with. - Dip into reserves to allocate the 64K as it's a temporary allocation. This is probably risky because under memory pressure, many CPUs may be doing this concurrently.
On 30/10/2024 21:01, Yosry Ahmed wrote: > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote: >> >> >> >> On 30/10/2024 19:51, Yosry Ahmed wrote: >>> [..] >>>>> My second point about the mitigation is as follows: For a system (or >>>>> memcg) under severe memory pressure, especially one without hardware TLB >>>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at >>>>> a larger granularity, some internal fragmentation is unavoidable, regardless >>>>> of optimization. Could the mitigation code help in automatically tuning >>>>> this fragmentation? >>>>> >>>> >>>> I agree with the point that enabling mTHP always is not the right thing to do >>>> on all platforms. I also think it might be the case that enabling mTHP >>>> might be a good thing for some workloads, but enabling mTHP swapin along with >>>> it might not. >>>> >>>> As you said when you have apps switching between foreground and background >>>> in android, it probably makes sense to have large folio swapping, as you >>>> want to bringin all the pages from background app as quickly as possible. >>>> And also all the TLB optimizations and smaller lru overhead you get after >>>> you have brought in all the pages. >>>> Linux kernel build test doesnt really get to benefit from the TLB optimization >>>> and smaller lru overhead, as probably the pages are very short lived. So I >>>> think it doesnt show the benefit of large folio swapin properly and >>>> large folio swapin should probably be disabled for this kind of workload, >>>> eventhough mTHP should be enabled. >>>> >>>> I am not sure that the approach we are trying in this patch is the right way: >>>> - This patch makes it a memcg issue, but you could have memcg disabled and >>>> then the mitigation being tried here wont apply. >>> >>> Is the problem reproducible without memcg? I imagine only if the >>> entire system is under memory pressure. I guess we would want the same >>> "mitigation" either way. >>> >> What would be a good open source benchmark/workload to test without limiting memory >> in memcg? >> For the kernel build test, I can only get zswap activity to happen if I build >> in cgroup and limit memory.max. > > You mean a benchmark that puts the entire system under memory > pressure? I am not sure, it ultimately depends on the size of memory > you have, among other factors. > > What if you run the kernel build test in a VM? Then you can limit is > size like a memcg, although you'd probably need to leave more room > because the entire guest OS will also subject to the same limit. > I had tried this, but the variance in time/zswap numbers was very high. Much higher than the AMD numbers I posted in reply to Barry. So found it very difficult to make comparison. >> >> I can just run zswap large folio zswapin in production and see, but that will take me a few >> days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing, >> then maybe its not really an issue? I believe Barry doesnt see an issue in android >> phones (but please correct me if I am wrong), and if there isnt an issue in Meta >> production as well, its a good data point for servers as well. And maybe >> kernel build in 4G memcg is not a good test. > > If there is a regression in the kernel build, this means some > workloads may be affected, even if Meta's prod isn't. I understand > that the benchmark is not very representative of real world workloads, > but in this instance I think the thrashing problem surfaced by the > benchmark is real. > >> >>>> - Instead of this being a large folio swapin issue, is it more of a readahead >>>> issue? If we zswap (without the large folio swapin series) and change the window >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as >>>> well. >>> >>> I think large folio swapin would make the problem worse anyway. I am >>> also not sure if the readahead window adjusts on memory pressure or >>> not. >>> >> readahead window doesnt look at memory pressure. So maybe the same thing is being >> seen here as there would be in swapin_readahead? > > Maybe readahead is not as aggressive in general as large folio > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order > of the window is the smaller of page_cluster (2 or 3) and > SWAP_RA_ORDER_CEILING (5). Yes, I was seeing 8 pages swapin (order 3) when testing. So might be similar to enabling 32K mTHP? > > Also readahead will swapin 4k folios AFAICT, so we don't need a > contiguous allocation like large folio swapin. So that could be > another factor why readahead may not reproduce the problem. > >> Maybe if we check kernel build test >> performance in 4G memcg with below diff, it might get better? > > I think you can use the page_cluster tunable to do this at runtime. > >> >> diff --git a/mm/swap_state.c b/mm/swap_state.c >> index 4669f29cf555..9e196e1e6885 100644 >> --- a/mm/swap_state.c >> +++ b/mm/swap_state.c >> @@ -809,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask, >> pgoff_t ilx; >> bool page_allocated; >> >> - win = swap_vma_ra_win(vmf, &start, &end); >> + win = 1; >> if (win == 1) >> goto skip; >>
On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 30/10/2024 21:01, Yosry Ahmed wrote: > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote: > >> > >> > >> > >> On 30/10/2024 19:51, Yosry Ahmed wrote: > >>> [..] > >>>>> My second point about the mitigation is as follows: For a system (or > >>>>> memcg) under severe memory pressure, especially one without hardware TLB > >>>>> optimization, is enabling mTHP always the right choice? Since mTHP operates at > >>>>> a larger granularity, some internal fragmentation is unavoidable, regardless > >>>>> of optimization. Could the mitigation code help in automatically tuning > >>>>> this fragmentation? > >>>>> > >>>> > >>>> I agree with the point that enabling mTHP always is not the right thing to do > >>>> on all platforms. I also think it might be the case that enabling mTHP > >>>> might be a good thing for some workloads, but enabling mTHP swapin along with > >>>> it might not. > >>>> > >>>> As you said when you have apps switching between foreground and background > >>>> in android, it probably makes sense to have large folio swapping, as you > >>>> want to bringin all the pages from background app as quickly as possible. > >>>> And also all the TLB optimizations and smaller lru overhead you get after > >>>> you have brought in all the pages. > >>>> Linux kernel build test doesnt really get to benefit from the TLB optimization > >>>> and smaller lru overhead, as probably the pages are very short lived. So I > >>>> think it doesnt show the benefit of large folio swapin properly and > >>>> large folio swapin should probably be disabled for this kind of workload, > >>>> eventhough mTHP should be enabled. > >>>> > >>>> I am not sure that the approach we are trying in this patch is the right way: > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and > >>>> then the mitigation being tried here wont apply. > >>> > >>> Is the problem reproducible without memcg? I imagine only if the > >>> entire system is under memory pressure. I guess we would want the same > >>> "mitigation" either way. > >>> > >> What would be a good open source benchmark/workload to test without limiting memory > >> in memcg? > >> For the kernel build test, I can only get zswap activity to happen if I build > >> in cgroup and limit memory.max. > > > > You mean a benchmark that puts the entire system under memory > > pressure? I am not sure, it ultimately depends on the size of memory > > you have, among other factors. > > > > What if you run the kernel build test in a VM? Then you can limit is > > size like a memcg, although you'd probably need to leave more room > > because the entire guest OS will also subject to the same limit. > > > > I had tried this, but the variance in time/zswap numbers was very high. > Much higher than the AMD numbers I posted in reply to Barry. So found > it very difficult to make comparison. Hmm yeah maybe more factors come into play with global memory pressure. I am honestly not sure how to test this scenario, and I suspect variance will be high anyway. We can just try to use whatever technique we use for the memcg limit though, if possible, right? > > >> > >> I can just run zswap large folio zswapin in production and see, but that will take me a few > >> days. tbh, running in prod is a much better test, and if there isn't any sort of thrashing, > >> then maybe its not really an issue? I believe Barry doesnt see an issue in android > >> phones (but please correct me if I am wrong), and if there isnt an issue in Meta > >> production as well, its a good data point for servers as well. And maybe > >> kernel build in 4G memcg is not a good test. > > > > If there is a regression in the kernel build, this means some > > workloads may be affected, even if Meta's prod isn't. I understand > > that the benchmark is not very representative of real world workloads, > > but in this instance I think the thrashing problem surfaced by the > > benchmark is real. > > > >> > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead > >>>> issue? If we zswap (without the large folio swapin series) and change the window > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as > >>>> well. > >>> > >>> I think large folio swapin would make the problem worse anyway. I am > >>> also not sure if the readahead window adjusts on memory pressure or > >>> not. > >>> > >> readahead window doesnt look at memory pressure. So maybe the same thing is being > >> seen here as there would be in swapin_readahead? > > > > Maybe readahead is not as aggressive in general as large folio > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order > > of the window is the smaller of page_cluster (2 or 3) and > > SWAP_RA_ORDER_CEILING (5). > Yes, I was seeing 8 pages swapin (order 3) when testing. So might > be similar to enabling 32K mTHP? Not quite. > > > > > Also readahead will swapin 4k folios AFAICT, so we don't need a > > contiguous allocation like large folio swapin. So that could be > > another factor why readahead may not reproduce the problem. Because of this ^.
On Thu, Oct 31, 2024 at 10:10 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > [..] > > >>> A crucial component is still missing—managing the compression and decompression > > >>> of multiple pages as a larger block. This could significantly reduce > > >>> system time and > > >>> potentially resolve the kernel build issue within a small memory > > >>> cgroup, even with > > >>> swap thrashing. > > >>> > > >>> I’ll send an update ASAP so you can rebase for zswap. > > >> > > >> Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/? > > >> Thats wont benefit zswap, right? > > > > > > That's right. I assume we can also make it work with zswap? > > > > Hopefully yes. Thats mainly why I was looking at that series, to try and find > > a way to do something similar for zswap. > > I would prefer for these things to be done separately. We still need > to evaluate the compression/decompression of large blocks. I am mainly > concerned about having to decompress a large chunk to fault in one > page. > > The obvious problems are fault latency, and wasted work having to > consistently decompress the large chunk to take one page from it. We > also need to decide if we'd rather split it after decompression and > compress the parts that we didn't swap in separately. > > This can cause problems beyond the fault latency. Imagine the case > where the system is under memory pressure, so we fallback to order-0 > swapin to avoid reclaim. Now we want to decompress a chunk that used > to be 64K. Yes, this could be an issue. We had actually tried to utilize several buffers for those partial swap-in cases, where the decompressed data was held in anticipation of the upcoming swap-in. This approach could address the majority of partial swap-ins for fallback scenarios. > > We need to allocate 64K of contiguous memory for a temporary > allocation to be able to fault a 4K page. Now we either need to: > - Go into reclaim, which we were trying to avoid to begin with. > - Dip into reserves to allocate the 64K as it's a temporary > allocation. This is probably risky because under memory pressure, many > CPUs may be doing this concurrently. This has been addressed by using contiguous memory that is prepared on a per-CPU basis., search the below: "alloc_pages() might fail, so we don't depend on allocation:" https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/ Thanks Barry
On Wed, Oct 30, 2024 at 2:21 PM Barry Song <21cnbao@gmail.com> wrote: > > On Thu, Oct 31, 2024 at 10:10 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > [..] > > > >>> A crucial component is still missing—managing the compression and decompression > > > >>> of multiple pages as a larger block. This could significantly reduce > > > >>> system time and > > > >>> potentially resolve the kernel build issue within a small memory > > > >>> cgroup, even with > > > >>> swap thrashing. > > > >>> > > > >>> I’ll send an update ASAP so you can rebase for zswap. > > > >> > > > >> Did you mean https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/? > > > >> Thats wont benefit zswap, right? > > > > > > > > That's right. I assume we can also make it work with zswap? > > > > > > Hopefully yes. Thats mainly why I was looking at that series, to try and find > > > a way to do something similar for zswap. > > > > I would prefer for these things to be done separately. We still need > > to evaluate the compression/decompression of large blocks. I am mainly > > concerned about having to decompress a large chunk to fault in one > > page. > > > > The obvious problems are fault latency, and wasted work having to > > consistently decompress the large chunk to take one page from it. We > > also need to decide if we'd rather split it after decompression and > > compress the parts that we didn't swap in separately. > > > > This can cause problems beyond the fault latency. Imagine the case > > where the system is under memory pressure, so we fallback to order-0 > > swapin to avoid reclaim. Now we want to decompress a chunk that used > > to be 64K. > > Yes, this could be an issue. > > We had actually tried to utilize several buffers for those partial > swap-in cases, > where the decompressed data was held in anticipation of the upcoming > swap-in. This approach could address the majority of partial swap-ins for > fallback scenarios. > > > > > We need to allocate 64K of contiguous memory for a temporary > > allocation to be able to fault a 4K page. Now we either need to: > > - Go into reclaim, which we were trying to avoid to begin with. > > - Dip into reserves to allocate the 64K as it's a temporary > > allocation. This is probably risky because under memory pressure, many > > CPUs may be doing this concurrently. > > This has been addressed by using contiguous memory that is prepared on > a per-CPU basis., search the below: > "alloc_pages() might fail, so we don't depend on allocation:" > https://lore.kernel.org/all/20241021232852.4061-1-21cnbao@gmail.com/ Thanks. I think this is reasonable but it makes it difficult to increase the size of the chunk. I would still prefer for both series to remain separate. If we want to wait for the large folio zswap loads until your series goes in to offset the thrashing that's fine, but I really think we should try to address the thrashing on its own.
On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: > On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote: > > On 30/10/2024 21:01, Yosry Ahmed wrote: > > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote: > > >>>> I am not sure that the approach we are trying in this patch is the right way: > > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and > > >>>> then the mitigation being tried here wont apply. > > >>> > > >>> Is the problem reproducible without memcg? I imagine only if the > > >>> entire system is under memory pressure. I guess we would want the same > > >>> "mitigation" either way. > > >>> > > >> What would be a good open source benchmark/workload to test without limiting memory > > >> in memcg? > > >> For the kernel build test, I can only get zswap activity to happen if I build > > >> in cgroup and limit memory.max. > > > > > > You mean a benchmark that puts the entire system under memory > > > pressure? I am not sure, it ultimately depends on the size of memory > > > you have, among other factors. > > > > > > What if you run the kernel build test in a VM? Then you can limit is > > > size like a memcg, although you'd probably need to leave more room > > > because the entire guest OS will also subject to the same limit. > > > > > > > I had tried this, but the variance in time/zswap numbers was very high. > > Much higher than the AMD numbers I posted in reply to Barry. So found > > it very difficult to make comparison. > > Hmm yeah maybe more factors come into play with global memory > pressure. I am honestly not sure how to test this scenario, and I > suspect variance will be high anyway. > > We can just try to use whatever technique we use for the memcg limit > though, if possible, right? You can boot a physical machine with mem=1G on the commandline, which restricts the physical range of memory that will be initialized. Double check /proc/meminfo after boot, because part of that physical range might not be usable RAM. I do this quite often to test physical memory pressure with workloads that don't scale up easily, like kernel builds. > > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead > > >>>> issue? If we zswap (without the large folio swapin series) and change the window > > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as > > >>>> well. +1 I also think there is too much focus on cgroup alone. The bigger issue seems to be how much optimistic volume we swap in when we're under pressure already. This applies to large folios and readahead; global memory availability and cgroup limits. It happens to manifest with THP in cgroups because that's what you guys are testing. But IMO, any solution to this problem should consider the wider scope. > > >>> I think large folio swapin would make the problem worse anyway. I am > > >>> also not sure if the readahead window adjusts on memory pressure or > > >>> not. > > >>> > > >> readahead window doesnt look at memory pressure. So maybe the same thing is being > > >> seen here as there would be in swapin_readahead? > > > > > > Maybe readahead is not as aggressive in general as large folio > > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order > > > of the window is the smaller of page_cluster (2 or 3) and > > > SWAP_RA_ORDER_CEILING (5). > > Yes, I was seeing 8 pages swapin (order 3) when testing. So might > > be similar to enabling 32K mTHP? > > Not quite. Actually, I would expect it to be... > > > Also readahead will swapin 4k folios AFAICT, so we don't need a > > > contiguous allocation like large folio swapin. So that could be > > > another factor why readahead may not reproduce the problem. > > Because of this ^. ...this matters for the physical allocation, which might require more reclaim and compaction to produce the 32k. But an earlier version of Barry's patch did the cgroup margin fallback after the THP was already physically allocated, and it still helped. So the issue in this test scenario seems to be mostly about cgroup volume. And then 8 4k charges should be equivalent to a singular 32k charge when it comes to cgroup pressure.
On Thu, Oct 31, 2024 at 8:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: > > On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote: > > > On 30/10/2024 21:01, Yosry Ahmed wrote: > > > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote: > > > >>>> I am not sure that the approach we are trying in this patch is the right way: > > > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and > > > >>>> then the mitigation being tried here wont apply. > > > >>> > > > >>> Is the problem reproducible without memcg? I imagine only if the > > > >>> entire system is under memory pressure. I guess we would want the same > > > >>> "mitigation" either way. > > > >>> > > > >> What would be a good open source benchmark/workload to test without limiting memory > > > >> in memcg? > > > >> For the kernel build test, I can only get zswap activity to happen if I build > > > >> in cgroup and limit memory.max. > > > > > > > > You mean a benchmark that puts the entire system under memory > > > > pressure? I am not sure, it ultimately depends on the size of memory > > > > you have, among other factors. > > > > > > > > What if you run the kernel build test in a VM? Then you can limit is > > > > size like a memcg, although you'd probably need to leave more room > > > > because the entire guest OS will also subject to the same limit. > > > > > > > > > > I had tried this, but the variance in time/zswap numbers was very high. > > > Much higher than the AMD numbers I posted in reply to Barry. So found > > > it very difficult to make comparison. > > > > Hmm yeah maybe more factors come into play with global memory > > pressure. I am honestly not sure how to test this scenario, and I > > suspect variance will be high anyway. > > > > We can just try to use whatever technique we use for the memcg limit > > though, if possible, right? > > You can boot a physical machine with mem=1G on the commandline, which > restricts the physical range of memory that will be initialized. > Double check /proc/meminfo after boot, because part of that physical > range might not be usable RAM. > > I do this quite often to test physical memory pressure with workloads > that don't scale up easily, like kernel builds. > > > > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead > > > >>>> issue? If we zswap (without the large folio swapin series) and change the window > > > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > > > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as > > > >>>> well. > > +1 > > I also think there is too much focus on cgroup alone. The bigger issue > seems to be how much optimistic volume we swap in when we're under > pressure already. This applies to large folios and readahead; global > memory availability and cgroup limits. Agreed, although the characteristics of large folios and readahead are different. But yeah, different flavors of the same problem. > > It happens to manifest with THP in cgroups because that's what you > guys are testing. But IMO, any solution to this problem should > consider the wider scope. +1, and I really think this should be addressed separately, not just rely on large block compression/decompression to offset the cost. It's probably not just a zswap/zram problem anyway, it just happens to be what we support large folio swapin for. > > > > >>> I think large folio swapin would make the problem worse anyway. I am > > > >>> also not sure if the readahead window adjusts on memory pressure or > > > >>> not. > > > >>> > > > >> readahead window doesnt look at memory pressure. So maybe the same thing is being > > > >> seen here as there would be in swapin_readahead? > > > > > > > > Maybe readahead is not as aggressive in general as large folio > > > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order > > > > of the window is the smaller of page_cluster (2 or 3) and > > > > SWAP_RA_ORDER_CEILING (5). > > > Yes, I was seeing 8 pages swapin (order 3) when testing. So might > > > be similar to enabling 32K mTHP? > > > > Not quite. > > Actually, I would expect it to be... > > > > > Also readahead will swapin 4k folios AFAICT, so we don't need a > > > > contiguous allocation like large folio swapin. So that could be > > > > another factor why readahead may not reproduce the problem. > > > > Because of this ^. > > ...this matters for the physical allocation, which might require more > reclaim and compaction to produce the 32k. But an earlier version of > Barry's patch did the cgroup margin fallback after the THP was already > physically allocated, and it still helped. > > So the issue in this test scenario seems to be mostly about cgroup > volume. And then 8 4k charges should be equivalent to a singular 32k > charge when it comes to cgroup pressure. In this test scenario, yes, because it's only exercising cgroup pressure. But if we want a general solution that also addresses global pressure, I expect large folios to be worse because of the contiguity and the size (compared to default readahead window sizes). So I think we shouldn't only test with readahead, as it won't cover some of the large folio cases.
On Fri, Nov 1, 2024 at 5:00 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Thu, Oct 31, 2024 at 8:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: > > > On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 30/10/2024 21:01, Yosry Ahmed wrote: > > > > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote: > > > > >>>> I am not sure that the approach we are trying in this patch is the right way: > > > > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and > > > > >>>> then the mitigation being tried here wont apply. > > > > >>> > > > > >>> Is the problem reproducible without memcg? I imagine only if the > > > > >>> entire system is under memory pressure. I guess we would want the same > > > > >>> "mitigation" either way. > > > > >>> > > > > >> What would be a good open source benchmark/workload to test without limiting memory > > > > >> in memcg? > > > > >> For the kernel build test, I can only get zswap activity to happen if I build > > > > >> in cgroup and limit memory.max. > > > > > > > > > > You mean a benchmark that puts the entire system under memory > > > > > pressure? I am not sure, it ultimately depends on the size of memory > > > > > you have, among other factors. > > > > > > > > > > What if you run the kernel build test in a VM? Then you can limit is > > > > > size like a memcg, although you'd probably need to leave more room > > > > > because the entire guest OS will also subject to the same limit. > > > > > > > > > > > > > I had tried this, but the variance in time/zswap numbers was very high. > > > > Much higher than the AMD numbers I posted in reply to Barry. So found > > > > it very difficult to make comparison. > > > > > > Hmm yeah maybe more factors come into play with global memory > > > pressure. I am honestly not sure how to test this scenario, and I > > > suspect variance will be high anyway. > > > > > > We can just try to use whatever technique we use for the memcg limit > > > though, if possible, right? > > > > You can boot a physical machine with mem=1G on the commandline, which > > restricts the physical range of memory that will be initialized. > > Double check /proc/meminfo after boot, because part of that physical > > range might not be usable RAM. > > > > I do this quite often to test physical memory pressure with workloads > > that don't scale up easily, like kernel builds. > > > > > > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead > > > > >>>> issue? If we zswap (without the large folio swapin series) and change the window > > > > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > > > > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as > > > > >>>> well. > > > > +1 > > > > I also think there is too much focus on cgroup alone. The bigger issue > > seems to be how much optimistic volume we swap in when we're under > > pressure already. This applies to large folios and readahead; global > > memory availability and cgroup limits. > > Agreed, although the characteristics of large folios and readahead are > different. But yeah, different flavors of the same problem. > > > > > It happens to manifest with THP in cgroups because that's what you > > guys are testing. But IMO, any solution to this problem should > > consider the wider scope. > > +1, and I really think this should be addressed separately, not just > rely on large block compression/decompression to offset the cost. It's > probably not just a zswap/zram problem anyway, it just happens to be > what we support large folio swapin for. Agreed these are two separate issues and should be both investigated though 2 can offset the cost of 1. 1. swap thrashing 2. large block compression/decompression For point 1, we likely want to investigate the following: 1. if we can see the same thrashing if we always perform readahead (rapidly filling the memcg to full again after reclamation). 2. Whether there are any issues with balancing file and anon memory reclamation. The 'refault feedback loop' in mglru compares refault rates between anon and file pages to decide which type should be prioritized for reclamation. type = get_type_to_scan(lruvec, swappiness, &tier); static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx) { ... read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp); read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv); type = positive_ctrl_err(&sp, &pv); read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp); for (tier = 1; tier < MAX_NR_TIERS; tier++) { read_ctrl_pos(lruvec, type, tier, gain[type], &pv); if (!positive_ctrl_err(&sp, &pv)) break; } *tier_idx = tier - 1; return type; } In this case, we may want to investigate whether reclamation is primarily targeting anonymous memory due to potential errors in the statistics path after mTHP is involved. 3. Determine if this is a memcg-specific issue by setting mem=1GB and running the same test on the global system. Yosry, Johannes, Usama, Is there anything else that might interest us? I'll get back to you after completing the investigation mentioned above. > > > > > > > >>> I think large folio swapin would make the problem worse anyway. I am > > > > >>> also not sure if the readahead window adjusts on memory pressure or > > > > >>> not. > > > > >>> > > > > >> readahead window doesnt look at memory pressure. So maybe the same thing is being > > > > >> seen here as there would be in swapin_readahead? > > > > > > > > > > Maybe readahead is not as aggressive in general as large folio > > > > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order > > > > > of the window is the smaller of page_cluster (2 or 3) and > > > > > SWAP_RA_ORDER_CEILING (5). > > > > Yes, I was seeing 8 pages swapin (order 3) when testing. So might > > > > be similar to enabling 32K mTHP? > > > > > > Not quite. > > > > Actually, I would expect it to be... > > > > > > > Also readahead will swapin 4k folios AFAICT, so we don't need a > > > > > contiguous allocation like large folio swapin. So that could be > > > > > another factor why readahead may not reproduce the problem. > > > > > > Because of this ^. > > > > ...this matters for the physical allocation, which might require more > > reclaim and compaction to produce the 32k. But an earlier version of > > Barry's patch did the cgroup margin fallback after the THP was already > > physically allocated, and it still helped. > > > > So the issue in this test scenario seems to be mostly about cgroup > > volume. And then 8 4k charges should be equivalent to a singular 32k > > charge when it comes to cgroup pressure. > > In this test scenario, yes, because it's only exercising cgroup > pressure. But if we want a general solution that also addresses global > pressure, I expect large folios to be worse because of the contiguity > and the size (compared to default readahead window sizes). So I think > we shouldn't only test with readahead, as it won't cover some of the > large folio cases. Thanks barry
On Thu, Oct 31, 2024 at 2:00 PM Barry Song <21cnbao@gmail.com> wrote: > > On Fri, Nov 1, 2024 at 5:00 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Thu, Oct 31, 2024 at 8:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: > > > > On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote: > > > > > On 30/10/2024 21:01, Yosry Ahmed wrote: > > > > > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote: > > > > > >>>> I am not sure that the approach we are trying in this patch is the right way: > > > > > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and > > > > > >>>> then the mitigation being tried here wont apply. > > > > > >>> > > > > > >>> Is the problem reproducible without memcg? I imagine only if the > > > > > >>> entire system is under memory pressure. I guess we would want the same > > > > > >>> "mitigation" either way. > > > > > >>> > > > > > >> What would be a good open source benchmark/workload to test without limiting memory > > > > > >> in memcg? > > > > > >> For the kernel build test, I can only get zswap activity to happen if I build > > > > > >> in cgroup and limit memory.max. > > > > > > > > > > > > You mean a benchmark that puts the entire system under memory > > > > > > pressure? I am not sure, it ultimately depends on the size of memory > > > > > > you have, among other factors. > > > > > > > > > > > > What if you run the kernel build test in a VM? Then you can limit is > > > > > > size like a memcg, although you'd probably need to leave more room > > > > > > because the entire guest OS will also subject to the same limit. > > > > > > > > > > > > > > > > I had tried this, but the variance in time/zswap numbers was very high. > > > > > Much higher than the AMD numbers I posted in reply to Barry. So found > > > > > it very difficult to make comparison. > > > > > > > > Hmm yeah maybe more factors come into play with global memory > > > > pressure. I am honestly not sure how to test this scenario, and I > > > > suspect variance will be high anyway. > > > > > > > > We can just try to use whatever technique we use for the memcg limit > > > > though, if possible, right? > > > > > > You can boot a physical machine with mem=1G on the commandline, which > > > restricts the physical range of memory that will be initialized. > > > Double check /proc/meminfo after boot, because part of that physical > > > range might not be usable RAM. > > > > > > I do this quite often to test physical memory pressure with workloads > > > that don't scale up easily, like kernel builds. > > > > > > > > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead > > > > > >>>> issue? If we zswap (without the large folio swapin series) and change the window > > > > > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > > > > > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as > > > > > >>>> well. > > > > > > +1 > > > > > > I also think there is too much focus on cgroup alone. The bigger issue > > > seems to be how much optimistic volume we swap in when we're under > > > pressure already. This applies to large folios and readahead; global > > > memory availability and cgroup limits. > > > > Agreed, although the characteristics of large folios and readahead are > > different. But yeah, different flavors of the same problem. > > > > > > > > It happens to manifest with THP in cgroups because that's what you > > > guys are testing. But IMO, any solution to this problem should > > > consider the wider scope. > > > > +1, and I really think this should be addressed separately, not just > > rely on large block compression/decompression to offset the cost. It's > > probably not just a zswap/zram problem anyway, it just happens to be > > what we support large folio swapin for. > > Agreed these are two separate issues and should be both investigated > though 2 can offset the cost of 1. > 1. swap thrashing > 2. large block compression/decompression > > For point 1, we likely want to investigate the following: > > 1. if we can see the same thrashing if we always perform readahead > (rapidly filling > the memcg to full again after reclamation). > > 2. Whether there are any issues with balancing file and anon memory > reclamation. > > The 'refault feedback loop' in mglru compares refault rates between anon and > file pages to decide which type should be prioritized for reclamation. > > type = get_type_to_scan(lruvec, swappiness, &tier); > > static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int > *tier_idx) > { > ... > read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp); > read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv); > type = positive_ctrl_err(&sp, &pv); > > read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp); > for (tier = 1; tier < MAX_NR_TIERS; tier++) { > read_ctrl_pos(lruvec, type, tier, gain[type], &pv); > if (!positive_ctrl_err(&sp, &pv)) > break; > } > > *tier_idx = tier - 1; > return type; > } > > In this case, we may want to investigate whether reclamation is primarily > targeting anonymous memory due to potential errors in the statistics path > after mTHP is involved. > > 3. Determine if this is a memcg-specific issue by setting mem=1GB and > running the same test on the global system. > > Yosry, Johannes, Usama, > Is there anything else that might interest us? > > I'll get back to you after completing the investigation mentioned above. Thanks for looking into this. Perhaps a naive question, but is this only related to swap faults? Can the same scenario happen with other types of faults allocating large folios (e.g. faulting in a file page, or a new anon allocation)? Do swap faults use a different policy for determining the folio order, or is it just the swap faults are naturally more correlated to memory pressure, so that's how the issue was surfaced?
Johannes Weiner <hannes@cmpxchg.org> writes: > On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: >> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote: >> > On 30/10/2024 21:01, Yosry Ahmed wrote: >> > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote: >> > >>>> I am not sure that the approach we are trying in this patch is the right way: >> > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and >> > >>>> then the mitigation being tried here wont apply. >> > >>> >> > >>> Is the problem reproducible without memcg? I imagine only if the >> > >>> entire system is under memory pressure. I guess we would want the same >> > >>> "mitigation" either way. >> > >>> >> > >> What would be a good open source benchmark/workload to test without limiting memory >> > >> in memcg? >> > >> For the kernel build test, I can only get zswap activity to happen if I build >> > >> in cgroup and limit memory.max. >> > > >> > > You mean a benchmark that puts the entire system under memory >> > > pressure? I am not sure, it ultimately depends on the size of memory >> > > you have, among other factors. >> > > >> > > What if you run the kernel build test in a VM? Then you can limit is >> > > size like a memcg, although you'd probably need to leave more room >> > > because the entire guest OS will also subject to the same limit. >> > > >> > >> > I had tried this, but the variance in time/zswap numbers was very high. >> > Much higher than the AMD numbers I posted in reply to Barry. So found >> > it very difficult to make comparison. >> >> Hmm yeah maybe more factors come into play with global memory >> pressure. I am honestly not sure how to test this scenario, and I >> suspect variance will be high anyway. >> >> We can just try to use whatever technique we use for the memcg limit >> though, if possible, right? > > You can boot a physical machine with mem=1G on the commandline, which > restricts the physical range of memory that will be initialized. > Double check /proc/meminfo after boot, because part of that physical > range might not be usable RAM. > > I do this quite often to test physical memory pressure with workloads > that don't scale up easily, like kernel builds. > >> > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead >> > >>>> issue? If we zswap (without the large folio swapin series) and change the window >> > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time >> > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as >> > >>>> well. > > +1 > > I also think there is too much focus on cgroup alone. The bigger issue > seems to be how much optimistic volume we swap in when we're under > pressure already. This applies to large folios and readahead; global > memory availability and cgroup limits. The current swap readahead logic is something like, 1. try readahead some pages for sequential access pattern, mark them as readahead 2. if these readahead pages get accessed before swapped out again, increase 'hits' counter 3. for next swap in, try readahead 'hits' pages and clear 'hits'. So, if there's heavy memory pressure, the readaheaded pages will not be accessed before being swapped out again (in 2 above), the readahead pages will be minimal. IMHO, mTHP swap-in is kind of swap readahead in effect. That is, in addition to the pages accessed are swapped in, the adjacent pages are swapped in (swap readahead) too. If these readahead pages are not accessed before swapped out again, system runs into more severe thrashing. This is because we lack the swap readahead window scaling mechanism as above. And, this is why I suggested to combine the swap readahead mechanism and mTHP swap-in by default before. That is, when kernel swaps in a page, it checks current swap readahead window, and decides mTHP order according to window size. So, if there are heavy memory pressure, so that the nearby pages will not be accessed before being swapped out again, the mTHP swap-in order can be adjusted automatically. > It happens to manifest with THP in cgroups because that's what you > guys are testing. But IMO, any solution to this problem should > consider the wider scope. > >> > >>> I think large folio swapin would make the problem worse anyway. I am >> > >>> also not sure if the readahead window adjusts on memory pressure or >> > >>> not. >> > >>> >> > >> readahead window doesnt look at memory pressure. So maybe the same thing is being >> > >> seen here as there would be in swapin_readahead? >> > > >> > > Maybe readahead is not as aggressive in general as large folio >> > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order >> > > of the window is the smaller of page_cluster (2 or 3) and >> > > SWAP_RA_ORDER_CEILING (5). >> > Yes, I was seeing 8 pages swapin (order 3) when testing. So might >> > be similar to enabling 32K mTHP? >> >> Not quite. > > Actually, I would expect it to be... Me too. >> > > Also readahead will swapin 4k folios AFAICT, so we don't need a >> > > contiguous allocation like large folio swapin. So that could be >> > > another factor why readahead may not reproduce the problem. >> >> Because of this ^. > > ...this matters for the physical allocation, which might require more > reclaim and compaction to produce the 32k. But an earlier version of > Barry's patch did the cgroup margin fallback after the THP was already > physically allocated, and it still helped. > > So the issue in this test scenario seems to be mostly about cgroup > volume. And then 8 4k charges should be equivalent to a singular 32k > charge when it comes to cgroup pressure. -- Best Regards, Huang, Ying
On Mon, Nov 4, 2024 at 7:46 PM Huang, Ying <ying.huang@intel.com> wrote: > > Johannes Weiner <hannes@cmpxchg.org> writes: > > > On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: > >> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote: > >> > On 30/10/2024 21:01, Yosry Ahmed wrote: > >> > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote: > >> > >>>> I am not sure that the approach we are trying in this patch is the right way: > >> > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and > >> > >>>> then the mitigation being tried here wont apply. > >> > >>> > >> > >>> Is the problem reproducible without memcg? I imagine only if the > >> > >>> entire system is under memory pressure. I guess we would want the same > >> > >>> "mitigation" either way. > >> > >>> > >> > >> What would be a good open source benchmark/workload to test without limiting memory > >> > >> in memcg? > >> > >> For the kernel build test, I can only get zswap activity to happen if I build > >> > >> in cgroup and limit memory.max. > >> > > > >> > > You mean a benchmark that puts the entire system under memory > >> > > pressure? I am not sure, it ultimately depends on the size of memory > >> > > you have, among other factors. > >> > > > >> > > What if you run the kernel build test in a VM? Then you can limit is > >> > > size like a memcg, although you'd probably need to leave more room > >> > > because the entire guest OS will also subject to the same limit. > >> > > > >> > > >> > I had tried this, but the variance in time/zswap numbers was very high. > >> > Much higher than the AMD numbers I posted in reply to Barry. So found > >> > it very difficult to make comparison. > >> > >> Hmm yeah maybe more factors come into play with global memory > >> pressure. I am honestly not sure how to test this scenario, and I > >> suspect variance will be high anyway. > >> > >> We can just try to use whatever technique we use for the memcg limit > >> though, if possible, right? > > > > You can boot a physical machine with mem=1G on the commandline, which > > restricts the physical range of memory that will be initialized. > > Double check /proc/meminfo after boot, because part of that physical > > range might not be usable RAM. > > > > I do this quite often to test physical memory pressure with workloads > > that don't scale up easily, like kernel builds. > > > >> > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead > >> > >>>> issue? If we zswap (without the large folio swapin series) and change the window > >> > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > >> > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as > >> > >>>> well. > > > > +1 > > > > I also think there is too much focus on cgroup alone. The bigger issue > > seems to be how much optimistic volume we swap in when we're under > > pressure already. This applies to large folios and readahead; global > > memory availability and cgroup limits. > > The current swap readahead logic is something like, > > 1. try readahead some pages for sequential access pattern, mark them as > readahead > > 2. if these readahead pages get accessed before swapped out again, > increase 'hits' counter > > 3. for next swap in, try readahead 'hits' pages and clear 'hits'. > > So, if there's heavy memory pressure, the readaheaded pages will not be > accessed before being swapped out again (in 2 above), the readahead > pages will be minimal. > > IMHO, mTHP swap-in is kind of swap readahead in effect. That is, in > addition to the pages accessed are swapped in, the adjacent pages are > swapped in (swap readahead) too. If these readahead pages are not > accessed before swapped out again, system runs into more severe > thrashing. This is because we lack the swap readahead window scaling > mechanism as above. And, this is why I suggested to combine the swap > readahead mechanism and mTHP swap-in by default before. That is, when > kernel swaps in a page, it checks current swap readahead window, and > decides mTHP order according to window size. So, if there are heavy > memory pressure, so that the nearby pages will not be accessed before > being swapped out again, the mTHP swap-in order can be adjusted > automatically. This might help reduce memory reclamation thrashing for kernel build workload running in a memory limited memcg which might not benefit from mTHP that much. But this mechanism has clear disadvantages: 1. Loss of large folios: For example, if you're using app A and then switch to app B, a significant portion (around 60%) of A's memory might be swapped out as it moves to the background. When you switch back to app A, a large portion of the memory originally in mTHP could be lost while swapping in small folios. Essentially, systems with a user interface operate quite differently from kernel build workloads running under a memory-limited memcg, as they switch applications between the foreground and background. 2. Fragmentation of swap slots: This fragmentation increases the likelihood of mTHP swapout failures, as it makes it harder to maintain contiguous memory blocks in swap. 3. Prevent the implementation of large block compression and decompression to achieve higher compression ratios and significantly lower CPU consumption, as small folio swap-ins may still remain the predominant approach. Memory-limited systems often face challenges with larger page sizes. Even on systems that support various base page sizes, such as 4KB, 16KB, and 64KB on ARM64, using 16KB or 64KB as the base page size is not always the best choice. With mTHP, we have already enabled per-size settings. For this kernel build workload operating within a limited memcg, enabling only 16kB is likely the best option for optimizing performance and minimizing thrashing. /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled We could focus on mTHP and seek strategies to minimize thrashing when free memory is severely limited :-) > > > It happens to manifest with THP in cgroups because that's what you > > guys are testing. But IMO, any solution to this problem should > > consider the wider scope. > > > >> > >>> I think large folio swapin would make the problem worse anyway. I am > >> > >>> also not sure if the readahead window adjusts on memory pressure or > >> > >>> not. > >> > >>> > >> > >> readahead window doesnt look at memory pressure. So maybe the same thing is being > >> > >> seen here as there would be in swapin_readahead? > >> > > > >> > > Maybe readahead is not as aggressive in general as large folio > >> > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order > >> > > of the window is the smaller of page_cluster (2 or 3) and > >> > > SWAP_RA_ORDER_CEILING (5). > >> > Yes, I was seeing 8 pages swapin (order 3) when testing. So might > >> > be similar to enabling 32K mTHP? > >> > >> Not quite. > > > > Actually, I would expect it to be... > > Me too. > > >> > > Also readahead will swapin 4k folios AFAICT, so we don't need a > >> > > contiguous allocation like large folio swapin. So that could be > >> > > another factor why readahead may not reproduce the problem. > >> > >> Because of this ^. > > > > ...this matters for the physical allocation, which might require more > > reclaim and compaction to produce the 32k. But an earlier version of > > Barry's patch did the cgroup margin fallback after the THP was already > > physically allocated, and it still helped. > > > > So the issue in this test scenario seems to be mostly about cgroup > > volume. And then 8 4k charges should be equivalent to a singular 32k > > charge when it comes to cgroup pressure. > > -- > Best Regards, > Huang, Ying Thanks Barry
On 04/11/2024 06:42, Huang, Ying wrote: > Johannes Weiner <hannes@cmpxchg.org> writes: > >> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: >>> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote: >>>> On 30/10/2024 21:01, Yosry Ahmed wrote: >>>>> On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>>>> I am not sure that the approach we are trying in this patch is the right way: >>>>>>>> - This patch makes it a memcg issue, but you could have memcg disabled and >>>>>>>> then the mitigation being tried here wont apply. >>>>>>> >>>>>>> Is the problem reproducible without memcg? I imagine only if the >>>>>>> entire system is under memory pressure. I guess we would want the same >>>>>>> "mitigation" either way. >>>>>>> >>>>>> What would be a good open source benchmark/workload to test without limiting memory >>>>>> in memcg? >>>>>> For the kernel build test, I can only get zswap activity to happen if I build >>>>>> in cgroup and limit memory.max. >>>>> >>>>> You mean a benchmark that puts the entire system under memory >>>>> pressure? I am not sure, it ultimately depends on the size of memory >>>>> you have, among other factors. >>>>> >>>>> What if you run the kernel build test in a VM? Then you can limit is >>>>> size like a memcg, although you'd probably need to leave more room >>>>> because the entire guest OS will also subject to the same limit. >>>>> >>>> >>>> I had tried this, but the variance in time/zswap numbers was very high. >>>> Much higher than the AMD numbers I posted in reply to Barry. So found >>>> it very difficult to make comparison. >>> >>> Hmm yeah maybe more factors come into play with global memory >>> pressure. I am honestly not sure how to test this scenario, and I >>> suspect variance will be high anyway. >>> >>> We can just try to use whatever technique we use for the memcg limit >>> though, if possible, right? >> >> You can boot a physical machine with mem=1G on the commandline, which >> restricts the physical range of memory that will be initialized. >> Double check /proc/meminfo after boot, because part of that physical >> range might not be usable RAM. >> >> I do this quite often to test physical memory pressure with workloads >> that don't scale up easily, like kernel builds. >> >>>>>>>> - Instead of this being a large folio swapin issue, is it more of a readahead >>>>>>>> issue? If we zswap (without the large folio swapin series) and change the window >>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time >>>>>>>> when cgroup memory is limited as readahead would probably cause swap thrashing as >>>>>>>> well. >> >> +1 >> >> I also think there is too much focus on cgroup alone. The bigger issue >> seems to be how much optimistic volume we swap in when we're under >> pressure already. This applies to large folios and readahead; global >> memory availability and cgroup limits. > > The current swap readahead logic is something like, > > 1. try readahead some pages for sequential access pattern, mark them as > readahead > > 2. if these readahead pages get accessed before swapped out again, > increase 'hits' counter > > 3. for next swap in, try readahead 'hits' pages and clear 'hits'. > > So, if there's heavy memory pressure, the readaheaded pages will not be > accessed before being swapped out again (in 2 above), the readahead > pages will be minimal. > > IMHO, mTHP swap-in is kind of swap readahead in effect. That is, in > addition to the pages accessed are swapped in, the adjacent pages are > swapped in (swap readahead) too. If these readahead pages are not > accessed before swapped out again, system runs into more severe > thrashing. This is because we lack the swap readahead window scaling > mechanism as above. And, this is why I suggested to combine the swap > readahead mechanism and mTHP swap-in by default before. That is, when > kernel swaps in a page, it checks current swap readahead window, and > decides mTHP order according to window size. So, if there are heavy > memory pressure, so that the nearby pages will not be accessed before > being swapped out again, the mTHP swap-in order can be adjusted > automatically. This is a good idea to do, but I think the issue is that readahead is a folio flag and not a page flag, so only works when folio size is 1. In the swapin_readahead swapcache path, the current implementation decides the ra_window based on hits, which is incremented in swap_cache_get_folio if it has not been gotten from swapcache before. The problem would be that we need information on how many distinct pages in a large folio that has been swapped in have been accessed to decide the hits/window size, which I don't think is possible. As once the entire large folio has been swapped in, we won't get a fault. > >> It happens to manifest with THP in cgroups because that's what you >> guys are testing. But IMO, any solution to this problem should >> consider the wider scope. >> >>>>>>> I think large folio swapin would make the problem worse anyway. I am >>>>>>> also not sure if the readahead window adjusts on memory pressure or >>>>>>> not. >>>>>>> >>>>>> readahead window doesnt look at memory pressure. So maybe the same thing is being >>>>>> seen here as there would be in swapin_readahead? >>>>> >>>>> Maybe readahead is not as aggressive in general as large folio >>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum order >>>>> of the window is the smaller of page_cluster (2 or 3) and >>>>> SWAP_RA_ORDER_CEILING (5). >>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might >>>> be similar to enabling 32K mTHP? >>> >>> Not quite. >> >> Actually, I would expect it to be... > > Me too. > >>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a >>>>> contiguous allocation like large folio swapin. So that could be >>>>> another factor why readahead may not reproduce the problem. >>> >>> Because of this ^. >> >> ...this matters for the physical allocation, which might require more >> reclaim and compaction to produce the 32k. But an earlier version of >> Barry's patch did the cgroup margin fallback after the THP was already >> physically allocated, and it still helped. >> >> So the issue in this test scenario seems to be mostly about cgroup >> volume. And then 8 4k charges should be equivalent to a singular 32k >> charge when it comes to cgroup pressure. > > -- > Best Regards, > Huang, Ying
Barry Song <21cnbao@gmail.com> writes: > On Mon, Nov 4, 2024 at 7:46 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Johannes Weiner <hannes@cmpxchg.org> writes: >> >> > On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: >> >> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote: >> >> > On 30/10/2024 21:01, Yosry Ahmed wrote: >> >> > > On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote: >> >> > >>>> I am not sure that the approach we are trying in this patch is the right way: >> >> > >>>> - This patch makes it a memcg issue, but you could have memcg disabled and >> >> > >>>> then the mitigation being tried here wont apply. >> >> > >>> >> >> > >>> Is the problem reproducible without memcg? I imagine only if the >> >> > >>> entire system is under memory pressure. I guess we would want the same >> >> > >>> "mitigation" either way. >> >> > >>> >> >> > >> What would be a good open source benchmark/workload to test without limiting memory >> >> > >> in memcg? >> >> > >> For the kernel build test, I can only get zswap activity to happen if I build >> >> > >> in cgroup and limit memory.max. >> >> > > >> >> > > You mean a benchmark that puts the entire system under memory >> >> > > pressure? I am not sure, it ultimately depends on the size of memory >> >> > > you have, among other factors. >> >> > > >> >> > > What if you run the kernel build test in a VM? Then you can limit is >> >> > > size like a memcg, although you'd probably need to leave more room >> >> > > because the entire guest OS will also subject to the same limit. >> >> > > >> >> > >> >> > I had tried this, but the variance in time/zswap numbers was very high. >> >> > Much higher than the AMD numbers I posted in reply to Barry. So found >> >> > it very difficult to make comparison. >> >> >> >> Hmm yeah maybe more factors come into play with global memory >> >> pressure. I am honestly not sure how to test this scenario, and I >> >> suspect variance will be high anyway. >> >> >> >> We can just try to use whatever technique we use for the memcg limit >> >> though, if possible, right? >> > >> > You can boot a physical machine with mem=1G on the commandline, which >> > restricts the physical range of memory that will be initialized. >> > Double check /proc/meminfo after boot, because part of that physical >> > range might not be usable RAM. >> > >> > I do this quite often to test physical memory pressure with workloads >> > that don't scale up easily, like kernel builds. >> > >> >> > >>>> - Instead of this being a large folio swapin issue, is it more of a readahead >> >> > >>>> issue? If we zswap (without the large folio swapin series) and change the window >> >> > >>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time >> >> > >>>> when cgroup memory is limited as readahead would probably cause swap thrashing as >> >> > >>>> well. >> > >> > +1 >> > >> > I also think there is too much focus on cgroup alone. The bigger issue >> > seems to be how much optimistic volume we swap in when we're under >> > pressure already. This applies to large folios and readahead; global >> > memory availability and cgroup limits. >> >> The current swap readahead logic is something like, >> >> 1. try readahead some pages for sequential access pattern, mark them as >> readahead >> >> 2. if these readahead pages get accessed before swapped out again, >> increase 'hits' counter >> >> 3. for next swap in, try readahead 'hits' pages and clear 'hits'. >> >> So, if there's heavy memory pressure, the readaheaded pages will not be >> accessed before being swapped out again (in 2 above), the readahead >> pages will be minimal. >> >> IMHO, mTHP swap-in is kind of swap readahead in effect. That is, in >> addition to the pages accessed are swapped in, the adjacent pages are >> swapped in (swap readahead) too. If these readahead pages are not >> accessed before swapped out again, system runs into more severe >> thrashing. This is because we lack the swap readahead window scaling >> mechanism as above. And, this is why I suggested to combine the swap >> readahead mechanism and mTHP swap-in by default before. That is, when >> kernel swaps in a page, it checks current swap readahead window, and >> decides mTHP order according to window size. So, if there are heavy >> memory pressure, so that the nearby pages will not be accessed before >> being swapped out again, the mTHP swap-in order can be adjusted >> automatically. > > This might help reduce memory reclamation thrashing for kernel build > workload running in a memory limited memcg which might not benefit > from mTHP that much. But this mechanism has clear disadvantages: > > 1. Loss of large folios: For example, if you're using app A and then switch > to app B, a significant portion (around 60%) of A's memory might be swapped > out as it moves to the background. When you switch back to app A, a large > portion of the memory originally in mTHP could be lost while swapping > in small folios. Why? Most app A pages could be swapped in as mTHP if all readahead pages are accessed before being swapped out again. > Essentially, systems with a user interface operate quite differently from kernel > build workloads running under a memory-limited memcg, as they switch > applications between the foreground and background. > > 2. Fragmentation of swap slots: This fragmentation increases the likelihood of > mTHP swapout failures, as it makes it harder to maintain contiguous memory > blocks in swap. > > 3. Prevent the implementation of large block compression and decompression > to achieve higher compression ratios and significantly lower CPU consumption, > as small folio swap-ins may still remain the predominant approach. > > Memory-limited systems often face challenges with larger page sizes. Even on > systems that support various base page sizes, such as 4KB, 16KB, and 64KB > on ARM64, using 16KB or 64KB as the base page size is not always the best > choice. With mTHP, we have already enabled per-size settings. For this kernel > build workload operating within a limited memcg, enabling only 16kB is likely > the best option for optimizing performance and minimizing thrashing. > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > > We could focus on mTHP and seek strategies to minimize thrashing when free > memory is severely limited :-) IIUC, mTHP can improve performance but may waste memory too. The best mTHP order may be different for different workloads and system states (e.g. high memory pressure). So, we need a way to identify the best mTHP order automatically. Swap readahead window size auto-scaling can be used to design a method to identify the best mTHP order for swap-in. I'm open to other possible methods too. I admit a fixed mTHP order can be best for some specific workloads or system states. However, it may be hard to find one mTHP order that works well for all workloads and system states. >> > It happens to manifest with THP in cgroups because that's what you >> > guys are testing. But IMO, any solution to this problem should >> > consider the wider scope. >> > >> >> > >>> I think large folio swapin would make the problem worse anyway. I am >> >> > >>> also not sure if the readahead window adjusts on memory pressure or >> >> > >>> not. >> >> > >>> >> >> > >> readahead window doesnt look at memory pressure. So maybe the same thing is being >> >> > >> seen here as there would be in swapin_readahead? >> >> > > >> >> > > Maybe readahead is not as aggressive in general as large folio >> >> > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order >> >> > > of the window is the smaller of page_cluster (2 or 3) and >> >> > > SWAP_RA_ORDER_CEILING (5). >> >> > Yes, I was seeing 8 pages swapin (order 3) when testing. So might >> >> > be similar to enabling 32K mTHP? >> >> >> >> Not quite. >> > >> > Actually, I would expect it to be... >> >> Me too. >> >> >> > > Also readahead will swapin 4k folios AFAICT, so we don't need a >> >> > > contiguous allocation like large folio swapin. So that could be >> >> > > another factor why readahead may not reproduce the problem. >> >> >> >> Because of this ^. >> > >> > ...this matters for the physical allocation, which might require more >> > reclaim and compaction to produce the 32k. But an earlier version of >> > Barry's patch did the cgroup margin fallback after the THP was already >> > physically allocated, and it still helped. >> > >> > So the issue in this test scenario seems to be mostly about cgroup >> > volume. And then 8 4k charges should be equivalent to a singular 32k >> > charge when it comes to cgroup pressure. >> -- Best Regards, Huang, Ying
Usama Arif <usamaarif642@gmail.com> writes: > On 04/11/2024 06:42, Huang, Ying wrote: >> Johannes Weiner <hannes@cmpxchg.org> writes: >> >>> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: >>>> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote: >>>>> On 30/10/2024 21:01, Yosry Ahmed wrote: >>>>>> On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>>>>> I am not sure that the approach we are trying in this patch is the right way: >>>>>>>>> - This patch makes it a memcg issue, but you could have memcg disabled and >>>>>>>>> then the mitigation being tried here wont apply. >>>>>>>> >>>>>>>> Is the problem reproducible without memcg? I imagine only if the >>>>>>>> entire system is under memory pressure. I guess we would want the same >>>>>>>> "mitigation" either way. >>>>>>>> >>>>>>> What would be a good open source benchmark/workload to test without limiting memory >>>>>>> in memcg? >>>>>>> For the kernel build test, I can only get zswap activity to happen if I build >>>>>>> in cgroup and limit memory.max. >>>>>> >>>>>> You mean a benchmark that puts the entire system under memory >>>>>> pressure? I am not sure, it ultimately depends on the size of memory >>>>>> you have, among other factors. >>>>>> >>>>>> What if you run the kernel build test in a VM? Then you can limit is >>>>>> size like a memcg, although you'd probably need to leave more room >>>>>> because the entire guest OS will also subject to the same limit. >>>>>> >>>>> >>>>> I had tried this, but the variance in time/zswap numbers was very high. >>>>> Much higher than the AMD numbers I posted in reply to Barry. So found >>>>> it very difficult to make comparison. >>>> >>>> Hmm yeah maybe more factors come into play with global memory >>>> pressure. I am honestly not sure how to test this scenario, and I >>>> suspect variance will be high anyway. >>>> >>>> We can just try to use whatever technique we use for the memcg limit >>>> though, if possible, right? >>> >>> You can boot a physical machine with mem=1G on the commandline, which >>> restricts the physical range of memory that will be initialized. >>> Double check /proc/meminfo after boot, because part of that physical >>> range might not be usable RAM. >>> >>> I do this quite often to test physical memory pressure with workloads >>> that don't scale up easily, like kernel builds. >>> >>>>>>>>> - Instead of this being a large folio swapin issue, is it more of a readahead >>>>>>>>> issue? If we zswap (without the large folio swapin series) and change the window >>>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time >>>>>>>>> when cgroup memory is limited as readahead would probably cause swap thrashing as >>>>>>>>> well. >>> >>> +1 >>> >>> I also think there is too much focus on cgroup alone. The bigger issue >>> seems to be how much optimistic volume we swap in when we're under >>> pressure already. This applies to large folios and readahead; global >>> memory availability and cgroup limits. >> >> The current swap readahead logic is something like, >> >> 1. try readahead some pages for sequential access pattern, mark them as >> readahead >> >> 2. if these readahead pages get accessed before swapped out again, >> increase 'hits' counter >> >> 3. for next swap in, try readahead 'hits' pages and clear 'hits'. >> >> So, if there's heavy memory pressure, the readaheaded pages will not be >> accessed before being swapped out again (in 2 above), the readahead >> pages will be minimal. >> >> IMHO, mTHP swap-in is kind of swap readahead in effect. That is, in >> addition to the pages accessed are swapped in, the adjacent pages are >> swapped in (swap readahead) too. If these readahead pages are not >> accessed before swapped out again, system runs into more severe >> thrashing. This is because we lack the swap readahead window scaling >> mechanism as above. And, this is why I suggested to combine the swap >> readahead mechanism and mTHP swap-in by default before. That is, when >> kernel swaps in a page, it checks current swap readahead window, and >> decides mTHP order according to window size. So, if there are heavy >> memory pressure, so that the nearby pages will not be accessed before >> being swapped out again, the mTHP swap-in order can be adjusted >> automatically. > > This is a good idea to do, but I think the issue is that readahead > is a folio flag and not a page flag, so only works when folio size is 1. > > In the swapin_readahead swapcache path, the current implementation decides > the ra_window based on hits, which is incremented in swap_cache_get_folio > if it has not been gotten from swapcache before. > The problem would be that we need information on how many distinct pages in > a large folio that has been swapped in have been accessed to decide the > hits/window size, which I don't think is possible. As once the entire large > folio has been swapped in, we won't get a fault. > To do that, we need to move readahead flag to per-page from per-folio. And we need to map only the accessed page of the folio in page fault handler. This may impact performance. So, we may only do that for sampled folios only, for example, every 100 folios. >> >>> It happens to manifest with THP in cgroups because that's what you >>> guys are testing. But IMO, any solution to this problem should >>> consider the wider scope. >>> >>>>>>>> I think large folio swapin would make the problem worse anyway. I am >>>>>>>> also not sure if the readahead window adjusts on memory pressure or >>>>>>>> not. >>>>>>>> >>>>>>> readahead window doesnt look at memory pressure. So maybe the same thing is being >>>>>>> seen here as there would be in swapin_readahead? >>>>>> >>>>>> Maybe readahead is not as aggressive in general as large folio >>>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum order >>>>>> of the window is the smaller of page_cluster (2 or 3) and >>>>>> SWAP_RA_ORDER_CEILING (5). >>>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might >>>>> be similar to enabling 32K mTHP? >>>> >>>> Not quite. >>> >>> Actually, I would expect it to be... >> >> Me too. >> >>>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a >>>>>> contiguous allocation like large folio swapin. So that could be >>>>>> another factor why readahead may not reproduce the problem. >>>> >>>> Because of this ^. >>> >>> ...this matters for the physical allocation, which might require more >>> reclaim and compaction to produce the 32k. But an earlier version of >>> Barry's patch did the cgroup margin fallback after the THP was already >>> physically allocated, and it still helped. >>> >>> So the issue in this test scenario seems to be mostly about cgroup >>> volume. And then 8 4k charges should be equivalent to a singular 32k >>> charge when it comes to cgroup pressure. -- Best Regards, Huang, Ying
On Tue, Nov 5, 2024 at 2:01 PM Huang, Ying <ying.huang@intel.com> wrote: > > Usama Arif <usamaarif642@gmail.com> writes: > > > On 04/11/2024 06:42, Huang, Ying wrote: > >> Johannes Weiner <hannes@cmpxchg.org> writes: > >> > >>> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: > >>>> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote: > >>>>> On 30/10/2024 21:01, Yosry Ahmed wrote: > >>>>>> On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote: > >>>>>>>>> I am not sure that the approach we are trying in this patch is the right way: > >>>>>>>>> - This patch makes it a memcg issue, but you could have memcg disabled and > >>>>>>>>> then the mitigation being tried here wont apply. > >>>>>>>> > >>>>>>>> Is the problem reproducible without memcg? I imagine only if the > >>>>>>>> entire system is under memory pressure. I guess we would want the same > >>>>>>>> "mitigation" either way. > >>>>>>>> > >>>>>>> What would be a good open source benchmark/workload to test without limiting memory > >>>>>>> in memcg? > >>>>>>> For the kernel build test, I can only get zswap activity to happen if I build > >>>>>>> in cgroup and limit memory.max. > >>>>>> > >>>>>> You mean a benchmark that puts the entire system under memory > >>>>>> pressure? I am not sure, it ultimately depends on the size of memory > >>>>>> you have, among other factors. > >>>>>> > >>>>>> What if you run the kernel build test in a VM? Then you can limit is > >>>>>> size like a memcg, although you'd probably need to leave more room > >>>>>> because the entire guest OS will also subject to the same limit. > >>>>>> > >>>>> > >>>>> I had tried this, but the variance in time/zswap numbers was very high. > >>>>> Much higher than the AMD numbers I posted in reply to Barry. So found > >>>>> it very difficult to make comparison. > >>>> > >>>> Hmm yeah maybe more factors come into play with global memory > >>>> pressure. I am honestly not sure how to test this scenario, and I > >>>> suspect variance will be high anyway. > >>>> > >>>> We can just try to use whatever technique we use for the memcg limit > >>>> though, if possible, right? > >>> > >>> You can boot a physical machine with mem=1G on the commandline, which > >>> restricts the physical range of memory that will be initialized. > >>> Double check /proc/meminfo after boot, because part of that physical > >>> range might not be usable RAM. > >>> > >>> I do this quite often to test physical memory pressure with workloads > >>> that don't scale up easily, like kernel builds. > >>> > >>>>>>>>> - Instead of this being a large folio swapin issue, is it more of a readahead > >>>>>>>>> issue? If we zswap (without the large folio swapin series) and change the window > >>>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > >>>>>>>>> when cgroup memory is limited as readahead would probably cause swap thrashing as > >>>>>>>>> well. > >>> > >>> +1 > >>> > >>> I also think there is too much focus on cgroup alone. The bigger issue > >>> seems to be how much optimistic volume we swap in when we're under > >>> pressure already. This applies to large folios and readahead; global > >>> memory availability and cgroup limits. > >> > >> The current swap readahead logic is something like, > >> > >> 1. try readahead some pages for sequential access pattern, mark them as > >> readahead > >> > >> 2. if these readahead pages get accessed before swapped out again, > >> increase 'hits' counter > >> > >> 3. for next swap in, try readahead 'hits' pages and clear 'hits'. > >> > >> So, if there's heavy memory pressure, the readaheaded pages will not be > >> accessed before being swapped out again (in 2 above), the readahead > >> pages will be minimal. > >> > >> IMHO, mTHP swap-in is kind of swap readahead in effect. That is, in > >> addition to the pages accessed are swapped in, the adjacent pages are > >> swapped in (swap readahead) too. If these readahead pages are not > >> accessed before swapped out again, system runs into more severe > >> thrashing. This is because we lack the swap readahead window scaling > >> mechanism as above. And, this is why I suggested to combine the swap > >> readahead mechanism and mTHP swap-in by default before. That is, when > >> kernel swaps in a page, it checks current swap readahead window, and > >> decides mTHP order according to window size. So, if there are heavy > >> memory pressure, so that the nearby pages will not be accessed before > >> being swapped out again, the mTHP swap-in order can be adjusted > >> automatically. > > > > This is a good idea to do, but I think the issue is that readahead > > is a folio flag and not a page flag, so only works when folio size is 1. > > > > In the swapin_readahead swapcache path, the current implementation decides > > the ra_window based on hits, which is incremented in swap_cache_get_folio > > if it has not been gotten from swapcache before. > > The problem would be that we need information on how many distinct pages in > > a large folio that has been swapped in have been accessed to decide the > > hits/window size, which I don't think is possible. As once the entire large > > folio has been swapped in, we won't get a fault. > > > > To do that, we need to move readahead flag to per-page from per-folio. > And we need to map only the accessed page of the folio in page fault > handler. This may impact performance. So, we may only do that for > sampled folios only, for example, every 100 folios. I'm not entirely sure there's a chance to gain traction on this, as the current trend clearly leans toward moving flags from page to folio, not from folio to page :-) > > >> > >>> It happens to manifest with THP in cgroups because that's what you > >>> guys are testing. But IMO, any solution to this problem should > >>> consider the wider scope. > >>> > >>>>>>>> I think large folio swapin would make the problem worse anyway. I am > >>>>>>>> also not sure if the readahead window adjusts on memory pressure or > >>>>>>>> not. > >>>>>>>> > >>>>>>> readahead window doesnt look at memory pressure. So maybe the same thing is being > >>>>>>> seen here as there would be in swapin_readahead? > >>>>>> > >>>>>> Maybe readahead is not as aggressive in general as large folio > >>>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum order > >>>>>> of the window is the smaller of page_cluster (2 or 3) and > >>>>>> SWAP_RA_ORDER_CEILING (5). > >>>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might > >>>>> be similar to enabling 32K mTHP? > >>>> > >>>> Not quite. > >>> > >>> Actually, I would expect it to be... > >> > >> Me too. > >> > >>>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a > >>>>>> contiguous allocation like large folio swapin. So that could be > >>>>>> another factor why readahead may not reproduce the problem. > >>>> > >>>> Because of this ^. > >>> > >>> ...this matters for the physical allocation, which might require more > >>> reclaim and compaction to produce the 32k. But an earlier version of > >>> Barry's patch did the cgroup margin fallback after the THP was already > >>> physically allocated, and it still helped. > >>> > >>> So the issue in this test scenario seems to be mostly about cgroup > >>> volume. And then 8 4k charges should be equivalent to a singular 32k > >>> charge when it comes to cgroup pressure. > > -- > Best Regards, > Huang, Ying
Barry Song <21cnbao@gmail.com> writes: > On Tue, Nov 5, 2024 at 2:01 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Usama Arif <usamaarif642@gmail.com> writes: >> >> > On 04/11/2024 06:42, Huang, Ying wrote: >> >> Johannes Weiner <hannes@cmpxchg.org> writes: >> >> >> >>> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: >> >>>> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif <usamaarif642@gmail.com> wrote: >> >>>>> On 30/10/2024 21:01, Yosry Ahmed wrote: >> >>>>>> On Wed, Oct 30, 2024 at 1:25 PM Usama Arif <usamaarif642@gmail.com> wrote: >> >>>>>>>>> I am not sure that the approach we are trying in this patch is the right way: >> >>>>>>>>> - This patch makes it a memcg issue, but you could have memcg disabled and >> >>>>>>>>> then the mitigation being tried here wont apply. >> >>>>>>>> >> >>>>>>>> Is the problem reproducible without memcg? I imagine only if the >> >>>>>>>> entire system is under memory pressure. I guess we would want the same >> >>>>>>>> "mitigation" either way. >> >>>>>>>> >> >>>>>>> What would be a good open source benchmark/workload to test without limiting memory >> >>>>>>> in memcg? >> >>>>>>> For the kernel build test, I can only get zswap activity to happen if I build >> >>>>>>> in cgroup and limit memory.max. >> >>>>>> >> >>>>>> You mean a benchmark that puts the entire system under memory >> >>>>>> pressure? I am not sure, it ultimately depends on the size of memory >> >>>>>> you have, among other factors. >> >>>>>> >> >>>>>> What if you run the kernel build test in a VM? Then you can limit is >> >>>>>> size like a memcg, although you'd probably need to leave more room >> >>>>>> because the entire guest OS will also subject to the same limit. >> >>>>>> >> >>>>> >> >>>>> I had tried this, but the variance in time/zswap numbers was very high. >> >>>>> Much higher than the AMD numbers I posted in reply to Barry. So found >> >>>>> it very difficult to make comparison. >> >>>> >> >>>> Hmm yeah maybe more factors come into play with global memory >> >>>> pressure. I am honestly not sure how to test this scenario, and I >> >>>> suspect variance will be high anyway. >> >>>> >> >>>> We can just try to use whatever technique we use for the memcg limit >> >>>> though, if possible, right? >> >>> >> >>> You can boot a physical machine with mem=1G on the commandline, which >> >>> restricts the physical range of memory that will be initialized. >> >>> Double check /proc/meminfo after boot, because part of that physical >> >>> range might not be usable RAM. >> >>> >> >>> I do this quite often to test physical memory pressure with workloads >> >>> that don't scale up easily, like kernel builds. >> >>> >> >>>>>>>>> - Instead of this being a large folio swapin issue, is it more of a readahead >> >>>>>>>>> issue? If we zswap (without the large folio swapin series) and change the window >> >>>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time >> >>>>>>>>> when cgroup memory is limited as readahead would probably cause swap thrashing as >> >>>>>>>>> well. >> >>> >> >>> +1 >> >>> >> >>> I also think there is too much focus on cgroup alone. The bigger issue >> >>> seems to be how much optimistic volume we swap in when we're under >> >>> pressure already. This applies to large folios and readahead; global >> >>> memory availability and cgroup limits. >> >> >> >> The current swap readahead logic is something like, >> >> >> >> 1. try readahead some pages for sequential access pattern, mark them as >> >> readahead >> >> >> >> 2. if these readahead pages get accessed before swapped out again, >> >> increase 'hits' counter >> >> >> >> 3. for next swap in, try readahead 'hits' pages and clear 'hits'. >> >> >> >> So, if there's heavy memory pressure, the readaheaded pages will not be >> >> accessed before being swapped out again (in 2 above), the readahead >> >> pages will be minimal. >> >> >> >> IMHO, mTHP swap-in is kind of swap readahead in effect. That is, in >> >> addition to the pages accessed are swapped in, the adjacent pages are >> >> swapped in (swap readahead) too. If these readahead pages are not >> >> accessed before swapped out again, system runs into more severe >> >> thrashing. This is because we lack the swap readahead window scaling >> >> mechanism as above. And, this is why I suggested to combine the swap >> >> readahead mechanism and mTHP swap-in by default before. That is, when >> >> kernel swaps in a page, it checks current swap readahead window, and >> >> decides mTHP order according to window size. So, if there are heavy >> >> memory pressure, so that the nearby pages will not be accessed before >> >> being swapped out again, the mTHP swap-in order can be adjusted >> >> automatically. >> > >> > This is a good idea to do, but I think the issue is that readahead >> > is a folio flag and not a page flag, so only works when folio size is 1. >> > >> > In the swapin_readahead swapcache path, the current implementation decides >> > the ra_window based on hits, which is incremented in swap_cache_get_folio >> > if it has not been gotten from swapcache before. >> > The problem would be that we need information on how many distinct pages in >> > a large folio that has been swapped in have been accessed to decide the >> > hits/window size, which I don't think is possible. As once the entire large >> > folio has been swapped in, we won't get a fault. >> > >> >> To do that, we need to move readahead flag to per-page from per-folio. >> And we need to map only the accessed page of the folio in page fault >> handler. This may impact performance. So, we may only do that for >> sampled folios only, for example, every 100 folios. > > I'm not entirely sure there's a chance to gain traction on this, as the current > trend clearly leans toward moving flags from page to folio, not from folio to > page :-) This may be a problem. However, I think we can try to find a solution for this. Anyway, we need some way to track per-page status in a folio, regardless how to implement it. >> >> >> >> >>> It happens to manifest with THP in cgroups because that's what you >> >>> guys are testing. But IMO, any solution to this problem should >> >>> consider the wider scope. >> >>> >> >>>>>>>> I think large folio swapin would make the problem worse anyway. I am >> >>>>>>>> also not sure if the readahead window adjusts on memory pressure or >> >>>>>>>> not. >> >>>>>>>> >> >>>>>>> readahead window doesnt look at memory pressure. So maybe the same thing is being >> >>>>>>> seen here as there would be in swapin_readahead? >> >>>>>> >> >>>>>> Maybe readahead is not as aggressive in general as large folio >> >>>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum order >> >>>>>> of the window is the smaller of page_cluster (2 or 3) and >> >>>>>> SWAP_RA_ORDER_CEILING (5). >> >>>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might >> >>>>> be similar to enabling 32K mTHP? >> >>>> >> >>>> Not quite. >> >>> >> >>> Actually, I would expect it to be... >> >> >> >> Me too. >> >> >> >>>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a >> >>>>>> contiguous allocation like large folio swapin. So that could be >> >>>>>> another factor why readahead may not reproduce the problem. >> >>>> >> >>>> Because of this ^. >> >>> >> >>> ...this matters for the physical allocation, which might require more >> >>> reclaim and compaction to produce the 32k. But an earlier version of >> >>> Barry's patch did the cgroup margin fallback after the THP was already >> >>> physically allocated, and it still helped. >> >>> >> >>> So the issue in this test scenario seems to be mostly about cgroup >> >>> volume. And then 8 4k charges should be equivalent to a singular 32k >> >>> charge when it comes to cgroup pressure. -- Best Regards, Huang, Ying
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 524006313b0d..8bcc8f4af39f 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, long nr_pages); +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, + swp_entry_t *entry); + int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry); @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, return 0; } +static inline int mem_cgroup_precharge_large_folio(struct mm_struct *mm, + swp_entry_t *entry) +{ + return 0; +} + static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 17af08367c68..f3d92b93ea6d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, return 0; } +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg) +{ + for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { + if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR) + return false; + } + + return true; +} + +/** + * mem_cgroup_swapin_precharge_large_folio: Precharge large folios. + * + * @mm: mm context of the victim + * @entry: swap entry for which the folio will be allocated + * + * If we are arriving the edge of an almost full memcg, return error so that + * swap-in and anon faults can quickly fall back to small folios to avoid swap + * thrashing. + * + * Returns 0 on success, an error code on failure. + */ +int mem_cgroup_precharge_large_folio(struct mm_struct *mm, swp_entry_t *entry) +{ + struct mem_cgroup *memcg = NULL; + unsigned short id; + bool has_margin; + + if (mem_cgroup_disabled()) + return 0; + + rcu_read_lock(); + if (entry) { + id = lookup_swap_cgroup_id(*entry); + memcg = mem_cgroup_from_id(id); + } + if (!memcg || !css_tryget_online(&memcg->css)) + memcg = get_mem_cgroup_from_mm(mm); + has_margin = mem_cgroup_has_margin(memcg); + rcu_read_unlock(); + + css_put(&memcg->css); + return has_margin ? 0 : -ENOMEM; +} + /** * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin. * @folio: folio to charge. diff --git a/mm/memory.c b/mm/memory.c index 0f614523b9f4..96368ba0e8a6 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4168,6 +4168,16 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) pte_unmap_unlock(pte, ptl); + if (!orders) + goto fallback; + + /* + * Avoid swapping in large folios when memcg is nearly full, as it + * may quickly trigger additional swap-out and swap-in cycles. + */ + if (mem_cgroup_precharge_large_folio(vma->vm_mm, &entry)) + goto fallback; + /* Try allocating the highest of the remaining orders. */ gfp = vma_thp_gfp_mask(vma); while (orders) { @@ -4707,6 +4717,13 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) if (!orders) goto fallback; + /* + * When memcg is nearly full, large folios can rapidly fill + * the margin and trigger new reclamation + */ + if (mem_cgroup_precharge_large_folio(vma->vm_mm, NULL)) + goto fallback; + /* Try allocating the highest of the remaining orders. */ gfp = vma_thp_gfp_mask(vma); while (orders) {