Message ID | 20241018105026.2521366-1-usamaarif642@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@gmail.com> wrote: > > After large folio zswapout support added in [1], this patch adds > support for zswapin of large folios to bring it on par with zram. > This series makes sure that the benefits of large folios (fewer > page faults, batched PTE and rmap manipulation, reduced lru list, > TLB coalescing (for arm64 and amd)) are not lost at swap out when > using zswap. > > It builds on top of [2] which added large folio swapin support for > zram and provides the same level of large folio swapin support as > zram, i.e. only supporting swap count == 1. > > Patch 1 skips swapcache for swapping in zswap pages, this should improve > no readahead swapin performance [3], and also allows us to build on large > folio swapin support added in [2], hence is a prerequisite for patch 3. > > Patch 3 adds support for large folio zswapin. This patch does not add > support for hybrid backends (i.e. folios partly present swap and zswap). > > The main performance benefit comes from maintaining large folios *after* > swapin, large folio performance improvements have been mentioned in previous > series posted on it [2],[4], so have not added those. Below is a simple > microbenchmark to measure the time needed *for* zswpin of 1G memory (along > with memory integrity check). > > | no mTHP (ms) | 1M mTHP enabled (ms) > Base kernel | 1165 | 1163 > Kernel with mTHP zswpin series | 1203 | 738 Hi Usama, Do you know where this minor regression for non-mTHP comes from? As you even have skipped swapcache for small folios in zswap in patch1, that part should have some gain? is it because of zswap_present_test()? > > The time measured was pretty consistent between runs (~1-2% variation). > There is 36% improvement in zswapin time with 1M folios. The percentage > improvement is likely to be more if the memcmp is removed. > > diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c > index 40de679248b8..77068c577c86 100644 > --- a/tools/testing/selftests/cgroup/test_zswap.c > +++ b/tools/testing/selftests/cgroup/test_zswap.c > @@ -9,6 +9,8 @@ > #include <string.h> > #include <sys/wait.h> > #include <sys/mman.h> > +#include <sys/time.h> > +#include <malloc.h> > > #include "../kselftest.h" > #include "cgroup_util.h" > @@ -407,6 +409,74 @@ static int test_zswap_writeback_disabled(const char *root) > return test_zswap_writeback(root, false); > } > > +static int zswapin_perf(const char *cgroup, void *arg) > +{ > + long pagesize = sysconf(_SC_PAGESIZE); > + size_t memsize = MB(1*1024); > + char buf[pagesize]; > + int ret = -1; > + char *mem; > + struct timeval start, end; > + > + mem = (char *)memalign(2*1024*1024, memsize); > + if (!mem) > + return ret; > + > + /* > + * Fill half of each page with increasing data, and keep other > + * half empty, this will result in data that is still compressible > + * and ends up in zswap, with material zswap usage. > + */ > + for (int i = 0; i < pagesize; i++) > + buf[i] = i < pagesize/2 ? (char) i : 0; > + > + for (int i = 0; i < memsize; i += pagesize) > + memcpy(&mem[i], buf, pagesize); > + > + /* Try and reclaim allocated memory */ > + if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) { > + ksft_print_msg("Failed to reclaim all of the requested memory\n"); > + goto out; > + } > + > + gettimeofday(&start, NULL); > + /* zswpin */ > + for (int i = 0; i < memsize; i += pagesize) { > + if (memcmp(&mem[i], buf, pagesize)) { > + ksft_print_msg("invalid memory\n"); > + goto out; > + } > + } > + gettimeofday(&end, NULL); > + printf ("zswapin took %fms to run.\n", (end.tv_sec - start.tv_sec)*1000 + (double)(end.tv_usec - start.tv_usec) / 1000); > + ret = 0; > +out: > + free(mem); > + return ret; > +} > + > +static int test_zswapin_perf(const char *root) > +{ > + int ret = KSFT_FAIL; > + char *test_group; > + > + test_group = cg_name(root, "zswapin_perf_test"); > + if (!test_group) > + goto out; > + if (cg_create(test_group)) > + goto out; > + > + if (cg_run(test_group, zswapin_perf, NULL)) > + goto out; > + > + ret = KSFT_PASS; > +out: > + cg_destroy(test_group); > + free(test_group); > + return ret; > +} > + > /* > * When trying to store a memcg page in zswap, if the memcg hits its memory > * limit in zswap, writeback should affect only the zswapped pages of that > @@ -584,6 +654,7 @@ struct zswap_test { > T(test_zswapin), > T(test_zswap_writeback_enabled), > T(test_zswap_writeback_disabled), > + T(test_zswapin_perf), > T(test_no_kmem_bypass), > T(test_no_invasive_cgroup_shrink), > }; > > [1] https://lore.kernel.org/all/20241001053222.6944-1-kanchana.p.sridhar@intel.com/ > [2] https://lore.kernel.org/all/20240821074541.516249-1-hanchuanhua@oppo.com/ > [3] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@kernel.org/T/#u > [4] https://lwn.net/Articles/955575/ > > Usama Arif (4): > mm/zswap: skip swapcache for swapping in zswap pages > mm/zswap: modify zswap_decompress to accept page instead of folio > mm/zswap: add support for large folio zswapin > mm/zswap: count successful large folio zswap loads > > Documentation/admin-guide/mm/transhuge.rst | 3 + > include/linux/huge_mm.h | 1 + > include/linux/zswap.h | 6 ++ > mm/huge_memory.c | 3 + > mm/memory.c | 16 +-- > mm/page_io.c | 2 +- > mm/zswap.c | 120 ++++++++++++++------- > 7 files changed, 99 insertions(+), 52 deletions(-) > > -- > 2.43.5 > Thanks barry
On 21/10/2024 06:09, Barry Song wrote: > On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@gmail.com> wrote: >> >> After large folio zswapout support added in [1], this patch adds >> support for zswapin of large folios to bring it on par with zram. >> This series makes sure that the benefits of large folios (fewer >> page faults, batched PTE and rmap manipulation, reduced lru list, >> TLB coalescing (for arm64 and amd)) are not lost at swap out when >> using zswap. >> >> It builds on top of [2] which added large folio swapin support for >> zram and provides the same level of large folio swapin support as >> zram, i.e. only supporting swap count == 1. >> >> Patch 1 skips swapcache for swapping in zswap pages, this should improve >> no readahead swapin performance [3], and also allows us to build on large >> folio swapin support added in [2], hence is a prerequisite for patch 3. >> >> Patch 3 adds support for large folio zswapin. This patch does not add >> support for hybrid backends (i.e. folios partly present swap and zswap). >> >> The main performance benefit comes from maintaining large folios *after* >> swapin, large folio performance improvements have been mentioned in previous >> series posted on it [2],[4], so have not added those. Below is a simple >> microbenchmark to measure the time needed *for* zswpin of 1G memory (along >> with memory integrity check). >> >> | no mTHP (ms) | 1M mTHP enabled (ms) >> Base kernel | 1165 | 1163 >> Kernel with mTHP zswpin series | 1203 | 738 > > Hi Usama, > Do you know where this minor regression for non-mTHP comes from? > As you even have skipped swapcache for small folios in zswap in patch1, > that part should have some gain? is it because of zswap_present_test()? > Hi Barry, The microbenchmark does a sequential read of 1G of memory, so it probably isnt very representative of real world usecases. This also means that swap_vma_readahead is able to readahead accurately all pages in its window. With this patch series, if doing 4K swapin, you get 1G/4K calls of fast do_swap_page. Without this patch, you get 1G/(4K*readahead window) of slow do_swap_page calls. I had added some prints and I was seeing 8 pages being readahead in 1 do_swap_page. The larger number of calls causes the slight regression (eventhough they are quite fast). I think in a realistic scenario, where readahead window wont be as large, there wont be a regression. The cost of zswap_present_test in the whole call stack of swapping page is very low and I think can be ignored. I think the more interesting thing is what Kanchana pointed out in https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gmail.com/ I am curious, did you see this when testing large folio swapin and compression at 4K granuality? Its looks like swap thrashing so I think it would be common between zswap and zram. I dont have larger granuality zswap compression done, which is why I think there is a regression in time taken. (It could be because its tested on intel as well). Thanks, Usama >> >> The time measured was pretty consistent between runs (~1-2% variation). >> There is 36% improvement in zswapin time with 1M folios. The percentage >> improvement is likely to be more if the memcmp is removed. >> >> diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c >> index 40de679248b8..77068c577c86 100644 >> --- a/tools/testing/selftests/cgroup/test_zswap.c >> +++ b/tools/testing/selftests/cgroup/test_zswap.c >> @@ -9,6 +9,8 @@ >> #include <string.h> >> #include <sys/wait.h> >> #include <sys/mman.h> >> +#include <sys/time.h> >> +#include <malloc.h> >> >> #include "../kselftest.h" >> #include "cgroup_util.h" >> @@ -407,6 +409,74 @@ static int test_zswap_writeback_disabled(const char *root) >> return test_zswap_writeback(root, false); >> } >> >> +static int zswapin_perf(const char *cgroup, void *arg) >> +{ >> + long pagesize = sysconf(_SC_PAGESIZE); >> + size_t memsize = MB(1*1024); >> + char buf[pagesize]; >> + int ret = -1; >> + char *mem; >> + struct timeval start, end; >> + >> + mem = (char *)memalign(2*1024*1024, memsize); >> + if (!mem) >> + return ret; >> + >> + /* >> + * Fill half of each page with increasing data, and keep other >> + * half empty, this will result in data that is still compressible >> + * and ends up in zswap, with material zswap usage. >> + */ >> + for (int i = 0; i < pagesize; i++) >> + buf[i] = i < pagesize/2 ? (char) i : 0; >> + >> + for (int i = 0; i < memsize; i += pagesize) >> + memcpy(&mem[i], buf, pagesize); >> + >> + /* Try and reclaim allocated memory */ >> + if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) { >> + ksft_print_msg("Failed to reclaim all of the requested memory\n"); >> + goto out; >> + } >> + >> + gettimeofday(&start, NULL); >> + /* zswpin */ >> + for (int i = 0; i < memsize; i += pagesize) { >> + if (memcmp(&mem[i], buf, pagesize)) { >> + ksft_print_msg("invalid memory\n"); >> + goto out; >> + } >> + } >> + gettimeofday(&end, NULL); >> + printf ("zswapin took %fms to run.\n", (end.tv_sec - start.tv_sec)*1000 + (double)(end.tv_usec - start.tv_usec) / 1000); >> + ret = 0; >> +out: >> + free(mem); >> + return ret; >> +} >> + >> +static int test_zswapin_perf(const char *root) >> +{ >> + int ret = KSFT_FAIL; >> + char *test_group; >> + >> + test_group = cg_name(root, "zswapin_perf_test"); >> + if (!test_group) >> + goto out; >> + if (cg_create(test_group)) >> + goto out; >> + >> + if (cg_run(test_group, zswapin_perf, NULL)) >> + goto out; >> + >> + ret = KSFT_PASS; >> +out: >> + cg_destroy(test_group); >> + free(test_group); >> + return ret; >> +} >> + >> /* >> * When trying to store a memcg page in zswap, if the memcg hits its memory >> * limit in zswap, writeback should affect only the zswapped pages of that >> @@ -584,6 +654,7 @@ struct zswap_test { >> T(test_zswapin), >> T(test_zswap_writeback_enabled), >> T(test_zswap_writeback_disabled), >> + T(test_zswapin_perf), >> T(test_no_kmem_bypass), >> T(test_no_invasive_cgroup_shrink), >> }; >> >> [1] https://lore.kernel.org/all/20241001053222.6944-1-kanchana.p.sridhar@intel.com/ >> [2] https://lore.kernel.org/all/20240821074541.516249-1-hanchuanhua@oppo.com/ >> [3] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@kernel.org/T/#u >> [4] https://lwn.net/Articles/955575/ >> >> Usama Arif (4): >> mm/zswap: skip swapcache for swapping in zswap pages >> mm/zswap: modify zswap_decompress to accept page instead of folio >> mm/zswap: add support for large folio zswapin >> mm/zswap: count successful large folio zswap loads >> >> Documentation/admin-guide/mm/transhuge.rst | 3 + >> include/linux/huge_mm.h | 1 + >> include/linux/zswap.h | 6 ++ >> mm/huge_memory.c | 3 + >> mm/memory.c | 16 +-- >> mm/page_io.c | 2 +- >> mm/zswap.c | 120 ++++++++++++++------- >> 7 files changed, 99 insertions(+), 52 deletions(-) >> >> -- >> 2.43.5 >> > > Thanks > barry
On 21/10/2024 11:40, Usama Arif wrote: > > > On 21/10/2024 06:09, Barry Song wrote: >> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@gmail.com> wrote: >>> >>> After large folio zswapout support added in [1], this patch adds >>> support for zswapin of large folios to bring it on par with zram. >>> This series makes sure that the benefits of large folios (fewer >>> page faults, batched PTE and rmap manipulation, reduced lru list, >>> TLB coalescing (for arm64 and amd)) are not lost at swap out when >>> using zswap. >>> >>> It builds on top of [2] which added large folio swapin support for >>> zram and provides the same level of large folio swapin support as >>> zram, i.e. only supporting swap count == 1. >>> >>> Patch 1 skips swapcache for swapping in zswap pages, this should improve >>> no readahead swapin performance [3], and also allows us to build on large >>> folio swapin support added in [2], hence is a prerequisite for patch 3. >>> >>> Patch 3 adds support for large folio zswapin. This patch does not add >>> support for hybrid backends (i.e. folios partly present swap and zswap). >>> >>> The main performance benefit comes from maintaining large folios *after* >>> swapin, large folio performance improvements have been mentioned in previous >>> series posted on it [2],[4], so have not added those. Below is a simple >>> microbenchmark to measure the time needed *for* zswpin of 1G memory (along >>> with memory integrity check). >>> >>> | no mTHP (ms) | 1M mTHP enabled (ms) >>> Base kernel | 1165 | 1163 >>> Kernel with mTHP zswpin series | 1203 | 738 >> >> Hi Usama, >> Do you know where this minor regression for non-mTHP comes from? >> As you even have skipped swapcache for small folios in zswap in patch1, >> that part should have some gain? is it because of zswap_present_test()? >> > > Hi Barry, > > The microbenchmark does a sequential read of 1G of memory, so it probably > isnt very representative of real world usecases. This also means that > swap_vma_readahead is able to readahead accurately all pages in its window. > With this patch series, if doing 4K swapin, you get 1G/4K calls of fast > do_swap_page. Without this patch, you get 1G/(4K*readahead window) of slow > do_swap_page calls. I had added some prints and I was seeing 8 pages being > readahead in 1 do_swap_page. The larger number of calls causes the slight > regression (eventhough they are quite fast). I think in a realistic scenario, > where readahead window wont be as large, there wont be a regression. > The cost of zswap_present_test in the whole call stack of swapping page is > very low and I think can be ignored. > > I think the more interesting thing is what Kanchana pointed out in > https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gmail.com/ > I am curious, did you see this when testing large folio swapin and compression > at 4K granuality? Its looks like swap thrashing so I think it would be common > between zswap and zram. I dont have larger granuality zswap compression done, > which is why I think there is a regression in time taken. (It could be because > its tested on intel as well). > > Thanks, > Usama > Hi, So I have been doing some benchmarking after Kanchana pointed out a performance regression in [1] of swapping in large folio. I would love to get thoughts from zram folks on this, as thats where large folio swapin was first added [2]. As far as I can see, the current support in zram is doing large folio swapin at 4K granuality. The large granuality compression in [3] which was posted in March is not merged, so I am currently comparing upstream zram with this series. With the microbenchmark below of timing 1G swapin, there was a very large improvement in performance by using this series. I think similar numbers would be seen in zram. But when doing kernel build test, Kanchana saw a regression in [1]. I believe its because of swap thrashing (causing large zswap activity), due to larger page swapin. The part of the code that decides large folio swapin is the same between zswap and zram, so I believe this would be observed in zram as well. My initial thought was this might be because its intel, where you dont have the advantage of TLB coalescing, so tested on AMD and ARM, but the regression is there on AMD and ARM as well, though a bit less (have added the numbers below). The numbers show that the zswap activity increases and page faults decrease. Overall this does result in sys time increasing and real time slightly increases, likely because the cost of increased zswap activity is more than the benefit of lower page faults. I can see in [3] that pagefaults reduced in zram as well. Large folio swapin shows good numbers in microbenchmarks that just target reduce page faults and sequential swapin only, but not in kernel build test. Is a similar regression observed with zram when enabling large folio swapin on kernel build test? Maybe large folio swapin makes more sense on workloads where mappings are kept for a longer time? Kernel build numbers in cgroup with memory.max=4G to trigger zswap Command for AMD: make defconfig; time make -j$(nproc) bzImage Command for ARM: make defconfig; time make -j$(nproc) Image AMD 16K+32K THP=always metric mm-unstable mm-unstable + large folio zswapin series real 1m23.038s 1m23.050s user 53m57.210s 53m53.437s sys 7m24.592s 7m48.843s zswpin 612070 999244 zswpout 2226403 2347979 pgfault 20667366 20481728 pgmajfault 385887 269117 AMD 16K+32K+64K THP=always metric mm-unstable mm-unstable + large folio zswapin series real 1m22.975s 1m23.266s user 53m51.302s 53m51.069s sys 7m40.168s 7m57.104s zswpin 676492 1258573 zswpout 2449839 2714767 pgfault 17540746 17296555 pgmajfault 429629 307495 -------------------------- ARM 16K+32K THP=always metric mm-unstable mm-unstable + large folio zswapin series real 0m51.168s 0m52.086s user 25m14.715s 25m15.765s sys 17m18.856s 18m8.031s zswpin 3904129 7339245 zswpout 11171295 13473461 pgfault 37313345 36011338 pgmajfault 2726253 1932642 ARM 16K+32K+64K THP=always metric mm-unstable mm-unstable + large folio zswapin series real 0m52.017s 0m53.828s user 25m2.742s 25m0.046s sys 18m24.525s 20m26.207s zswpin 4853571 8908664 zswpout 12297199 15768764 pgfault 32158152 30425519 pgmajfault 3320717 2237015 Thanks! Usama [1] https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gmail.com/ [2] https://lore.kernel.org/all/20240821074541.516249-3-hanchuanhua@oppo.com/ [3] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/ > >>> >>> The time measured was pretty consistent between runs (~1-2% variation). >>> There is 36% improvement in zswapin time with 1M folios. The percentage >>> improvement is likely to be more if the memcmp is removed. >>> >>> diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c >>> index 40de679248b8..77068c577c86 100644 >>> --- a/tools/testing/selftests/cgroup/test_zswap.c >>> +++ b/tools/testing/selftests/cgroup/test_zswap.c >>> @@ -9,6 +9,8 @@ >>> #include <string.h> >>> #include <sys/wait.h> >>> #include <sys/mman.h> >>> +#include <sys/time.h> >>> +#include <malloc.h> >>> >>> #include "../kselftest.h" >>> #include "cgroup_util.h" >>> @@ -407,6 +409,74 @@ static int test_zswap_writeback_disabled(const char *root) >>> return test_zswap_writeback(root, false); >>> } >>> >>> +static int zswapin_perf(const char *cgroup, void *arg) >>> +{ >>> + long pagesize = sysconf(_SC_PAGESIZE); >>> + size_t memsize = MB(1*1024); >>> + char buf[pagesize]; >>> + int ret = -1; >>> + char *mem; >>> + struct timeval start, end; >>> + >>> + mem = (char *)memalign(2*1024*1024, memsize); >>> + if (!mem) >>> + return ret; >>> + >>> + /* >>> + * Fill half of each page with increasing data, and keep other >>> + * half empty, this will result in data that is still compressible >>> + * and ends up in zswap, with material zswap usage. >>> + */ >>> + for (int i = 0; i < pagesize; i++) >>> + buf[i] = i < pagesize/2 ? (char) i : 0; >>> + >>> + for (int i = 0; i < memsize; i += pagesize) >>> + memcpy(&mem[i], buf, pagesize); >>> + >>> + /* Try and reclaim allocated memory */ >>> + if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) { >>> + ksft_print_msg("Failed to reclaim all of the requested memory\n"); >>> + goto out; >>> + } >>> + >>> + gettimeofday(&start, NULL); >>> + /* zswpin */ >>> + for (int i = 0; i < memsize; i += pagesize) { >>> + if (memcmp(&mem[i], buf, pagesize)) { >>> + ksft_print_msg("invalid memory\n"); >>> + goto out; >>> + } >>> + } >>> + gettimeofday(&end, NULL); >>> + printf ("zswapin took %fms to run.\n", (end.tv_sec - start.tv_sec)*1000 + (double)(end.tv_usec - start.tv_usec) / 1000); >>> + ret = 0; >>> +out: >>> + free(mem); >>> + return ret; >>> +} >>> + >>> +static int test_zswapin_perf(const char *root) >>> +{ >>> + int ret = KSFT_FAIL; >>> + char *test_group; >>> + >>> + test_group = cg_name(root, "zswapin_perf_test"); >>> + if (!test_group) >>> + goto out; >>> + if (cg_create(test_group)) >>> + goto out; >>> + >>> + if (cg_run(test_group, zswapin_perf, NULL)) >>> + goto out; >>> + >>> + ret = KSFT_PASS; >>> +out: >>> + cg_destroy(test_group); >>> + free(test_group); >>> + return ret; >>> +} >>> + >>> /* >>> * When trying to store a memcg page in zswap, if the memcg hits its memory >>> * limit in zswap, writeback should affect only the zswapped pages of that >>> @@ -584,6 +654,7 @@ struct zswap_test { >>> T(test_zswapin), >>> T(test_zswap_writeback_enabled), >>> T(test_zswap_writeback_disabled), >>> + T(test_zswapin_perf), >>> T(test_no_kmem_bypass), >>> T(test_no_invasive_cgroup_shrink), >>> }; >>> >>> [1] https://lore.kernel.org/all/20241001053222.6944-1-kanchana.p.sridhar@intel.com/ >>> [2] https://lore.kernel.org/all/20240821074541.516249-1-hanchuanhua@oppo.com/ >>> [3] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@kernel.org/T/#u >>> [4] https://lwn.net/Articles/955575/ >>> >>> Usama Arif (4): >>> mm/zswap: skip swapcache for swapping in zswap pages >>> mm/zswap: modify zswap_decompress to accept page instead of folio >>> mm/zswap: add support for large folio zswapin >>> mm/zswap: count successful large folio zswap loads >>> >>> Documentation/admin-guide/mm/transhuge.rst | 3 + >>> include/linux/huge_mm.h | 1 + >>> include/linux/zswap.h | 6 ++ >>> mm/huge_memory.c | 3 + >>> mm/memory.c | 16 +-- >>> mm/page_io.c | 2 +- >>> mm/zswap.c | 120 ++++++++++++++------- >>> 7 files changed, 99 insertions(+), 52 deletions(-) >>> >>> -- >>> 2.43.5 >>> >> >> Thanks >> barry >
On Wed, Oct 23, 2024 at 4:26 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 21/10/2024 11:40, Usama Arif wrote: > > > > > > On 21/10/2024 06:09, Barry Song wrote: > >> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@gmail.com> wrote: > >>> > >>> After large folio zswapout support added in [1], this patch adds > >>> support for zswapin of large folios to bring it on par with zram. > >>> This series makes sure that the benefits of large folios (fewer > >>> page faults, batched PTE and rmap manipulation, reduced lru list, > >>> TLB coalescing (for arm64 and amd)) are not lost at swap out when > >>> using zswap. > >>> > >>> It builds on top of [2] which added large folio swapin support for > >>> zram and provides the same level of large folio swapin support as > >>> zram, i.e. only supporting swap count == 1. > >>> > >>> Patch 1 skips swapcache for swapping in zswap pages, this should improve > >>> no readahead swapin performance [3], and also allows us to build on large > >>> folio swapin support added in [2], hence is a prerequisite for patch 3. > >>> > >>> Patch 3 adds support for large folio zswapin. This patch does not add > >>> support for hybrid backends (i.e. folios partly present swap and zswap). > >>> > >>> The main performance benefit comes from maintaining large folios *after* > >>> swapin, large folio performance improvements have been mentioned in previous > >>> series posted on it [2],[4], so have not added those. Below is a simple > >>> microbenchmark to measure the time needed *for* zswpin of 1G memory (along > >>> with memory integrity check). > >>> > >>> | no mTHP (ms) | 1M mTHP enabled (ms) > >>> Base kernel | 1165 | 1163 > >>> Kernel with mTHP zswpin series | 1203 | 738 > >> > >> Hi Usama, > >> Do you know where this minor regression for non-mTHP comes from? > >> As you even have skipped swapcache for small folios in zswap in patch1, > >> that part should have some gain? is it because of zswap_present_test()? > >> > > > > Hi Barry, > > > > The microbenchmark does a sequential read of 1G of memory, so it probably > > isnt very representative of real world usecases. This also means that > > swap_vma_readahead is able to readahead accurately all pages in its window. > > With this patch series, if doing 4K swapin, you get 1G/4K calls of fast > > do_swap_page. Without this patch, you get 1G/(4K*readahead window) of slow > > do_swap_page calls. I had added some prints and I was seeing 8 pages being > > readahead in 1 do_swap_page. The larger number of calls causes the slight > > regression (eventhough they are quite fast). I think in a realistic scenario, > > where readahead window wont be as large, there wont be a regression. > > The cost of zswap_present_test in the whole call stack of swapping page is > > very low and I think can be ignored. > > > > I think the more interesting thing is what Kanchana pointed out in > > https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gmail.com/ > > I am curious, did you see this when testing large folio swapin and compression > > at 4K granuality? Its looks like swap thrashing so I think it would be common > > between zswap and zram. I dont have larger granuality zswap compression done, > > which is why I think there is a regression in time taken. (It could be because > > its tested on intel as well). > > > > Thanks, > > Usama > > > > Hi, > > So I have been doing some benchmarking after Kanchana pointed out a performance > regression in [1] of swapping in large folio. I would love to get thoughts from > zram folks on this, as thats where large folio swapin was first added [2]. > As far as I can see, the current support in zram is doing large folio swapin > at 4K granuality. The large granuality compression in [3] which was posted > in March is not merged, so I am currently comparing upstream zram with this series. > > With the microbenchmark below of timing 1G swapin, there was a very large improvement > in performance by using this series. I think similar numbers would be seen in zram. Imagine running several apps on a phone and switching between them: A → B → C → D → E … → A → B … The app currently on the screen retains its memory, while the ones sent to the background are swapped out. When we bring those apps back to the foreground, their memory is restored. This behavior is quite similar to what you're seeing with your microbenchmark. > > But when doing kernel build test, Kanchana saw a regression in [1]. I believe > its because of swap thrashing (causing large zswap activity), due to larger page swapin. > The part of the code that decides large folio swapin is the same between zswap and zram, > so I believe this would be observed in zram as well. Is this an extreme case where the workload's working set far exceeds the available memory by memcg limitation? I doubt mTHP would provide any real benefit from the start if the workload is bound to experience swap thrashing. What if we disable mTHP entirely? > > My initial thought was this might be because its intel, where you dont have the advantage > of TLB coalescing, so tested on AMD and ARM, but the regression is there on AMD > and ARM as well, though a bit less (have added the numbers below). > > The numbers show that the zswap activity increases and page faults decrease. > Overall this does result in sys time increasing and real time slightly increases, > likely because the cost of increased zswap activity is more than the benefit of > lower page faults. > I can see in [3] that pagefaults reduced in zram as well. > > Large folio swapin shows good numbers in microbenchmarks that just target reduce page > faults and sequential swapin only, but not in kernel build test. Is a similar regression > observed with zram when enabling large folio swapin on kernel build test? Maybe large > folio swapin makes more sense on workloads where mappings are kept for a longer time? > I suspect this is because mTHP doesn't always benefit workloads when available memory is quite limited compared to the working set. In that case, mTHP swap-in might introduce more features that exacerbate the problem. We used to have an extra control "swapin_enabled" for swap-in, but it never gained much traction: https://lore.kernel.org/linux-mm/20240726094618.401593-5-21cnbao@gmail.com/ We can reconsider whether to include the knob, but if it's better to disable mTHP entirely for these cases, we can still adhere to the policy of "enabled". Using large block compression and decompression in zRAM will significantly reduce CPU usage, likely making the issue unnoticeable. However, the default minimum size for large block support is currently set to 64KB(ZSMALLOC_MULTI_PAGES_ORDER = 4). > > Kernel build numbers in cgroup with memory.max=4G to trigger zswap > Command for AMD: make defconfig; time make -j$(nproc) bzImage > Command for ARM: make defconfig; time make -j$(nproc) Image > > > AMD 16K+32K THP=always > metric mm-unstable mm-unstable + large folio zswapin series > real 1m23.038s 1m23.050s > user 53m57.210s 53m53.437s > sys 7m24.592s 7m48.843s > zswpin 612070 999244 > zswpout 2226403 2347979 > pgfault 20667366 20481728 > pgmajfault 385887 269117 > > AMD 16K+32K+64K THP=always > metric mm-unstable mm-unstable + large folio zswapin series > real 1m22.975s 1m23.266s > user 53m51.302s 53m51.069s > sys 7m40.168s 7m57.104s > zswpin 676492 1258573 > zswpout 2449839 2714767 > pgfault 17540746 17296555 > pgmajfault 429629 307495 > -------------------------- > ARM 16K+32K THP=always > metric mm-unstable mm-unstable + large folio zswapin series > real 0m51.168s 0m52.086s > user 25m14.715s 25m15.765s > sys 17m18.856s 18m8.031s > zswpin 3904129 7339245 > zswpout 11171295 13473461 > pgfault 37313345 36011338 > pgmajfault 2726253 1932642 > > > ARM 16K+32K+64K THP=always > metric mm-unstable mm-unstable + large folio zswapin series > real 0m52.017s 0m53.828s > user 25m2.742s 25m0.046s > sys 18m24.525s 20m26.207s > zswpin 4853571 8908664 > zswpout 12297199 15768764 > pgfault 32158152 30425519 > pgmajfault 3320717 2237015 > > > Thanks! > Usama > > > [1] https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gmail.com/ > [2] https://lore.kernel.org/all/20240821074541.516249-3-hanchuanhua@oppo.com/ > [3] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/ > > > > >>> > >>> The time measured was pretty consistent between runs (~1-2% variation). > >>> There is 36% improvement in zswapin time with 1M folios. The percentage > >>> improvement is likely to be more if the memcmp is removed. > >>> > >>> diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c > >>> index 40de679248b8..77068c577c86 100644 > >>> --- a/tools/testing/selftests/cgroup/test_zswap.c > >>> +++ b/tools/testing/selftests/cgroup/test_zswap.c > >>> @@ -9,6 +9,8 @@ > >>> #include <string.h> > >>> #include <sys/wait.h> > >>> #include <sys/mman.h> > >>> +#include <sys/time.h> > >>> +#include <malloc.h> > >>> > >>> #include "../kselftest.h" > >>> #include "cgroup_util.h" > >>> @@ -407,6 +409,74 @@ static int test_zswap_writeback_disabled(const char *root) > >>> return test_zswap_writeback(root, false); > >>> } > >>> > >>> +static int zswapin_perf(const char *cgroup, void *arg) > >>> +{ > >>> + long pagesize = sysconf(_SC_PAGESIZE); > >>> + size_t memsize = MB(1*1024); > >>> + char buf[pagesize]; > >>> + int ret = -1; > >>> + char *mem; > >>> + struct timeval start, end; > >>> + > >>> + mem = (char *)memalign(2*1024*1024, memsize); > >>> + if (!mem) > >>> + return ret; > >>> + > >>> + /* > >>> + * Fill half of each page with increasing data, and keep other > >>> + * half empty, this will result in data that is still compressible > >>> + * and ends up in zswap, with material zswap usage. > >>> + */ > >>> + for (int i = 0; i < pagesize; i++) > >>> + buf[i] = i < pagesize/2 ? (char) i : 0; > >>> + > >>> + for (int i = 0; i < memsize; i += pagesize) > >>> + memcpy(&mem[i], buf, pagesize); > >>> + > >>> + /* Try and reclaim allocated memory */ > >>> + if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) { > >>> + ksft_print_msg("Failed to reclaim all of the requested memory\n"); > >>> + goto out; > >>> + } > >>> + > >>> + gettimeofday(&start, NULL); > >>> + /* zswpin */ > >>> + for (int i = 0; i < memsize; i += pagesize) { > >>> + if (memcmp(&mem[i], buf, pagesize)) { > >>> + ksft_print_msg("invalid memory\n"); > >>> + goto out; > >>> + } > >>> + } > >>> + gettimeofday(&end, NULL); > >>> + printf ("zswapin took %fms to run.\n", (end.tv_sec - start.tv_sec)*1000 + (double)(end.tv_usec - start.tv_usec) / 1000); > >>> + ret = 0; > >>> +out: > >>> + free(mem); > >>> + return ret; > >>> +} > >>> + > >>> +static int test_zswapin_perf(const char *root) > >>> +{ > >>> + int ret = KSFT_FAIL; > >>> + char *test_group; > >>> + > >>> + test_group = cg_name(root, "zswapin_perf_test"); > >>> + if (!test_group) > >>> + goto out; > >>> + if (cg_create(test_group)) > >>> + goto out; > >>> + > >>> + if (cg_run(test_group, zswapin_perf, NULL)) > >>> + goto out; > >>> + > >>> + ret = KSFT_PASS; > >>> +out: > >>> + cg_destroy(test_group); > >>> + free(test_group); > >>> + return ret; > >>> +} > >>> + > >>> /* > >>> * When trying to store a memcg page in zswap, if the memcg hits its memory > >>> * limit in zswap, writeback should affect only the zswapped pages of that > >>> @@ -584,6 +654,7 @@ struct zswap_test { > >>> T(test_zswapin), > >>> T(test_zswap_writeback_enabled), > >>> T(test_zswap_writeback_disabled), > >>> + T(test_zswapin_perf), > >>> T(test_no_kmem_bypass), > >>> T(test_no_invasive_cgroup_shrink), > >>> }; > >>> > >>> [1] https://lore.kernel.org/all/20241001053222.6944-1-kanchana.p.sridhar@intel.com/ > >>> [2] https://lore.kernel.org/all/20240821074541.516249-1-hanchuanhua@oppo.com/ > >>> [3] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@kernel.org/T/#u > >>> [4] https://lwn.net/Articles/955575/ > >>> > >>> Usama Arif (4): > >>> mm/zswap: skip swapcache for swapping in zswap pages > >>> mm/zswap: modify zswap_decompress to accept page instead of folio > >>> mm/zswap: add support for large folio zswapin > >>> mm/zswap: count successful large folio zswap loads > >>> > >>> Documentation/admin-guide/mm/transhuge.rst | 3 + > >>> include/linux/huge_mm.h | 1 + > >>> include/linux/zswap.h | 6 ++ > >>> mm/huge_memory.c | 3 + > >>> mm/memory.c | 16 +-- > >>> mm/page_io.c | 2 +- > >>> mm/zswap.c | 120 ++++++++++++++------- > >>> 7 files changed, 99 insertions(+), 52 deletions(-) > >>> > >>> -- > >>> 2.43.5 > >>> > >> Thanks Barry
On 22/10/2024 21:46, Barry Song wrote: > On Wed, Oct 23, 2024 at 4:26 AM Usama Arif <usamaarif642@gmail.com> wrote: >> >> >> >> On 21/10/2024 11:40, Usama Arif wrote: >>> >>> >>> On 21/10/2024 06:09, Barry Song wrote: >>>> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@gmail.com> wrote: >>>>> >>>>> After large folio zswapout support added in [1], this patch adds >>>>> support for zswapin of large folios to bring it on par with zram. >>>>> This series makes sure that the benefits of large folios (fewer >>>>> page faults, batched PTE and rmap manipulation, reduced lru list, >>>>> TLB coalescing (for arm64 and amd)) are not lost at swap out when >>>>> using zswap. >>>>> >>>>> It builds on top of [2] which added large folio swapin support for >>>>> zram and provides the same level of large folio swapin support as >>>>> zram, i.e. only supporting swap count == 1. >>>>> >>>>> Patch 1 skips swapcache for swapping in zswap pages, this should improve >>>>> no readahead swapin performance [3], and also allows us to build on large >>>>> folio swapin support added in [2], hence is a prerequisite for patch 3. >>>>> >>>>> Patch 3 adds support for large folio zswapin. This patch does not add >>>>> support for hybrid backends (i.e. folios partly present swap and zswap). >>>>> >>>>> The main performance benefit comes from maintaining large folios *after* >>>>> swapin, large folio performance improvements have been mentioned in previous >>>>> series posted on it [2],[4], so have not added those. Below is a simple >>>>> microbenchmark to measure the time needed *for* zswpin of 1G memory (along >>>>> with memory integrity check). >>>>> >>>>> | no mTHP (ms) | 1M mTHP enabled (ms) >>>>> Base kernel | 1165 | 1163 >>>>> Kernel with mTHP zswpin series | 1203 | 738 >>>> >>>> Hi Usama, >>>> Do you know where this minor regression for non-mTHP comes from? >>>> As you even have skipped swapcache for small folios in zswap in patch1, >>>> that part should have some gain? is it because of zswap_present_test()? >>>> >>> >>> Hi Barry, >>> >>> The microbenchmark does a sequential read of 1G of memory, so it probably >>> isnt very representative of real world usecases. This also means that >>> swap_vma_readahead is able to readahead accurately all pages in its window. >>> With this patch series, if doing 4K swapin, you get 1G/4K calls of fast >>> do_swap_page. Without this patch, you get 1G/(4K*readahead window) of slow >>> do_swap_page calls. I had added some prints and I was seeing 8 pages being >>> readahead in 1 do_swap_page. The larger number of calls causes the slight >>> regression (eventhough they are quite fast). I think in a realistic scenario, >>> where readahead window wont be as large, there wont be a regression. >>> The cost of zswap_present_test in the whole call stack of swapping page is >>> very low and I think can be ignored. >>> >>> I think the more interesting thing is what Kanchana pointed out in >>> https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gmail.com/ >>> I am curious, did you see this when testing large folio swapin and compression >>> at 4K granuality? Its looks like swap thrashing so I think it would be common >>> between zswap and zram. I dont have larger granuality zswap compression done, >>> which is why I think there is a regression in time taken. (It could be because >>> its tested on intel as well). >>> >>> Thanks, >>> Usama >>> >> >> Hi, >> >> So I have been doing some benchmarking after Kanchana pointed out a performance >> regression in [1] of swapping in large folio. I would love to get thoughts from >> zram folks on this, as thats where large folio swapin was first added [2]. >> As far as I can see, the current support in zram is doing large folio swapin >> at 4K granuality. The large granuality compression in [3] which was posted >> in March is not merged, so I am currently comparing upstream zram with this series. >> >> With the microbenchmark below of timing 1G swapin, there was a very large improvement >> in performance by using this series. I think similar numbers would be seen in zram. > > Imagine running several apps on a phone and switching > between them: A → B → C → D → E … → A → B … The app > currently on the screen retains its memory, while the ones > sent to the background are swapped out. When we bring > those apps back to the foreground, their memory is restored. > This behavior is quite similar to what you're seeing with > your microbenchmark. > Hi Barry, Thanks for explaining this! Do you know if there is some open source benchmark we could use to show an improvement in app switching with large folios? Also I guess swap thrashing can happen when apps are brought back to foreground? >> >> But when doing kernel build test, Kanchana saw a regression in [1]. I believe >> its because of swap thrashing (causing large zswap activity), due to larger page swapin. >> The part of the code that decides large folio swapin is the same between zswap and zram, >> so I believe this would be observed in zram as well. > > Is this an extreme case where the workload's working set far > exceeds the available memory by memcg limitation? I doubt mTHP > would provide any real benefit from the start if the workload is bound to > experience swap thrashing. What if we disable mTHP entirely? > I would agree, this is an extreme case. I wanted (z)swap activity to happen so limited memory.max to 4G. mTHP is beneficial in kernel test benchmarking going from no mTHP to 16K: ARM make defconfig; time make -j$(nproc) Image, cgroup memory.max=4G metric no mTHP 16K mTHP=always real 1m0.613s 0m52.008s user 25m23.028s 25m19.488s sys 25m45.466s 18m11.640s zswpin 1911194 3108438 zswpout 6880815 9374628 pgfault 120430166 48976658 pgmajfault 1580674 2327086 >> >> My initial thought was this might be because its intel, where you dont have the advantage >> of TLB coalescing, so tested on AMD and ARM, but the regression is there on AMD >> and ARM as well, though a bit less (have added the numbers below). >> >> The numbers show that the zswap activity increases and page faults decrease. >> Overall this does result in sys time increasing and real time slightly increases, >> likely because the cost of increased zswap activity is more than the benefit of >> lower page faults. >> I can see in [3] that pagefaults reduced in zram as well. >> >> Large folio swapin shows good numbers in microbenchmarks that just target reduce page >> faults and sequential swapin only, but not in kernel build test. Is a similar regression >> observed with zram when enabling large folio swapin on kernel build test? Maybe large >> folio swapin makes more sense on workloads where mappings are kept for a longer time? >> > > I suspect this is because mTHP doesn't always benefit workloads > when available memory is quite limited compared to the working set. > In that case, mTHP swap-in might introduce more features that > exacerbate the problem. We used to have an extra control "swapin_enabled" > for swap-in, but it never gained much traction: > https://lore.kernel.org/linux-mm/20240726094618.401593-5-21cnbao@gmail.com/ > We can reconsider whether to include the knob, but if it's better > to disable mTHP entirely for these cases, we can still adhere to > the policy of "enabled". > Yes I think this makes sense to have. The only thing is, its too many knobs! I personally think its already difficult to decide upto which mTHP size we should enable (and I think this changes per workload). But if we add swapin_enabled on top of that it can make things more difficult. > Using large block compression and decompression in zRAM will > significantly reduce CPU usage, likely making the issue unnoticeable. > However, the default minimum size for large block support is currently > set to 64KB(ZSMALLOC_MULTI_PAGES_ORDER = 4). > I saw that the patch was sent in March, and there werent any updates after? Maybe I can try and cherry-pick that and see if we can develop large granularity compression for zswap. >> >> Kernel build numbers in cgroup with memory.max=4G to trigger zswap >> Command for AMD: make defconfig; time make -j$(nproc) bzImage >> Command for ARM: make defconfig; time make -j$(nproc) Image >> >> >> AMD 16K+32K THP=always >> metric mm-unstable mm-unstable + large folio zswapin series >> real 1m23.038s 1m23.050s >> user 53m57.210s 53m53.437s >> sys 7m24.592s 7m48.843s >> zswpin 612070 999244 >> zswpout 2226403 2347979 >> pgfault 20667366 20481728 >> pgmajfault 385887 269117 >> >> AMD 16K+32K+64K THP=always >> metric mm-unstable mm-unstable + large folio zswapin series >> real 1m22.975s 1m23.266s >> user 53m51.302s 53m51.069s >> sys 7m40.168s 7m57.104s >> zswpin 676492 1258573 >> zswpout 2449839 2714767 >> pgfault 17540746 17296555 >> pgmajfault 429629 307495 >> -------------------------- >> ARM 16K+32K THP=always >> metric mm-unstable mm-unstable + large folio zswapin series >> real 0m51.168s 0m52.086s >> user 25m14.715s 25m15.765s >> sys 17m18.856s 18m8.031s >> zswpin 3904129 7339245 >> zswpout 11171295 13473461 >> pgfault 37313345 36011338 >> pgmajfault 2726253 1932642 >> >> >> ARM 16K+32K+64K THP=always >> metric mm-unstable mm-unstable + large folio zswapin series >> real 0m52.017s 0m53.828s >> user 25m2.742s 25m0.046s >> sys 18m24.525s 20m26.207s >> zswpin 4853571 8908664 >> zswpout 12297199 15768764 >> pgfault 32158152 30425519 >> pgmajfault 3320717 2237015 >> >> >> Thanks! >> Usama >> >> >> [1] https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gmail.com/ >> [2] https://lore.kernel.org/all/20240821074541.516249-3-hanchuanhua@oppo.com/ >> [3] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/ >> >>> >>>>> >>>>> The time measured was pretty consistent between runs (~1-2% variation). >>>>> There is 36% improvement in zswapin time with 1M folios. The percentage >>>>> improvement is likely to be more if the memcmp is removed. >>>>> >>>>> diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c >>>>> index 40de679248b8..77068c577c86 100644 >>>>> --- a/tools/testing/selftests/cgroup/test_zswap.c >>>>> +++ b/tools/testing/selftests/cgroup/test_zswap.c >>>>> @@ -9,6 +9,8 @@ >>>>> #include <string.h> >>>>> #include <sys/wait.h> >>>>> #include <sys/mman.h> >>>>> +#include <sys/time.h> >>>>> +#include <malloc.h> >>>>> >>>>> #include "../kselftest.h" >>>>> #include "cgroup_util.h" >>>>> @@ -407,6 +409,74 @@ static int test_zswap_writeback_disabled(const char *root) >>>>> return test_zswap_writeback(root, false); >>>>> } >>>>> >>>>> +static int zswapin_perf(const char *cgroup, void *arg) >>>>> +{ >>>>> + long pagesize = sysconf(_SC_PAGESIZE); >>>>> + size_t memsize = MB(1*1024); >>>>> + char buf[pagesize]; >>>>> + int ret = -1; >>>>> + char *mem; >>>>> + struct timeval start, end; >>>>> + >>>>> + mem = (char *)memalign(2*1024*1024, memsize); >>>>> + if (!mem) >>>>> + return ret; >>>>> + >>>>> + /* >>>>> + * Fill half of each page with increasing data, and keep other >>>>> + * half empty, this will result in data that is still compressible >>>>> + * and ends up in zswap, with material zswap usage. >>>>> + */ >>>>> + for (int i = 0; i < pagesize; i++) >>>>> + buf[i] = i < pagesize/2 ? (char) i : 0; >>>>> + >>>>> + for (int i = 0; i < memsize; i += pagesize) >>>>> + memcpy(&mem[i], buf, pagesize); >>>>> + >>>>> + /* Try and reclaim allocated memory */ >>>>> + if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) { >>>>> + ksft_print_msg("Failed to reclaim all of the requested memory\n"); >>>>> + goto out; >>>>> + } >>>>> + >>>>> + gettimeofday(&start, NULL); >>>>> + /* zswpin */ >>>>> + for (int i = 0; i < memsize; i += pagesize) { >>>>> + if (memcmp(&mem[i], buf, pagesize)) { >>>>> + ksft_print_msg("invalid memory\n"); >>>>> + goto out; >>>>> + } >>>>> + } >>>>> + gettimeofday(&end, NULL); >>>>> + printf ("zswapin took %fms to run.\n", (end.tv_sec - start.tv_sec)*1000 + (double)(end.tv_usec - start.tv_usec) / 1000); >>>>> + ret = 0; >>>>> +out: >>>>> + free(mem); >>>>> + return ret; >>>>> +} >>>>> + >>>>> +static int test_zswapin_perf(const char *root) >>>>> +{ >>>>> + int ret = KSFT_FAIL; >>>>> + char *test_group; >>>>> + >>>>> + test_group = cg_name(root, "zswapin_perf_test"); >>>>> + if (!test_group) >>>>> + goto out; >>>>> + if (cg_create(test_group)) >>>>> + goto out; >>>>> + >>>>> + if (cg_run(test_group, zswapin_perf, NULL)) >>>>> + goto out; >>>>> + >>>>> + ret = KSFT_PASS; >>>>> +out: >>>>> + cg_destroy(test_group); >>>>> + free(test_group); >>>>> + return ret; >>>>> +} >>>>> + >>>>> /* >>>>> * When trying to store a memcg page in zswap, if the memcg hits its memory >>>>> * limit in zswap, writeback should affect only the zswapped pages of that >>>>> @@ -584,6 +654,7 @@ struct zswap_test { >>>>> T(test_zswapin), >>>>> T(test_zswap_writeback_enabled), >>>>> T(test_zswap_writeback_disabled), >>>>> + T(test_zswapin_perf), >>>>> T(test_no_kmem_bypass), >>>>> T(test_no_invasive_cgroup_shrink), >>>>> }; >>>>> >>>>> [1] https://lore.kernel.org/all/20241001053222.6944-1-kanchana.p.sridhar@intel.com/ >>>>> [2] https://lore.kernel.org/all/20240821074541.516249-1-hanchuanhua@oppo.com/ >>>>> [3] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@kernel.org/T/#u >>>>> [4] https://lwn.net/Articles/955575/ >>>>> >>>>> Usama Arif (4): >>>>> mm/zswap: skip swapcache for swapping in zswap pages >>>>> mm/zswap: modify zswap_decompress to accept page instead of folio >>>>> mm/zswap: add support for large folio zswapin >>>>> mm/zswap: count successful large folio zswap loads >>>>> >>>>> Documentation/admin-guide/mm/transhuge.rst | 3 + >>>>> include/linux/huge_mm.h | 1 + >>>>> include/linux/zswap.h | 6 ++ >>>>> mm/huge_memory.c | 3 + >>>>> mm/memory.c | 16 +-- >>>>> mm/page_io.c | 2 +- >>>>> mm/zswap.c | 120 ++++++++++++++------- >>>>> 7 files changed, 99 insertions(+), 52 deletions(-) >>>>> >>>>> -- >>>>> 2.43.5 >>>>> >>>> > > Thanks > Barry
On Wed, Oct 23, 2024 at 10:17 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 22/10/2024 21:46, Barry Song wrote: > > On Wed, Oct 23, 2024 at 4:26 AM Usama Arif <usamaarif642@gmail.com> wrote: > >> > >> > >> > >> On 21/10/2024 11:40, Usama Arif wrote: > >>> > >>> > >>> On 21/10/2024 06:09, Barry Song wrote: > >>>> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@gmail.com> wrote: > >>>>> > >>>>> After large folio zswapout support added in [1], this patch adds > >>>>> support for zswapin of large folios to bring it on par with zram. > >>>>> This series makes sure that the benefits of large folios (fewer > >>>>> page faults, batched PTE and rmap manipulation, reduced lru list, > >>>>> TLB coalescing (for arm64 and amd)) are not lost at swap out when > >>>>> using zswap. > >>>>> > >>>>> It builds on top of [2] which added large folio swapin support for > >>>>> zram and provides the same level of large folio swapin support as > >>>>> zram, i.e. only supporting swap count == 1. > >>>>> > >>>>> Patch 1 skips swapcache for swapping in zswap pages, this should improve > >>>>> no readahead swapin performance [3], and also allows us to build on large > >>>>> folio swapin support added in [2], hence is a prerequisite for patch 3. > >>>>> > >>>>> Patch 3 adds support for large folio zswapin. This patch does not add > >>>>> support for hybrid backends (i.e. folios partly present swap and zswap). > >>>>> > >>>>> The main performance benefit comes from maintaining large folios *after* > >>>>> swapin, large folio performance improvements have been mentioned in previous > >>>>> series posted on it [2],[4], so have not added those. Below is a simple > >>>>> microbenchmark to measure the time needed *for* zswpin of 1G memory (along > >>>>> with memory integrity check). > >>>>> > >>>>> | no mTHP (ms) | 1M mTHP enabled (ms) > >>>>> Base kernel | 1165 | 1163 > >>>>> Kernel with mTHP zswpin series | 1203 | 738 > >>>> > >>>> Hi Usama, > >>>> Do you know where this minor regression for non-mTHP comes from? > >>>> As you even have skipped swapcache for small folios in zswap in patch1, > >>>> that part should have some gain? is it because of zswap_present_test()? > >>>> > >>> > >>> Hi Barry, > >>> > >>> The microbenchmark does a sequential read of 1G of memory, so it probably > >>> isnt very representative of real world usecases. This also means that > >>> swap_vma_readahead is able to readahead accurately all pages in its window. > >>> With this patch series, if doing 4K swapin, you get 1G/4K calls of fast > >>> do_swap_page. Without this patch, you get 1G/(4K*readahead window) of slow > >>> do_swap_page calls. I had added some prints and I was seeing 8 pages being > >>> readahead in 1 do_swap_page. The larger number of calls causes the slight > >>> regression (eventhough they are quite fast). I think in a realistic scenario, > >>> where readahead window wont be as large, there wont be a regression. > >>> The cost of zswap_present_test in the whole call stack of swapping page is > >>> very low and I think can be ignored. > >>> > >>> I think the more interesting thing is what Kanchana pointed out in > >>> https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gmail.com/ > >>> I am curious, did you see this when testing large folio swapin and compression > >>> at 4K granuality? Its looks like swap thrashing so I think it would be common > >>> between zswap and zram. I dont have larger granuality zswap compression done, > >>> which is why I think there is a regression in time taken. (It could be because > >>> its tested on intel as well). > >>> > >>> Thanks, > >>> Usama > >>> > >> > >> Hi, > >> > >> So I have been doing some benchmarking after Kanchana pointed out a performance > >> regression in [1] of swapping in large folio. I would love to get thoughts from > >> zram folks on this, as thats where large folio swapin was first added [2]. > >> As far as I can see, the current support in zram is doing large folio swapin > >> at 4K granuality. The large granuality compression in [3] which was posted > >> in March is not merged, so I am currently comparing upstream zram with this series. > >> > >> With the microbenchmark below of timing 1G swapin, there was a very large improvement > >> in performance by using this series. I think similar numbers would be seen in zram. > > > > Imagine running several apps on a phone and switching > > between them: A → B → C → D → E … → A → B … The app > > currently on the screen retains its memory, while the ones > > sent to the background are swapped out. When we bring > > those apps back to the foreground, their memory is restored. > > This behavior is quite similar to what you're seeing with > > your microbenchmark. > > > > Hi Barry, > > Thanks for explaining this! Do you know if there is some open source benchmark > we could use to show an improvement in app switching with large folios? > I’m fairly certain the Android team has this benchmark, but it’s not open source. A straightforward way to simulate this is to use a script that cyclically launches multiple applications, such as Chrome, Firefox, Office, PDF, and others. for example: launch chrome; launch firefox; launch youtube; .... launch chrome; launch firefox; .... On Android, we have "Android activity manager 'am' command" to do that. https://gist.github.com/tsohr/5711945 Not quite sure if other windows managers have similar tools. > Also I guess swap thrashing can happen when apps are brought back to foreground? > Typically, the foreground app doesn't experience much swapping, as it is the most recently or frequently used. However, this may not hold for very low-end phones, where memory is significantly less than the app's working set. For instance, we can't expect a good user experience when playing a large game that requires 8GB of memory on a 4GB phone! :-) And for low-end phones, we never even enable mTHP. > >> > >> But when doing kernel build test, Kanchana saw a regression in [1]. I believe > >> its because of swap thrashing (causing large zswap activity), due to larger page swapin. > >> The part of the code that decides large folio swapin is the same between zswap and zram, > >> so I believe this would be observed in zram as well. > > > > Is this an extreme case where the workload's working set far > > exceeds the available memory by memcg limitation? I doubt mTHP > > would provide any real benefit from the start if the workload is bound to > > experience swap thrashing. What if we disable mTHP entirely? > > > > I would agree, this is an extreme case. I wanted (z)swap activity to happen so limited > memory.max to 4G. > > mTHP is beneficial in kernel test benchmarking going from no mTHP to 16K: > > ARM make defconfig; time make -j$(nproc) Image, cgroup memory.max=4G > metric no mTHP 16K mTHP=always > real 1m0.613s 0m52.008s > user 25m23.028s 25m19.488s > sys 25m45.466s 18m11.640s > zswpin 1911194 3108438 > zswpout 6880815 9374628 > pgfault 120430166 48976658 > pgmajfault 1580674 2327086 > > Interesting! We never use a phone to build the Linux kernel, but let me see if I can find some other machines to reproduce your data. > > > >> > >> My initial thought was this might be because its intel, where you dont have the advantage > >> of TLB coalescing, so tested on AMD and ARM, but the regression is there on AMD > >> and ARM as well, though a bit less (have added the numbers below). > >> > >> The numbers show that the zswap activity increases and page faults decrease. > >> Overall this does result in sys time increasing and real time slightly increases, > >> likely because the cost of increased zswap activity is more than the benefit of > >> lower page faults. > >> I can see in [3] that pagefaults reduced in zram as well. > >> > >> Large folio swapin shows good numbers in microbenchmarks that just target reduce page > >> faults and sequential swapin only, but not in kernel build test. Is a similar regression > >> observed with zram when enabling large folio swapin on kernel build test? Maybe large > >> folio swapin makes more sense on workloads where mappings are kept for a longer time? > >> > > > > I suspect this is because mTHP doesn't always benefit workloads > > when available memory is quite limited compared to the working set. > > In that case, mTHP swap-in might introduce more features that > > exacerbate the problem. We used to have an extra control "swapin_enabled" > > for swap-in, but it never gained much traction: > > https://lore.kernel.org/linux-mm/20240726094618.401593-5-21cnbao@gmail.com/ > > We can reconsider whether to include the knob, but if it's better > > to disable mTHP entirely for these cases, we can still adhere to > > the policy of "enabled". > > > Yes I think this makes sense to have. The only thing is, its too many knobs! > I personally think its already difficult to decide upto which mTHP size we > should enable (and I think this changes per workload). But if we add swapin_enabled > on top of that it can make things more difficult. > > > Using large block compression and decompression in zRAM will > > significantly reduce CPU usage, likely making the issue unnoticeable. > > However, the default minimum size for large block support is currently > > set to 64KB(ZSMALLOC_MULTI_PAGES_ORDER = 4). > > > > I saw that the patch was sent in March, and there werent any updates after? > Maybe I can try and cherry-pick that and see if we can develop large > granularity compression for zswap. will provide an updated version next week. > > >> > >> Kernel build numbers in cgroup with memory.max=4G to trigger zswap > >> Command for AMD: make defconfig; time make -j$(nproc) bzImage > >> Command for ARM: make defconfig; time make -j$(nproc) Image > >> > >> > >> AMD 16K+32K THP=always > >> metric mm-unstable mm-unstable + large folio zswapin series > >> real 1m23.038s 1m23.050s > >> user 53m57.210s 53m53.437s > >> sys 7m24.592s 7m48.843s > >> zswpin 612070 999244 > >> zswpout 2226403 2347979 > >> pgfault 20667366 20481728 > >> pgmajfault 385887 269117 > >> > >> AMD 16K+32K+64K THP=always > >> metric mm-unstable mm-unstable + large folio zswapin series > >> real 1m22.975s 1m23.266s > >> user 53m51.302s 53m51.069s > >> sys 7m40.168s 7m57.104s > >> zswpin 676492 1258573 > >> zswpout 2449839 2714767 > >> pgfault 17540746 17296555 > >> pgmajfault 429629 307495 > >> -------------------------- > >> ARM 16K+32K THP=always > >> metric mm-unstable mm-unstable + large folio zswapin series > >> real 0m51.168s 0m52.086s > >> user 25m14.715s 25m15.765s > >> sys 17m18.856s 18m8.031s > >> zswpin 3904129 7339245 > >> zswpout 11171295 13473461 > >> pgfault 37313345 36011338 > >> pgmajfault 2726253 1932642 > >> > >> > >> ARM 16K+32K+64K THP=always > >> metric mm-unstable mm-unstable + large folio zswapin series > >> real 0m52.017s 0m53.828s > >> user 25m2.742s 25m0.046s > >> sys 18m24.525s 20m26.207s > >> zswpin 4853571 8908664 > >> zswpout 12297199 15768764 > >> pgfault 32158152 30425519 > >> pgmajfault 3320717 2237015 > >> > >> > >> Thanks! > >> Usama > >> > >> > >> [1] https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gmail.com/ > >> [2] https://lore.kernel.org/all/20240821074541.516249-3-hanchuanhua@oppo.com/ > >> [3] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/ > >> > >>> > >>>>> > >>>>> The time measured was pretty consistent between runs (~1-2% variation). > >>>>> There is 36% improvement in zswapin time with 1M folios. The percentage > >>>>> improvement is likely to be more if the memcmp is removed. > >>>>> > >>>>> diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c > >>>>> index 40de679248b8..77068c577c86 100644 > >>>>> --- a/tools/testing/selftests/cgroup/test_zswap.c > >>>>> +++ b/tools/testing/selftests/cgroup/test_zswap.c > >>>>> @@ -9,6 +9,8 @@ > >>>>> #include <string.h> > >>>>> #include <sys/wait.h> > >>>>> #include <sys/mman.h> > >>>>> +#include <sys/time.h> > >>>>> +#include <malloc.h> > >>>>> > >>>>> #include "../kselftest.h" > >>>>> #include "cgroup_util.h" > >>>>> @@ -407,6 +409,74 @@ static int test_zswap_writeback_disabled(const char *root) > >>>>> return test_zswap_writeback(root, false); > >>>>> } > >>>>> > >>>>> +static int zswapin_perf(const char *cgroup, void *arg) > >>>>> +{ > >>>>> + long pagesize = sysconf(_SC_PAGESIZE); > >>>>> + size_t memsize = MB(1*1024); > >>>>> + char buf[pagesize]; > >>>>> + int ret = -1; > >>>>> + char *mem; > >>>>> + struct timeval start, end; > >>>>> + > >>>>> + mem = (char *)memalign(2*1024*1024, memsize); > >>>>> + if (!mem) > >>>>> + return ret; > >>>>> + > >>>>> + /* > >>>>> + * Fill half of each page with increasing data, and keep other > >>>>> + * half empty, this will result in data that is still compressible > >>>>> + * and ends up in zswap, with material zswap usage. > >>>>> + */ > >>>>> + for (int i = 0; i < pagesize; i++) > >>>>> + buf[i] = i < pagesize/2 ? (char) i : 0; > >>>>> + > >>>>> + for (int i = 0; i < memsize; i += pagesize) > >>>>> + memcpy(&mem[i], buf, pagesize); > >>>>> + > >>>>> + /* Try and reclaim allocated memory */ > >>>>> + if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) { > >>>>> + ksft_print_msg("Failed to reclaim all of the requested memory\n"); > >>>>> + goto out; > >>>>> + } > >>>>> + > >>>>> + gettimeofday(&start, NULL); > >>>>> + /* zswpin */ > >>>>> + for (int i = 0; i < memsize; i += pagesize) { > >>>>> + if (memcmp(&mem[i], buf, pagesize)) { > >>>>> + ksft_print_msg("invalid memory\n"); > >>>>> + goto out; > >>>>> + } > >>>>> + } > >>>>> + gettimeofday(&end, NULL); > >>>>> + printf ("zswapin took %fms to run.\n", (end.tv_sec - start.tv_sec)*1000 + (double)(end.tv_usec - start.tv_usec) / 1000); > >>>>> + ret = 0; > >>>>> +out: > >>>>> + free(mem); > >>>>> + return ret; > >>>>> +} > >>>>> + > >>>>> +static int test_zswapin_perf(const char *root) > >>>>> +{ > >>>>> + int ret = KSFT_FAIL; > >>>>> + char *test_group; > >>>>> + > >>>>> + test_group = cg_name(root, "zswapin_perf_test"); > >>>>> + if (!test_group) > >>>>> + goto out; > >>>>> + if (cg_create(test_group)) > >>>>> + goto out; > >>>>> + > >>>>> + if (cg_run(test_group, zswapin_perf, NULL)) > >>>>> + goto out; > >>>>> + > >>>>> + ret = KSFT_PASS; > >>>>> +out: > >>>>> + cg_destroy(test_group); > >>>>> + free(test_group); > >>>>> + return ret; > >>>>> +} > >>>>> + > >>>>> /* > >>>>> * When trying to store a memcg page in zswap, if the memcg hits its memory > >>>>> * limit in zswap, writeback should affect only the zswapped pages of that > >>>>> @@ -584,6 +654,7 @@ struct zswap_test { > >>>>> T(test_zswapin), > >>>>> T(test_zswap_writeback_enabled), > >>>>> T(test_zswap_writeback_disabled), > >>>>> + T(test_zswapin_perf), > >>>>> T(test_no_kmem_bypass), > >>>>> T(test_no_invasive_cgroup_shrink), > >>>>> }; > >>>>> > >>>>> [1] https://lore.kernel.org/all/20241001053222.6944-1-kanchana.p.sridhar@intel.com/ > >>>>> [2] https://lore.kernel.org/all/20240821074541.516249-1-hanchuanhua@oppo.com/ > >>>>> [3] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@kernel.org/T/#u > >>>>> [4] https://lwn.net/Articles/955575/ > >>>>> > >>>>> Usama Arif (4): > >>>>> mm/zswap: skip swapcache for swapping in zswap pages > >>>>> mm/zswap: modify zswap_decompress to accept page instead of folio > >>>>> mm/zswap: add support for large folio zswapin > >>>>> mm/zswap: count successful large folio zswap loads > >>>>> > >>>>> Documentation/admin-guide/mm/transhuge.rst | 3 + > >>>>> include/linux/huge_mm.h | 1 + > >>>>> include/linux/zswap.h | 6 ++ > >>>>> mm/huge_memory.c | 3 + > >>>>> mm/memory.c | 16 +-- > >>>>> mm/page_io.c | 2 +- > >>>>> mm/zswap.c | 120 ++++++++++++++------- > >>>>> 7 files changed, 99 insertions(+), 52 deletions(-) > >>>>> > >>>>> -- > >>>>> 2.43.5 > >>>>> > >>>> > > Thanks Barry
On Wed, Oct 23, 2024 at 11:07 AM Barry Song <21cnbao@gmail.com> wrote: > > On Wed, Oct 23, 2024 at 10:17 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > > > > > On 22/10/2024 21:46, Barry Song wrote: > > > On Wed, Oct 23, 2024 at 4:26 AM Usama Arif <usamaarif642@gmail.com> wrote: > > >> > > >> > > >> > > >> On 21/10/2024 11:40, Usama Arif wrote: > > >>> > > >>> > > >>> On 21/10/2024 06:09, Barry Song wrote: > > >>>> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@gmail.com> wrote: > > >>>>> > > >>>>> After large folio zswapout support added in [1], this patch adds > > >>>>> support for zswapin of large folios to bring it on par with zram. > > >>>>> This series makes sure that the benefits of large folios (fewer > > >>>>> page faults, batched PTE and rmap manipulation, reduced lru list, > > >>>>> TLB coalescing (for arm64 and amd)) are not lost at swap out when > > >>>>> using zswap. > > >>>>> > > >>>>> It builds on top of [2] which added large folio swapin support for > > >>>>> zram and provides the same level of large folio swapin support as > > >>>>> zram, i.e. only supporting swap count == 1. > > >>>>> > > >>>>> Patch 1 skips swapcache for swapping in zswap pages, this should improve > > >>>>> no readahead swapin performance [3], and also allows us to build on large > > >>>>> folio swapin support added in [2], hence is a prerequisite for patch 3. > > >>>>> > > >>>>> Patch 3 adds support for large folio zswapin. This patch does not add > > >>>>> support for hybrid backends (i.e. folios partly present swap and zswap). > > >>>>> > > >>>>> The main performance benefit comes from maintaining large folios *after* > > >>>>> swapin, large folio performance improvements have been mentioned in previous > > >>>>> series posted on it [2],[4], so have not added those. Below is a simple > > >>>>> microbenchmark to measure the time needed *for* zswpin of 1G memory (along > > >>>>> with memory integrity check). > > >>>>> > > >>>>> | no mTHP (ms) | 1M mTHP enabled (ms) > > >>>>> Base kernel | 1165 | 1163 > > >>>>> Kernel with mTHP zswpin series | 1203 | 738 > > >>>> > > >>>> Hi Usama, > > >>>> Do you know where this minor regression for non-mTHP comes from? > > >>>> As you even have skipped swapcache for small folios in zswap in patch1, > > >>>> that part should have some gain? is it because of zswap_present_test()? > > >>>> > > >>> > > >>> Hi Barry, > > >>> > > >>> The microbenchmark does a sequential read of 1G of memory, so it probably > > >>> isnt very representative of real world usecases. This also means that > > >>> swap_vma_readahead is able to readahead accurately all pages in its window. > > >>> With this patch series, if doing 4K swapin, you get 1G/4K calls of fast > > >>> do_swap_page. Without this patch, you get 1G/(4K*readahead window) of slow > > >>> do_swap_page calls. I had added some prints and I was seeing 8 pages being > > >>> readahead in 1 do_swap_page. The larger number of calls causes the slight > > >>> regression (eventhough they are quite fast). I think in a realistic scenario, > > >>> where readahead window wont be as large, there wont be a regression. > > >>> The cost of zswap_present_test in the whole call stack of swapping page is > > >>> very low and I think can be ignored. > > >>> > > >>> I think the more interesting thing is what Kanchana pointed out in > > >>> https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gmail.com/ > > >>> I am curious, did you see this when testing large folio swapin and compression > > >>> at 4K granuality? Its looks like swap thrashing so I think it would be common > > >>> between zswap and zram. I dont have larger granuality zswap compression done, > > >>> which is why I think there is a regression in time taken. (It could be because > > >>> its tested on intel as well). > > >>> > > >>> Thanks, > > >>> Usama > > >>> > > >> > > >> Hi, > > >> > > >> So I have been doing some benchmarking after Kanchana pointed out a performance > > >> regression in [1] of swapping in large folio. I would love to get thoughts from > > >> zram folks on this, as thats where large folio swapin was first added [2]. > > >> As far as I can see, the current support in zram is doing large folio swapin > > >> at 4K granuality. The large granuality compression in [3] which was posted > > >> in March is not merged, so I am currently comparing upstream zram with this series. > > >> > > >> With the microbenchmark below of timing 1G swapin, there was a very large improvement > > >> in performance by using this series. I think similar numbers would be seen in zram. > > > > > > Imagine running several apps on a phone and switching > > > between them: A → B → C → D → E … → A → B … The app > > > currently on the screen retains its memory, while the ones > > > sent to the background are swapped out. When we bring > > > those apps back to the foreground, their memory is restored. > > > This behavior is quite similar to what you're seeing with > > > your microbenchmark. > > > > > > > Hi Barry, > > > > Thanks for explaining this! Do you know if there is some open source benchmark > > we could use to show an improvement in app switching with large folios? > > > > I’m fairly certain the Android team has this benchmark, but it’s not > open source. > > A straightforward way to simulate this is to use a script that > cyclically launches multiple applications, such as Chrome, Firefox, > Office, PDF, and others. > > for example: > > launch chrome; > launch firefox; > launch youtube; > .... > launch chrome; > launch firefox; > .... > > On Android, we have "Android activity manager 'am' command" to do that. > https://gist.github.com/tsohr/5711945 > > Not quite sure if other windows managers have similar tools. > > > Also I guess swap thrashing can happen when apps are brought back to foreground? > > > > Typically, the foreground app doesn't experience much swapping, > as it is the most recently or frequently used. However, this may > not hold for very low-end phones, where memory is significantly > less than the app's working set. For instance, we can't expect a > good user experience when playing a large game that requires 8GB > of memory on a 4GB phone! :-) > And for low-end phones, we never even enable mTHP. > > > >> > > >> But when doing kernel build test, Kanchana saw a regression in [1]. I believe > > >> its because of swap thrashing (causing large zswap activity), due to larger page swapin. > > >> The part of the code that decides large folio swapin is the same between zswap and zram, > > >> so I believe this would be observed in zram as well. > > > > > > Is this an extreme case where the workload's working set far > > > exceeds the available memory by memcg limitation? I doubt mTHP > > > would provide any real benefit from the start if the workload is bound to > > > experience swap thrashing. What if we disable mTHP entirely? > > > > > > > I would agree, this is an extreme case. I wanted (z)swap activity to happen so limited > > memory.max to 4G. > > > > mTHP is beneficial in kernel test benchmarking going from no mTHP to 16K: > > > > ARM make defconfig; time make -j$(nproc) Image, cgroup memory.max=4G > > metric no mTHP 16K mTHP=always > > real 1m0.613s 0m52.008s > > user 25m23.028s 25m19.488s > > sys 25m45.466s 18m11.640s > > zswpin 1911194 3108438 > > zswpout 6880815 9374628 > > pgfault 120430166 48976658 > > pgmajfault 1580674 2327086 > > > > > > Interesting! We never use a phone to build the Linux kernel, but > let me see if I can find some other machines to reproduce your data. Hi Usama, I suspect the regression occurs because you're running an edge case where the memory cgroup stays nearly full most of the time (this isn't an inherent issue with large folio swap-in). As a result, swapping in mTHP quickly triggers a memcg overflow, causing a swap-out. The next swap-in then recreates the overflow, leading to a repeating cycle. We need a way to stop the cup from repeatedly filling to the brim and overflowing. While not a definitive fix, the following change might help improve the situation: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 17af08367c68..f2fa0eeb2d9a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4559,7 +4559,10 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, memcg = get_mem_cgroup_from_mm(mm); rcu_read_unlock(); - ret = charge_memcg(folio, memcg, gfp); + if (folio_test_large(folio) && mem_cgroup_margin(memcg) < MEMCG_CHARGE_BATCH) + ret = -ENOMEM; + else + ret = charge_memcg(folio, memcg, gfp); css_put(&memcg->css); return ret; } Please confirm if it makes the kernel build with memcg limitation faster. If so, let's work together to figure out an official patch :-) The above code hasn't consider the parent memcg's overflow, so not an ideal fix. > > > > > > > >> > > >> My initial thought was this might be because its intel, where you dont have the advantage > > >> of TLB coalescing, so tested on AMD and ARM, but the regression is there on AMD > > >> and ARM as well, though a bit less (have added the numbers below). > > >> > > >> The numbers show that the zswap activity increases and page faults decrease. > > >> Overall this does result in sys time increasing and real time slightly increases, > > >> likely because the cost of increased zswap activity is more than the benefit of > > >> lower page faults. > > >> I can see in [3] that pagefaults reduced in zram as well. > > >> > > >> Large folio swapin shows good numbers in microbenchmarks that just target reduce page > > >> faults and sequential swapin only, but not in kernel build test. Is a similar regression > > >> observed with zram when enabling large folio swapin on kernel build test? Maybe large > > >> folio swapin makes more sense on workloads where mappings are kept for a longer time? > > >> > > > > > > I suspect this is because mTHP doesn't always benefit workloads > > > when available memory is quite limited compared to the working set. > > > In that case, mTHP swap-in might introduce more features that > > > exacerbate the problem. We used to have an extra control "swapin_enabled" > > > for swap-in, but it never gained much traction: > > > https://lore.kernel.org/linux-mm/20240726094618.401593-5-21cnbao@gmail.com/ > > > We can reconsider whether to include the knob, but if it's better > > > to disable mTHP entirely for these cases, we can still adhere to > > > the policy of "enabled". > > > > > Yes I think this makes sense to have. The only thing is, its too many knobs! > > I personally think its already difficult to decide upto which mTHP size we > > should enable (and I think this changes per workload). But if we add swapin_enabled > > on top of that it can make things more difficult. > > > > > Using large block compression and decompression in zRAM will > > > significantly reduce CPU usage, likely making the issue unnoticeable. > > > However, the default minimum size for large block support is currently > > > set to 64KB(ZSMALLOC_MULTI_PAGES_ORDER = 4). > > > > > > > I saw that the patch was sent in March, and there werent any updates after? > > Maybe I can try and cherry-pick that and see if we can develop large > > granularity compression for zswap. > > will provide an updated version next week. > > > > > >> > > >> Kernel build numbers in cgroup with memory.max=4G to trigger zswap > > >> Command for AMD: make defconfig; time make -j$(nproc) bzImage > > >> Command for ARM: make defconfig; time make -j$(nproc) Image > > >> > > >> > > >> AMD 16K+32K THP=always > > >> metric mm-unstable mm-unstable + large folio zswapin series > > >> real 1m23.038s 1m23.050s > > >> user 53m57.210s 53m53.437s > > >> sys 7m24.592s 7m48.843s > > >> zswpin 612070 999244 > > >> zswpout 2226403 2347979 > > >> pgfault 20667366 20481728 > > >> pgmajfault 385887 269117 > > >> > > >> AMD 16K+32K+64K THP=always > > >> metric mm-unstable mm-unstable + large folio zswapin series > > >> real 1m22.975s 1m23.266s > > >> user 53m51.302s 53m51.069s > > >> sys 7m40.168s 7m57.104s > > >> zswpin 676492 1258573 > > >> zswpout 2449839 2714767 > > >> pgfault 17540746 17296555 > > >> pgmajfault 429629 307495 > > >> -------------------------- > > >> ARM 16K+32K THP=always > > >> metric mm-unstable mm-unstable + large folio zswapin series > > >> real 0m51.168s 0m52.086s > > >> user 25m14.715s 25m15.765s > > >> sys 17m18.856s 18m8.031s > > >> zswpin 3904129 7339245 > > >> zswpout 11171295 13473461 > > >> pgfault 37313345 36011338 > > >> pgmajfault 2726253 1932642 > > >> > > >> > > >> ARM 16K+32K+64K THP=always > > >> metric mm-unstable mm-unstable + large folio zswapin series > > >> real 0m52.017s 0m53.828s > > >> user 25m2.742s 25m0.046s > > >> sys 18m24.525s 20m26.207s > > >> zswpin 4853571 8908664 > > >> zswpout 12297199 15768764 > > >> pgfault 32158152 30425519 > > >> pgmajfault 3320717 2237015 > > >> > > >> > > >> Thanks! > > >> Usama > > >> > > >> > > >> [1] https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gmail.com/ > > >> [2] https://lore.kernel.org/all/20240821074541.516249-3-hanchuanhua@oppo.com/ > > >> [3] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/ > > >> > > >>> > > >>>>> > > >>>>> The time measured was pretty consistent between runs (~1-2% variation). > > >>>>> There is 36% improvement in zswapin time with 1M folios. The percentage > > >>>>> improvement is likely to be more if the memcmp is removed. > > >>>>> > > >>>>> diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c > > >>>>> index 40de679248b8..77068c577c86 100644 > > >>>>> --- a/tools/testing/selftests/cgroup/test_zswap.c > > >>>>> +++ b/tools/testing/selftests/cgroup/test_zswap.c > > >>>>> @@ -9,6 +9,8 @@ > > >>>>> #include <string.h> > > >>>>> #include <sys/wait.h> > > >>>>> #include <sys/mman.h> > > >>>>> +#include <sys/time.h> > > >>>>> +#include <malloc.h> > > >>>>> > > >>>>> #include "../kselftest.h" > > >>>>> #include "cgroup_util.h" > > >>>>> @@ -407,6 +409,74 @@ static int test_zswap_writeback_disabled(const char *root) > > >>>>> return test_zswap_writeback(root, false); > > >>>>> } > > >>>>> > > >>>>> +static int zswapin_perf(const char *cgroup, void *arg) > > >>>>> +{ > > >>>>> + long pagesize = sysconf(_SC_PAGESIZE); > > >>>>> + size_t memsize = MB(1*1024); > > >>>>> + char buf[pagesize]; > > >>>>> + int ret = -1; > > >>>>> + char *mem; > > >>>>> + struct timeval start, end; > > >>>>> + > > >>>>> + mem = (char *)memalign(2*1024*1024, memsize); > > >>>>> + if (!mem) > > >>>>> + return ret; > > >>>>> + > > >>>>> + /* > > >>>>> + * Fill half of each page with increasing data, and keep other > > >>>>> + * half empty, this will result in data that is still compressible > > >>>>> + * and ends up in zswap, with material zswap usage. > > >>>>> + */ > > >>>>> + for (int i = 0; i < pagesize; i++) > > >>>>> + buf[i] = i < pagesize/2 ? (char) i : 0; > > >>>>> + > > >>>>> + for (int i = 0; i < memsize; i += pagesize) > > >>>>> + memcpy(&mem[i], buf, pagesize); > > >>>>> + > > >>>>> + /* Try and reclaim allocated memory */ > > >>>>> + if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) { > > >>>>> + ksft_print_msg("Failed to reclaim all of the requested memory\n"); > > >>>>> + goto out; > > >>>>> + } > > >>>>> + > > >>>>> + gettimeofday(&start, NULL); > > >>>>> + /* zswpin */ > > >>>>> + for (int i = 0; i < memsize; i += pagesize) { > > >>>>> + if (memcmp(&mem[i], buf, pagesize)) { > > >>>>> + ksft_print_msg("invalid memory\n"); > > >>>>> + goto out; > > >>>>> + } > > >>>>> + } > > >>>>> + gettimeofday(&end, NULL); > > >>>>> + printf ("zswapin took %fms to run.\n", (end.tv_sec - start.tv_sec)*1000 + (double)(end.tv_usec - start.tv_usec) / 1000); > > >>>>> + ret = 0; > > >>>>> +out: > > >>>>> + free(mem); > > >>>>> + return ret; > > >>>>> +} > > >>>>> + > > >>>>> +static int test_zswapin_perf(const char *root) > > >>>>> +{ > > >>>>> + int ret = KSFT_FAIL; > > >>>>> + char *test_group; > > >>>>> + > > >>>>> + test_group = cg_name(root, "zswapin_perf_test"); > > >>>>> + if (!test_group) > > >>>>> + goto out; > > >>>>> + if (cg_create(test_group)) > > >>>>> + goto out; > > >>>>> + > > >>>>> + if (cg_run(test_group, zswapin_perf, NULL)) > > >>>>> + goto out; > > >>>>> + > > >>>>> + ret = KSFT_PASS; > > >>>>> +out: > > >>>>> + cg_destroy(test_group); > > >>>>> + free(test_group); > > >>>>> + return ret; > > >>>>> +} > > >>>>> + > > >>>>> /* > > >>>>> * When trying to store a memcg page in zswap, if the memcg hits its memory > > >>>>> * limit in zswap, writeback should affect only the zswapped pages of that > > >>>>> @@ -584,6 +654,7 @@ struct zswap_test { > > >>>>> T(test_zswapin), > > >>>>> T(test_zswap_writeback_enabled), > > >>>>> T(test_zswap_writeback_disabled), > > >>>>> + T(test_zswapin_perf), > > >>>>> T(test_no_kmem_bypass), > > >>>>> T(test_no_invasive_cgroup_shrink), > > >>>>> }; > > >>>>> > > >>>>> [1] https://lore.kernel.org/all/20241001053222.6944-1-kanchana.p.sridhar@intel.com/ > > >>>>> [2] https://lore.kernel.org/all/20240821074541.516249-1-hanchuanhua@oppo.com/ > > >>>>> [3] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@kernel.org/T/#u > > >>>>> [4] https://lwn.net/Articles/955575/ > > >>>>> > > >>>>> Usama Arif (4): > > >>>>> mm/zswap: skip swapcache for swapping in zswap pages > > >>>>> mm/zswap: modify zswap_decompress to accept page instead of folio > > >>>>> mm/zswap: add support for large folio zswapin > > >>>>> mm/zswap: count successful large folio zswap loads > > >>>>> > > >>>>> Documentation/admin-guide/mm/transhuge.rst | 3 + > > >>>>> include/linux/huge_mm.h | 1 + > > >>>>> include/linux/zswap.h | 6 ++ > > >>>>> mm/huge_memory.c | 3 + > > >>>>> mm/memory.c | 16 +-- > > >>>>> mm/page_io.c | 2 +- > > >>>>> mm/zswap.c | 120 ++++++++++++++------- > > >>>>> 7 files changed, 99 insertions(+), 52 deletions(-) > > >>>>> > > >>>>> -- > > >>>>> 2.43.5 > > >>>>> > > >>>> > > > > Thanks Barry
On 23/10/2024 11:26, Barry Song wrote: > On Wed, Oct 23, 2024 at 11:07 AM Barry Song <21cnbao@gmail.com> wrote: >> >> On Wed, Oct 23, 2024 at 10:17 AM Usama Arif <usamaarif642@gmail.com> wrote: >>> >>> >>> >>> On 22/10/2024 21:46, Barry Song wrote: >>>> On Wed, Oct 23, 2024 at 4:26 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>>> >>>>> >>>>> >>>>> On 21/10/2024 11:40, Usama Arif wrote: >>>>>> >>>>>> >>>>>> On 21/10/2024 06:09, Barry Song wrote: >>>>>>> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>>>> >>>>>>>> After large folio zswapout support added in [1], this patch adds >>>>>>>> support for zswapin of large folios to bring it on par with zram. >>>>>>>> This series makes sure that the benefits of large folios (fewer >>>>>>>> page faults, batched PTE and rmap manipulation, reduced lru list, >>>>>>>> TLB coalescing (for arm64 and amd)) are not lost at swap out when >>>>>>>> using zswap. >>>>>>>> >>>>>>>> It builds on top of [2] which added large folio swapin support for >>>>>>>> zram and provides the same level of large folio swapin support as >>>>>>>> zram, i.e. only supporting swap count == 1. >>>>>>>> >>>>>>>> Patch 1 skips swapcache for swapping in zswap pages, this should improve >>>>>>>> no readahead swapin performance [3], and also allows us to build on large >>>>>>>> folio swapin support added in [2], hence is a prerequisite for patch 3. >>>>>>>> >>>>>>>> Patch 3 adds support for large folio zswapin. This patch does not add >>>>>>>> support for hybrid backends (i.e. folios partly present swap and zswap). >>>>>>>> >>>>>>>> The main performance benefit comes from maintaining large folios *after* >>>>>>>> swapin, large folio performance improvements have been mentioned in previous >>>>>>>> series posted on it [2],[4], so have not added those. Below is a simple >>>>>>>> microbenchmark to measure the time needed *for* zswpin of 1G memory (along >>>>>>>> with memory integrity check). >>>>>>>> >>>>>>>> | no mTHP (ms) | 1M mTHP enabled (ms) >>>>>>>> Base kernel | 1165 | 1163 >>>>>>>> Kernel with mTHP zswpin series | 1203 | 738 >>>>>>> >>>>>>> Hi Usama, >>>>>>> Do you know where this minor regression for non-mTHP comes from? >>>>>>> As you even have skipped swapcache for small folios in zswap in patch1, >>>>>>> that part should have some gain? is it because of zswap_present_test()? >>>>>>> >>>>>> >>>>>> Hi Barry, >>>>>> >>>>>> The microbenchmark does a sequential read of 1G of memory, so it probably >>>>>> isnt very representative of real world usecases. This also means that >>>>>> swap_vma_readahead is able to readahead accurately all pages in its window. >>>>>> With this patch series, if doing 4K swapin, you get 1G/4K calls of fast >>>>>> do_swap_page. Without this patch, you get 1G/(4K*readahead window) of slow >>>>>> do_swap_page calls. I had added some prints and I was seeing 8 pages being >>>>>> readahead in 1 do_swap_page. The larger number of calls causes the slight >>>>>> regression (eventhough they are quite fast). I think in a realistic scenario, >>>>>> where readahead window wont be as large, there wont be a regression. >>>>>> The cost of zswap_present_test in the whole call stack of swapping page is >>>>>> very low and I think can be ignored. >>>>>> >>>>>> I think the more interesting thing is what Kanchana pointed out in >>>>>> https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gmail.com/ >>>>>> I am curious, did you see this when testing large folio swapin and compression >>>>>> at 4K granuality? Its looks like swap thrashing so I think it would be common >>>>>> between zswap and zram. I dont have larger granuality zswap compression done, >>>>>> which is why I think there is a regression in time taken. (It could be because >>>>>> its tested on intel as well). >>>>>> >>>>>> Thanks, >>>>>> Usama >>>>>> >>>>> >>>>> Hi, >>>>> >>>>> So I have been doing some benchmarking after Kanchana pointed out a performance >>>>> regression in [1] of swapping in large folio. I would love to get thoughts from >>>>> zram folks on this, as thats where large folio swapin was first added [2]. >>>>> As far as I can see, the current support in zram is doing large folio swapin >>>>> at 4K granuality. The large granuality compression in [3] which was posted >>>>> in March is not merged, so I am currently comparing upstream zram with this series. >>>>> >>>>> With the microbenchmark below of timing 1G swapin, there was a very large improvement >>>>> in performance by using this series. I think similar numbers would be seen in zram. >>>> >>>> Imagine running several apps on a phone and switching >>>> between them: A → B → C → D → E … → A → B … The app >>>> currently on the screen retains its memory, while the ones >>>> sent to the background are swapped out. When we bring >>>> those apps back to the foreground, their memory is restored. >>>> This behavior is quite similar to what you're seeing with >>>> your microbenchmark. >>>> >>> >>> Hi Barry, >>> >>> Thanks for explaining this! Do you know if there is some open source benchmark >>> we could use to show an improvement in app switching with large folios? >>> >> >> I’m fairly certain the Android team has this benchmark, but it’s not >> open source. >> >> A straightforward way to simulate this is to use a script that >> cyclically launches multiple applications, such as Chrome, Firefox, >> Office, PDF, and others. >> >> for example: >> >> launch chrome; >> launch firefox; >> launch youtube; >> .... >> launch chrome; >> launch firefox; >> .... >> >> On Android, we have "Android activity manager 'am' command" to do that. >> https://gist.github.com/tsohr/5711945 >> >> Not quite sure if other windows managers have similar tools. >> >>> Also I guess swap thrashing can happen when apps are brought back to foreground? >>> >> >> Typically, the foreground app doesn't experience much swapping, >> as it is the most recently or frequently used. However, this may >> not hold for very low-end phones, where memory is significantly >> less than the app's working set. For instance, we can't expect a >> good user experience when playing a large game that requires 8GB >> of memory on a 4GB phone! :-) >> And for low-end phones, we never even enable mTHP. >> >>>>> >>>>> But when doing kernel build test, Kanchana saw a regression in [1]. I believe >>>>> its because of swap thrashing (causing large zswap activity), due to larger page swapin. >>>>> The part of the code that decides large folio swapin is the same between zswap and zram, >>>>> so I believe this would be observed in zram as well. >>>> >>>> Is this an extreme case where the workload's working set far >>>> exceeds the available memory by memcg limitation? I doubt mTHP >>>> would provide any real benefit from the start if the workload is bound to >>>> experience swap thrashing. What if we disable mTHP entirely? >>>> >>> >>> I would agree, this is an extreme case. I wanted (z)swap activity to happen so limited >>> memory.max to 4G. >>> >>> mTHP is beneficial in kernel test benchmarking going from no mTHP to 16K: >>> >>> ARM make defconfig; time make -j$(nproc) Image, cgroup memory.max=4G >>> metric no mTHP 16K mTHP=always >>> real 1m0.613s 0m52.008s >>> user 25m23.028s 25m19.488s >>> sys 25m45.466s 18m11.640s >>> zswpin 1911194 3108438 >>> zswpout 6880815 9374628 >>> pgfault 120430166 48976658 >>> pgmajfault 1580674 2327086 >>> >>> >> >> Interesting! We never use a phone to build the Linux kernel, but >> let me see if I can find some other machines to reproduce your data. > > Hi Usama, > > I suspect the regression occurs because you're running an edge case > where the memory cgroup stays nearly full most of the time (this isn't > an inherent issue with large folio swap-in). As a result, swapping in > mTHP quickly triggers a memcg overflow, causing a swap-out. The > next swap-in then recreates the overflow, leading to a repeating > cycle. > Yes, agreed! Looking at the swap counters, I think this is what is going on as well. > We need a way to stop the cup from repeatedly filling to the brim and > overflowing. While not a definitive fix, the following change might help > improve the situation: > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 17af08367c68..f2fa0eeb2d9a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > > @@ -4559,7 +4559,10 @@ int mem_cgroup_swapin_charge_folio(struct folio > *folio, struct mm_struct *mm, > memcg = get_mem_cgroup_from_mm(mm); > rcu_read_unlock(); > > - ret = charge_memcg(folio, memcg, gfp); > + if (folio_test_large(folio) && mem_cgroup_margin(memcg) < > MEMCG_CHARGE_BATCH) > + ret = -ENOMEM; > + else > + ret = charge_memcg(folio, memcg, gfp); > > css_put(&memcg->css); > return ret; > } > The diff makes sense to me. Let me test later today and get back to you. Thanks! > Please confirm if it makes the kernel build with memcg limitation > faster. If so, let's > work together to figure out an official patch :-) The above code hasn't consider > the parent memcg's overflow, so not an ideal fix. > >> >>> >>> >>>>> >>>>> My initial thought was this might be because its intel, where you dont have the advantage >>>>> of TLB coalescing, so tested on AMD and ARM, but the regression is there on AMD >>>>> and ARM as well, though a bit less (have added the numbers below). >>>>> >>>>> The numbers show that the zswap activity increases and page faults decrease. >>>>> Overall this does result in sys time increasing and real time slightly increases, >>>>> likely because the cost of increased zswap activity is more than the benefit of >>>>> lower page faults. >>>>> I can see in [3] that pagefaults reduced in zram as well. >>>>> >>>>> Large folio swapin shows good numbers in microbenchmarks that just target reduce page >>>>> faults and sequential swapin only, but not in kernel build test. Is a similar regression >>>>> observed with zram when enabling large folio swapin on kernel build test? Maybe large >>>>> folio swapin makes more sense on workloads where mappings are kept for a longer time? >>>>> >>>> >>>> I suspect this is because mTHP doesn't always benefit workloads >>>> when available memory is quite limited compared to the working set. >>>> In that case, mTHP swap-in might introduce more features that >>>> exacerbate the problem. We used to have an extra control "swapin_enabled" >>>> for swap-in, but it never gained much traction: >>>> https://lore.kernel.org/linux-mm/20240726094618.401593-5-21cnbao@gmail.com/ >>>> We can reconsider whether to include the knob, but if it's better >>>> to disable mTHP entirely for these cases, we can still adhere to >>>> the policy of "enabled". >>>> >>> Yes I think this makes sense to have. The only thing is, its too many knobs! >>> I personally think its already difficult to decide upto which mTHP size we >>> should enable (and I think this changes per workload). But if we add swapin_enabled >>> on top of that it can make things more difficult. >>> >>>> Using large block compression and decompression in zRAM will >>>> significantly reduce CPU usage, likely making the issue unnoticeable. >>>> However, the default minimum size for large block support is currently >>>> set to 64KB(ZSMALLOC_MULTI_PAGES_ORDER = 4). >>>> >>> >>> I saw that the patch was sent in March, and there werent any updates after? >>> Maybe I can try and cherry-pick that and see if we can develop large >>> granularity compression for zswap. >> >> will provide an updated version next week. >> >>> >>>>> >>>>> Kernel build numbers in cgroup with memory.max=4G to trigger zswap >>>>> Command for AMD: make defconfig; time make -j$(nproc) bzImage >>>>> Command for ARM: make defconfig; time make -j$(nproc) Image >>>>> >>>>> >>>>> AMD 16K+32K THP=always >>>>> metric mm-unstable mm-unstable + large folio zswapin series >>>>> real 1m23.038s 1m23.050s >>>>> user 53m57.210s 53m53.437s >>>>> sys 7m24.592s 7m48.843s >>>>> zswpin 612070 999244 >>>>> zswpout 2226403 2347979 >>>>> pgfault 20667366 20481728 >>>>> pgmajfault 385887 269117 >>>>> >>>>> AMD 16K+32K+64K THP=always >>>>> metric mm-unstable mm-unstable + large folio zswapin series >>>>> real 1m22.975s 1m23.266s >>>>> user 53m51.302s 53m51.069s >>>>> sys 7m40.168s 7m57.104s >>>>> zswpin 676492 1258573 >>>>> zswpout 2449839 2714767 >>>>> pgfault 17540746 17296555 >>>>> pgmajfault 429629 307495 >>>>> -------------------------- >>>>> ARM 16K+32K THP=always >>>>> metric mm-unstable mm-unstable + large folio zswapin series >>>>> real 0m51.168s 0m52.086s >>>>> user 25m14.715s 25m15.765s >>>>> sys 17m18.856s 18m8.031s >>>>> zswpin 3904129 7339245 >>>>> zswpout 11171295 13473461 >>>>> pgfault 37313345 36011338 >>>>> pgmajfault 2726253 1932642 >>>>> >>>>> >>>>> ARM 16K+32K+64K THP=always >>>>> metric mm-unstable mm-unstable + large folio zswapin series >>>>> real 0m52.017s 0m53.828s >>>>> user 25m2.742s 25m0.046s >>>>> sys 18m24.525s 20m26.207s >>>>> zswpin 4853571 8908664 >>>>> zswpout 12297199 15768764 >>>>> pgfault 32158152 30425519 >>>>> pgmajfault 3320717 2237015 >>>>> >>>>> >>>>> Thanks! >>>>> Usama >>>>> >>>>> >>>>> [1] https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gmail.com/ >>>>> [2] https://lore.kernel.org/all/20240821074541.516249-3-hanchuanhua@oppo.com/ >>>>> [3] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/ >>>>> >>>>>> >>>>>>>> >>>>>>>> The time measured was pretty consistent between runs (~1-2% variation). >>>>>>>> There is 36% improvement in zswapin time with 1M folios. The percentage >>>>>>>> improvement is likely to be more if the memcmp is removed. >>>>>>>> >>>>>>>> diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c >>>>>>>> index 40de679248b8..77068c577c86 100644 >>>>>>>> --- a/tools/testing/selftests/cgroup/test_zswap.c >>>>>>>> +++ b/tools/testing/selftests/cgroup/test_zswap.c >>>>>>>> @@ -9,6 +9,8 @@ >>>>>>>> #include <string.h> >>>>>>>> #include <sys/wait.h> >>>>>>>> #include <sys/mman.h> >>>>>>>> +#include <sys/time.h> >>>>>>>> +#include <malloc.h> >>>>>>>> >>>>>>>> #include "../kselftest.h" >>>>>>>> #include "cgroup_util.h" >>>>>>>> @@ -407,6 +409,74 @@ static int test_zswap_writeback_disabled(const char *root) >>>>>>>> return test_zswap_writeback(root, false); >>>>>>>> } >>>>>>>> >>>>>>>> +static int zswapin_perf(const char *cgroup, void *arg) >>>>>>>> +{ >>>>>>>> + long pagesize = sysconf(_SC_PAGESIZE); >>>>>>>> + size_t memsize = MB(1*1024); >>>>>>>> + char buf[pagesize]; >>>>>>>> + int ret = -1; >>>>>>>> + char *mem; >>>>>>>> + struct timeval start, end; >>>>>>>> + >>>>>>>> + mem = (char *)memalign(2*1024*1024, memsize); >>>>>>>> + if (!mem) >>>>>>>> + return ret; >>>>>>>> + >>>>>>>> + /* >>>>>>>> + * Fill half of each page with increasing data, and keep other >>>>>>>> + * half empty, this will result in data that is still compressible >>>>>>>> + * and ends up in zswap, with material zswap usage. >>>>>>>> + */ >>>>>>>> + for (int i = 0; i < pagesize; i++) >>>>>>>> + buf[i] = i < pagesize/2 ? (char) i : 0; >>>>>>>> + >>>>>>>> + for (int i = 0; i < memsize; i += pagesize) >>>>>>>> + memcpy(&mem[i], buf, pagesize); >>>>>>>> + >>>>>>>> + /* Try and reclaim allocated memory */ >>>>>>>> + if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) { >>>>>>>> + ksft_print_msg("Failed to reclaim all of the requested memory\n"); >>>>>>>> + goto out; >>>>>>>> + } >>>>>>>> + >>>>>>>> + gettimeofday(&start, NULL); >>>>>>>> + /* zswpin */ >>>>>>>> + for (int i = 0; i < memsize; i += pagesize) { >>>>>>>> + if (memcmp(&mem[i], buf, pagesize)) { >>>>>>>> + ksft_print_msg("invalid memory\n"); >>>>>>>> + goto out; >>>>>>>> + } >>>>>>>> + } >>>>>>>> + gettimeofday(&end, NULL); >>>>>>>> + printf ("zswapin took %fms to run.\n", (end.tv_sec - start.tv_sec)*1000 + (double)(end.tv_usec - start.tv_usec) / 1000); >>>>>>>> + ret = 0; >>>>>>>> +out: >>>>>>>> + free(mem); >>>>>>>> + return ret; >>>>>>>> +} >>>>>>>> + >>>>>>>> +static int test_zswapin_perf(const char *root) >>>>>>>> +{ >>>>>>>> + int ret = KSFT_FAIL; >>>>>>>> + char *test_group; >>>>>>>> + >>>>>>>> + test_group = cg_name(root, "zswapin_perf_test"); >>>>>>>> + if (!test_group) >>>>>>>> + goto out; >>>>>>>> + if (cg_create(test_group)) >>>>>>>> + goto out; >>>>>>>> + >>>>>>>> + if (cg_run(test_group, zswapin_perf, NULL)) >>>>>>>> + goto out; >>>>>>>> + >>>>>>>> + ret = KSFT_PASS; >>>>>>>> +out: >>>>>>>> + cg_destroy(test_group); >>>>>>>> + free(test_group); >>>>>>>> + return ret; >>>>>>>> +} >>>>>>>> + >>>>>>>> /* >>>>>>>> * When trying to store a memcg page in zswap, if the memcg hits its memory >>>>>>>> * limit in zswap, writeback should affect only the zswapped pages of that >>>>>>>> @@ -584,6 +654,7 @@ struct zswap_test { >>>>>>>> T(test_zswapin), >>>>>>>> T(test_zswap_writeback_enabled), >>>>>>>> T(test_zswap_writeback_disabled), >>>>>>>> + T(test_zswapin_perf), >>>>>>>> T(test_no_kmem_bypass), >>>>>>>> T(test_no_invasive_cgroup_shrink), >>>>>>>> }; >>>>>>>> >>>>>>>> [1] https://lore.kernel.org/all/20241001053222.6944-1-kanchana.p.sridhar@intel.com/ >>>>>>>> [2] https://lore.kernel.org/all/20240821074541.516249-1-hanchuanhua@oppo.com/ >>>>>>>> [3] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@kernel.org/T/#u >>>>>>>> [4] https://lwn.net/Articles/955575/ >>>>>>>> >>>>>>>> Usama Arif (4): >>>>>>>> mm/zswap: skip swapcache for swapping in zswap pages >>>>>>>> mm/zswap: modify zswap_decompress to accept page instead of folio >>>>>>>> mm/zswap: add support for large folio zswapin >>>>>>>> mm/zswap: count successful large folio zswap loads >>>>>>>> >>>>>>>> Documentation/admin-guide/mm/transhuge.rst | 3 + >>>>>>>> include/linux/huge_mm.h | 1 + >>>>>>>> include/linux/zswap.h | 6 ++ >>>>>>>> mm/huge_memory.c | 3 + >>>>>>>> mm/memory.c | 16 +-- >>>>>>>> mm/page_io.c | 2 +- >>>>>>>> mm/zswap.c | 120 ++++++++++++++------- >>>>>>>> 7 files changed, 99 insertions(+), 52 deletions(-) >>>>>>>> >>>>>>>> -- >>>>>>>> 2.43.5 >>>>>>>> >>>>>>> >>>> >> > > Thanks > Barry
On 23/10/2024 11:48, Usama Arif wrote: > > > On 23/10/2024 11:26, Barry Song wrote: >> On Wed, Oct 23, 2024 at 11:07 AM Barry Song <21cnbao@gmail.com> wrote: >>> >>> On Wed, Oct 23, 2024 at 10:17 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>> >>>> >>>> >>>> On 22/10/2024 21:46, Barry Song wrote: >>>>> On Wed, Oct 23, 2024 at 4:26 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 21/10/2024 11:40, Usama Arif wrote: >>>>>>> >>>>>>> >>>>>>> On 21/10/2024 06:09, Barry Song wrote: >>>>>>>> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>>>>> >>>>>>>>> After large folio zswapout support added in [1], this patch adds >>>>>>>>> support for zswapin of large folios to bring it on par with zram. >>>>>>>>> This series makes sure that the benefits of large folios (fewer >>>>>>>>> page faults, batched PTE and rmap manipulation, reduced lru list, >>>>>>>>> TLB coalescing (for arm64 and amd)) are not lost at swap out when >>>>>>>>> using zswap. >>>>>>>>> >>>>>>>>> It builds on top of [2] which added large folio swapin support for >>>>>>>>> zram and provides the same level of large folio swapin support as >>>>>>>>> zram, i.e. only supporting swap count == 1. >>>>>>>>> >>>>>>>>> Patch 1 skips swapcache for swapping in zswap pages, this should improve >>>>>>>>> no readahead swapin performance [3], and also allows us to build on large >>>>>>>>> folio swapin support added in [2], hence is a prerequisite for patch 3. >>>>>>>>> >>>>>>>>> Patch 3 adds support for large folio zswapin. This patch does not add >>>>>>>>> support for hybrid backends (i.e. folios partly present swap and zswap). >>>>>>>>> >>>>>>>>> The main performance benefit comes from maintaining large folios *after* >>>>>>>>> swapin, large folio performance improvements have been mentioned in previous >>>>>>>>> series posted on it [2],[4], so have not added those. Below is a simple >>>>>>>>> microbenchmark to measure the time needed *for* zswpin of 1G memory (along >>>>>>>>> with memory integrity check). >>>>>>>>> >>>>>>>>> | no mTHP (ms) | 1M mTHP enabled (ms) >>>>>>>>> Base kernel | 1165 | 1163 >>>>>>>>> Kernel with mTHP zswpin series | 1203 | 738 >>>>>>>> >>>>>>>> Hi Usama, >>>>>>>> Do you know where this minor regression for non-mTHP comes from? >>>>>>>> As you even have skipped swapcache for small folios in zswap in patch1, >>>>>>>> that part should have some gain? is it because of zswap_present_test()? >>>>>>>> >>>>>>> >>>>>>> Hi Barry, >>>>>>> >>>>>>> The microbenchmark does a sequential read of 1G of memory, so it probably >>>>>>> isnt very representative of real world usecases. This also means that >>>>>>> swap_vma_readahead is able to readahead accurately all pages in its window. >>>>>>> With this patch series, if doing 4K swapin, you get 1G/4K calls of fast >>>>>>> do_swap_page. Without this patch, you get 1G/(4K*readahead window) of slow >>>>>>> do_swap_page calls. I had added some prints and I was seeing 8 pages being >>>>>>> readahead in 1 do_swap_page. The larger number of calls causes the slight >>>>>>> regression (eventhough they are quite fast). I think in a realistic scenario, >>>>>>> where readahead window wont be as large, there wont be a regression. >>>>>>> The cost of zswap_present_test in the whole call stack of swapping page is >>>>>>> very low and I think can be ignored. >>>>>>> >>>>>>> I think the more interesting thing is what Kanchana pointed out in >>>>>>> https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gmail.com/ >>>>>>> I am curious, did you see this when testing large folio swapin and compression >>>>>>> at 4K granuality? Its looks like swap thrashing so I think it would be common >>>>>>> between zswap and zram. I dont have larger granuality zswap compression done, >>>>>>> which is why I think there is a regression in time taken. (It could be because >>>>>>> its tested on intel as well). >>>>>>> >>>>>>> Thanks, >>>>>>> Usama >>>>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> So I have been doing some benchmarking after Kanchana pointed out a performance >>>>>> regression in [1] of swapping in large folio. I would love to get thoughts from >>>>>> zram folks on this, as thats where large folio swapin was first added [2]. >>>>>> As far as I can see, the current support in zram is doing large folio swapin >>>>>> at 4K granuality. The large granuality compression in [3] which was posted >>>>>> in March is not merged, so I am currently comparing upstream zram with this series. >>>>>> >>>>>> With the microbenchmark below of timing 1G swapin, there was a very large improvement >>>>>> in performance by using this series. I think similar numbers would be seen in zram. >>>>> >>>>> Imagine running several apps on a phone and switching >>>>> between them: A → B → C → D → E … → A → B … The app >>>>> currently on the screen retains its memory, while the ones >>>>> sent to the background are swapped out. When we bring >>>>> those apps back to the foreground, their memory is restored. >>>>> This behavior is quite similar to what you're seeing with >>>>> your microbenchmark. >>>>> >>>> >>>> Hi Barry, >>>> >>>> Thanks for explaining this! Do you know if there is some open source benchmark >>>> we could use to show an improvement in app switching with large folios? >>>> >>> >>> I’m fairly certain the Android team has this benchmark, but it’s not >>> open source. >>> >>> A straightforward way to simulate this is to use a script that >>> cyclically launches multiple applications, such as Chrome, Firefox, >>> Office, PDF, and others. >>> >>> for example: >>> >>> launch chrome; >>> launch firefox; >>> launch youtube; >>> .... >>> launch chrome; >>> launch firefox; >>> .... >>> >>> On Android, we have "Android activity manager 'am' command" to do that. >>> https://gist.github.com/tsohr/5711945 >>> >>> Not quite sure if other windows managers have similar tools. >>> >>>> Also I guess swap thrashing can happen when apps are brought back to foreground? >>>> >>> >>> Typically, the foreground app doesn't experience much swapping, >>> as it is the most recently or frequently used. However, this may >>> not hold for very low-end phones, where memory is significantly >>> less than the app's working set. For instance, we can't expect a >>> good user experience when playing a large game that requires 8GB >>> of memory on a 4GB phone! :-) >>> And for low-end phones, we never even enable mTHP. >>> >>>>>> >>>>>> But when doing kernel build test, Kanchana saw a regression in [1]. I believe >>>>>> its because of swap thrashing (causing large zswap activity), due to larger page swapin. >>>>>> The part of the code that decides large folio swapin is the same between zswap and zram, >>>>>> so I believe this would be observed in zram as well. >>>>> >>>>> Is this an extreme case where the workload's working set far >>>>> exceeds the available memory by memcg limitation? I doubt mTHP >>>>> would provide any real benefit from the start if the workload is bound to >>>>> experience swap thrashing. What if we disable mTHP entirely? >>>>> >>>> >>>> I would agree, this is an extreme case. I wanted (z)swap activity to happen so limited >>>> memory.max to 4G. >>>> >>>> mTHP is beneficial in kernel test benchmarking going from no mTHP to 16K: >>>> >>>> ARM make defconfig; time make -j$(nproc) Image, cgroup memory.max=4G >>>> metric no mTHP 16K mTHP=always >>>> real 1m0.613s 0m52.008s >>>> user 25m23.028s 25m19.488s >>>> sys 25m45.466s 18m11.640s >>>> zswpin 1911194 3108438 >>>> zswpout 6880815 9374628 >>>> pgfault 120430166 48976658 >>>> pgmajfault 1580674 2327086 >>>> >>>> >>> >>> Interesting! We never use a phone to build the Linux kernel, but >>> let me see if I can find some other machines to reproduce your data. >> >> Hi Usama, >> >> I suspect the regression occurs because you're running an edge case >> where the memory cgroup stays nearly full most of the time (this isn't >> an inherent issue with large folio swap-in). As a result, swapping in >> mTHP quickly triggers a memcg overflow, causing a swap-out. The >> next swap-in then recreates the overflow, leading to a repeating >> cycle. >> > > Yes, agreed! Looking at the swap counters, I think this is what is going > on as well. > >> We need a way to stop the cup from repeatedly filling to the brim and >> overflowing. While not a definitive fix, the following change might help >> improve the situation: >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> >> index 17af08367c68..f2fa0eeb2d9a 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> >> @@ -4559,7 +4559,10 @@ int mem_cgroup_swapin_charge_folio(struct folio >> *folio, struct mm_struct *mm, >> memcg = get_mem_cgroup_from_mm(mm); >> rcu_read_unlock(); >> >> - ret = charge_memcg(folio, memcg, gfp); >> + if (folio_test_large(folio) && mem_cgroup_margin(memcg) < >> MEMCG_CHARGE_BATCH) >> + ret = -ENOMEM; >> + else >> + ret = charge_memcg(folio, memcg, gfp); >> >> css_put(&memcg->css); >> return ret; >> } >> > > The diff makes sense to me. Let me test later today and get back to you. > > Thanks! > >> Please confirm if it makes the kernel build with memcg limitation >> faster. If so, let's >> work together to figure out an official patch :-) The above code hasn't consider >> the parent memcg's overflow, so not an ideal fix. >> Thanks Barry, I think this fixes the regression, and even gives an improvement! I think the below might be better to do: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c098fd7f5c5e..0a1ec55cc079 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4550,7 +4550,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, memcg = get_mem_cgroup_from_mm(mm); rcu_read_unlock(); - ret = charge_memcg(folio, memcg, gfp); + if (folio_test_large(folio) && + mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) + ret = -ENOMEM; + else + ret = charge_memcg(folio, memcg, gfp); css_put(&memcg->css); return ret; AMD 16K+32K THP=always metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix real 1m23.038s 1m23.050s 1m22.704s user 53m57.210s 53m53.437s 53m52.577s sys 7m24.592s 7m48.843s 7m22.519s zswpin 612070 999244 815934 zswpout 2226403 2347979 2054980 pgfault 20667366 20481728 20478690 pgmajfault 385887 269117 309702 AMD 16K+32K+64K THP=always metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix real 1m22.975s 1m23.266s 1m22.549s user 53m51.302s 53m51.069s 53m46.471s sys 7m40.168s 7m57.104s 7m25.012s zswpin 676492 1258573 1225703 zswpout 2449839 2714767 2899178 pgfault 17540746 17296555 17234663 pgmajfault 429629 307495 287859 >>> >>>> >>>> >>>>>> >>>>>> My initial thought was this might be because its intel, where you dont have the advantage >>>>>> of TLB coalescing, so tested on AMD and ARM, but the regression is there on AMD >>>>>> and ARM as well, though a bit less (have added the numbers below). >>>>>> >>>>>> The numbers show that the zswap activity increases and page faults decrease. >>>>>> Overall this does result in sys time increasing and real time slightly increases, >>>>>> likely because the cost of increased zswap activity is more than the benefit of >>>>>> lower page faults. >>>>>> I can see in [3] that pagefaults reduced in zram as well. >>>>>> >>>>>> Large folio swapin shows good numbers in microbenchmarks that just target reduce page >>>>>> faults and sequential swapin only, but not in kernel build test. Is a similar regression >>>>>> observed with zram when enabling large folio swapin on kernel build test? Maybe large >>>>>> folio swapin makes more sense on workloads where mappings are kept for a longer time? >>>>>> >>>>> >>>>> I suspect this is because mTHP doesn't always benefit workloads >>>>> when available memory is quite limited compared to the working set. >>>>> In that case, mTHP swap-in might introduce more features that >>>>> exacerbate the problem. We used to have an extra control "swapin_enabled" >>>>> for swap-in, but it never gained much traction: >>>>> https://lore.kernel.org/linux-mm/20240726094618.401593-5-21cnbao@gmail.com/ >>>>> We can reconsider whether to include the knob, but if it's better >>>>> to disable mTHP entirely for these cases, we can still adhere to >>>>> the policy of "enabled". >>>>> >>>> Yes I think this makes sense to have. The only thing is, its too many knobs! >>>> I personally think its already difficult to decide upto which mTHP size we >>>> should enable (and I think this changes per workload). But if we add swapin_enabled >>>> on top of that it can make things more difficult. >>>> >>>>> Using large block compression and decompression in zRAM will >>>>> significantly reduce CPU usage, likely making the issue unnoticeable. >>>>> However, the default minimum size for large block support is currently >>>>> set to 64KB(ZSMALLOC_MULTI_PAGES_ORDER = 4). >>>>> >>>> >>>> I saw that the patch was sent in March, and there werent any updates after? >>>> Maybe I can try and cherry-pick that and see if we can develop large >>>> granularity compression for zswap. >>> >>> will provide an updated version next week. >>> >>>> >>>>>> >>>>>> Kernel build numbers in cgroup with memory.max=4G to trigger zswap >>>>>> Command for AMD: make defconfig; time make -j$(nproc) bzImage >>>>>> Command for ARM: make defconfig; time make -j$(nproc) Image >>>>>> >>>>>> >>>>>> AMD 16K+32K THP=always >>>>>> metric mm-unstable mm-unstable + large folio zswapin series >>>>>> real 1m23.038s 1m23.050s >>>>>> user 53m57.210s 53m53.437s >>>>>> sys 7m24.592s 7m48.843s >>>>>> zswpin 612070 999244 >>>>>> zswpout 2226403 2347979 >>>>>> pgfault 20667366 20481728 >>>>>> pgmajfault 385887 269117 >>>>>> >>>>>> AMD 16K+32K+64K THP=always >>>>>> metric mm-unstable mm-unstable + large folio zswapin series >>>>>> real 1m22.975s 1m23.266s >>>>>> user 53m51.302s 53m51.069s >>>>>> sys 7m40.168s 7m57.104s >>>>>> zswpin 676492 1258573 >>>>>> zswpout 2449839 2714767 >>>>>> pgfault 17540746 17296555 >>>>>> pgmajfault 429629 307495 >>>>>> -------------------------- >>>>>> ARM 16K+32K THP=always >>>>>> metric mm-unstable mm-unstable + large folio zswapin series >>>>>> real 0m51.168s 0m52.086s >>>>>> user 25m14.715s 25m15.765s >>>>>> sys 17m18.856s 18m8.031s >>>>>> zswpin 3904129 7339245 >>>>>> zswpout 11171295 13473461 >>>>>> pgfault 37313345 36011338 >>>>>> pgmajfault 2726253 1932642 >>>>>> >>>>>> >>>>>> ARM 16K+32K+64K THP=always >>>>>> metric mm-unstable mm-unstable + large folio zswapin series >>>>>> real 0m52.017s 0m53.828s >>>>>> user 25m2.742s 25m0.046s >>>>>> sys 18m24.525s 20m26.207s >>>>>> zswpin 4853571 8908664 >>>>>> zswpout 12297199 15768764 >>>>>> pgfault 32158152 30425519 >>>>>> pgmajfault 3320717 2237015 >>>>>> >>>>>> >>>>>> Thanks! >>>>>> Usama >>>>>> >>>>>> >>>>>> [1] https://lore.kernel.org/all/f2f2053f-ec5f-46a4-800d-50a3d2e61bff@gmail.com/ >>>>>> [2] https://lore.kernel.org/all/20240821074541.516249-3-hanchuanhua@oppo.com/ >>>>>> [3] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/ >>>>>> >>>>>>> >>>>>>>>> >>>>>>>>> The time measured was pretty consistent between runs (~1-2% variation). >>>>>>>>> There is 36% improvement in zswapin time with 1M folios. The percentage >>>>>>>>> improvement is likely to be more if the memcmp is removed. >>>>>>>>> >>>>>>>>> diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c >>>>>>>>> index 40de679248b8..77068c577c86 100644 >>>>>>>>> --- a/tools/testing/selftests/cgroup/test_zswap.c >>>>>>>>> +++ b/tools/testing/selftests/cgroup/test_zswap.c >>>>>>>>> @@ -9,6 +9,8 @@ >>>>>>>>> #include <string.h> >>>>>>>>> #include <sys/wait.h> >>>>>>>>> #include <sys/mman.h> >>>>>>>>> +#include <sys/time.h> >>>>>>>>> +#include <malloc.h> >>>>>>>>> >>>>>>>>> #include "../kselftest.h" >>>>>>>>> #include "cgroup_util.h" >>>>>>>>> @@ -407,6 +409,74 @@ static int test_zswap_writeback_disabled(const char *root) >>>>>>>>> return test_zswap_writeback(root, false); >>>>>>>>> } >>>>>>>>> >>>>>>>>> +static int zswapin_perf(const char *cgroup, void *arg) >>>>>>>>> +{ >>>>>>>>> + long pagesize = sysconf(_SC_PAGESIZE); >>>>>>>>> + size_t memsize = MB(1*1024); >>>>>>>>> + char buf[pagesize]; >>>>>>>>> + int ret = -1; >>>>>>>>> + char *mem; >>>>>>>>> + struct timeval start, end; >>>>>>>>> + >>>>>>>>> + mem = (char *)memalign(2*1024*1024, memsize); >>>>>>>>> + if (!mem) >>>>>>>>> + return ret; >>>>>>>>> + >>>>>>>>> + /* >>>>>>>>> + * Fill half of each page with increasing data, and keep other >>>>>>>>> + * half empty, this will result in data that is still compressible >>>>>>>>> + * and ends up in zswap, with material zswap usage. >>>>>>>>> + */ >>>>>>>>> + for (int i = 0; i < pagesize; i++) >>>>>>>>> + buf[i] = i < pagesize/2 ? (char) i : 0; >>>>>>>>> + >>>>>>>>> + for (int i = 0; i < memsize; i += pagesize) >>>>>>>>> + memcpy(&mem[i], buf, pagesize); >>>>>>>>> + >>>>>>>>> + /* Try and reclaim allocated memory */ >>>>>>>>> + if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) { >>>>>>>>> + ksft_print_msg("Failed to reclaim all of the requested memory\n"); >>>>>>>>> + goto out; >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + gettimeofday(&start, NULL); >>>>>>>>> + /* zswpin */ >>>>>>>>> + for (int i = 0; i < memsize; i += pagesize) { >>>>>>>>> + if (memcmp(&mem[i], buf, pagesize)) { >>>>>>>>> + ksft_print_msg("invalid memory\n"); >>>>>>>>> + goto out; >>>>>>>>> + } >>>>>>>>> + } >>>>>>>>> + gettimeofday(&end, NULL); >>>>>>>>> + printf ("zswapin took %fms to run.\n", (end.tv_sec - start.tv_sec)*1000 + (double)(end.tv_usec - start.tv_usec) / 1000); >>>>>>>>> + ret = 0; >>>>>>>>> +out: >>>>>>>>> + free(mem); >>>>>>>>> + return ret; >>>>>>>>> +} >>>>>>>>> + >>>>>>>>> +static int test_zswapin_perf(const char *root) >>>>>>>>> +{ >>>>>>>>> + int ret = KSFT_FAIL; >>>>>>>>> + char *test_group; >>>>>>>>> + >>>>>>>>> + test_group = cg_name(root, "zswapin_perf_test"); >>>>>>>>> + if (!test_group) >>>>>>>>> + goto out; >>>>>>>>> + if (cg_create(test_group)) >>>>>>>>> + goto out; >>>>>>>>> + >>>>>>>>> + if (cg_run(test_group, zswapin_perf, NULL)) >>>>>>>>> + goto out; >>>>>>>>> + >>>>>>>>> + ret = KSFT_PASS; >>>>>>>>> +out: >>>>>>>>> + cg_destroy(test_group); >>>>>>>>> + free(test_group); >>>>>>>>> + return ret; >>>>>>>>> +} >>>>>>>>> + >>>>>>>>> /* >>>>>>>>> * When trying to store a memcg page in zswap, if the memcg hits its memory >>>>>>>>> * limit in zswap, writeback should affect only the zswapped pages of that >>>>>>>>> @@ -584,6 +654,7 @@ struct zswap_test { >>>>>>>>> T(test_zswapin), >>>>>>>>> T(test_zswap_writeback_enabled), >>>>>>>>> T(test_zswap_writeback_disabled), >>>>>>>>> + T(test_zswapin_perf), >>>>>>>>> T(test_no_kmem_bypass), >>>>>>>>> T(test_no_invasive_cgroup_shrink), >>>>>>>>> }; >>>>>>>>> >>>>>>>>> [1] https://lore.kernel.org/all/20241001053222.6944-1-kanchana.p.sridhar@intel.com/ >>>>>>>>> [2] https://lore.kernel.org/all/20240821074541.516249-1-hanchuanhua@oppo.com/ >>>>>>>>> [3] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@kernel.org/T/#u >>>>>>>>> [4] https://lwn.net/Articles/955575/ >>>>>>>>> >>>>>>>>> Usama Arif (4): >>>>>>>>> mm/zswap: skip swapcache for swapping in zswap pages >>>>>>>>> mm/zswap: modify zswap_decompress to accept page instead of folio >>>>>>>>> mm/zswap: add support for large folio zswapin >>>>>>>>> mm/zswap: count successful large folio zswap loads >>>>>>>>> >>>>>>>>> Documentation/admin-guide/mm/transhuge.rst | 3 + >>>>>>>>> include/linux/huge_mm.h | 1 + >>>>>>>>> include/linux/zswap.h | 6 ++ >>>>>>>>> mm/huge_memory.c | 3 + >>>>>>>>> mm/memory.c | 16 +-- >>>>>>>>> mm/page_io.c | 2 +- >>>>>>>>> mm/zswap.c | 120 ++++++++++++++------- >>>>>>>>> 7 files changed, 99 insertions(+), 52 deletions(-) >>>>>>>>> >>>>>>>>> -- >>>>>>>>> 2.43.5 >>>>>>>>> >>>>>>>> >>>>> >>> >> >> Thanks >> Barry >
[..] > >> I suspect the regression occurs because you're running an edge case > >> where the memory cgroup stays nearly full most of the time (this isn't > >> an inherent issue with large folio swap-in). As a result, swapping in > >> mTHP quickly triggers a memcg overflow, causing a swap-out. The > >> next swap-in then recreates the overflow, leading to a repeating > >> cycle. > >> > > > > Yes, agreed! Looking at the swap counters, I think this is what is going > > on as well. > > > >> We need a way to stop the cup from repeatedly filling to the brim and > >> overflowing. While not a definitive fix, the following change might help > >> improve the situation: > >> > >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >> > >> index 17af08367c68..f2fa0eeb2d9a 100644 > >> --- a/mm/memcontrol.c > >> +++ b/mm/memcontrol.c > >> > >> @@ -4559,7 +4559,10 @@ int mem_cgroup_swapin_charge_folio(struct folio > >> *folio, struct mm_struct *mm, > >> memcg = get_mem_cgroup_from_mm(mm); > >> rcu_read_unlock(); > >> > >> - ret = charge_memcg(folio, memcg, gfp); > >> + if (folio_test_large(folio) && mem_cgroup_margin(memcg) < > >> MEMCG_CHARGE_BATCH) > >> + ret = -ENOMEM; > >> + else > >> + ret = charge_memcg(folio, memcg, gfp); > >> > >> css_put(&memcg->css); > >> return ret; > >> } > >> > > > > The diff makes sense to me. Let me test later today and get back to you. > > > > Thanks! > > > >> Please confirm if it makes the kernel build with memcg limitation > >> faster. If so, let's > >> work together to figure out an official patch :-) The above code hasn't consider > >> the parent memcg's overflow, so not an ideal fix. > >> > > Thanks Barry, I think this fixes the regression, and even gives an improvement! > I think the below might be better to do: > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index c098fd7f5c5e..0a1ec55cc079 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4550,7 +4550,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > memcg = get_mem_cgroup_from_mm(mm); > rcu_read_unlock(); > > - ret = charge_memcg(folio, memcg, gfp); > + if (folio_test_large(folio) && > + mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) > + ret = -ENOMEM; > + else > + ret = charge_memcg(folio, memcg, gfp); > > css_put(&memcg->css); > return ret; > > > AMD 16K+32K THP=always > metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > real 1m23.038s 1m23.050s 1m22.704s > user 53m57.210s 53m53.437s 53m52.577s > sys 7m24.592s 7m48.843s 7m22.519s > zswpin 612070 999244 815934 > zswpout 2226403 2347979 2054980 > pgfault 20667366 20481728 20478690 > pgmajfault 385887 269117 309702 > > AMD 16K+32K+64K THP=always > metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > real 1m22.975s 1m23.266s 1m22.549s > user 53m51.302s 53m51.069s 53m46.471s > sys 7m40.168s 7m57.104s 7m25.012s > zswpin 676492 1258573 1225703 > zswpout 2449839 2714767 2899178 > pgfault 17540746 17296555 17234663 > pgmajfault 429629 307495 287859 > Thanks Usama and Barry for looking into this. It seems like this would fix a regression with large folio swapin regardless of zswap. Can the same result be reproduced on zram without this series?
On 23/10/2024 19:02, Yosry Ahmed wrote: > [..] >>>> I suspect the regression occurs because you're running an edge case >>>> where the memory cgroup stays nearly full most of the time (this isn't >>>> an inherent issue with large folio swap-in). As a result, swapping in >>>> mTHP quickly triggers a memcg overflow, causing a swap-out. The >>>> next swap-in then recreates the overflow, leading to a repeating >>>> cycle. >>>> >>> >>> Yes, agreed! Looking at the swap counters, I think this is what is going >>> on as well. >>> >>>> We need a way to stop the cup from repeatedly filling to the brim and >>>> overflowing. While not a definitive fix, the following change might help >>>> improve the situation: >>>> >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>>> >>>> index 17af08367c68..f2fa0eeb2d9a 100644 >>>> --- a/mm/memcontrol.c >>>> +++ b/mm/memcontrol.c >>>> >>>> @@ -4559,7 +4559,10 @@ int mem_cgroup_swapin_charge_folio(struct folio >>>> *folio, struct mm_struct *mm, >>>> memcg = get_mem_cgroup_from_mm(mm); >>>> rcu_read_unlock(); >>>> >>>> - ret = charge_memcg(folio, memcg, gfp); >>>> + if (folio_test_large(folio) && mem_cgroup_margin(memcg) < >>>> MEMCG_CHARGE_BATCH) >>>> + ret = -ENOMEM; >>>> + else >>>> + ret = charge_memcg(folio, memcg, gfp); >>>> >>>> css_put(&memcg->css); >>>> return ret; >>>> } >>>> >>> >>> The diff makes sense to me. Let me test later today and get back to you. >>> >>> Thanks! >>> >>>> Please confirm if it makes the kernel build with memcg limitation >>>> faster. If so, let's >>>> work together to figure out an official patch :-) The above code hasn't consider >>>> the parent memcg's overflow, so not an ideal fix. >>>> >> >> Thanks Barry, I think this fixes the regression, and even gives an improvement! >> I think the below might be better to do: >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index c098fd7f5c5e..0a1ec55cc079 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -4550,7 +4550,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, >> memcg = get_mem_cgroup_from_mm(mm); >> rcu_read_unlock(); >> >> - ret = charge_memcg(folio, memcg, gfp); >> + if (folio_test_large(folio) && >> + mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) >> + ret = -ENOMEM; >> + else >> + ret = charge_memcg(folio, memcg, gfp); >> >> css_put(&memcg->css); >> return ret; >> >> >> AMD 16K+32K THP=always >> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix >> real 1m23.038s 1m23.050s 1m22.704s >> user 53m57.210s 53m53.437s 53m52.577s >> sys 7m24.592s 7m48.843s 7m22.519s >> zswpin 612070 999244 815934 >> zswpout 2226403 2347979 2054980 >> pgfault 20667366 20481728 20478690 >> pgmajfault 385887 269117 309702 >> >> AMD 16K+32K+64K THP=always >> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix >> real 1m22.975s 1m23.266s 1m22.549s >> user 53m51.302s 53m51.069s 53m46.471s >> sys 7m40.168s 7m57.104s 7m25.012s >> zswpin 676492 1258573 1225703 >> zswpout 2449839 2714767 2899178 >> pgfault 17540746 17296555 17234663 >> pgmajfault 429629 307495 287859 >> > > Thanks Usama and Barry for looking into this. It seems like this would > fix a regression with large folio swapin regardless of zswap. Can the > same result be reproduced on zram without this series? Yes, its a regression in large folio swapin support regardless of zswap/zram. Need to do 3 tests, one with probably the below diff to remove large folio support, one with current upstream and one with upstream + swap thrashing fix. We only use zswap and dont have a zram setup (and I am a bit lazy to create one :)). Any zram volunteers to try this? diff --git a/mm/memory.c b/mm/memory.c index fecdd044bc0b..62f6b087beb3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4124,6 +4124,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) gfp_t gfp; int order; + goto fallback; + /* * If uffd is active for the vma we need per-page fault fidelity to * maintain the uffd semantics.
On Thu, Oct 24, 2024 at 7:31 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 23/10/2024 19:02, Yosry Ahmed wrote: > > [..] > >>>> I suspect the regression occurs because you're running an edge case > >>>> where the memory cgroup stays nearly full most of the time (this isn't > >>>> an inherent issue with large folio swap-in). As a result, swapping in > >>>> mTHP quickly triggers a memcg overflow, causing a swap-out. The > >>>> next swap-in then recreates the overflow, leading to a repeating > >>>> cycle. > >>>> > >>> > >>> Yes, agreed! Looking at the swap counters, I think this is what is going > >>> on as well. > >>> > >>>> We need a way to stop the cup from repeatedly filling to the brim and > >>>> overflowing. While not a definitive fix, the following change might help > >>>> improve the situation: > >>>> > >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >>>> > >>>> index 17af08367c68..f2fa0eeb2d9a 100644 > >>>> --- a/mm/memcontrol.c > >>>> +++ b/mm/memcontrol.c > >>>> > >>>> @@ -4559,7 +4559,10 @@ int mem_cgroup_swapin_charge_folio(struct folio > >>>> *folio, struct mm_struct *mm, > >>>> memcg = get_mem_cgroup_from_mm(mm); > >>>> rcu_read_unlock(); > >>>> > >>>> - ret = charge_memcg(folio, memcg, gfp); > >>>> + if (folio_test_large(folio) && mem_cgroup_margin(memcg) < > >>>> MEMCG_CHARGE_BATCH) > >>>> + ret = -ENOMEM; > >>>> + else > >>>> + ret = charge_memcg(folio, memcg, gfp); > >>>> > >>>> css_put(&memcg->css); > >>>> return ret; > >>>> } > >>>> > >>> > >>> The diff makes sense to me. Let me test later today and get back to you. > >>> > >>> Thanks! > >>> > >>>> Please confirm if it makes the kernel build with memcg limitation > >>>> faster. If so, let's > >>>> work together to figure out an official patch :-) The above code hasn't consider > >>>> the parent memcg's overflow, so not an ideal fix. > >>>> > >> > >> Thanks Barry, I think this fixes the regression, and even gives an improvement! > >> I think the below might be better to do: > >> > >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >> index c098fd7f5c5e..0a1ec55cc079 100644 > >> --- a/mm/memcontrol.c > >> +++ b/mm/memcontrol.c > >> @@ -4550,7 +4550,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > >> memcg = get_mem_cgroup_from_mm(mm); > >> rcu_read_unlock(); > >> > >> - ret = charge_memcg(folio, memcg, gfp); > >> + if (folio_test_large(folio) && > >> + mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) > >> + ret = -ENOMEM; > >> + else > >> + ret = charge_memcg(folio, memcg, gfp); > >> > >> css_put(&memcg->css); > >> return ret; > >> > >> > >> AMD 16K+32K THP=always > >> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > >> real 1m23.038s 1m23.050s 1m22.704s > >> user 53m57.210s 53m53.437s 53m52.577s > >> sys 7m24.592s 7m48.843s 7m22.519s > >> zswpin 612070 999244 815934 > >> zswpout 2226403 2347979 2054980 > >> pgfault 20667366 20481728 20478690 > >> pgmajfault 385887 269117 309702 > >> > >> AMD 16K+32K+64K THP=always > >> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > >> real 1m22.975s 1m23.266s 1m22.549s > >> user 53m51.302s 53m51.069s 53m46.471s > >> sys 7m40.168s 7m57.104s 7m25.012s > >> zswpin 676492 1258573 1225703 > >> zswpout 2449839 2714767 2899178 > >> pgfault 17540746 17296555 17234663 > >> pgmajfault 429629 307495 287859 > >> > > > > Thanks Usama and Barry for looking into this. It seems like this would > > fix a regression with large folio swapin regardless of zswap. Can the > > same result be reproduced on zram without this series? > > > Yes, its a regression in large folio swapin support regardless of zswap/zram. > > Need to do 3 tests, one with probably the below diff to remove large folio support, > one with current upstream and one with upstream + swap thrashing fix. > > We only use zswap and dont have a zram setup (and I am a bit lazy to create one :)). > Any zram volunteers to try this? Hi Usama, I tried a quick experiment: echo 1 > /sys/module/zswap/parameters/enabled echo 0 > /sys/module/zswap/parameters/enabled This was to test the zRAM scenario. Enabling zswap even once disables mTHP swap-in. :) I noticed a similar regression with zRAM alone, but the change resolved the issue and even sped up the kernel build compared to the setup without mTHP swap-in. However, I’m still working on a proper patch to address this. The current approach: mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)) isn’t sufficient, as it doesn’t cover cases where group A contains group B, and we’re operating within group B. The problem occurs not at the boundary of group B but at the boundary of group A. I believe there’s still room for improvement. For example, if a 64KB charge attempt fails, there’s no need to waste time trying 32KB or 16KB. We can directly fall back to 4KB, as 32KB and 16KB will also fail based on our margin detection logic. > > diff --git a/mm/memory.c b/mm/memory.c > index fecdd044bc0b..62f6b087beb3 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4124,6 +4124,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > gfp_t gfp; > int order; > > + goto fallback; > + > /* > * If uffd is active for the vma we need per-page fault fidelity to > * maintain the uffd semantics. Thanks Barry
On 23/10/2024 19:52, Barry Song wrote: > On Thu, Oct 24, 2024 at 7:31 AM Usama Arif <usamaarif642@gmail.com> wrote: >> >> >> >> On 23/10/2024 19:02, Yosry Ahmed wrote: >>> [..] >>>>>> I suspect the regression occurs because you're running an edge case >>>>>> where the memory cgroup stays nearly full most of the time (this isn't >>>>>> an inherent issue with large folio swap-in). As a result, swapping in >>>>>> mTHP quickly triggers a memcg overflow, causing a swap-out. The >>>>>> next swap-in then recreates the overflow, leading to a repeating >>>>>> cycle. >>>>>> >>>>> >>>>> Yes, agreed! Looking at the swap counters, I think this is what is going >>>>> on as well. >>>>> >>>>>> We need a way to stop the cup from repeatedly filling to the brim and >>>>>> overflowing. While not a definitive fix, the following change might help >>>>>> improve the situation: >>>>>> >>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>>>>> >>>>>> index 17af08367c68..f2fa0eeb2d9a 100644 >>>>>> --- a/mm/memcontrol.c >>>>>> +++ b/mm/memcontrol.c >>>>>> >>>>>> @@ -4559,7 +4559,10 @@ int mem_cgroup_swapin_charge_folio(struct folio >>>>>> *folio, struct mm_struct *mm, >>>>>> memcg = get_mem_cgroup_from_mm(mm); >>>>>> rcu_read_unlock(); >>>>>> >>>>>> - ret = charge_memcg(folio, memcg, gfp); >>>>>> + if (folio_test_large(folio) && mem_cgroup_margin(memcg) < >>>>>> MEMCG_CHARGE_BATCH) >>>>>> + ret = -ENOMEM; >>>>>> + else >>>>>> + ret = charge_memcg(folio, memcg, gfp); >>>>>> >>>>>> css_put(&memcg->css); >>>>>> return ret; >>>>>> } >>>>>> >>>>> >>>>> The diff makes sense to me. Let me test later today and get back to you. >>>>> >>>>> Thanks! >>>>> >>>>>> Please confirm if it makes the kernel build with memcg limitation >>>>>> faster. If so, let's >>>>>> work together to figure out an official patch :-) The above code hasn't consider >>>>>> the parent memcg's overflow, so not an ideal fix. >>>>>> >>>> >>>> Thanks Barry, I think this fixes the regression, and even gives an improvement! >>>> I think the below might be better to do: >>>> >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>>> index c098fd7f5c5e..0a1ec55cc079 100644 >>>> --- a/mm/memcontrol.c >>>> +++ b/mm/memcontrol.c >>>> @@ -4550,7 +4550,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, >>>> memcg = get_mem_cgroup_from_mm(mm); >>>> rcu_read_unlock(); >>>> >>>> - ret = charge_memcg(folio, memcg, gfp); >>>> + if (folio_test_large(folio) && >>>> + mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) >>>> + ret = -ENOMEM; >>>> + else >>>> + ret = charge_memcg(folio, memcg, gfp); >>>> >>>> css_put(&memcg->css); >>>> return ret; >>>> >>>> >>>> AMD 16K+32K THP=always >>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix >>>> real 1m23.038s 1m23.050s 1m22.704s >>>> user 53m57.210s 53m53.437s 53m52.577s >>>> sys 7m24.592s 7m48.843s 7m22.519s >>>> zswpin 612070 999244 815934 >>>> zswpout 2226403 2347979 2054980 >>>> pgfault 20667366 20481728 20478690 >>>> pgmajfault 385887 269117 309702 >>>> >>>> AMD 16K+32K+64K THP=always >>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix >>>> real 1m22.975s 1m23.266s 1m22.549s >>>> user 53m51.302s 53m51.069s 53m46.471s >>>> sys 7m40.168s 7m57.104s 7m25.012s >>>> zswpin 676492 1258573 1225703 >>>> zswpout 2449839 2714767 2899178 >>>> pgfault 17540746 17296555 17234663 >>>> pgmajfault 429629 307495 287859 >>>> >>> >>> Thanks Usama and Barry for looking into this. It seems like this would >>> fix a regression with large folio swapin regardless of zswap. Can the >>> same result be reproduced on zram without this series? >> >> >> Yes, its a regression in large folio swapin support regardless of zswap/zram. >> >> Need to do 3 tests, one with probably the below diff to remove large folio support, >> one with current upstream and one with upstream + swap thrashing fix. >> >> We only use zswap and dont have a zram setup (and I am a bit lazy to create one :)). >> Any zram volunteers to try this? > > Hi Usama, > > I tried a quick experiment: > > echo 1 > /sys/module/zswap/parameters/enabled > echo 0 > /sys/module/zswap/parameters/enabled > > This was to test the zRAM scenario. Enabling zswap even > once disables mTHP swap-in. :) > > I noticed a similar regression with zRAM alone, but the change resolved > the issue and even sped up the kernel build compared to the setup without > mTHP swap-in. Thanks for trying, this is amazing! > > However, I’m still working on a proper patch to address this. The current > approach: > > mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)) > > isn’t sufficient, as it doesn’t cover cases where group A contains group B, and > we’re operating within group B. The problem occurs not at the boundary of > group B but at the boundary of group A. I am not sure I completely followed this. As MEMCG_CHARGE_BATCH=64, if we are trying to swapin a 16kB page, we basically check if atleast 64/4 = 16 folios can be charged to cgroup, which is reasonable. If we try to swapin a 1M folio, we just check if we can charge atleast 1 folio. Are you saying that checking just 1 folio is not enough in this case and can still cause thrashing, i.e we should check more? If we want to maintain consitency for all folios another option is mem_cgroup_margin(memcg) < MEMCG_CHARGE_BATCH * folio_nr_pages(folio) but I think this is too extreme, we would be checking if 64M can be charged to cgroup just to swapin 1M. > > I believe there’s still room for improvement. For example, if a 64KB charge > attempt fails, there’s no need to waste time trying 32KB or 16KB. We can > directly fall back to 4KB, as 32KB and 16KB will also fail based on our > margin detection logic. > Yes that makes sense. Would something like below work to fix that: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c098fd7f5c5e..0a1ec55cc079 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4550,7 +4550,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, memcg = get_mem_cgroup_from_mm(mm); rcu_read_unlock(); - ret = charge_memcg(folio, memcg, gfp); + if (folio_test_large(folio) && + mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) + ret = -ENOMEM; + else + ret = charge_memcg(folio, memcg, gfp); css_put(&memcg->css); return ret; diff --git a/mm/memory.c b/mm/memory.c index fecdd044bc0b..b6ce6605dc63 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4123,6 +4123,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) pte_t *pte; gfp_t gfp; int order; + int ret; /* * If uffd is active for the vma we need per-page fault fidelity to @@ -4170,9 +4171,13 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); folio = vma_alloc_folio(gfp, order, vma, addr, true); if (folio) { - if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, - gfp, entry)) + ret = mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, gfp, entry); + if (!ret) { return folio; + } else if (ret == -ENOMEM) { + folio_put(folio); + goto fallback; + } folio_put(folio); } order = next_order(&orders, order);
On Thu, Oct 24, 2024 at 8:47 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 23/10/2024 19:52, Barry Song wrote: > > On Thu, Oct 24, 2024 at 7:31 AM Usama Arif <usamaarif642@gmail.com> wrote: > >> > >> > >> > >> On 23/10/2024 19:02, Yosry Ahmed wrote: > >>> [..] > >>>>>> I suspect the regression occurs because you're running an edge case > >>>>>> where the memory cgroup stays nearly full most of the time (this isn't > >>>>>> an inherent issue with large folio swap-in). As a result, swapping in > >>>>>> mTHP quickly triggers a memcg overflow, causing a swap-out. The > >>>>>> next swap-in then recreates the overflow, leading to a repeating > >>>>>> cycle. > >>>>>> > >>>>> > >>>>> Yes, agreed! Looking at the swap counters, I think this is what is going > >>>>> on as well. > >>>>> > >>>>>> We need a way to stop the cup from repeatedly filling to the brim and > >>>>>> overflowing. While not a definitive fix, the following change might help > >>>>>> improve the situation: > >>>>>> > >>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >>>>>> > >>>>>> index 17af08367c68..f2fa0eeb2d9a 100644 > >>>>>> --- a/mm/memcontrol.c > >>>>>> +++ b/mm/memcontrol.c > >>>>>> > >>>>>> @@ -4559,7 +4559,10 @@ int mem_cgroup_swapin_charge_folio(struct folio > >>>>>> *folio, struct mm_struct *mm, > >>>>>> memcg = get_mem_cgroup_from_mm(mm); > >>>>>> rcu_read_unlock(); > >>>>>> > >>>>>> - ret = charge_memcg(folio, memcg, gfp); > >>>>>> + if (folio_test_large(folio) && mem_cgroup_margin(memcg) < > >>>>>> MEMCG_CHARGE_BATCH) > >>>>>> + ret = -ENOMEM; > >>>>>> + else > >>>>>> + ret = charge_memcg(folio, memcg, gfp); > >>>>>> > >>>>>> css_put(&memcg->css); > >>>>>> return ret; > >>>>>> } > >>>>>> > >>>>> > >>>>> The diff makes sense to me. Let me test later today and get back to you. > >>>>> > >>>>> Thanks! > >>>>> > >>>>>> Please confirm if it makes the kernel build with memcg limitation > >>>>>> faster. If so, let's > >>>>>> work together to figure out an official patch :-) The above code hasn't consider > >>>>>> the parent memcg's overflow, so not an ideal fix. > >>>>>> > >>>> > >>>> Thanks Barry, I think this fixes the regression, and even gives an improvement! > >>>> I think the below might be better to do: > >>>> > >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >>>> index c098fd7f5c5e..0a1ec55cc079 100644 > >>>> --- a/mm/memcontrol.c > >>>> +++ b/mm/memcontrol.c > >>>> @@ -4550,7 +4550,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > >>>> memcg = get_mem_cgroup_from_mm(mm); > >>>> rcu_read_unlock(); > >>>> > >>>> - ret = charge_memcg(folio, memcg, gfp); > >>>> + if (folio_test_large(folio) && > >>>> + mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) > >>>> + ret = -ENOMEM; > >>>> + else > >>>> + ret = charge_memcg(folio, memcg, gfp); > >>>> > >>>> css_put(&memcg->css); > >>>> return ret; > >>>> > >>>> > >>>> AMD 16K+32K THP=always > >>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > >>>> real 1m23.038s 1m23.050s 1m22.704s > >>>> user 53m57.210s 53m53.437s 53m52.577s > >>>> sys 7m24.592s 7m48.843s 7m22.519s > >>>> zswpin 612070 999244 815934 > >>>> zswpout 2226403 2347979 2054980 > >>>> pgfault 20667366 20481728 20478690 > >>>> pgmajfault 385887 269117 309702 > >>>> > >>>> AMD 16K+32K+64K THP=always > >>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > >>>> real 1m22.975s 1m23.266s 1m22.549s > >>>> user 53m51.302s 53m51.069s 53m46.471s > >>>> sys 7m40.168s 7m57.104s 7m25.012s > >>>> zswpin 676492 1258573 1225703 > >>>> zswpout 2449839 2714767 2899178 > >>>> pgfault 17540746 17296555 17234663 > >>>> pgmajfault 429629 307495 287859 > >>>> > >>> > >>> Thanks Usama and Barry for looking into this. It seems like this would > >>> fix a regression with large folio swapin regardless of zswap. Can the > >>> same result be reproduced on zram without this series? > >> > >> > >> Yes, its a regression in large folio swapin support regardless of zswap/zram. > >> > >> Need to do 3 tests, one with probably the below diff to remove large folio support, > >> one with current upstream and one with upstream + swap thrashing fix. > >> > >> We only use zswap and dont have a zram setup (and I am a bit lazy to create one :)). > >> Any zram volunteers to try this? > > > > Hi Usama, > > > > I tried a quick experiment: > > > > echo 1 > /sys/module/zswap/parameters/enabled > > echo 0 > /sys/module/zswap/parameters/enabled > > > > This was to test the zRAM scenario. Enabling zswap even > > once disables mTHP swap-in. :) > > > > I noticed a similar regression with zRAM alone, but the change resolved > > the issue and even sped up the kernel build compared to the setup without > > mTHP swap-in. > > Thanks for trying, this is amazing! > > > > However, I’m still working on a proper patch to address this. The current > > approach: > > > > mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)) > > > > isn’t sufficient, as it doesn’t cover cases where group A contains group B, and > > we’re operating within group B. The problem occurs not at the boundary of > > group B but at the boundary of group A. > > I am not sure I completely followed this. As MEMCG_CHARGE_BATCH=64, if we are > trying to swapin a 16kB page, we basically check if atleast 64/4 = 16 folios can be > charged to cgroup, which is reasonable. If we try to swapin a 1M folio, we just > check if we can charge atleast 1 folio. Are you saying that checking just 1 folio > is not enough in this case and can still cause thrashing, i.e we should check more? My understanding is that cgroups are hierarchical. Even if we don’t hit the memory limit of the folio’s direct memcg, we could still reach the limit of one of its parent memcgs. Imagine a structure like: /sys/fs/cgroup/a/b/c/d If we’re compiling the kernel in d, there’s a chance that while d isn’t at its limit, its parents (c, b, or a) could be. Currently, the check only applies to d. > > If we want to maintain consitency for all folios another option is > mem_cgroup_margin(memcg) < MEMCG_CHARGE_BATCH * folio_nr_pages(folio) > but I think this is too extreme, we would be checking if 64M can be charged to > cgroup just to swapin 1M. > > > > > I believe there’s still room for improvement. For example, if a 64KB charge > > attempt fails, there’s no need to waste time trying 32KB or 16KB. We can > > directly fall back to 4KB, as 32KB and 16KB will also fail based on our > > margin detection logic. > > > > Yes that makes sense. Would something like below work to fix that: > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index c098fd7f5c5e..0a1ec55cc079 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4550,7 +4550,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > memcg = get_mem_cgroup_from_mm(mm); > rcu_read_unlock(); > > - ret = charge_memcg(folio, memcg, gfp); > + if (folio_test_large(folio) && > + mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) > + ret = -ENOMEM; > + else > + ret = charge_memcg(folio, memcg, gfp); > > css_put(&memcg->css); > return ret; > diff --git a/mm/memory.c b/mm/memory.c > index fecdd044bc0b..b6ce6605dc63 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4123,6 +4123,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > pte_t *pte; > gfp_t gfp; > int order; > + int ret; > > /* > * If uffd is active for the vma we need per-page fault fidelity to > @@ -4170,9 +4171,13 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > folio = vma_alloc_folio(gfp, order, vma, addr, true); > if (folio) { > - if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, > - gfp, entry)) > + ret = mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, gfp, entry); > + if (!ret) { > return folio; > + } else if (ret == -ENOMEM) { > + folio_put(folio); > + goto fallback; > + } > folio_put(folio); > } > order = next_order(&orders, order); > Yes, does it make your kernel build even faster? Thanks Barry
On Thu, Oct 24, 2024 at 9:36 AM Barry Song <21cnbao@gmail.com> wrote: > > On Thu, Oct 24, 2024 at 8:47 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > > > > > On 23/10/2024 19:52, Barry Song wrote: > > > On Thu, Oct 24, 2024 at 7:31 AM Usama Arif <usamaarif642@gmail.com> wrote: > > >> > > >> > > >> > > >> On 23/10/2024 19:02, Yosry Ahmed wrote: > > >>> [..] > > >>>>>> I suspect the regression occurs because you're running an edge case > > >>>>>> where the memory cgroup stays nearly full most of the time (this isn't > > >>>>>> an inherent issue with large folio swap-in). As a result, swapping in > > >>>>>> mTHP quickly triggers a memcg overflow, causing a swap-out. The > > >>>>>> next swap-in then recreates the overflow, leading to a repeating > > >>>>>> cycle. > > >>>>>> > > >>>>> > > >>>>> Yes, agreed! Looking at the swap counters, I think this is what is going > > >>>>> on as well. > > >>>>> > > >>>>>> We need a way to stop the cup from repeatedly filling to the brim and > > >>>>>> overflowing. While not a definitive fix, the following change might help > > >>>>>> improve the situation: > > >>>>>> > > >>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > >>>>>> > > >>>>>> index 17af08367c68..f2fa0eeb2d9a 100644 > > >>>>>> --- a/mm/memcontrol.c > > >>>>>> +++ b/mm/memcontrol.c > > >>>>>> > > >>>>>> @@ -4559,7 +4559,10 @@ int mem_cgroup_swapin_charge_folio(struct folio > > >>>>>> *folio, struct mm_struct *mm, > > >>>>>> memcg = get_mem_cgroup_from_mm(mm); > > >>>>>> rcu_read_unlock(); > > >>>>>> > > >>>>>> - ret = charge_memcg(folio, memcg, gfp); > > >>>>>> + if (folio_test_large(folio) && mem_cgroup_margin(memcg) < > > >>>>>> MEMCG_CHARGE_BATCH) > > >>>>>> + ret = -ENOMEM; > > >>>>>> + else > > >>>>>> + ret = charge_memcg(folio, memcg, gfp); > > >>>>>> > > >>>>>> css_put(&memcg->css); > > >>>>>> return ret; > > >>>>>> } > > >>>>>> > > >>>>> > > >>>>> The diff makes sense to me. Let me test later today and get back to you. > > >>>>> > > >>>>> Thanks! > > >>>>> > > >>>>>> Please confirm if it makes the kernel build with memcg limitation > > >>>>>> faster. If so, let's > > >>>>>> work together to figure out an official patch :-) The above code hasn't consider > > >>>>>> the parent memcg's overflow, so not an ideal fix. > > >>>>>> > > >>>> > > >>>> Thanks Barry, I think this fixes the regression, and even gives an improvement! > > >>>> I think the below might be better to do: > > >>>> > > >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > >>>> index c098fd7f5c5e..0a1ec55cc079 100644 > > >>>> --- a/mm/memcontrol.c > > >>>> +++ b/mm/memcontrol.c > > >>>> @@ -4550,7 +4550,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > > >>>> memcg = get_mem_cgroup_from_mm(mm); > > >>>> rcu_read_unlock(); > > >>>> > > >>>> - ret = charge_memcg(folio, memcg, gfp); > > >>>> + if (folio_test_large(folio) && > > >>>> + mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) > > >>>> + ret = -ENOMEM; > > >>>> + else > > >>>> + ret = charge_memcg(folio, memcg, gfp); > > >>>> > > >>>> css_put(&memcg->css); > > >>>> return ret; > > >>>> > > >>>> > > >>>> AMD 16K+32K THP=always > > >>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > > >>>> real 1m23.038s 1m23.050s 1m22.704s > > >>>> user 53m57.210s 53m53.437s 53m52.577s > > >>>> sys 7m24.592s 7m48.843s 7m22.519s > > >>>> zswpin 612070 999244 815934 > > >>>> zswpout 2226403 2347979 2054980 > > >>>> pgfault 20667366 20481728 20478690 > > >>>> pgmajfault 385887 269117 309702 > > >>>> > > >>>> AMD 16K+32K+64K THP=always > > >>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > > >>>> real 1m22.975s 1m23.266s 1m22.549s > > >>>> user 53m51.302s 53m51.069s 53m46.471s > > >>>> sys 7m40.168s 7m57.104s 7m25.012s > > >>>> zswpin 676492 1258573 1225703 > > >>>> zswpout 2449839 2714767 2899178 > > >>>> pgfault 17540746 17296555 17234663 > > >>>> pgmajfault 429629 307495 287859 > > >>>> > > >>> > > >>> Thanks Usama and Barry for looking into this. It seems like this would > > >>> fix a regression with large folio swapin regardless of zswap. Can the > > >>> same result be reproduced on zram without this series? > > >> > > >> > > >> Yes, its a regression in large folio swapin support regardless of zswap/zram. > > >> > > >> Need to do 3 tests, one with probably the below diff to remove large folio support, > > >> one with current upstream and one with upstream + swap thrashing fix. > > >> > > >> We only use zswap and dont have a zram setup (and I am a bit lazy to create one :)). > > >> Any zram volunteers to try this? > > > > > > Hi Usama, > > > > > > I tried a quick experiment: > > > > > > echo 1 > /sys/module/zswap/parameters/enabled > > > echo 0 > /sys/module/zswap/parameters/enabled > > > > > > This was to test the zRAM scenario. Enabling zswap even > > > once disables mTHP swap-in. :) > > > > > > I noticed a similar regression with zRAM alone, but the change resolved > > > the issue and even sped up the kernel build compared to the setup without > > > mTHP swap-in. > > > > Thanks for trying, this is amazing! > > > > > > However, I’m still working on a proper patch to address this. The current > > > approach: > > > > > > mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)) > > > > > > isn’t sufficient, as it doesn’t cover cases where group A contains group B, and > > > we’re operating within group B. The problem occurs not at the boundary of > > > group B but at the boundary of group A. > > > > I am not sure I completely followed this. As MEMCG_CHARGE_BATCH=64, if we are > > trying to swapin a 16kB page, we basically check if atleast 64/4 = 16 folios can be > > charged to cgroup, which is reasonable. If we try to swapin a 1M folio, we just > > check if we can charge atleast 1 folio. Are you saying that checking just 1 folio > > is not enough in this case and can still cause thrashing, i.e we should check more? > > My understanding is that cgroups are hierarchical. Even if we don’t > hit the memory > limit of the folio’s direct memcg, we could still reach the limit of > one of its parent > memcgs. Imagine a structure like: > > /sys/fs/cgroup/a/b/c/d > > If we’re compiling the kernel in d, there’s a chance that while d > isn’t at its limit, its > parents (c, b, or a) could be. Currently, the check only applies to d. To clarify, I mean something like this: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 17af08367c68..cc6d21848ee8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4530,6 +4530,29 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, return 0; } +/* + * When the memory cgroup is nearly full, swapping in large folios can + * easily lead to swap thrashing, as the memcg operates on the edge of + * being full. We maintain a margin to allow for quick fallback to + * smaller folios during the swap-in process. + */ +static inline bool mem_cgroup_swapin_margin_protected(struct mem_cgroup *memcg, + struct folio *folio) +{ + unsigned int nr; + + if (!folio_test_large(folio)) + return false; + + nr = max_t(unsigned int, folio_nr_pages(folio), MEMCG_CHARGE_BATCH); + for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { + if (mem_cgroup_margin(memcg) < nr) + return true; + } + + return false; +} + /** * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin. * @folio: folio to charge. @@ -4547,7 +4570,8 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, { struct mem_cgroup *memcg; unsigned short id; - int ret; + int ret = -ENOMEM; + bool margin_prot; if (mem_cgroup_disabled()) return 0; @@ -4557,9 +4581,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, memcg = mem_cgroup_from_id(id); if (!memcg || !css_tryget_online(&memcg->css)) memcg = get_mem_cgroup_from_mm(mm); + margin_prot = mem_cgroup_swapin_margin_protected(memcg, folio); rcu_read_unlock(); - ret = charge_memcg(folio, memcg, gfp); + if (!margin_prot) + ret = charge_memcg(folio, memcg, gfp); css_put(&memcg->css); return ret; > > > > > If we want to maintain consitency for all folios another option is > > mem_cgroup_margin(memcg) < MEMCG_CHARGE_BATCH * folio_nr_pages(folio) > > but I think this is too extreme, we would be checking if 64M can be charged to > > cgroup just to swapin 1M. > > > > > > > > I believe there’s still room for improvement. For example, if a 64KB charge > > > attempt fails, there’s no need to waste time trying 32KB or 16KB. We can > > > directly fall back to 4KB, as 32KB and 16KB will also fail based on our > > > margin detection logic. > > > > > > > Yes that makes sense. Would something like below work to fix that: > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index c098fd7f5c5e..0a1ec55cc079 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -4550,7 +4550,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > > memcg = get_mem_cgroup_from_mm(mm); > > rcu_read_unlock(); > > > > - ret = charge_memcg(folio, memcg, gfp); > > + if (folio_test_large(folio) && > > + mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) > > + ret = -ENOMEM; > > + else > > + ret = charge_memcg(folio, memcg, gfp); > > > > css_put(&memcg->css); > > return ret; > > diff --git a/mm/memory.c b/mm/memory.c > > index fecdd044bc0b..b6ce6605dc63 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -4123,6 +4123,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > pte_t *pte; > > gfp_t gfp; > > int order; > > + int ret; > > > > /* > > * If uffd is active for the vma we need per-page fault fidelity to > > @@ -4170,9 +4171,13 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > > folio = vma_alloc_folio(gfp, order, vma, addr, true); > > if (folio) { > > - if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, > > - gfp, entry)) > > + ret = mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, gfp, entry); > > + if (!ret) { > > return folio; > > + } else if (ret == -ENOMEM) { > > + folio_put(folio); > > + goto fallback; > > + } > > folio_put(folio); > > } > > order = next_order(&orders, order); > > > > Yes, does it make your kernel build even faster? Thanks Barry
On Thu, Oct 24, 2024 at 12:35:48PM +1300, Barry Song wrote: > On Thu, Oct 24, 2024 at 9:36 AM Barry Song <21cnbao@gmail.com> wrote: > > > > On Thu, Oct 24, 2024 at 8:47 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > > > > > > > > > On 23/10/2024 19:52, Barry Song wrote: > > > > On Thu, Oct 24, 2024 at 7:31 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > >> > > > >> > > > >> > > > >> On 23/10/2024 19:02, Yosry Ahmed wrote: > > > >>> [..] > > > >>>>>> I suspect the regression occurs because you're running an edge case > > > >>>>>> where the memory cgroup stays nearly full most of the time (this isn't > > > >>>>>> an inherent issue with large folio swap-in). As a result, swapping in > > > >>>>>> mTHP quickly triggers a memcg overflow, causing a swap-out. The > > > >>>>>> next swap-in then recreates the overflow, leading to a repeating > > > >>>>>> cycle. > > > >>>>>> > > > >>>>> > > > >>>>> Yes, agreed! Looking at the swap counters, I think this is what is going > > > >>>>> on as well. > > > >>>>> > > > >>>>>> We need a way to stop the cup from repeatedly filling to the brim and > > > >>>>>> overflowing. While not a definitive fix, the following change might help > > > >>>>>> improve the situation: > > > >>>>>> > > > >>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > >>>>>> > > > >>>>>> index 17af08367c68..f2fa0eeb2d9a 100644 > > > >>>>>> --- a/mm/memcontrol.c > > > >>>>>> +++ b/mm/memcontrol.c > > > >>>>>> > > > >>>>>> @@ -4559,7 +4559,10 @@ int mem_cgroup_swapin_charge_folio(struct folio > > > >>>>>> *folio, struct mm_struct *mm, > > > >>>>>> memcg = get_mem_cgroup_from_mm(mm); > > > >>>>>> rcu_read_unlock(); > > > >>>>>> > > > >>>>>> - ret = charge_memcg(folio, memcg, gfp); > > > >>>>>> + if (folio_test_large(folio) && mem_cgroup_margin(memcg) < > > > >>>>>> MEMCG_CHARGE_BATCH) > > > >>>>>> + ret = -ENOMEM; > > > >>>>>> + else > > > >>>>>> + ret = charge_memcg(folio, memcg, gfp); > > > >>>>>> > > > >>>>>> css_put(&memcg->css); > > > >>>>>> return ret; > > > >>>>>> } > > > >>>>>> > > > >>>>> > > > >>>>> The diff makes sense to me. Let me test later today and get back to you. > > > >>>>> > > > >>>>> Thanks! > > > >>>>> > > > >>>>>> Please confirm if it makes the kernel build with memcg limitation > > > >>>>>> faster. If so, let's > > > >>>>>> work together to figure out an official patch :-) The above code hasn't consider > > > >>>>>> the parent memcg's overflow, so not an ideal fix. > > > >>>>>> > > > >>>> > > > >>>> Thanks Barry, I think this fixes the regression, and even gives an improvement! > > > >>>> I think the below might be better to do: > > > >>>> > > > >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > >>>> index c098fd7f5c5e..0a1ec55cc079 100644 > > > >>>> --- a/mm/memcontrol.c > > > >>>> +++ b/mm/memcontrol.c > > > >>>> @@ -4550,7 +4550,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > > > >>>> memcg = get_mem_cgroup_from_mm(mm); > > > >>>> rcu_read_unlock(); > > > >>>> > > > >>>> - ret = charge_memcg(folio, memcg, gfp); > > > >>>> + if (folio_test_large(folio) && > > > >>>> + mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) > > > >>>> + ret = -ENOMEM; > > > >>>> + else > > > >>>> + ret = charge_memcg(folio, memcg, gfp); > > > >>>> > > > >>>> css_put(&memcg->css); > > > >>>> return ret; > > > >>>> > > > >>>> > > > >>>> AMD 16K+32K THP=always > > > >>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > > > >>>> real 1m23.038s 1m23.050s 1m22.704s > > > >>>> user 53m57.210s 53m53.437s 53m52.577s > > > >>>> sys 7m24.592s 7m48.843s 7m22.519s > > > >>>> zswpin 612070 999244 815934 > > > >>>> zswpout 2226403 2347979 2054980 > > > >>>> pgfault 20667366 20481728 20478690 > > > >>>> pgmajfault 385887 269117 309702 > > > >>>> > > > >>>> AMD 16K+32K+64K THP=always > > > >>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > > > >>>> real 1m22.975s 1m23.266s 1m22.549s > > > >>>> user 53m51.302s 53m51.069s 53m46.471s > > > >>>> sys 7m40.168s 7m57.104s 7m25.012s > > > >>>> zswpin 676492 1258573 1225703 > > > >>>> zswpout 2449839 2714767 2899178 > > > >>>> pgfault 17540746 17296555 17234663 > > > >>>> pgmajfault 429629 307495 287859 > > > >>>> > > > >>> > > > >>> Thanks Usama and Barry for looking into this. It seems like this would > > > >>> fix a regression with large folio swapin regardless of zswap. Can the > > > >>> same result be reproduced on zram without this series? > > > >> > > > >> > > > >> Yes, its a regression in large folio swapin support regardless of zswap/zram. > > > >> > > > >> Need to do 3 tests, one with probably the below diff to remove large folio support, > > > >> one with current upstream and one with upstream + swap thrashing fix. > > > >> > > > >> We only use zswap and dont have a zram setup (and I am a bit lazy to create one :)). > > > >> Any zram volunteers to try this? > > > > > > > > Hi Usama, > > > > > > > > I tried a quick experiment: > > > > > > > > echo 1 > /sys/module/zswap/parameters/enabled > > > > echo 0 > /sys/module/zswap/parameters/enabled > > > > > > > > This was to test the zRAM scenario. Enabling zswap even > > > > once disables mTHP swap-in. :) > > > > > > > > I noticed a similar regression with zRAM alone, but the change resolved > > > > the issue and even sped up the kernel build compared to the setup without > > > > mTHP swap-in. > > > > > > Thanks for trying, this is amazing! > > > > > > > > However, I’m still working on a proper patch to address this. The current > > > > approach: > > > > > > > > mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)) > > > > > > > > isn’t sufficient, as it doesn’t cover cases where group A contains group B, and > > > > we’re operating within group B. The problem occurs not at the boundary of > > > > group B but at the boundary of group A. > > > > > > I am not sure I completely followed this. As MEMCG_CHARGE_BATCH=64, if we are > > > trying to swapin a 16kB page, we basically check if atleast 64/4 = 16 folios can be > > > charged to cgroup, which is reasonable. If we try to swapin a 1M folio, we just > > > check if we can charge atleast 1 folio. Are you saying that checking just 1 folio > > > is not enough in this case and can still cause thrashing, i.e we should check more? > > > > My understanding is that cgroups are hierarchical. Even if we don’t > > hit the memory > > limit of the folio’s direct memcg, we could still reach the limit of > > one of its parent > > memcgs. Imagine a structure like: > > > > /sys/fs/cgroup/a/b/c/d > > > > If we’re compiling the kernel in d, there’s a chance that while d > > isn’t at its limit, its > > parents (c, b, or a) could be. Currently, the check only applies to d. > > To clarify, I mean something like this: > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 17af08367c68..cc6d21848ee8 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4530,6 +4530,29 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, > return 0; > } > > +/* > + * When the memory cgroup is nearly full, swapping in large folios can > + * easily lead to swap thrashing, as the memcg operates on the edge of > + * being full. We maintain a margin to allow for quick fallback to > + * smaller folios during the swap-in process. > + */ > +static inline bool mem_cgroup_swapin_margin_protected(struct mem_cgroup *memcg, > + struct folio *folio) > +{ > + unsigned int nr; > + > + if (!folio_test_large(folio)) > + return false; > + > + nr = max_t(unsigned int, folio_nr_pages(folio), MEMCG_CHARGE_BATCH); > + for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { > + if (mem_cgroup_margin(memcg) < nr) > + return true; > + } > + > + return false; > +} > + > /** > * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin. > * @folio: folio to charge. > @@ -4547,7 +4570,8 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > { > struct mem_cgroup *memcg; > unsigned short id; > - int ret; > + int ret = -ENOMEM; > + bool margin_prot; > > if (mem_cgroup_disabled()) > return 0; > @@ -4557,9 +4581,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > memcg = mem_cgroup_from_id(id); > if (!memcg || !css_tryget_online(&memcg->css)) > memcg = get_mem_cgroup_from_mm(mm); > + margin_prot = mem_cgroup_swapin_margin_protected(memcg, folio); > rcu_read_unlock(); > > - ret = charge_memcg(folio, memcg, gfp); > + if (!margin_prot) > + ret = charge_memcg(folio, memcg, gfp); > > css_put(&memcg->css); > return ret; I'm not quite following. The charging code DOES the margin check. If you just want to avoid reclaim, pass gfp without __GFP_DIRECT_RECLAIM, and it will return -ENOMEM if there is no margin. alloc_swap_folio() passes the THP mask, which should not include the reclaim flag per default (GFP_TRANSHUGE_LIGHT). Unless you run with defrag=always. Is that what's going on?
On Fri, Oct 25, 2024 at 3:29 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Oct 24, 2024 at 12:35:48PM +1300, Barry Song wrote: > > On Thu, Oct 24, 2024 at 9:36 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > On Thu, Oct 24, 2024 at 8:47 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > > > > > > > > > > > > > On 23/10/2024 19:52, Barry Song wrote: > > > > > On Thu, Oct 24, 2024 at 7:31 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > >> > > > > >> > > > > >> > > > > >> On 23/10/2024 19:02, Yosry Ahmed wrote: > > > > >>> [..] > > > > >>>>>> I suspect the regression occurs because you're running an edge case > > > > >>>>>> where the memory cgroup stays nearly full most of the time (this isn't > > > > >>>>>> an inherent issue with large folio swap-in). As a result, swapping in > > > > >>>>>> mTHP quickly triggers a memcg overflow, causing a swap-out. The > > > > >>>>>> next swap-in then recreates the overflow, leading to a repeating > > > > >>>>>> cycle. > > > > >>>>>> > > > > >>>>> > > > > >>>>> Yes, agreed! Looking at the swap counters, I think this is what is going > > > > >>>>> on as well. > > > > >>>>> > > > > >>>>>> We need a way to stop the cup from repeatedly filling to the brim and > > > > >>>>>> overflowing. While not a definitive fix, the following change might help > > > > >>>>>> improve the situation: > > > > >>>>>> > > > > >>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > >>>>>> > > > > >>>>>> index 17af08367c68..f2fa0eeb2d9a 100644 > > > > >>>>>> --- a/mm/memcontrol.c > > > > >>>>>> +++ b/mm/memcontrol.c > > > > >>>>>> > > > > >>>>>> @@ -4559,7 +4559,10 @@ int mem_cgroup_swapin_charge_folio(struct folio > > > > >>>>>> *folio, struct mm_struct *mm, > > > > >>>>>> memcg = get_mem_cgroup_from_mm(mm); > > > > >>>>>> rcu_read_unlock(); > > > > >>>>>> > > > > >>>>>> - ret = charge_memcg(folio, memcg, gfp); > > > > >>>>>> + if (folio_test_large(folio) && mem_cgroup_margin(memcg) < > > > > >>>>>> MEMCG_CHARGE_BATCH) > > > > >>>>>> + ret = -ENOMEM; > > > > >>>>>> + else > > > > >>>>>> + ret = charge_memcg(folio, memcg, gfp); > > > > >>>>>> > > > > >>>>>> css_put(&memcg->css); > > > > >>>>>> return ret; > > > > >>>>>> } > > > > >>>>>> > > > > >>>>> > > > > >>>>> The diff makes sense to me. Let me test later today and get back to you. > > > > >>>>> > > > > >>>>> Thanks! > > > > >>>>> > > > > >>>>>> Please confirm if it makes the kernel build with memcg limitation > > > > >>>>>> faster. If so, let's > > > > >>>>>> work together to figure out an official patch :-) The above code hasn't consider > > > > >>>>>> the parent memcg's overflow, so not an ideal fix. > > > > >>>>>> > > > > >>>> > > > > >>>> Thanks Barry, I think this fixes the regression, and even gives an improvement! > > > > >>>> I think the below might be better to do: > > > > >>>> > > > > >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > >>>> index c098fd7f5c5e..0a1ec55cc079 100644 > > > > >>>> --- a/mm/memcontrol.c > > > > >>>> +++ b/mm/memcontrol.c > > > > >>>> @@ -4550,7 +4550,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > > > > >>>> memcg = get_mem_cgroup_from_mm(mm); > > > > >>>> rcu_read_unlock(); > > > > >>>> > > > > >>>> - ret = charge_memcg(folio, memcg, gfp); > > > > >>>> + if (folio_test_large(folio) && > > > > >>>> + mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio))) > > > > >>>> + ret = -ENOMEM; > > > > >>>> + else > > > > >>>> + ret = charge_memcg(folio, memcg, gfp); > > > > >>>> > > > > >>>> css_put(&memcg->css); > > > > >>>> return ret; > > > > >>>> > > > > >>>> > > > > >>>> AMD 16K+32K THP=always > > > > >>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > > > > >>>> real 1m23.038s 1m23.050s 1m22.704s > > > > >>>> user 53m57.210s 53m53.437s 53m52.577s > > > > >>>> sys 7m24.592s 7m48.843s 7m22.519s > > > > >>>> zswpin 612070 999244 815934 > > > > >>>> zswpout 2226403 2347979 2054980 > > > > >>>> pgfault 20667366 20481728 20478690 > > > > >>>> pgmajfault 385887 269117 309702 > > > > >>>> > > > > >>>> AMD 16K+32K+64K THP=always > > > > >>>> metric mm-unstable mm-unstable + large folio zswapin series mm-unstable + large folio zswapin + no swap thrashing fix > > > > >>>> real 1m22.975s 1m23.266s 1m22.549s > > > > >>>> user 53m51.302s 53m51.069s 53m46.471s > > > > >>>> sys 7m40.168s 7m57.104s 7m25.012s > > > > >>>> zswpin 676492 1258573 1225703 > > > > >>>> zswpout 2449839 2714767 2899178 > > > > >>>> pgfault 17540746 17296555 17234663 > > > > >>>> pgmajfault 429629 307495 287859 > > > > >>>> > > > > >>> > > > > >>> Thanks Usama and Barry for looking into this. It seems like this would > > > > >>> fix a regression with large folio swapin regardless of zswap. Can the > > > > >>> same result be reproduced on zram without this series? > > > > >> > > > > >> > > > > >> Yes, its a regression in large folio swapin support regardless of zswap/zram. > > > > >> > > > > >> Need to do 3 tests, one with probably the below diff to remove large folio support, > > > > >> one with current upstream and one with upstream + swap thrashing fix. > > > > >> > > > > >> We only use zswap and dont have a zram setup (and I am a bit lazy to create one :)). > > > > >> Any zram volunteers to try this? > > > > > > > > > > Hi Usama, > > > > > > > > > > I tried a quick experiment: > > > > > > > > > > echo 1 > /sys/module/zswap/parameters/enabled > > > > > echo 0 > /sys/module/zswap/parameters/enabled > > > > > > > > > > This was to test the zRAM scenario. Enabling zswap even > > > > > once disables mTHP swap-in. :) > > > > > > > > > > I noticed a similar regression with zRAM alone, but the change resolved > > > > > the issue and even sped up the kernel build compared to the setup without > > > > > mTHP swap-in. > > > > > > > > Thanks for trying, this is amazing! > > > > > > > > > > However, I’m still working on a proper patch to address this. The current > > > > > approach: > > > > > > > > > > mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages(folio)) > > > > > > > > > > isn’t sufficient, as it doesn’t cover cases where group A contains group B, and > > > > > we’re operating within group B. The problem occurs not at the boundary of > > > > > group B but at the boundary of group A. > > > > > > > > I am not sure I completely followed this. As MEMCG_CHARGE_BATCH=64, if we are > > > > trying to swapin a 16kB page, we basically check if atleast 64/4 = 16 folios can be > > > > charged to cgroup, which is reasonable. If we try to swapin a 1M folio, we just > > > > check if we can charge atleast 1 folio. Are you saying that checking just 1 folio > > > > is not enough in this case and can still cause thrashing, i.e we should check more? > > > > > > My understanding is that cgroups are hierarchical. Even if we don’t > > > hit the memory > > > limit of the folio’s direct memcg, we could still reach the limit of > > > one of its parent > > > memcgs. Imagine a structure like: > > > > > > /sys/fs/cgroup/a/b/c/d > > > > > > If we’re compiling the kernel in d, there’s a chance that while d > > > isn’t at its limit, its > > > parents (c, b, or a) could be. Currently, the check only applies to d. > > > > To clarify, I mean something like this: > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 17af08367c68..cc6d21848ee8 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -4530,6 +4530,29 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, > > return 0; > > } > > > > +/* > > + * When the memory cgroup is nearly full, swapping in large folios can > > + * easily lead to swap thrashing, as the memcg operates on the edge of > > + * being full. We maintain a margin to allow for quick fallback to > > + * smaller folios during the swap-in process. > > + */ > > +static inline bool mem_cgroup_swapin_margin_protected(struct mem_cgroup *memcg, > > + struct folio *folio) > > +{ > > + unsigned int nr; > > + > > + if (!folio_test_large(folio)) > > + return false; > > + > > + nr = max_t(unsigned int, folio_nr_pages(folio), MEMCG_CHARGE_BATCH); > > + for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) { > > + if (mem_cgroup_margin(memcg) < nr) > > + return true; > > + } > > + > > + return false; > > +} > > + > > /** > > * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin. > > * @folio: folio to charge. > > @@ -4547,7 +4570,8 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > > { > > struct mem_cgroup *memcg; > > unsigned short id; > > - int ret; > > + int ret = -ENOMEM; > > + bool margin_prot; > > > > if (mem_cgroup_disabled()) > > return 0; > > @@ -4557,9 +4581,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > > memcg = mem_cgroup_from_id(id); > > if (!memcg || !css_tryget_online(&memcg->css)) > > memcg = get_mem_cgroup_from_mm(mm); > > + margin_prot = mem_cgroup_swapin_margin_protected(memcg, folio); > > rcu_read_unlock(); > > > > - ret = charge_memcg(folio, memcg, gfp); > > + if (!margin_prot) > > + ret = charge_memcg(folio, memcg, gfp); > > > > css_put(&memcg->css); > > return ret; > > I'm not quite following. > > The charging code DOES the margin check. If you just want to avoid > reclaim, pass gfp without __GFP_DIRECT_RECLAIM, and it will return > -ENOMEM if there is no margin. > > alloc_swap_folio() passes the THP mask, which should not include the > reclaim flag per default (GFP_TRANSHUGE_LIGHT). Unless you run with > defrag=always. Is that what's going on? No, quite sure "defrag=never" can just achieve the same result. Imagine we only have small folios—each time reclamation occurs, we have at least a SWAP_CLUSTER_MAX buffer before the next reclamation is triggered. .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), However, with large folios, we can quickly exhaust the SWAP_CLUSTER_MAX buffer and reach the next reclamation point. Once we consume SWAP_CLUSTER_MAX - 1, the mem_cgroup_swapin_charge_folio() call for the final small folio with GFP_KERNEL will trigger reclamation. if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, GFP_KERNEL, entry)) { Thanks Barry
diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c index 40de679248b8..77068c577c86 100644 --- a/tools/testing/selftests/cgroup/test_zswap.c +++ b/tools/testing/selftests/cgroup/test_zswap.c @@ -9,6 +9,8 @@ #include <string.h> #include <sys/wait.h> #include <sys/mman.h> +#include <sys/time.h> +#include <malloc.h> #include "../kselftest.h" #include "cgroup_util.h" @@ -407,6 +409,74 @@ static int test_zswap_writeback_disabled(const char *root) return test_zswap_writeback(root, false); } +static int zswapin_perf(const char *cgroup, void *arg) +{ + long pagesize = sysconf(_SC_PAGESIZE); + size_t memsize = MB(1*1024); + char buf[pagesize]; + int ret = -1; + char *mem; + struct timeval start, end; + + mem = (char *)memalign(2*1024*1024, memsize); + if (!mem) + return ret; + + /* + * Fill half of each page with increasing data, and keep other + * half empty, this will result in data that is still compressible + * and ends up in zswap, with material zswap usage. + */ + for (int i = 0; i < pagesize; i++) + buf[i] = i < pagesize/2 ? (char) i : 0; + + for (int i = 0; i < memsize; i += pagesize) + memcpy(&mem[i], buf, pagesize); + + /* Try and reclaim allocated memory */ + if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) { + ksft_print_msg("Failed to reclaim all of the requested memory\n"); + goto out; + } + + gettimeofday(&start, NULL); + /* zswpin */ + for (int i = 0; i < memsize; i += pagesize) { + if (memcmp(&mem[i], buf, pagesize)) { + ksft_print_msg("invalid memory\n"); + goto out; + } + } + gettimeofday(&end, NULL); + printf ("zswapin took %fms to run.\n", (end.tv_sec - start.tv_sec)*1000 + (double)(end.tv_usec - start.tv_usec) / 1000); + ret = 0; +out: + free(mem); + return ret; +} + +static int test_zswapin_perf(const char *root) +{ + int ret = KSFT_FAIL; + char *test_group; + + test_group = cg_name(root, "zswapin_perf_test"); + if (!test_group) + goto out; + if (cg_create(test_group)) + goto out; + + if (cg_run(test_group, zswapin_perf, NULL)) + goto out; + + ret = KSFT_PASS; +out: + cg_destroy(test_group); + free(test_group); + return ret; +} + /* * When trying to store a memcg page in zswap, if the memcg hits its memory * limit in zswap, writeback should affect only the zswapped pages of that @@ -584,6 +654,7 @@ struct zswap_test {