Message ID | 20241107101005.69121-1-21cnbao@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | mTHP-friendly compression in zsmalloc and zram based on multi-pages | expand |
Hi, Barry, Barry Song <21cnbao@gmail.com> writes: > From: Barry Song <v-songbaohua@oppo.com> > > When large folios are compressed at a larger granularity, we observe > a notable reduction in CPU usage and a significant improvement in > compression ratios. > > mTHP's ability to be swapped out without splitting and swapped back in > as a whole allows compression and decompression at larger granularities. > > This patchset enhances zsmalloc and zram by adding support for dividing > large folios into multi-page blocks, typically configured with a > 2-order granularity. Without this patchset, a large folio is always > divided into `nr_pages` 4KiB blocks. > > The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER` > setting, where the default of 2 allows all anonymous THP to benefit. > > Examples include: > * A 16KiB large folio will be compressed and stored as a single 16KiB > block. > * A 64KiB large folio will be compressed and stored as four 16KiB > blocks. > > For example, swapping out and swapping in 100MiB of typical anonymous > data 100 times (with 16KB mTHP enabled) using zstd yields the following > results: > > w/o patches w/ patches > swap-out time(ms) 68711 49908 > swap-in time(ms) 30687 20685 > compression ratio 20.49% 16.9% The data looks good. Thanks! Have you considered the situation that the large folio fails to be allocated during swap-in? It's possible because the memory may be very fragmented. > -v2: > While it is not mature yet, I know some people are waiting for > an update :-) > * Fixed some stability issues. > * rebase againest the latest mm-unstable. > * Set default order to 2 which benefits all anon mTHP. > * multipages ZsPageMovable is not supported yet. > > Tangquan Zheng (2): > mm: zsmalloc: support objects compressed based on multiple pages > zram: support compression at the granularity of multi-pages > > drivers/block/zram/Kconfig | 9 + > drivers/block/zram/zcomp.c | 17 +- > drivers/block/zram/zcomp.h | 12 +- > drivers/block/zram/zram_drv.c | 450 +++++++++++++++++++++++++++++++--- > drivers/block/zram/zram_drv.h | 45 ++++ > include/linux/zsmalloc.h | 10 +- > mm/Kconfig | 18 ++ > mm/zsmalloc.c | 232 +++++++++++++----- > 8 files changed, 699 insertions(+), 94 deletions(-) -- Best Regards, Huang, Ying
On Fri, Nov 8, 2024 at 6:23 PM Huang, Ying <ying.huang@intel.com> wrote: > > Hi, Barry, > > Barry Song <21cnbao@gmail.com> writes: > > > From: Barry Song <v-songbaohua@oppo.com> > > > > When large folios are compressed at a larger granularity, we observe > > a notable reduction in CPU usage and a significant improvement in > > compression ratios. > > > > mTHP's ability to be swapped out without splitting and swapped back in > > as a whole allows compression and decompression at larger granularities. > > > > This patchset enhances zsmalloc and zram by adding support for dividing > > large folios into multi-page blocks, typically configured with a > > 2-order granularity. Without this patchset, a large folio is always > > divided into `nr_pages` 4KiB blocks. > > > > The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER` > > setting, where the default of 2 allows all anonymous THP to benefit. > > > > Examples include: > > * A 16KiB large folio will be compressed and stored as a single 16KiB > > block. > > * A 64KiB large folio will be compressed and stored as four 16KiB > > blocks. > > > > For example, swapping out and swapping in 100MiB of typical anonymous > > data 100 times (with 16KB mTHP enabled) using zstd yields the following > > results: > > > > w/o patches w/ patches > > swap-out time(ms) 68711 49908 > > swap-in time(ms) 30687 20685 > > compression ratio 20.49% 16.9% > > The data looks good. Thanks! > > Have you considered the situation that the large folio fails to be > allocated during swap-in? It's possible because the memory may be very > fragmented. That's correct, good question. On phones, we use a large folio pool to maintain a relatively high allocation success rate. When mTHP allocation fails, we have a workaround to allocate nr_pages of small folios and map them together to avoid partial reads. This ensures that the benefits of larger block compression and decompression are consistently maintained. That was the code running on production phones. We also previously experimented with maintaining multiple buffers for decompressed large blocks in zRAM, allowing upcoming do_swap_page() calls to use them when falling back to small folios. In this setup, the buffers achieved a high hit rate, though I don’t recall the exact number. I'm concerned that this fault-around-like fallback to nr_pages small folios may not gain traction upstream. Do you have any suggestions for improvement? > > > -v2: > > While it is not mature yet, I know some people are waiting for > > an update :-) > > * Fixed some stability issues. > > * rebase againest the latest mm-unstable. > > * Set default order to 2 which benefits all anon mTHP. > > * multipages ZsPageMovable is not supported yet. > > > > Tangquan Zheng (2): > > mm: zsmalloc: support objects compressed based on multiple pages > > zram: support compression at the granularity of multi-pages > > > > drivers/block/zram/Kconfig | 9 + > > drivers/block/zram/zcomp.c | 17 +- > > drivers/block/zram/zcomp.h | 12 +- > > drivers/block/zram/zram_drv.c | 450 +++++++++++++++++++++++++++++++--- > > drivers/block/zram/zram_drv.h | 45 ++++ > > include/linux/zsmalloc.h | 10 +- > > mm/Kconfig | 18 ++ > > mm/zsmalloc.c | 232 +++++++++++++----- > > 8 files changed, 699 insertions(+), 94 deletions(-) > > -- > Best Regards, > Huang, Ying Thanks barry
On 08/11/2024 06:51, Barry Song wrote: > On Fri, Nov 8, 2024 at 6:23 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Hi, Barry, >> >> Barry Song <21cnbao@gmail.com> writes: >> >>> From: Barry Song <v-songbaohua@oppo.com> >>> >>> When large folios are compressed at a larger granularity, we observe >>> a notable reduction in CPU usage and a significant improvement in >>> compression ratios. >>> >>> mTHP's ability to be swapped out without splitting and swapped back in >>> as a whole allows compression and decompression at larger granularities. >>> >>> This patchset enhances zsmalloc and zram by adding support for dividing >>> large folios into multi-page blocks, typically configured with a >>> 2-order granularity. Without this patchset, a large folio is always >>> divided into `nr_pages` 4KiB blocks. >>> >>> The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER` >>> setting, where the default of 2 allows all anonymous THP to benefit. >>> >>> Examples include: >>> * A 16KiB large folio will be compressed and stored as a single 16KiB >>> block. >>> * A 64KiB large folio will be compressed and stored as four 16KiB >>> blocks. >>> >>> For example, swapping out and swapping in 100MiB of typical anonymous >>> data 100 times (with 16KB mTHP enabled) using zstd yields the following >>> results: >>> >>> w/o patches w/ patches >>> swap-out time(ms) 68711 49908 >>> swap-in time(ms) 30687 20685 >>> compression ratio 20.49% 16.9% >> >> The data looks good. Thanks! >> >> Have you considered the situation that the large folio fails to be >> allocated during swap-in? It's possible because the memory may be very >> fragmented. > > That's correct, good question. On phones, we use a large folio pool to maintain > a relatively high allocation success rate. When mTHP allocation fails, we have > a workaround to allocate nr_pages of small folios and map them together to > avoid partial reads. This ensures that the benefits of larger block compression > and decompression are consistently maintained. That was the code running > on production phones. > Thanks for sending the v2! How is the large folio pool maintained. I dont think there is something in upstream kernel for this? The only thing that I saw on the mailing list is TAO for pmd-mappable THPs only? I think that was about 7-8 months ago and wasn't merged? The workaround to allocate nr_pages of small folios and map them together to avoid partial reads is also not upstream, right? Do you have any data how this would perform with the upstream kernel, i.e. without a large folio pool and the workaround and if large granularity compression is worth having without those patches? Thanks, Usama > We also previously experimented with maintaining multiple buffers for > decompressed > large blocks in zRAM, allowing upcoming do_swap_page() calls to use them when > falling back to small folios. In this setup, the buffers achieved a > high hit rate, though > I don’t recall the exact number. > > I'm concerned that this fault-around-like fallback to nr_pages small > folios may not > gain traction upstream. Do you have any suggestions for improvement? > >> >>> -v2: >>> While it is not mature yet, I know some people are waiting for >>> an update :-) >>> * Fixed some stability issues. >>> * rebase againest the latest mm-unstable. >>> * Set default order to 2 which benefits all anon mTHP. >>> * multipages ZsPageMovable is not supported yet. >>> >>> Tangquan Zheng (2): >>> mm: zsmalloc: support objects compressed based on multiple pages >>> zram: support compression at the granularity of multi-pages >>> >>> drivers/block/zram/Kconfig | 9 + >>> drivers/block/zram/zcomp.c | 17 +- >>> drivers/block/zram/zcomp.h | 12 +- >>> drivers/block/zram/zram_drv.c | 450 +++++++++++++++++++++++++++++++--- >>> drivers/block/zram/zram_drv.h | 45 ++++ >>> include/linux/zsmalloc.h | 10 +- >>> mm/Kconfig | 18 ++ >>> mm/zsmalloc.c | 232 +++++++++++++----- >>> 8 files changed, 699 insertions(+), 94 deletions(-) >> >> -- >> Best Regards, >> Huang, Ying > > Thanks > barry
On Thu, Nov 7, 2024 at 2:10 AM Barry Song <21cnbao@gmail.com> wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > When large folios are compressed at a larger granularity, we observe > a notable reduction in CPU usage and a significant improvement in > compression ratios. > > mTHP's ability to be swapped out without splitting and swapped back in > as a whole allows compression and decompression at larger granularities. > > This patchset enhances zsmalloc and zram by adding support for dividing > large folios into multi-page blocks, typically configured with a > 2-order granularity. Without this patchset, a large folio is always > divided into `nr_pages` 4KiB blocks. > > The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER` > setting, where the default of 2 allows all anonymous THP to benefit. > > Examples include: > * A 16KiB large folio will be compressed and stored as a single 16KiB > block. > * A 64KiB large folio will be compressed and stored as four 16KiB > blocks. > > For example, swapping out and swapping in 100MiB of typical anonymous > data 100 times (with 16KB mTHP enabled) using zstd yields the following > results: > > w/o patches w/ patches > swap-out time(ms) 68711 49908 > swap-in time(ms) 30687 20685 > compression ratio 20.49% 16.9% The data looks very promising :) My understanding is it also results in memory saving as well right? Since zstd operates better on bigger inputs. Is there any end-to-end benchmarking? My intuition is that this patch series overall will improve the situations, assuming we don't fallback to individual zero order page swapin too often, but it'd be nice if there is some data backing this intuition (especially with the upstream setup, i.e without any private patches). If the fallback scenario happens frequently, the patch series can make a page fault more expensive (since we have to decompress the entire chunk, and discard everything but the single page being loaded in), so it might make a difference. Not super qualified to comment on zram changes otherwise - just a casual observer to see if we can adopt this for zswap. zswap has the added complexity of not supporting THP zswap in (until Usama's patch series lands), and the presence of mixed backing states (due to zswap writeback), increasing the likelihood of fallback :)
On Tue, Nov 12, 2024 at 5:43 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 08/11/2024 06:51, Barry Song wrote: > > On Fri, Nov 8, 2024 at 6:23 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Hi, Barry, > >> > >> Barry Song <21cnbao@gmail.com> writes: > >> > >>> From: Barry Song <v-songbaohua@oppo.com> > >>> > >>> When large folios are compressed at a larger granularity, we observe > >>> a notable reduction in CPU usage and a significant improvement in > >>> compression ratios. > >>> > >>> mTHP's ability to be swapped out without splitting and swapped back in > >>> as a whole allows compression and decompression at larger granularities. > >>> > >>> This patchset enhances zsmalloc and zram by adding support for dividing > >>> large folios into multi-page blocks, typically configured with a > >>> 2-order granularity. Without this patchset, a large folio is always > >>> divided into `nr_pages` 4KiB blocks. > >>> > >>> The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER` > >>> setting, where the default of 2 allows all anonymous THP to benefit. > >>> > >>> Examples include: > >>> * A 16KiB large folio will be compressed and stored as a single 16KiB > >>> block. > >>> * A 64KiB large folio will be compressed and stored as four 16KiB > >>> blocks. > >>> > >>> For example, swapping out and swapping in 100MiB of typical anonymous > >>> data 100 times (with 16KB mTHP enabled) using zstd yields the following > >>> results: > >>> > >>> w/o patches w/ patches > >>> swap-out time(ms) 68711 49908 > >>> swap-in time(ms) 30687 20685 > >>> compression ratio 20.49% 16.9% > >> > >> The data looks good. Thanks! > >> > >> Have you considered the situation that the large folio fails to be > >> allocated during swap-in? It's possible because the memory may be very > >> fragmented. > > > > That's correct, good question. On phones, we use a large folio pool to maintain > > a relatively high allocation success rate. When mTHP allocation fails, we have > > a workaround to allocate nr_pages of small folios and map them together to > > avoid partial reads. This ensures that the benefits of larger block compression > > and decompression are consistently maintained. That was the code running > > on production phones. > > > > Thanks for sending the v2! > > How is the large folio pool maintained. I dont think there is something in upstream In production phones, we have extended the migration type for mTHP separately during Linux boot[1]. [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplus/sm8650_u_14.0.0_oneplus12/mm/page_alloc.c#L2089 These pageblocks have their own migration type, resulting in a separate buddy free list. We prevent order-0 allocations from drawing memory from this pool, ensuring a relatively high success rate for mTHP allocations. In one instance, phones reported an mTHP allocation success rate of less than 5% after running for a few hours without this kind of reservation mechanism. Therefore, we need an upstream solution in the kernel to ensure sustainable mTHP support across all scenarios. > kernel for this? The only thing that I saw on the mailing list is TAO for pmd-mappable > THPs only? I think that was about 7-8 months ago and wasn't merged? TAO supports mTHP as long as it can be configured through the bootcmd: nomerge=25%,4 This means we are providing a 4-order mTHP pool with 25% of total memory reserved. Note that the Android common kernel has already integrated TAO[2][3], so we are trying to use TAO to replace our previous approach of extending the migration type. [2] https://android.googlesource.com/kernel/common/+/c1ff6dcf209e4abc23584d2cd117f725421bccac [3] https://android.googlesource.com/kernel/common/+/066872d13d0c0b076785f0b794b650de0941c1c9 > The workaround to allocate nr_pages of small folios and map them > together to avoid partial reads is also not upstream, right? Correct. It's running on the phones[4][5], but I still don't know how to handle it upstream properly. [4] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplus/sm8650_u_14.0.0_oneplus12/mm/memory.c#L4656 [5] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplus/sm8650_u_14.0.0_oneplus12/mm/memory.c#L5439 > > Do you have any data how this would perform with the upstream kernel, i.e. without > a large folio pool and the workaround and if large granularity compression is worth having > without those patches? I’d say large granularity compression isn’t a problem, but large granularity decompression could be. The worst case would be if we swap out a large block, such as 16KB, but end up swapping in 4 times due to allocation failures, falling back to smaller folios. In this scenario, we would need to perform three redundant decompressions. I will work with Tangquan to provide this data this week. But once we swap in small folios, they remain small (we can't collapse them for mTHP). As a result, the next time, they will be swapped out and swapped in as small folios. Therefore, this potential loss is one-time. > > Thanks, > Usama > > > We also previously experimented with maintaining multiple buffers for > > decompressed > > large blocks in zRAM, allowing upcoming do_swap_page() calls to use them when > > falling back to small folios. In this setup, the buffers achieved a > > high hit rate, though > > I don’t recall the exact number. > > > > I'm concerned that this fault-around-like fallback to nr_pages small > > folios may not > > gain traction upstream. Do you have any suggestions for improvement? > > > >> > >>> -v2: > >>> While it is not mature yet, I know some people are waiting for > >>> an update :-) > >>> * Fixed some stability issues. > >>> * rebase againest the latest mm-unstable. > >>> * Set default order to 2 which benefits all anon mTHP. > >>> * multipages ZsPageMovable is not supported yet. > >>> > >>> Tangquan Zheng (2): > >>> mm: zsmalloc: support objects compressed based on multiple pages > >>> zram: support compression at the granularity of multi-pages > >>> > >>> drivers/block/zram/Kconfig | 9 + > >>> drivers/block/zram/zcomp.c | 17 +- > >>> drivers/block/zram/zcomp.h | 12 +- > >>> drivers/block/zram/zram_drv.c | 450 +++++++++++++++++++++++++++++++--- > >>> drivers/block/zram/zram_drv.h | 45 ++++ > >>> include/linux/zsmalloc.h | 10 +- > >>> mm/Kconfig | 18 ++ > >>> mm/zsmalloc.c | 232 +++++++++++++----- > >>> 8 files changed, 699 insertions(+), 94 deletions(-) > >> > >> -- > >> Best Regards, > >> Huang, Ying > > Thanks barry
On Tue, Nov 12, 2024 at 8:30 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Thu, Nov 7, 2024 at 2:10 AM Barry Song <21cnbao@gmail.com> wrote: > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > When large folios are compressed at a larger granularity, we observe > > a notable reduction in CPU usage and a significant improvement in > > compression ratios. > > > > mTHP's ability to be swapped out without splitting and swapped back in > > as a whole allows compression and decompression at larger granularities. > > > > This patchset enhances zsmalloc and zram by adding support for dividing > > large folios into multi-page blocks, typically configured with a > > 2-order granularity. Without this patchset, a large folio is always > > divided into `nr_pages` 4KiB blocks. > > > > The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER` > > setting, where the default of 2 allows all anonymous THP to benefit. > > > > Examples include: > > * A 16KiB large folio will be compressed and stored as a single 16KiB > > block. > > * A 64KiB large folio will be compressed and stored as four 16KiB > > blocks. > > > > For example, swapping out and swapping in 100MiB of typical anonymous > > data 100 times (with 16KB mTHP enabled) using zstd yields the following > > results: > > > > w/o patches w/ patches > > swap-out time(ms) 68711 49908 > > swap-in time(ms) 30687 20685 > > compression ratio 20.49% 16.9% > > The data looks very promising :) My understanding is it also results > in memory saving as well right? Since zstd operates better on bigger > inputs. > > Is there any end-to-end benchmarking? My intuition is that this patch > series overall will improve the situations, assuming we don't fallback > to individual zero order page swapin too often, but it'd be nice if > there is some data backing this intuition (especially with the > upstream setup, i.e without any private patches). If the fallback > scenario happens frequently, the patch series can make a page fault > more expensive (since we have to decompress the entire chunk, and > discard everything but the single page being loaded in), so it might > make a difference. > > Not super qualified to comment on zram changes otherwise - just a > casual observer to see if we can adopt this for zswap. zswap has the > added complexity of not supporting THP zswap in (until Usama's patch > series lands), and the presence of mixed backing states (due to zswap > writeback), increasing the likelihood of fallback :) Correct. As I mentioned to Usama[1], this could be a problem, and we are collecting data. The simplest approach to work around the issue is to fall back to four small folios instead of just one, which would prevent the need for three extra decompressions. [1] https://lore.kernel.org/linux-mm/CAGsJ_4yuZLOE0_yMOZj=KkRTyTotHw4g5g-t91W=MvS5zA4rYw@mail.gmail.com/ Thanks Barry
Barry Song <21cnbao@gmail.com> writes: > On Fri, Nov 8, 2024 at 6:23 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Hi, Barry, >> >> Barry Song <21cnbao@gmail.com> writes: >> >> > From: Barry Song <v-songbaohua@oppo.com> >> > >> > When large folios are compressed at a larger granularity, we observe >> > a notable reduction in CPU usage and a significant improvement in >> > compression ratios. >> > >> > mTHP's ability to be swapped out without splitting and swapped back in >> > as a whole allows compression and decompression at larger granularities. >> > >> > This patchset enhances zsmalloc and zram by adding support for dividing >> > large folios into multi-page blocks, typically configured with a >> > 2-order granularity. Without this patchset, a large folio is always >> > divided into `nr_pages` 4KiB blocks. >> > >> > The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER` >> > setting, where the default of 2 allows all anonymous THP to benefit. >> > >> > Examples include: >> > * A 16KiB large folio will be compressed and stored as a single 16KiB >> > block. >> > * A 64KiB large folio will be compressed and stored as four 16KiB >> > blocks. >> > >> > For example, swapping out and swapping in 100MiB of typical anonymous >> > data 100 times (with 16KB mTHP enabled) using zstd yields the following >> > results: >> > >> > w/o patches w/ patches >> > swap-out time(ms) 68711 49908 >> > swap-in time(ms) 30687 20685 >> > compression ratio 20.49% 16.9% >> >> The data looks good. Thanks! >> >> Have you considered the situation that the large folio fails to be >> allocated during swap-in? It's possible because the memory may be very >> fragmented. > > That's correct, good question. On phones, we use a large folio pool to maintain > a relatively high allocation success rate. When mTHP allocation fails, we have > a workaround to allocate nr_pages of small folios and map them together to > avoid partial reads. This ensures that the benefits of larger block compression > and decompression are consistently maintained. That was the code running > on production phones. > > We also previously experimented with maintaining multiple buffers for > decompressed > large blocks in zRAM, allowing upcoming do_swap_page() calls to use them when > falling back to small folios. In this setup, the buffers achieved a > high hit rate, though > I don’t recall the exact number. > > I'm concerned that this fault-around-like fallback to nr_pages small > folios may not > gain traction upstream. Do you have any suggestions for improvement? It appears that we still haven't a solution to guarantee 100% mTHP allocation success rate. If so, we need a fallback solution for that. Another possible solution is, 1) If failed to allocate mTHP with nr_pages, allocate nr_pages normal (4k) folios instead 2) Revise the decompression interface to accept a set of folios (instead of one folio) as target. Then, we can decompress to the normal folios allocated in 1). 3) in do_swap_page(), we can either map all folios or just the fault folios. We can put non-fault folios into swap cache if necessary. Does this work? >> >> > -v2: >> > While it is not mature yet, I know some people are waiting for >> > an update :-) >> > * Fixed some stability issues. >> > * rebase againest the latest mm-unstable. >> > * Set default order to 2 which benefits all anon mTHP. >> > * multipages ZsPageMovable is not supported yet. >> > >> > Tangquan Zheng (2): >> > mm: zsmalloc: support objects compressed based on multiple pages >> > zram: support compression at the granularity of multi-pages >> > >> > drivers/block/zram/Kconfig | 9 + >> > drivers/block/zram/zcomp.c | 17 +- >> > drivers/block/zram/zcomp.h | 12 +- >> > drivers/block/zram/zram_drv.c | 450 +++++++++++++++++++++++++++++++--- >> > drivers/block/zram/zram_drv.h | 45 ++++ >> > include/linux/zsmalloc.h | 10 +- >> > mm/Kconfig | 18 ++ >> > mm/zsmalloc.c | 232 +++++++++++++----- >> > 8 files changed, 699 insertions(+), 94 deletions(-) >> -- Best Regards, Huang, Ying
On Tue, Nov 12, 2024 at 2:11 PM Huang, Ying <ying.huang@intel.com> wrote: > > Barry Song <21cnbao@gmail.com> writes: > > > On Fri, Nov 8, 2024 at 6:23 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Hi, Barry, > >> > >> Barry Song <21cnbao@gmail.com> writes: > >> > >> > From: Barry Song <v-songbaohua@oppo.com> > >> > > >> > When large folios are compressed at a larger granularity, we observe > >> > a notable reduction in CPU usage and a significant improvement in > >> > compression ratios. > >> > > >> > mTHP's ability to be swapped out without splitting and swapped back in > >> > as a whole allows compression and decompression at larger granularities. > >> > > >> > This patchset enhances zsmalloc and zram by adding support for dividing > >> > large folios into multi-page blocks, typically configured with a > >> > 2-order granularity. Without this patchset, a large folio is always > >> > divided into `nr_pages` 4KiB blocks. > >> > > >> > The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER` > >> > setting, where the default of 2 allows all anonymous THP to benefit. > >> > > >> > Examples include: > >> > * A 16KiB large folio will be compressed and stored as a single 16KiB > >> > block. > >> > * A 64KiB large folio will be compressed and stored as four 16KiB > >> > blocks. > >> > > >> > For example, swapping out and swapping in 100MiB of typical anonymous > >> > data 100 times (with 16KB mTHP enabled) using zstd yields the following > >> > results: > >> > > >> > w/o patches w/ patches > >> > swap-out time(ms) 68711 49908 > >> > swap-in time(ms) 30687 20685 > >> > compression ratio 20.49% 16.9% > >> > >> The data looks good. Thanks! > >> > >> Have you considered the situation that the large folio fails to be > >> allocated during swap-in? It's possible because the memory may be very > >> fragmented. > > > > That's correct, good question. On phones, we use a large folio pool to maintain > > a relatively high allocation success rate. When mTHP allocation fails, we have > > a workaround to allocate nr_pages of small folios and map them together to > > avoid partial reads. This ensures that the benefits of larger block compression > > and decompression are consistently maintained. That was the code running > > on production phones. > > > > We also previously experimented with maintaining multiple buffers for > > decompressed > > large blocks in zRAM, allowing upcoming do_swap_page() calls to use them when > > falling back to small folios. In this setup, the buffers achieved a > > high hit rate, though > > I don’t recall the exact number. > > > > I'm concerned that this fault-around-like fallback to nr_pages small > > folios may not > > gain traction upstream. Do you have any suggestions for improvement? > > It appears that we still haven't a solution to guarantee 100% mTHP > allocation success rate. If so, we need a fallback solution for that. > > Another possible solution is, > > 1) If failed to allocate mTHP with nr_pages, allocate nr_pages normal (4k) > folios instead > > 2) Revise the decompression interface to accept a set of folios (instead > of one folio) as target. Then, we can decompress to the normal > folios allocated in 1). > > 3) in do_swap_page(), we can either map all folios or just the fault > folios. We can put non-fault folios into swap cache if necessary. > > Does this work? this is exactly what we did in production phones: [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplus/sm8650_u_14.0.0_oneplus12/mm/memory.c#L4656 [2] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplus/sm8650_u_14.0.0_oneplus12/mm/memory.c#L5439 I feel that we don't need to fall back to nr_pages (though that's what we used on phones); using a dedicated 4 should be sufficient, as if zsmalloc is handling compression and decompression of 16KB. However, we are not adding them to the swapcache; instead, they are mapped immediately. > > >> > >> > -v2: > >> > While it is not mature yet, I know some people are waiting for > >> > an update :-) > >> > * Fixed some stability issues. > >> > * rebase againest the latest mm-unstable. > >> > * Set default order to 2 which benefits all anon mTHP. > >> > * multipages ZsPageMovable is not supported yet. > >> > > >> > Tangquan Zheng (2): > >> > mm: zsmalloc: support objects compressed based on multiple pages > >> > zram: support compression at the granularity of multi-pages > >> > > >> > drivers/block/zram/Kconfig | 9 + > >> > drivers/block/zram/zcomp.c | 17 +- > >> > drivers/block/zram/zcomp.h | 12 +- > >> > drivers/block/zram/zram_drv.c | 450 +++++++++++++++++++++++++++++++--- > >> > drivers/block/zram/zram_drv.h | 45 ++++ > >> > include/linux/zsmalloc.h | 10 +- > >> > mm/Kconfig | 18 ++ > >> > mm/zsmalloc.c | 232 +++++++++++++----- > >> > 8 files changed, 699 insertions(+), 94 deletions(-) > >> > > -- > Best Regards, > Huang, Ying Thanks barry
Barry Song <21cnbao@gmail.com> writes: > On Tue, Nov 12, 2024 at 2:11 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Barry Song <21cnbao@gmail.com> writes: >> >> > On Fri, Nov 8, 2024 at 6:23 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Hi, Barry, >> >> >> >> Barry Song <21cnbao@gmail.com> writes: >> >> >> >> > From: Barry Song <v-songbaohua@oppo.com> >> >> > >> >> > When large folios are compressed at a larger granularity, we observe >> >> > a notable reduction in CPU usage and a significant improvement in >> >> > compression ratios. >> >> > >> >> > mTHP's ability to be swapped out without splitting and swapped back in >> >> > as a whole allows compression and decompression at larger granularities. >> >> > >> >> > This patchset enhances zsmalloc and zram by adding support for dividing >> >> > large folios into multi-page blocks, typically configured with a >> >> > 2-order granularity. Without this patchset, a large folio is always >> >> > divided into `nr_pages` 4KiB blocks. >> >> > >> >> > The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER` >> >> > setting, where the default of 2 allows all anonymous THP to benefit. >> >> > >> >> > Examples include: >> >> > * A 16KiB large folio will be compressed and stored as a single 16KiB >> >> > block. >> >> > * A 64KiB large folio will be compressed and stored as four 16KiB >> >> > blocks. >> >> > >> >> > For example, swapping out and swapping in 100MiB of typical anonymous >> >> > data 100 times (with 16KB mTHP enabled) using zstd yields the following >> >> > results: >> >> > >> >> > w/o patches w/ patches >> >> > swap-out time(ms) 68711 49908 >> >> > swap-in time(ms) 30687 20685 >> >> > compression ratio 20.49% 16.9% >> >> >> >> The data looks good. Thanks! >> >> >> >> Have you considered the situation that the large folio fails to be >> >> allocated during swap-in? It's possible because the memory may be very >> >> fragmented. >> > >> > That's correct, good question. On phones, we use a large folio pool to maintain >> > a relatively high allocation success rate. When mTHP allocation fails, we have >> > a workaround to allocate nr_pages of small folios and map them together to >> > avoid partial reads. This ensures that the benefits of larger block compression >> > and decompression are consistently maintained. That was the code running >> > on production phones. >> > >> > We also previously experimented with maintaining multiple buffers for >> > decompressed >> > large blocks in zRAM, allowing upcoming do_swap_page() calls to use them when >> > falling back to small folios. In this setup, the buffers achieved a >> > high hit rate, though >> > I don’t recall the exact number. >> > >> > I'm concerned that this fault-around-like fallback to nr_pages small >> > folios may not >> > gain traction upstream. Do you have any suggestions for improvement? >> >> It appears that we still haven't a solution to guarantee 100% mTHP >> allocation success rate. If so, we need a fallback solution for that. >> >> Another possible solution is, >> >> 1) If failed to allocate mTHP with nr_pages, allocate nr_pages normal (4k) >> folios instead >> >> 2) Revise the decompression interface to accept a set of folios (instead >> of one folio) as target. Then, we can decompress to the normal >> folios allocated in 1). >> >> 3) in do_swap_page(), we can either map all folios or just the fault >> folios. We can put non-fault folios into swap cache if necessary. >> >> Does this work? > > this is exactly what we did in production phones: I think that this is upstreamable. > [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplus/sm8650_u_14.0.0_oneplus12/mm/memory.c#L4656 > [2] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplus/sm8650_u_14.0.0_oneplus12/mm/memory.c#L5439 > > I feel that we don't need to fall back to nr_pages (though that's what > we used on phones); > using a dedicated 4 should be sufficient, as if zsmalloc is handling > compression and > decompression of 16KB. Yes. We only need the number of normal folios to make decompress work. > However, we are not adding them to the > swapcache; instead, > they are mapped immediately. I think that works. >> >> >> >> >> > -v2: >> >> > While it is not mature yet, I know some people are waiting for >> >> > an update :-) >> >> > * Fixed some stability issues. >> >> > * rebase againest the latest mm-unstable. >> >> > * Set default order to 2 which benefits all anon mTHP. >> >> > * multipages ZsPageMovable is not supported yet. >> >> > >> >> > Tangquan Zheng (2): >> >> > mm: zsmalloc: support objects compressed based on multiple pages >> >> > zram: support compression at the granularity of multi-pages >> >> > >> >> > drivers/block/zram/Kconfig | 9 + >> >> > drivers/block/zram/zcomp.c | 17 +- >> >> > drivers/block/zram/zcomp.h | 12 +- >> >> > drivers/block/zram/zram_drv.c | 450 +++++++++++++++++++++++++++++++--- >> >> > drivers/block/zram/zram_drv.h | 45 ++++ >> >> > include/linux/zsmalloc.h | 10 +- >> >> > mm/Kconfig | 18 ++ >> >> > mm/zsmalloc.c | 232 +++++++++++++----- >> >> > 8 files changed, 699 insertions(+), 94 deletions(-) >> >> -- Best Regards, Huang, Ying
On (24/11/12 09:31), Barry Song wrote: [..] > > Do you have any data how this would perform with the upstream kernel, i.e. without > > a large folio pool and the workaround and if large granularity compression is worth having > > without those patches? > > I’d say large granularity compression isn’t a problem, but large > granularity decompression > could be. > > The worst case would be if we swap out a large block, such as 16KB, > but end up swapping in > 4 times due to allocation failures, falling back to smaller folios. In > this scenario, we would need > to perform three redundant decompressions. I will work with Tangquan > to provide this data this > week. Well, apart from that... I sort of don't know. This seems to be exclusively for swap case (or do file-systems use mTHP too?) and zram/zsmalloc don't really focus on one particular usage scenario, pretty much all of our features can be used regardless of what zram is backing up - be it a swap partition or a mounted fs. Another thing is that I don't see how to integrate these large objects support with post-processig: recompression and writeback. Well, recompression is okay-ish, I guess, but writeback is not. Writeback works in PAGE_SIZE units; we get that worst case scenario here. So, yeah, there are many questions. p.s. Sorry for late reply. I just started looking at the series and don't have any solid opinions yet.
On Tue, Nov 12, 2024 at 10:37 AM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Nov 12, 2024 at 8:30 AM Nhat Pham <nphamcs@gmail.com> wrote: > > > > On Thu, Nov 7, 2024 at 2:10 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > When large folios are compressed at a larger granularity, we observe > > > a notable reduction in CPU usage and a significant improvement in > > > compression ratios. > > > > > > mTHP's ability to be swapped out without splitting and swapped back in > > > as a whole allows compression and decompression at larger granularities. > > > > > > This patchset enhances zsmalloc and zram by adding support for dividing > > > large folios into multi-page blocks, typically configured with a > > > 2-order granularity. Without this patchset, a large folio is always > > > divided into `nr_pages` 4KiB blocks. > > > > > > The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER` > > > setting, where the default of 2 allows all anonymous THP to benefit. > > > > > > Examples include: > > > * A 16KiB large folio will be compressed and stored as a single 16KiB > > > block. > > > * A 64KiB large folio will be compressed and stored as four 16KiB > > > blocks. > > > > > > For example, swapping out and swapping in 100MiB of typical anonymous > > > data 100 times (with 16KB mTHP enabled) using zstd yields the following > > > results: > > > > > > w/o patches w/ patches > > > swap-out time(ms) 68711 49908 > > > swap-in time(ms) 30687 20685 > > > compression ratio 20.49% 16.9% > > > > The data looks very promising :) My understanding is it also results > > in memory saving as well right? Since zstd operates better on bigger > > inputs. > > > > Is there any end-to-end benchmarking? My intuition is that this patch > > series overall will improve the situations, assuming we don't fallback > > to individual zero order page swapin too often, but it'd be nice if > > there is some data backing this intuition (especially with the > > upstream setup, i.e without any private patches). If the fallback > > scenario happens frequently, the patch series can make a page fault > > more expensive (since we have to decompress the entire chunk, and > > discard everything but the single page being loaded in), so it might > > make a difference. > > > > Not super qualified to comment on zram changes otherwise - just a > > casual observer to see if we can adopt this for zswap. zswap has the > > added complexity of not supporting THP zswap in (until Usama's patch > > series lands), and the presence of mixed backing states (due to zswap > > writeback), increasing the likelihood of fallback :) > > Correct. As I mentioned to Usama[1], this could be a problem, and we are > collecting data. The simplest approach to work around the issue is to fall > back to four small folios instead of just one, which would prevent the need > for three extra decompressions. > > [1] https://lore.kernel.org/linux-mm/CAGsJ_4yuZLOE0_yMOZj=KkRTyTotHw4g5g-t91W=MvS5zA4rYw@mail.gmail.com/ > Hi Nhat, Usama, Ying, I committed to providing data for cases where large folio allocation fails and swap-in falls back to swapping in small folios. Here is the data that Tangquan helped collect: * zstd, 100MB typical anon memory swapout+swapin 100times 1. 16kb mTHP swapout + 16kb mTHP swapin + w/o zsmalloc large block (de)compression swap-out(ms) 63151 swap-in(ms) 31551 2. 16kb mTHP swapout + 16kb mTHP swapin + w/ zsmalloc large block (de)compression swap-out(ms) 43925 swap-in(ms) 21763 3. 16kb mTHP swapout + 100% fallback to small folios swap-in + w/ zsmalloc large block (de)compression swap-out(ms) 43423 swap-in(ms) 68660 Thus, "swap-in(ms) 68660," where mTHP allocation always fails, is significantly slower than "swap-in(ms) 21763," where mTHP allocation succeeds. If there are no objections, I could send a v3 patch to fall back to 4 small folios instead of one. However, this would significantly increase the complexity of do_swap_page(). My gut feeling is that the added complexity might not be well-received :-) Thanks Barry
On Mon, Nov 18, 2024 at 2:27 AM Barry Song <21cnbao@gmail.com> wrote: > Thanks for the data, Barry and Tangquan! > On Tue, Nov 12, 2024 at 10:37 AM Barry Song <21cnbao@gmail.com> wrote: > > Thus, "swap-in(ms) 68660," where mTHP allocation always fails, is significantly > slower than "swap-in(ms) 21763," where mTHP allocation succeeds. As well as the first scenario (the status quo) :( I guess it depends on how often we are seeing this degenerate case (i.e how often do we see (m)THP allocation failure?) > > If there are no objections, I could send a v3 patch to fall back to 4 > small folios > instead of one. However, this would significantly increase the complexity of > do_swap_page(). My gut feeling is that the added complexity might not be > well-received :-) Yeah I'm curious too. I'll wait for your numbers - the dynamics are completely unpredictable to me. OTOH, we'll be less wasteful in terms of CPU work (no longer have to decompress the same chunk multiple times). OTOH, we're creating more memory pressure (having to load the whole chunk in), without the THP benefits. I think this is an OK workaround for now. Increasing (m)THP allocation success rate would be the true fix, but that is a hard problem :) > > Thanks > Barry
On Mon, Nov 18, 2024 at 10:56 PM Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > > On (24/11/12 09:31), Barry Song wrote: > [..] > > > Do you have any data how this would perform with the upstream kernel, i.e. without > > > a large folio pool and the workaround and if large granularity compression is worth having > > > without those patches? > > > > I’d say large granularity compression isn’t a problem, but large > > granularity decompression > > could be. > > > > The worst case would be if we swap out a large block, such as 16KB, > > but end up swapping in > > 4 times due to allocation failures, falling back to smaller folios. In > > this scenario, we would need > > to perform three redundant decompressions. I will work with Tangquan > > to provide this data this > > week. > > Well, apart from that... I sort of don't know. > > This seems to be exclusively for swap case (or do file-systems use > mTHP too?) and zram/zsmalloc don't really focus on one particular > usage scenario, pretty much all of our features can be used regardless > of what zram is backing up - be it a swap partition or a mounted fs. > Yes, some filesystems also support mTHP. A simple grep command can list them all: fs % git grep mapping_set_large_folios afs/inode.c: mapping_set_large_folios(inode->i_mapping); afs/inode.c: mapping_set_large_folios(inode->i_mapping); bcachefs/fs.c: mapping_set_large_folios(inode->v.i_mapping); erofs/inode.c: mapping_set_large_folios(inode->i_mapping); nfs/inode.c: mapping_set_large_folios(inode->i_mapping); smb/client/inode.c: mapping_set_large_folios(inode->i_mapping); zonefs/super.c: mapping_set_large_folios(inode->i_mapping); more filesystems might begin to support large mapping. In the current implementation, only size is considered when determining whether to apply large block compression: static inline bool want_multi_pages_comp(struct zram *zram, struct bio *bio) { u32 index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT; if (bio->bi_io_vec->bv_len >= ZCOMP_MULTI_PAGES_SIZE) return true; ... } If we encounter too many corner cases with filesystems (such as excessive recompression or partial reads), we could also verify if the folio is anonymous to return true. For swap, we are working to get things under control. The challenging scenario that could lead to many partial reads arises when mTHP allocation fails during swap-in. In such cases, do_swap_page() will swap in only a single small folio, even after decompressing the entire 16KB. > Another thing is that I don't see how to integrate these large > objects support with post-processig: recompression and writeback. > Well, recompression is okay-ish, I guess, but writeback is not. > Writeback works in PAGE_SIZE units; we get that worst case scenario > here. So, yeah, there are many questions. For ZRAM writeback, my intuition is that we should write back the entire large block (4 * PAGE_SIZE) at once. If the large block is idle or marked as huge in ZRAM, it generally applies to the entire block. This isn't currently implemented, likely because writeback hasn't been enabled on our phones yet. > > p.s. Sorry for late reply. I just started looking at the series and > don't have any solid opinions yet. Thank you for starting to review the series. Your suggestions are greatly appreciated. Best Regards Barry
On 18/11/2024 02:27, Barry Song wrote: > On Tue, Nov 12, 2024 at 10:37 AM Barry Song <21cnbao@gmail.com> wrote: >> >> On Tue, Nov 12, 2024 at 8:30 AM Nhat Pham <nphamcs@gmail.com> wrote: >>> >>> On Thu, Nov 7, 2024 at 2:10 AM Barry Song <21cnbao@gmail.com> wrote: >>>> >>>> From: Barry Song <v-songbaohua@oppo.com> >>>> >>>> When large folios are compressed at a larger granularity, we observe >>>> a notable reduction in CPU usage and a significant improvement in >>>> compression ratios. >>>> >>>> mTHP's ability to be swapped out without splitting and swapped back in >>>> as a whole allows compression and decompression at larger granularities. >>>> >>>> This patchset enhances zsmalloc and zram by adding support for dividing >>>> large folios into multi-page blocks, typically configured with a >>>> 2-order granularity. Without this patchset, a large folio is always >>>> divided into `nr_pages` 4KiB blocks. >>>> >>>> The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER` >>>> setting, where the default of 2 allows all anonymous THP to benefit. >>>> >>>> Examples include: >>>> * A 16KiB large folio will be compressed and stored as a single 16KiB >>>> block. >>>> * A 64KiB large folio will be compressed and stored as four 16KiB >>>> blocks. >>>> >>>> For example, swapping out and swapping in 100MiB of typical anonymous >>>> data 100 times (with 16KB mTHP enabled) using zstd yields the following >>>> results: >>>> >>>> w/o patches w/ patches >>>> swap-out time(ms) 68711 49908 >>>> swap-in time(ms) 30687 20685 >>>> compression ratio 20.49% 16.9% >>> >>> The data looks very promising :) My understanding is it also results >>> in memory saving as well right? Since zstd operates better on bigger >>> inputs. >>> >>> Is there any end-to-end benchmarking? My intuition is that this patch >>> series overall will improve the situations, assuming we don't fallback >>> to individual zero order page swapin too often, but it'd be nice if >>> there is some data backing this intuition (especially with the >>> upstream setup, i.e without any private patches). If the fallback >>> scenario happens frequently, the patch series can make a page fault >>> more expensive (since we have to decompress the entire chunk, and >>> discard everything but the single page being loaded in), so it might >>> make a difference. >>> >>> Not super qualified to comment on zram changes otherwise - just a >>> casual observer to see if we can adopt this for zswap. zswap has the >>> added complexity of not supporting THP zswap in (until Usama's patch >>> series lands), and the presence of mixed backing states (due to zswap >>> writeback), increasing the likelihood of fallback :) >> >> Correct. As I mentioned to Usama[1], this could be a problem, and we are >> collecting data. The simplest approach to work around the issue is to fall >> back to four small folios instead of just one, which would prevent the need >> for three extra decompressions. >> >> [1] https://lore.kernel.org/linux-mm/CAGsJ_4yuZLOE0_yMOZj=KkRTyTotHw4g5g-t91W=MvS5zA4rYw@mail.gmail.com/ >> > > Hi Nhat, Usama, Ying, > > I committed to providing data for cases where large folio allocation fails and > swap-in falls back to swapping in small folios. Here is the data that Tangquan > helped collect: > > * zstd, 100MB typical anon memory swapout+swapin 100times > > 1. 16kb mTHP swapout + 16kb mTHP swapin + w/o zsmalloc large block > (de)compression > swap-out(ms) 63151 > swap-in(ms) 31551 > 2. 16kb mTHP swapout + 16kb mTHP swapin + w/ zsmalloc large block > (de)compression > swap-out(ms) 43925 > swap-in(ms) 21763 > 3. 16kb mTHP swapout + 100% fallback to small folios swap-in + w/ > zsmalloc large block (de)compression > swap-out(ms) 43423 > swap-in(ms) 68660 > Hi Barry, Thanks for the numbers! In what condition was it falling back to small folios. Did you just added a hack in alloc_swap_folio to just jump to fallback? or was it due to cgroup limited memory pressure? Would it be good to test with something like kernel build test (or something else that causes swap thrashing) to see if the regression worsens with large granularity decompression? i.e. would be good to have numbers for real world applications. > Thus, "swap-in(ms) 68660," where mTHP allocation always fails, is significantly > slower than "swap-in(ms) 21763," where mTHP allocation succeeds. > > If there are no objections, I could send a v3 patch to fall back to 4 > small folios > instead of one. However, this would significantly increase the complexity of > do_swap_page(). My gut feeling is that the added complexity might not be > well-received :-) > If there is space for 4 small folios, then maybe it might be worth passing __GFP_DIRECT_RECLAIM? as that can trigger compaction and give a large folio. Thanks, Usama > Thanks > Barry
On Tue, Nov 19, 2024 at 9:29 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 18/11/2024 02:27, Barry Song wrote: > > On Tue, Nov 12, 2024 at 10:37 AM Barry Song <21cnbao@gmail.com> wrote: > >> > >> On Tue, Nov 12, 2024 at 8:30 AM Nhat Pham <nphamcs@gmail.com> wrote: > >>> > >>> On Thu, Nov 7, 2024 at 2:10 AM Barry Song <21cnbao@gmail.com> wrote: > >>>> > >>>> From: Barry Song <v-songbaohua@oppo.com> > >>>> > >>>> When large folios are compressed at a larger granularity, we observe > >>>> a notable reduction in CPU usage and a significant improvement in > >>>> compression ratios. > >>>> > >>>> mTHP's ability to be swapped out without splitting and swapped back in > >>>> as a whole allows compression and decompression at larger granularities. > >>>> > >>>> This patchset enhances zsmalloc and zram by adding support for dividing > >>>> large folios into multi-page blocks, typically configured with a > >>>> 2-order granularity. Without this patchset, a large folio is always > >>>> divided into `nr_pages` 4KiB blocks. > >>>> > >>>> The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER` > >>>> setting, where the default of 2 allows all anonymous THP to benefit. > >>>> > >>>> Examples include: > >>>> * A 16KiB large folio will be compressed and stored as a single 16KiB > >>>> block. > >>>> * A 64KiB large folio will be compressed and stored as four 16KiB > >>>> blocks. > >>>> > >>>> For example, swapping out and swapping in 100MiB of typical anonymous > >>>> data 100 times (with 16KB mTHP enabled) using zstd yields the following > >>>> results: > >>>> > >>>> w/o patches w/ patches > >>>> swap-out time(ms) 68711 49908 > >>>> swap-in time(ms) 30687 20685 > >>>> compression ratio 20.49% 16.9% > >>> > >>> The data looks very promising :) My understanding is it also results > >>> in memory saving as well right? Since zstd operates better on bigger > >>> inputs. > >>> > >>> Is there any end-to-end benchmarking? My intuition is that this patch > >>> series overall will improve the situations, assuming we don't fallback > >>> to individual zero order page swapin too often, but it'd be nice if > >>> there is some data backing this intuition (especially with the > >>> upstream setup, i.e without any private patches). If the fallback > >>> scenario happens frequently, the patch series can make a page fault > >>> more expensive (since we have to decompress the entire chunk, and > >>> discard everything but the single page being loaded in), so it might > >>> make a difference. > >>> > >>> Not super qualified to comment on zram changes otherwise - just a > >>> casual observer to see if we can adopt this for zswap. zswap has the > >>> added complexity of not supporting THP zswap in (until Usama's patch > >>> series lands), and the presence of mixed backing states (due to zswap > >>> writeback), increasing the likelihood of fallback :) > >> > >> Correct. As I mentioned to Usama[1], this could be a problem, and we are > >> collecting data. The simplest approach to work around the issue is to fall > >> back to four small folios instead of just one, which would prevent the need > >> for three extra decompressions. > >> > >> [1] https://lore.kernel.org/linux-mm/CAGsJ_4yuZLOE0_yMOZj=KkRTyTotHw4g5g-t91W=MvS5zA4rYw@mail.gmail.com/ > >> > > > > Hi Nhat, Usama, Ying, > > > > I committed to providing data for cases where large folio allocation fails and > > swap-in falls back to swapping in small folios. Here is the data that Tangquan > > helped collect: > > > > * zstd, 100MB typical anon memory swapout+swapin 100times > > > > 1. 16kb mTHP swapout + 16kb mTHP swapin + w/o zsmalloc large block > > (de)compression > > swap-out(ms) 63151 > > swap-in(ms) 31551 > > 2. 16kb mTHP swapout + 16kb mTHP swapin + w/ zsmalloc large block > > (de)compression > > swap-out(ms) 43925 > > swap-in(ms) 21763 > > 3. 16kb mTHP swapout + 100% fallback to small folios swap-in + w/ > > zsmalloc large block (de)compression > > swap-out(ms) 43423 > > swap-in(ms) 68660 > > > > Hi Barry, > > Thanks for the numbers! > > In what condition was it falling back to small folios. Did you just added a hack > in alloc_swap_folio to just jump to fallback? or was it due to cgroup limited memory > pressure? In real scenarios, even without memcg, fallbacks mainly occur due to memory fragmentation, which prevents the allocation of mTHP (contiguous pages) from the buddy system. While cgroup memory pressure isn't the primary issue here, it can also contribute to fallbacks. Note that this fallback occurs universally for both do_anonymous_page() and filesystem mTHP. > > Would it be good to test with something like kernel build test (or something else that > causes swap thrashing) to see if the regression worsens with large granularity decompression? > i.e. would be good to have numbers for real world applications. I’m confident that the data will be reliable as long as memory isn’t fragmented, but fragmentation depends on when the case is run. For example, on a fresh system, memory is not fragmented at all, but after running various workloads for a few hours, serious fragmentation may occur. I recall reporting that a phone using 64KB mTHP had a high mTHP allocation success rate in the first hour, but this dropped to less than 10% after a few hours of use. In my understanding, the performance of mTHP can vary significantly depending on the system's fragmentation state. This is why efforts like Yu Zhao's TAO are being developed to address the mTHP allocation success rate issue. > > > Thus, "swap-in(ms) 68660," where mTHP allocation always fails, is significantly > > slower than "swap-in(ms) 21763," where mTHP allocation succeeds. > > > > If there are no objections, I could send a v3 patch to fall back to 4 > > small folios > > instead of one. However, this would significantly increase the complexity of > > do_swap_page(). My gut feeling is that the added complexity might not be > > well-received :-) > > > > If there is space for 4 small folios, then maybe it might be worth passing > __GFP_DIRECT_RECLAIM? as that can trigger compaction and give a large folio. > Small folios are always much *easier* to obtain from the system. Triggering compaction won't necessarily yield a large folio if unmovable small folios are scattered. For small folios, reclamation is already the case for memcg. as a small folio is charged by GFP_KERNEL as it was before. static struct folio *__alloc_swap_folio(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct folio *folio; swp_entry_t entry; folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address); if (!folio) return NULL; entry = pte_to_swp_entry(vmf->orig_pte); if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, GFP_KERNEL, entry)) { folio_put(folio); return NULL; } return folio; } Thanks Barry
On Tue, Nov 19, 2024 at 9:51 AM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Nov 19, 2024 at 9:29 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > > > > > On 18/11/2024 02:27, Barry Song wrote: > > > On Tue, Nov 12, 2024 at 10:37 AM Barry Song <21cnbao@gmail.com> wrote: > > >> > > >> On Tue, Nov 12, 2024 at 8:30 AM Nhat Pham <nphamcs@gmail.com> wrote: > > >>> > > >>> On Thu, Nov 7, 2024 at 2:10 AM Barry Song <21cnbao@gmail.com> wrote: > > >>>> > > >>>> From: Barry Song <v-songbaohua@oppo.com> > > >>>> > > >>>> When large folios are compressed at a larger granularity, we observe > > >>>> a notable reduction in CPU usage and a significant improvement in > > >>>> compression ratios. > > >>>> > > >>>> mTHP's ability to be swapped out without splitting and swapped back in > > >>>> as a whole allows compression and decompression at larger granularities. > > >>>> > > >>>> This patchset enhances zsmalloc and zram by adding support for dividing > > >>>> large folios into multi-page blocks, typically configured with a > > >>>> 2-order granularity. Without this patchset, a large folio is always > > >>>> divided into `nr_pages` 4KiB blocks. > > >>>> > > >>>> The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER` > > >>>> setting, where the default of 2 allows all anonymous THP to benefit. > > >>>> > > >>>> Examples include: > > >>>> * A 16KiB large folio will be compressed and stored as a single 16KiB > > >>>> block. > > >>>> * A 64KiB large folio will be compressed and stored as four 16KiB > > >>>> blocks. > > >>>> > > >>>> For example, swapping out and swapping in 100MiB of typical anonymous > > >>>> data 100 times (with 16KB mTHP enabled) using zstd yields the following > > >>>> results: > > >>>> > > >>>> w/o patches w/ patches > > >>>> swap-out time(ms) 68711 49908 > > >>>> swap-in time(ms) 30687 20685 > > >>>> compression ratio 20.49% 16.9% > > >>> > > >>> The data looks very promising :) My understanding is it also results > > >>> in memory saving as well right? Since zstd operates better on bigger > > >>> inputs. > > >>> > > >>> Is there any end-to-end benchmarking? My intuition is that this patch > > >>> series overall will improve the situations, assuming we don't fallback > > >>> to individual zero order page swapin too often, but it'd be nice if > > >>> there is some data backing this intuition (especially with the > > >>> upstream setup, i.e without any private patches). If the fallback > > >>> scenario happens frequently, the patch series can make a page fault > > >>> more expensive (since we have to decompress the entire chunk, and > > >>> discard everything but the single page being loaded in), so it might > > >>> make a difference. > > >>> > > >>> Not super qualified to comment on zram changes otherwise - just a > > >>> casual observer to see if we can adopt this for zswap. zswap has the > > >>> added complexity of not supporting THP zswap in (until Usama's patch > > >>> series lands), and the presence of mixed backing states (due to zswap > > >>> writeback), increasing the likelihood of fallback :) > > >> > > >> Correct. As I mentioned to Usama[1], this could be a problem, and we are > > >> collecting data. The simplest approach to work around the issue is to fall > > >> back to four small folios instead of just one, which would prevent the need > > >> for three extra decompressions. > > >> > > >> [1] https://lore.kernel.org/linux-mm/CAGsJ_4yuZLOE0_yMOZj=KkRTyTotHw4g5g-t91W=MvS5zA4rYw@mail.gmail.com/ > > >> > > > > > > Hi Nhat, Usama, Ying, > > > > > > I committed to providing data for cases where large folio allocation fails and > > > swap-in falls back to swapping in small folios. Here is the data that Tangquan > > > helped collect: > > > > > > * zstd, 100MB typical anon memory swapout+swapin 100times > > > > > > 1. 16kb mTHP swapout + 16kb mTHP swapin + w/o zsmalloc large block > > > (de)compression > > > swap-out(ms) 63151 > > > swap-in(ms) 31551 > > > 2. 16kb mTHP swapout + 16kb mTHP swapin + w/ zsmalloc large block > > > (de)compression > > > swap-out(ms) 43925 > > > swap-in(ms) 21763 > > > 3. 16kb mTHP swapout + 100% fallback to small folios swap-in + w/ > > > zsmalloc large block (de)compression > > > swap-out(ms) 43423 > > > swap-in(ms) 68660 > > > > > > > Hi Barry, > > > > Thanks for the numbers! > > > > In what condition was it falling back to small folios. Did you just added a hack > > in alloc_swap_folio to just jump to fallback? or was it due to cgroup limited memory > > pressure? Usama, I realized you might be asking how the test 3 was done. yes, it is a simple hack by 100%fallback to small folios. > > In real scenarios, even without memcg, fallbacks mainly occur due to memory > fragmentation, which prevents the allocation of mTHP (contiguous pages) from > the buddy system. While cgroup memory pressure isn't the primary issue here, > it can also contribute to fallbacks. > > Note that this fallback occurs universally for both do_anonymous_page() and > filesystem mTHP. > > > > > Would it be good to test with something like kernel build test (or something else that > > causes swap thrashing) to see if the regression worsens with large granularity decompression? > > i.e. would be good to have numbers for real world applications. > > I’m confident that the data will be reliable as long as memory isn’t fragmented, > but fragmentation depends on when the case is run. For example, on a fresh > system, memory is not fragmented at all, but after running various workloads > for a few hours, serious fragmentation may occur. > > I recall reporting that a phone using 64KB mTHP had a high mTHP allocation > success rate in the first hour, but this dropped to less than 10% after a few > hours of use. > > In my understanding, the performance of mTHP can vary significantly depending > on the system's fragmentation state. This is why efforts like Yu Zhao's TAO are > being developed to address the mTHP allocation success rate issue. > > > > > > Thus, "swap-in(ms) 68660," where mTHP allocation always fails, is significantly > > > slower than "swap-in(ms) 21763," where mTHP allocation succeeds. > > > > > > If there are no objections, I could send a v3 patch to fall back to 4 > > > small folios > > > instead of one. However, this would significantly increase the complexity of > > > do_swap_page(). My gut feeling is that the added complexity might not be > > > well-received :-) > > > > > > > If there is space for 4 small folios, then maybe it might be worth passing > > __GFP_DIRECT_RECLAIM? as that can trigger compaction and give a large folio. > > > > Small folios are always much *easier* to obtain from the system. > Triggering compaction > won't necessarily yield a large folio if unmovable small folios are scattered. > > For small folios, reclamation is already the case for memcg. as a small folio > is charged by GFP_KERNEL as it was before. > > static struct folio *__alloc_swap_folio(struct vm_fault *vmf) > { > struct vm_area_struct *vma = vmf->vma; > struct folio *folio; > swp_entry_t entry; > > folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address); > if (!folio) > return NULL; > > entry = pte_to_swp_entry(vmf->orig_pte); > if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, > GFP_KERNEL, entry)) { > folio_put(folio); > return NULL; > } > > return folio; > } > > Thanks > Barry
On (24/11/19 09:27), Barry Song wrote: > On Mon, Nov 18, 2024 at 10:56 PM Sergey Senozhatsky > <senozhatsky@chromium.org> wrote: > > > > On (24/11/12 09:31), Barry Song wrote: > > [..] > Yes, some filesystems also support mTHP. A simple grep > command can list them all: > > fs % git grep mapping_set_large_folios > afs/inode.c: mapping_set_large_folios(inode->i_mapping); > afs/inode.c: mapping_set_large_folios(inode->i_mapping); > bcachefs/fs.c: mapping_set_large_folios(inode->v.i_mapping); > erofs/inode.c: mapping_set_large_folios(inode->i_mapping); > nfs/inode.c: mapping_set_large_folios(inode->i_mapping); > smb/client/inode.c: mapping_set_large_folios(inode->i_mapping); > zonefs/super.c: mapping_set_large_folios(inode->i_mapping); Yeah, those are mostly not on-disk file systems, or not filesystems that people use en-mass for r/w I/O workloads (e.g. vfat, ext4, etc.)
On Tue, Nov 19, 2024 at 3:45 PM Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > > On (24/11/19 09:27), Barry Song wrote: > > On Mon, Nov 18, 2024 at 10:56 PM Sergey Senozhatsky > > <senozhatsky@chromium.org> wrote: > > > > > > On (24/11/12 09:31), Barry Song wrote: > > > [..] > > Yes, some filesystems also support mTHP. A simple grep > > command can list them all: > > > > fs % git grep mapping_set_large_folios > > afs/inode.c: mapping_set_large_folios(inode->i_mapping); > > afs/inode.c: mapping_set_large_folios(inode->i_mapping); > > bcachefs/fs.c: mapping_set_large_folios(inode->v.i_mapping); > > erofs/inode.c: mapping_set_large_folios(inode->i_mapping); > > nfs/inode.c: mapping_set_large_folios(inode->i_mapping); > > smb/client/inode.c: mapping_set_large_folios(inode->i_mapping); > > zonefs/super.c: mapping_set_large_folios(inode->i_mapping); > > Yeah, those are mostly not on-disk file systems, or not filesystems > that people use en-mass for r/w I/O workloads (e.g. vfat, ext4, etc.) there is work to bring up ext4 large folios though :-) https://lore.kernel.org/linux-fsdevel/20241022111059.2566137-1-yi.zhang@huaweicloud.com/
From: Barry Song <v-songbaohua@oppo.com> When large folios are compressed at a larger granularity, we observe a notable reduction in CPU usage and a significant improvement in compression ratios. mTHP's ability to be swapped out without splitting and swapped back in as a whole allows compression and decompression at larger granularities. This patchset enhances zsmalloc and zram by adding support for dividing large folios into multi-page blocks, typically configured with a 2-order granularity. Without this patchset, a large folio is always divided into `nr_pages` 4KiB blocks. The granularity can be set using the `ZSMALLOC_MULTI_PAGES_ORDER` setting, where the default of 2 allows all anonymous THP to benefit. Examples include: * A 16KiB large folio will be compressed and stored as a single 16KiB block. * A 64KiB large folio will be compressed and stored as four 16KiB blocks. For example, swapping out and swapping in 100MiB of typical anonymous data 100 times (with 16KB mTHP enabled) using zstd yields the following results: w/o patches w/ patches swap-out time(ms) 68711 49908 swap-in time(ms) 30687 20685 compression ratio 20.49% 16.9% -v2: While it is not mature yet, I know some people are waiting for an update :-) * Fixed some stability issues. * rebase againest the latest mm-unstable. * Set default order to 2 which benefits all anon mTHP. * multipages ZsPageMovable is not supported yet. Tangquan Zheng (2): mm: zsmalloc: support objects compressed based on multiple pages zram: support compression at the granularity of multi-pages drivers/block/zram/Kconfig | 9 + drivers/block/zram/zcomp.c | 17 +- drivers/block/zram/zcomp.h | 12 +- drivers/block/zram/zram_drv.c | 450 +++++++++++++++++++++++++++++++--- drivers/block/zram/zram_drv.h | 45 ++++ include/linux/zsmalloc.h | 10 +- mm/Kconfig | 18 ++ mm/zsmalloc.c | 232 +++++++++++++----- 8 files changed, 699 insertions(+), 94 deletions(-)