[0/2] Improve Zram by separating compression context from kswapd

Message ID	20250307120141.1566673-1-qun-wei.lin@mediatek.com (mailing list archive)
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> From: Qun-Wei Lin <qun-wei.lin@mediatek.com> To: Jens Axboe <axboe@kernel.dk>, Minchan Kim <minchan@kernel.org>, Sergey Senozhatsky <senozhatsky@chromium.org>, Vishal Verma <vishal.l.verma@intel.com>, Dan Williams <dan.j.williams@intel.com>, Dave Jiang <dave.jiang@intel.com>, Ira Weiny <ira.weiny@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Matthias Brugger <matthias.bgg@gmail.com>, AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>, Chris Li <chrisl@kernel.org>, Ryan Roberts <ryan.roberts@arm.com>, "Huang, Ying" <ying.huang@intel.com>, Kairui Song <kasong@tencent.com>, Dan Schatzberg <schatzberg.dan@gmail.com>, Barry Song <baohua@kernel.org>, Al Viro <viro@zeniv.linux.org.uk> CC: <linux-kernel@vger.kernel.org>, <linux-block@vger.kernel.org>, <nvdimm@lists.linux.dev>, <linux-mm@kvack.org>, <linux-arm-kernel@lists.infradead.org>, <linux-mediatek@lists.infradead.org>, Casper Li <casper.li@mediatek.com>, Chinwen Chang <chinwen.chang@mediatek.com>, Andrew Yang <andrew.yang@mediatek.com>, James Hsu <james.hsu@mediatek.com>, Qun-Wei Lin <qun-wei.lin@mediatek.com> Subject: [PATCH 0/2] Improve Zram by separating compression context from kswapd Date: Fri, 7 Mar 2025 20:01:02 +0800 Message-ID: <20250307120141.1566673-1-qun-wei.lin@mediatek.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: list Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	Improve Zram by separating compression context from kswapd \| expand [0/2] Improve Zram by separating compression context from kswapd [1/2] mm: Split BLK_FEAT_SYNCHRONOUS and SWP_SYNCHRONOUS_IO into separate read and write flags [2/2] kcompressd: Add Kcompressd for accelerated zram compression

Qun-wei Lin (林群崴) March 7, 2025, 12:01 p.m. UTC

This patch series introduces a new mechanism called kcompressd to
improve the efficiency of memory reclaiming in the operating system. The
main goal is to separate the tasks of page scanning and page compression
into distinct processes or threads, thereby reducing the load on the
kswapd thread and enhancing overall system performance under high memory
pressure conditions.

Problem:
 In the current system, the kswapd thread is responsible for both
 scanning the LRU pages and compressing pages into the ZRAM. This
 combined responsibility can lead to significant performance bottlenecks,
 especially under high memory pressure. The kswapd thread becomes a
 single point of contention, causing delays in memory reclaiming and
 overall system performance degradation.

Target:
 The target of this invention is to improve the efficiency of memory
 reclaiming. By separating the tasks of page scanning and page
 compression into distinct processes or threads, the system can handle
 memory pressure more effectively.

Patch 1:
- Introduces 2 new feature flags, BLK_FEAT_READ_SYNCHRONOUS and
  SWP_READ_SYNCHRONOUS_IO.

Patch 2:
- Implemented the core functionality of Kcompressd and made necessary
  modifications to the zram driver to support it.

In our handheld devices, we found that applying this mechanism under high
memory pressure scenarios can increase the rate of pgsteal_anon per second
by over 260% compared to the situation with only kswapd.

Qun-Wei Lin (2):
  mm: Split BLK_FEAT_SYNCHRONOUS and SWP_SYNCHRONOUS_IO into separate
    read and write flags
  kcompressd: Add Kcompressd for accelerated zram compression

 drivers/block/brd.c             |   3 +-
 drivers/block/zram/Kconfig      |  11 ++
 drivers/block/zram/Makefile     |   3 +-
 drivers/block/zram/kcompressd.c | 340 ++++++++++++++++++++++++++++++++
 drivers/block/zram/kcompressd.h |  25 +++
 drivers/block/zram/zram_drv.c   |  21 +-
 drivers/nvdimm/btt.c            |   3 +-
 drivers/nvdimm/pmem.c           |   5 +-
 include/linux/blkdev.h          |  24 ++-
 include/linux/swap.h            |  31 +--
 mm/memory.c                     |   4 +-
 mm/page_io.c                    |   6 +-
 mm/swapfile.c                   |   7 +-
 13 files changed, 446 insertions(+), 37 deletions(-)
 create mode 100644 drivers/block/zram/kcompressd.c
 create mode 100644 drivers/block/zram/kcompressd.h

Barry Song March 7, 2025, 7:34 p.m. UTC | #1

On Sat, Mar 8, 2025 at 1:02 AM Qun-Wei Lin <qun-wei.lin@mediatek.com> wrote:
>
> This patch series introduces a new mechanism called kcompressd to
> improve the efficiency of memory reclaiming in the operating system. The
> main goal is to separate the tasks of page scanning and page compression
> into distinct processes or threads, thereby reducing the load on the
> kswapd thread and enhancing overall system performance under high memory
> pressure conditions.
>
> Problem:
>  In the current system, the kswapd thread is responsible for both
>  scanning the LRU pages and compressing pages into the ZRAM. This
>  combined responsibility can lead to significant performance bottlenecks,
>  especially under high memory pressure. The kswapd thread becomes a
>  single point of contention, causing delays in memory reclaiming and
>  overall system performance degradation.
>
> Target:
>  The target of this invention is to improve the efficiency of memory
>  reclaiming. By separating the tasks of page scanning and page
>  compression into distinct processes or threads, the system can handle
>  memory pressure more effectively.

Sounds great. However, we also have a time window where folios under
writeback are kept, whereas previously, writeback was done synchronously
without your patch. This may temporarily increase memory usage until the
kept folios are re-scanned.

So, you’ve observed that folio_rotate_reclaimable() runs shortly while the
async thread completes compression? Then the kept folios are shortly
re-scanned?

>
> Patch 1:
> - Introduces 2 new feature flags, BLK_FEAT_READ_SYNCHRONOUS and
>   SWP_READ_SYNCHRONOUS_IO.
>
> Patch 2:
> - Implemented the core functionality of Kcompressd and made necessary
>   modifications to the zram driver to support it.
>
> In our handheld devices, we found that applying this mechanism under high
> memory pressure scenarios can increase the rate of pgsteal_anon per second
> by over 260% compared to the situation with only kswapd.

Sounds really great.

What compression algorithm is being used? I assume that after switching to a
different compression algorithms, the benefits will change significantly. For
example, Zstd might not show as much improvement.
How was the CPU usage ratio between page scan/unmap and compression
observed before applying this patch?

>
> Qun-Wei Lin (2):
>   mm: Split BLK_FEAT_SYNCHRONOUS and SWP_SYNCHRONOUS_IO into separate
>     read and write flags
>   kcompressd: Add Kcompressd for accelerated zram compression
>
>  drivers/block/brd.c             |   3 +-
>  drivers/block/zram/Kconfig      |  11 ++
>  drivers/block/zram/Makefile     |   3 +-
>  drivers/block/zram/kcompressd.c | 340 ++++++++++++++++++++++++++++++++
>  drivers/block/zram/kcompressd.h |  25 +++
>  drivers/block/zram/zram_drv.c   |  21 +-
>  drivers/nvdimm/btt.c            |   3 +-
>  drivers/nvdimm/pmem.c           |   5 +-
>  include/linux/blkdev.h          |  24 ++-
>  include/linux/swap.h            |  31 +--
>  mm/memory.c                     |   4 +-
>  mm/page_io.c                    |   6 +-
>  mm/swapfile.c                   |   7 +-
>  13 files changed, 446 insertions(+), 37 deletions(-)
>  create mode 100644 drivers/block/zram/kcompressd.c
>  create mode 100644 drivers/block/zram/kcompressd.h
>
> --
> 2.45.2
>

Thanks
Barry

Nhat Pham March 7, 2025, 11:03 p.m. UTC | #2

On Fri, Mar 7, 2025 at 4:02 AM Qun-Wei Lin <qun-wei.lin@mediatek.com> wrote:
>
> This patch series introduces a new mechanism called kcompressd to
> improve the efficiency of memory reclaiming in the operating system. The
> main goal is to separate the tasks of page scanning and page compression
> into distinct processes or threads, thereby reducing the load on the
> kswapd thread and enhancing overall system performance under high memory
> pressure conditions.

Please excuse my ignorance, but from your cover letter I still don't
quite get what is the problem here? And how would decouple compression
and scanning help?

>
> Problem:
>  In the current system, the kswapd thread is responsible for both
>  scanning the LRU pages and compressing pages into the ZRAM. This
>  combined responsibility can lead to significant performance bottlenecks,

What bottleneck are we talking about? Is one stage slower than the other?

>  especially under high memory pressure. The kswapd thread becomes a
>  single point of contention, causing delays in memory reclaiming and
>  overall system performance degradation.
>
> Target:
>  The target of this invention is to improve the efficiency of memory
>  reclaiming. By separating the tasks of page scanning and page
>  compression into distinct processes or threads, the system can handle
>  memory pressure more effectively.

I'm not a zram maintainer, so I'm definitely not trying to stop this
patch. But whatever problem zram is facing will likely occur with
zswap too, so I'd like to learn more :)

Barry Song March 8, 2025, 5:41 a.m. UTC | #3

On Sat, Mar 8, 2025 at 12:03 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Fri, Mar 7, 2025 at 4:02 AM Qun-Wei Lin <qun-wei.lin@mediatek.com> wrote:
> >
> > This patch series introduces a new mechanism called kcompressd to
> > improve the efficiency of memory reclaiming in the operating system. The
> > main goal is to separate the tasks of page scanning and page compression
> > into distinct processes or threads, thereby reducing the load on the
> > kswapd thread and enhancing overall system performance under high memory
> > pressure conditions.
>
> Please excuse my ignorance, but from your cover letter I still don't
> quite get what is the problem here? And how would decouple compression
> and scanning help?

My understanding is as follows:

When kswapd attempts to reclaim M anonymous folios and N file folios,
the process involves the following steps:

* t1: Time to scan and unmap anonymous folios
* t2: Time to compress anonymous folios
* t3: Time to reclaim file folios

Currently, these steps are executed sequentially, meaning the total time
required to reclaim M + N folios is t1 + t2 + t3.

However, Qun-Wei's patch enables t1 + t3 and t2 to run in parallel,
reducing the total time to max(t1 + t3, t2). This likely improves the
reclamation speed, potentially reducing allocation stalls.

I don’t have concrete data on this. Does Qun-Wei have detailed
performance data?

>
> >
> > Problem:
> >  In the current system, the kswapd thread is responsible for both
> >  scanning the LRU pages and compressing pages into the ZRAM. This
> >  combined responsibility can lead to significant performance bottlenecks,
>
> What bottleneck are we talking about? Is one stage slower than the other?
>
> >  especially under high memory pressure. The kswapd thread becomes a
> >  single point of contention, causing delays in memory reclaiming and
> >  overall system performance degradation.
> >
> > Target:
> >  The target of this invention is to improve the efficiency of memory
> >  reclaiming. By separating the tasks of page scanning and page
> >  compression into distinct processes or threads, the system can handle
> >  memory pressure more effectively.
>
> I'm not a zram maintainer, so I'm definitely not trying to stop this
> patch. But whatever problem zram is facing will likely occur with
> zswap too, so I'd like to learn more :)

Right, this is likely something that could be addressed more generally
for zswap and zram.

Thanks
Barry

Qun-wei Lin (林群崴) March 10, 2025, 1:21 p.m. UTC | #4

On Sat, 2025-03-08 at 08:34 +1300, Barry Song wrote:
> 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
> 
> 
> On Sat, Mar 8, 2025 at 1:02 AM Qun-Wei Lin <qun-wei.lin@mediatek.com>
> wrote:
> > 
> > This patch series introduces a new mechanism called kcompressd to
> > improve the efficiency of memory reclaiming in the operating
> > system. The
> > main goal is to separate the tasks of page scanning and page
> > compression
> > into distinct processes or threads, thereby reducing the load on
> > the
> > kswapd thread and enhancing overall system performance under high
> > memory
> > pressure conditions.
> > 
> > Problem:
> >  In the current system, the kswapd thread is responsible for both
> >  scanning the LRU pages and compressing pages into the ZRAM. This
> >  combined responsibility can lead to significant performance
> > bottlenecks,
> >  especially under high memory pressure. The kswapd thread becomes a
> >  single point of contention, causing delays in memory reclaiming
> > and
> >  overall system performance degradation.
> > 
> > Target:
> >  The target of this invention is to improve the efficiency of
> > memory
> >  reclaiming. By separating the tasks of page scanning and page
> >  compression into distinct processes or threads, the system can
> > handle
> >  memory pressure more effectively.
> 
> Sounds great. However, we also have a time window where folios under
> writeback are kept, whereas previously, writeback was done
> synchronously
> without your patch. This may temporarily increase memory usage until
> the
> kept folios are re-scanned.
> 
> So, you’ve observed that folio_rotate_reclaimable() runs shortly
> while the
> async thread completes compression? Then the kept folios are shortly
> re-scanned?
> 

Yes, these folios may need to be re-scanned, so
folio_rotate_reclaimable() will be run. This can be observed from the
increase in pgrotated in /proc/vmstat.

> > 
> > Patch 1:
> > - Introduces 2 new feature flags, BLK_FEAT_READ_SYNCHRONOUS and
> >   SWP_READ_SYNCHRONOUS_IO.
> > 
> > Patch 2:
> > - Implemented the core functionality of Kcompressd and made
> > necessary
> >   modifications to the zram driver to support it.
> > 
> > In our handheld devices, we found that applying this mechanism
> > under high
> > memory pressure scenarios can increase the rate of pgsteal_anon per
> > second
> > by over 260% compared to the situation with only kswapd.
> 
> Sounds really great.
> 
> What compression algorithm is being used? I assume that after
> switching to a
> different compression algorithms, the benefits will change
> significantly. For
> example, Zstd might not show as much improvement.
> How was the CPU usage ratio between page scan/unmap and compression
> observed before applying this patch?
> 

The original tests were based on LZ4.
We have observed that the CPU time spent on scanning the LRU and
compressing folios is approximately in 3:7.

We also try ZSTD as the zram backend, but the the number of anonymous
folios reclaimed per second did not differ significantly from LZ4 (the
benefits were far less compared to what could be achieved with parallel
processing). Even with ZSTD, we were still able to reach around 800,000
pgsteal_anon per second using kcompressd.


> > 
> > Qun-Wei Lin (2):
> >   mm: Split BLK_FEAT_SYNCHRONOUS and SWP_SYNCHRONOUS_IO into
> > separate
> >     read and write flags
> >   kcompressd: Add Kcompressd for accelerated zram compression
> > 
> >  drivers/block/brd.c             |   3 +-
> >  drivers/block/zram/Kconfig      |  11 ++
> >  drivers/block/zram/Makefile     |   3 +-
> >  drivers/block/zram/kcompressd.c | 340
> > ++++++++++++++++++++++++++++++++
> >  drivers/block/zram/kcompressd.h |  25 +++
> >  drivers/block/zram/zram_drv.c   |  21 +-
> >  drivers/nvdimm/btt.c            |   3 +-
> >  drivers/nvdimm/pmem.c           |   5 +-
> >  include/linux/blkdev.h          |  24 ++-
> >  include/linux/swap.h            |  31 +--
> >  mm/memory.c                     |   4 +-
> >  mm/page_io.c                    |   6 +-
> >  mm/swapfile.c                   |   7 +-
> >  13 files changed, 446 insertions(+), 37 deletions(-)
> >  create mode 100644 drivers/block/zram/kcompressd.c
> >  create mode 100644 drivers/block/zram/kcompressd.h
> > 
> > --
> > 2.45.2
> > 
> 
> Thanks
> Barry

Best Regards,
Qun-wei

Qun-wei Lin (林群崴) March 10, 2025, 1:22 p.m. UTC | #5

On Sat, 2025-03-08 at 18:41 +1300, Barry Song wrote:
> 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
> 
> 
> On Sat, Mar 8, 2025 at 12:03 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > 
> > On Fri, Mar 7, 2025 at 4:02 AM Qun-Wei Lin
> > <qun-wei.lin@mediatek.com> wrote:
> > > 
> > > This patch series introduces a new mechanism called kcompressd to
> > > improve the efficiency of memory reclaiming in the operating
> > > system. The
> > > main goal is to separate the tasks of page scanning and page
> > > compression
> > > into distinct processes or threads, thereby reducing the load on
> > > the
> > > kswapd thread and enhancing overall system performance under high
> > > memory
> > > pressure conditions.
> > 
> > Please excuse my ignorance, but from your cover letter I still
> > don't
> > quite get what is the problem here? And how would decouple
> > compression
> > and scanning help?
> 
> My understanding is as follows:
> 
> When kswapd attempts to reclaim M anonymous folios and N file folios,
> the process involves the following steps:
> 
> * t1: Time to scan and unmap anonymous folios
> * t2: Time to compress anonymous folios
> * t3: Time to reclaim file folios
> 
> Currently, these steps are executed sequentially, meaning the total
> time
> required to reclaim M + N folios is t1 + t2 + t3.
> 
> However, Qun-Wei's patch enables t1 + t3 and t2 to run in parallel,
> reducing the total time to max(t1 + t3, t2). This likely improves the
> reclamation speed, potentially reducing allocation stalls.
> 
> I don’t have concrete data on this. Does Qun-Wei have detailed
> performance data?
> 

Thank you for your explanation. Compared to the original single kswapd,
we expect t1 to have a slight increase in re-scan time. However, since
our kcompressd can focus on compression tasks and we can have multiple
kcompressd instances (kcompressd0, kcompressd1, ...) running in
parallel, we anticipate that the number of times a folio needs be re-
scanned will not be too many.

In our experiments, we fixed the CPU and DRAM at a certain frequency.
We created a high memory pressure enviroment using a memory eater and
recorded the increase in pgsteal_anon per second, which was around 300,
000. Then we applied our patch and measured again, that pgsteal_anon/s
increased to over 800,000.

> > 
> > > 
> > > Problem:
> > >  In the current system, the kswapd thread is responsible for both
> > >  scanning the LRU pages and compressing pages into the ZRAM. This
> > >  combined responsibility can lead to significant performance
> > > bottlenecks,
> > 
> > What bottleneck are we talking about? Is one stage slower than the
> > other?
> > 
> > >  especially under high memory pressure. The kswapd thread becomes
> > > a
> > >  single point of contention, causing delays in memory reclaiming
> > > and
> > >  overall system performance degradation.
> > > 
> > > Target:
> > >  The target of this invention is to improve the efficiency of
> > > memory
> > >  reclaiming. By separating the tasks of page scanning and page
> > >  compression into distinct processes or threads, the system can
> > > handle
> > >  memory pressure more effectively.
> > 
> > I'm not a zram maintainer, so I'm definitely not trying to stop
> > this
> > patch. But whatever problem zram is facing will likely occur with
> > zswap too, so I'd like to learn more :)
> 
> Right, this is likely something that could be addressed more
> generally
> for zswap and zram.
> 

Yes, we also hope to extend this to other swap devices, but currently,
we have only modified zram. We are not very familiar with zswap and
would like to ask if anyone has any suggestions for modifications?

> Thanks
> Barry

Best Regards,
Qun-wei

Nhat Pham March 10, 2025, 4:58 p.m. UTC | #6

On Mon, Mar 10, 2025 at 6:22 AM Qun-wei Lin (林群崴)
<Qun-wei.Lin@mediatek.com> wrote:
>
>
> Thank you for your explanation. Compared to the original single kswapd,
> we expect t1 to have a slight increase in re-scan time. However, since
> our kcompressd can focus on compression tasks and we can have multiple
> kcompressd instances (kcompressd0, kcompressd1, ...) running in
> parallel, we anticipate that the number of times a folio needs be re-
> scanned will not be too many.
>
> In our experiments, we fixed the CPU and DRAM at a certain frequency.
> We created a high memory pressure enviroment using a memory eater and
> recorded the increase in pgsteal_anon per second, which was around 300,
> 000. Then we applied our patch and measured again, that pgsteal_anon/s
> increased to over 800,000.
>
> > >
> > > >
> > > > Problem:
> > > >  In the current system, the kswapd thread is responsible for both
> > > >  scanning the LRU pages and compressing pages into the ZRAM. This
> > > >  combined responsibility can lead to significant performance
> > > > bottlenecks,
> > >
> > > What bottleneck are we talking about? Is one stage slower than the
> > > other?
> > >
> > > >  especially under high memory pressure. The kswapd thread becomes
> > > > a
> > > >  single point of contention, causing delays in memory reclaiming
> > > > and
> > > >  overall system performance degradation.
> > > >
> > > > Target:
> > > >  The target of this invention is to improve the efficiency of
> > > > memory
> > > >  reclaiming. By separating the tasks of page scanning and page
> > > >  compression into distinct processes or threads, the system can
> > > > handle
> > > >  memory pressure more effectively.
> > >
> > > I'm not a zram maintainer, so I'm definitely not trying to stop
> > > this
> > > patch. But whatever problem zram is facing will likely occur with
> > > zswap too, so I'd like to learn more :)
> >
> > Right, this is likely something that could be addressed more
> > generally
> > for zswap and zram.
> >
>
> Yes, we also hope to extend this to other swap devices, but currently,
> we have only modified zram. We are not very familiar with zswap and
> would like to ask if anyone has any suggestions for modifications?
>

My understanding is right now schedule_bio_write is the work
submission API right? We can make it generic, having it accept a
callback and a generic untyped pointer which can be casted into a
backend-specific context struct. For zram it would contain struct zram
and the bio. For zswap, depending on at which point do you want to
begin offloading the work - it could simply be just the folio itself
if we offload early, or a more complicated scheme.



> > Thanks
> > Barry
>
> Best Regards,
> Qun-wei
>
>

Nhat Pham March 10, 2025, 5:30 p.m. UTC | #7

On Mon, Mar 10, 2025 at 9:58 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Mar 10, 2025 at 6:22 AM Qun-wei Lin (林群崴)
> <Qun-wei.Lin@mediatek.com> wrote:
> >
> >
> > Thank you for your explanation. Compared to the original single kswapd,
> > we expect t1 to have a slight increase in re-scan time. However, since
> > our kcompressd can focus on compression tasks and we can have multiple
> > kcompressd instances (kcompressd0, kcompressd1, ...) running in
> > parallel, we anticipate that the number of times a folio needs be re-
> > scanned will not be too many.
> >
> > In our experiments, we fixed the CPU and DRAM at a certain frequency.
> > We created a high memory pressure enviroment using a memory eater and
> > recorded the increase in pgsteal_anon per second, which was around 300,
> > 000. Then we applied our patch and measured again, that pgsteal_anon/s
> > increased to over 800,000.
> >
> > > >
> > > > >
> > > > > Problem:
> > > > >  In the current system, the kswapd thread is responsible for both
> > > > >  scanning the LRU pages and compressing pages into the ZRAM. This
> > > > >  combined responsibility can lead to significant performance
> > > > > bottlenecks,
> > > >
> > > > What bottleneck are we talking about? Is one stage slower than the
> > > > other?
> > > >
> > > > >  especially under high memory pressure. The kswapd thread becomes
> > > > > a
> > > > >  single point of contention, causing delays in memory reclaiming
> > > > > and
> > > > >  overall system performance degradation.
> > > > >
> > > > > Target:
> > > > >  The target of this invention is to improve the efficiency of
> > > > > memory
> > > > >  reclaiming. By separating the tasks of page scanning and page
> > > > >  compression into distinct processes or threads, the system can
> > > > > handle
> > > > >  memory pressure more effectively.
> > > >
> > > > I'm not a zram maintainer, so I'm definitely not trying to stop
> > > > this
> > > > patch. But whatever problem zram is facing will likely occur with
> > > > zswap too, so I'd like to learn more :)
> > >
> > > Right, this is likely something that could be addressed more
> > > generally
> > > for zswap and zram.
> > >
> >
> > Yes, we also hope to extend this to other swap devices, but currently,
> > we have only modified zram. We are not very familiar with zswap and
> > would like to ask if anyone has any suggestions for modifications?
> >
>
> My understanding is right now schedule_bio_write is the work
> submission API right? We can make it generic, having it accept a
> callback and a generic untyped pointer which can be casted into a
> backend-specific context struct. For zram it would contain struct zram
> and the bio. For zswap, depending on at which point do you want to
> begin offloading the work - it could simply be just the folio itself
> if we offload early, or a more complicated scheme.

To expand a bit - zswap_store() is where all the logic lives. It's
fairly straightforward: checking zswap cgroup limits, acquire the
zswap pool (a combination of compression algorithm and backend memory
allocator, which is just zsmalloc now), perform compression, then ask
for a slot from zsmalloc and store it there.

You can probably just offload the whole thing here, or perform some
steps of the sequence before offloading the rest :) One slight
complication is don't forget to fallback to disk swapping - unlike
zram, zswap is originally designed as a "cache" for underlying swap
files on disk, which we can fallback to if the compression attempt
fails. Everything should be fairly straightforward though :)

>
>
>
> > > Thanks
> > > Barry
> >
> > Best Regards,
> > Qun-wei
> >
> >

Sergey Senozhatsky March 11, 2025, 4:58 a.m. UTC | #8

On (25/03/08 18:41), Barry Song wrote:
> On Sat, Mar 8, 2025 at 12:03 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Fri, Mar 7, 2025 at 4:02 AM Qun-Wei Lin <qun-wei.lin@mediatek.com> wrote:
> > >
> > > This patch series introduces a new mechanism called kcompressd to
> > > improve the efficiency of memory reclaiming in the operating system. The
> > > main goal is to separate the tasks of page scanning and page compression
> > > into distinct processes or threads, thereby reducing the load on the
> > > kswapd thread and enhancing overall system performance under high memory
> > > pressure conditions.
> >
> > Please excuse my ignorance, but from your cover letter I still don't
> > quite get what is the problem here? And how would decouple compression
> > and scanning help?
> 
> My understanding is as follows:
> 
> When kswapd attempts to reclaim M anonymous folios and N file folios,
> the process involves the following steps:
> 
> * t1: Time to scan and unmap anonymous folios
> * t2: Time to compress anonymous folios
> * t3: Time to reclaim file folios
> 
> Currently, these steps are executed sequentially, meaning the total time
> required to reclaim M + N folios is t1 + t2 + t3.
> 
> However, Qun-Wei's patch enables t1 + t3 and t2 to run in parallel,
> reducing the total time to max(t1 + t3, t2). This likely improves the
> reclamation speed, potentially reducing allocation stalls.

If compression kthread-s can run (have CPUs to be scheduled on).
This looks a bit like a bottleneck.  Is there anything that
guarantees forward progress?  Also, if compression kthreads
constantly preempt kswapd, then it might not be worth it to
have compression kthreads, I assume?

If we have a pagefault and need to map a page that is still in
the compression queue (not compressed and stored in zram yet, e.g.
dut to scheduling latency + slow compression algorithm) then what
happens?

Barry Song March 11, 2025, 9:33 a.m. UTC | #9

On Tue, Mar 11, 2025 at 5:58 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (25/03/08 18:41), Barry Song wrote:
> > On Sat, Mar 8, 2025 at 12:03 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > On Fri, Mar 7, 2025 at 4:02 AM Qun-Wei Lin <qun-wei.lin@mediatek.com> wrote:
> > > >
> > > > This patch series introduces a new mechanism called kcompressd to
> > > > improve the efficiency of memory reclaiming in the operating system. The
> > > > main goal is to separate the tasks of page scanning and page compression
> > > > into distinct processes or threads, thereby reducing the load on the
> > > > kswapd thread and enhancing overall system performance under high memory
> > > > pressure conditions.
> > >
> > > Please excuse my ignorance, but from your cover letter I still don't
> > > quite get what is the problem here? And how would decouple compression
> > > and scanning help?
> >
> > My understanding is as follows:
> >
> > When kswapd attempts to reclaim M anonymous folios and N file folios,
> > the process involves the following steps:
> >
> > * t1: Time to scan and unmap anonymous folios
> > * t2: Time to compress anonymous folios
> > * t3: Time to reclaim file folios
> >
> > Currently, these steps are executed sequentially, meaning the total time
> > required to reclaim M + N folios is t1 + t2 + t3.
> >
> > However, Qun-Wei's patch enables t1 + t3 and t2 to run in parallel,
> > reducing the total time to max(t1 + t3, t2). This likely improves the
> > reclamation speed, potentially reducing allocation stalls.
>
> If compression kthread-s can run (have CPUs to be scheduled on).
> This looks a bit like a bottleneck.  Is there anything that
> guarantees forward progress?  Also, if compression kthreads
> constantly preempt kswapd, then it might not be worth it to
> have compression kthreads, I assume?

Thanks for your critical insights, all of which are valuable.

Qun-Wei is likely working on an Android case where the CPU is
relatively idle in many scenarios (though there are certainly cases
where all CPUs are busy), but free memory is quite limited.
We may soon see benefits for these types of use cases. I expect
Android might have the opportunity to adopt it before it's fully
ready upstream.

If the workload keeps all CPUs busy, I suppose this async thread
won’t help, but at least we might find a way to mitigate regression.

We likely need to collect more data on various scenarios—when
CPUs are relatively idle and when all CPUs are busy—and
determine the proper approach based on the data, which we
currently lack :-)

>
> If we have a pagefault and need to map a page that is still in
> the compression queue (not compressed and stored in zram yet, e.g.
> dut to scheduling latency + slow compression algorithm) then what
> happens?

This is happening now even without the patch?  Right now we are
having 4 steps:
1. add_to_swap: The folio is added to the swapcache.
2. try_to_unmap: PTEs are converted to swap entries.
3. pageout: The folio is written back.
4. Swapcache is cleared.

If a swap-in occurs between 2 and 4, doesn't that mean
we've already encountered the case where we hit
the swapcache for a folio undergoing compression?

It seems we might have an opportunity to terminate
compression if the request is still in the queue and
compression hasn’t started for a folio yet? seems
quite difficult to do?

Thanks
Barry

Qun-wei Lin (林群崴) March 11, 2025, 2:12 p.m. UTC | #10

On Tue, 2025-03-11 at 22:33 +1300, Barry Song wrote:
> 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
> 
> 
> On Tue, Mar 11, 2025 at 5:58 PM Sergey Senozhatsky
> <senozhatsky@chromium.org> wrote:
> > 
> > On (25/03/08 18:41), Barry Song wrote:
> > > On Sat, Mar 8, 2025 at 12:03 PM Nhat Pham <nphamcs@gmail.com>
> > > wrote:
> > > > 
> > > > On Fri, Mar 7, 2025 at 4:02 AM Qun-Wei Lin
> > > > <qun-wei.lin@mediatek.com> wrote:
> > > > > 
> > > > > This patch series introduces a new mechanism called
> > > > > kcompressd to
> > > > > improve the efficiency of memory reclaiming in the operating
> > > > > system. The
> > > > > main goal is to separate the tasks of page scanning and page
> > > > > compression
> > > > > into distinct processes or threads, thereby reducing the load
> > > > > on the
> > > > > kswapd thread and enhancing overall system performance under
> > > > > high memory
> > > > > pressure conditions.
> > > > 
> > > > Please excuse my ignorance, but from your cover letter I still
> > > > don't
> > > > quite get what is the problem here? And how would decouple
> > > > compression
> > > > and scanning help?
> > > 
> > > My understanding is as follows:
> > > 
> > > When kswapd attempts to reclaim M anonymous folios and N file
> > > folios,
> > > the process involves the following steps:
> > > 
> > > * t1: Time to scan and unmap anonymous folios
> > > * t2: Time to compress anonymous folios
> > > * t3: Time to reclaim file folios
> > > 
> > > Currently, these steps are executed sequentially, meaning the
> > > total time
> > > required to reclaim M + N folios is t1 + t2 + t3.
> > > 
> > > However, Qun-Wei's patch enables t1 + t3 and t2 to run in
> > > parallel,
> > > reducing the total time to max(t1 + t3, t2). This likely improves
> > > the
> > > reclamation speed, potentially reducing allocation stalls.
> > 
> > If compression kthread-s can run (have CPUs to be scheduled on).
> > This looks a bit like a bottleneck.  Is there anything that
> > guarantees forward progress?  Also, if compression kthreads
> > constantly preempt kswapd, then it might not be worth it to
> > have compression kthreads, I assume?
> 
> Thanks for your critical insights, all of which are valuable.
> 
> Qun-Wei is likely working on an Android case where the CPU is
> relatively idle in many scenarios (though there are certainly cases
> where all CPUs are busy), but free memory is quite limited.
> We may soon see benefits for these types of use cases. I expect
> Android might have the opportunity to adopt it before it's fully
> ready upstream.
> 
> If the workload keeps all CPUs busy, I suppose this async thread
> won’t help, but at least we might find a way to mitigate regression.
> 
> We likely need to collect more data on various scenarios—when
> CPUs are relatively idle and when all CPUs are busy—and
> determine the proper approach based on the data, which we
> currently lack :-)
> 

Thanks for the explaining!

> > 
> > If we have a pagefault and need to map a page that is still in
> > the compression queue (not compressed and stored in zram yet, e.g.
> > dut to scheduling latency + slow compression algorithm) then what
> > happens?
> 
> This is happening now even without the patch?  Right now we are
> having 4 steps:
> 1. add_to_swap: The folio is added to the swapcache.
> 2. try_to_unmap: PTEs are converted to swap entries.
> 3. pageout: The folio is written back.
> 4. Swapcache is cleared.
> 
> If a swap-in occurs between 2 and 4, doesn't that mean
> we've already encountered the case where we hit
> the swapcache for a folio undergoing compression?
> 
> It seems we might have an opportunity to terminate
> compression if the request is still in the queue and
> compression hasn’t started for a folio yet? seems
> quite difficult to do?

As Barry explained, these folios that are being compressed are in the
swapcache. If a refault occurs during the compression process, its
correctness is already guaranteed by the swap subsystem (similar to 
other asynchronous swap devices).

Indeed, terminating a folio that is already in the queue waiting for
compression is a challenging task. Will this require some modifications
to the current architecture of swap subsystem?

> 
> Thanks
> Barry

Best Regards,
Qun-wei

Sergey Senozhatsky March 12, 2025, 5:19 a.m. UTC | #11

On (25/03/11 14:12), Qun-wei Lin (林群崴) wrote:
> > > If compression kthread-s can run (have CPUs to be scheduled on).
> > > This looks a bit like a bottleneck.  Is there anything that
> > > guarantees forward progress?  Also, if compression kthreads
> > > constantly preempt kswapd, then it might not be worth it to
> > > have compression kthreads, I assume?
> >
> > Thanks for your critical insights, all of which are valuable.
> >
> > Qun-Wei is likely working on an Android case where the CPU is
> > relatively idle in many scenarios (though there are certainly cases
> > where all CPUs are busy), but free memory is quite limited.
> > We may soon see benefits for these types of use cases. I expect
> > Android might have the opportunity to adopt it before it's fully
> > ready upstream.
> >
> > If the workload keeps all CPUs busy, I suppose this async thread
> > won’t help, but at least we might find a way to mitigate regression.
> >
> > We likely need to collect more data on various scenarios—when
> > CPUs are relatively idle and when all CPUs are busy—and
> > determine the proper approach based on the data, which we
> > currently lack :-)

Right.  The scan/unmap can move very fast (a rabbit) while the
compressor can move rather slow (a tortoise.)  There is some
benefit in the fact that kswap does compression directly, I'd
presume.

Another thing to consider, perhaps, is that not every page is
necessarily required to go through the compressor queue and stay
there until the woken-up compressor finally picks it up just to
realize that the page is filled with 0xff (or any other pattern).
At least on the zram side such pages are not compressed and stored
as an 8-byte pattern in the zram meta table (w/o using any zsmalloc
memory.)

> > > If we have a pagefault and need to map a page that is still in
> > > the compression queue (not compressed and stored in zram yet, e.g.
> > > dut to scheduling latency + slow compression algorithm) then what
> > > happens?
> >
> > This is happening now even without the patch?  Right now we are
> > having 4 steps:
> > 1. add_to_swap: The folio is added to the swapcache.
> > 2. try_to_unmap: PTEs are converted to swap entries.
> > 3. pageout: The folio is written back.
> > 4. Swapcache is cleared.
> >
> > If a swap-in occurs between 2 and 4, doesn't that mean
> > we've already encountered the case where we hit
> > the swapcache for a folio undergoing compression?
> >
> > It seems we might have an opportunity to terminate
> > compression if the request is still in the queue and
> > compression hasn’t started for a folio yet? seems
> > quite difficult to do?
> 
> As Barry explained, these folios that are being compressed are in the
> swapcache. If a refault occurs during the compression process, its
> correctness is already guaranteed by the swap subsystem (similar to
> other asynchronous swap devices).

Right.  I just was thinking that now there is a wake_up between
scan/unmap and compress.  Not sure how much trouble this can make.

> Indeed, terminating a folio that is already in the queue waiting for
> compression is a challenging task. Will this require some modifications
> to the current architecture of swap subsystem?

Yeah, I'll leave it mm folks to decide :)

Minchan Kim March 12, 2025, 6:11 p.m. UTC | #12

Hi Qun-Wei

On Fri, Mar 07, 2025 at 08:01:02PM +0800, Qun-Wei Lin wrote:
> This patch series introduces a new mechanism called kcompressd to
> improve the efficiency of memory reclaiming in the operating system. The
> main goal is to separate the tasks of page scanning and page compression
> into distinct processes or threads, thereby reducing the load on the
> kswapd thread and enhancing overall system performance under high memory
> pressure conditions.
> 
> Problem:
>  In the current system, the kswapd thread is responsible for both
>  scanning the LRU pages and compressing pages into the ZRAM. This
>  combined responsibility can lead to significant performance bottlenecks,
>  especially under high memory pressure. The kswapd thread becomes a
>  single point of contention, causing delays in memory reclaiming and
>  overall system performance degradation.

Isn't it general problem if backend for swap is slow(but synchronous)?
I think zram need to support asynchrnous IO(can do introduce multiple
threads to compress batched pages) and doesn't declare it's
synchrnous device for the case.

Sergey Senozhatsky March 13, 2025, 3:09 a.m. UTC | #13

On (25/03/12 11:11), Minchan Kim wrote:
> On Fri, Mar 07, 2025 at 08:01:02PM +0800, Qun-Wei Lin wrote:
> > This patch series introduces a new mechanism called kcompressd to
> > improve the efficiency of memory reclaiming in the operating system. The
> > main goal is to separate the tasks of page scanning and page compression
> > into distinct processes or threads, thereby reducing the load on the
> > kswapd thread and enhancing overall system performance under high memory
> > pressure conditions.
> > 
> > Problem:
> >  In the current system, the kswapd thread is responsible for both
> >  scanning the LRU pages and compressing pages into the ZRAM. This
> >  combined responsibility can lead to significant performance bottlenecks,
> >  especially under high memory pressure. The kswapd thread becomes a
> >  single point of contention, causing delays in memory reclaiming and
> >  overall system performance degradation.
> 
> Isn't it general problem if backend for swap is slow(but synchronous)?
> I think zram need to support asynchrnous IO(can do introduce multiple
> threads to compress batched pages) and doesn't declare it's
> synchrnous device for the case.

The current conclusion is that kcompressd will sit above zram,
because zram is not the only compressing swap backend we have.

Barry Song March 13, 2025, 3:45 a.m. UTC | #14

On Thu, Mar 13, 2025 at 4:09 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (25/03/12 11:11), Minchan Kim wrote:
> > On Fri, Mar 07, 2025 at 08:01:02PM +0800, Qun-Wei Lin wrote:
> > > This patch series introduces a new mechanism called kcompressd to
> > > improve the efficiency of memory reclaiming in the operating system. The
> > > main goal is to separate the tasks of page scanning and page compression
> > > into distinct processes or threads, thereby reducing the load on the
> > > kswapd thread and enhancing overall system performance under high memory
> > > pressure conditions.
> > >
> > > Problem:
> > >  In the current system, the kswapd thread is responsible for both
> > >  scanning the LRU pages and compressing pages into the ZRAM. This
> > >  combined responsibility can lead to significant performance bottlenecks,
> > >  especially under high memory pressure. The kswapd thread becomes a
> > >  single point of contention, causing delays in memory reclaiming and
> > >  overall system performance degradation.
> >
> > Isn't it general problem if backend for swap is slow(but synchronous)?
> > I think zram need to support asynchrnous IO(can do introduce multiple
> > threads to compress batched pages) and doesn't declare it's
> > synchrnous device for the case.
>
> The current conclusion is that kcompressd will sit above zram,
> because zram is not the only compressing swap backend we have.

also. it is not good to hack zram to be aware of if it is kswapd
, direct reclaim , proactive reclaim and block device with
mounted filesystem.

so i am thinking sth as below

page_io.c

if (sync_device or zswap_enabled())
   schedule swap_writepage to a separate per-node thread

btw,  ran the current patchset with one thread(not default 4)
on phones and saw 50%+ allocstall reduction. so the idea
looks like a good direction to go.

Thanks
Barry

Barry Song March 13, 2025, 3:52 a.m. UTC | #15

On Thu, Mar 13, 2025 at 4:09 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (25/03/12 11:11), Minchan Kim wrote:
> > On Fri, Mar 07, 2025 at 08:01:02PM +0800, Qun-Wei Lin wrote:
> > > This patch series introduces a new mechanism called kcompressd to
> > > improve the efficiency of memory reclaiming in the operating system. The
> > > main goal is to separate the tasks of page scanning and page compression
> > > into distinct processes or threads, thereby reducing the load on the
> > > kswapd thread and enhancing overall system performance under high memory
> > > pressure conditions.
> > >
> > > Problem:
> > >  In the current system, the kswapd thread is responsible for both
> > >  scanning the LRU pages and compressing pages into the ZRAM. This
> > >  combined responsibility can lead to significant performance bottlenecks,
> > >  especially under high memory pressure. The kswapd thread becomes a
> > >  single point of contention, causing delays in memory reclaiming and
> > >  overall system performance degradation.
> >
> > Isn't it general problem if backend for swap is slow(but synchronous)?
> > I think zram need to support asynchrnous IO(can do introduce multiple
> > threads to compress batched pages) and doesn't declare it's
> > synchrnous device for the case.
>
> The current conclusion is that kcompressd will sit above zram,
> because zram is not the only compressing swap backend we have.

also. it is not good to hack zram to be aware of if it is kswapd
, direct reclaim , proactive reclaim and block device with
mounted filesystem.

so i am thinking sth as below

page_io.c

if (sync_device or zswap_enabled())
   schedule swap_writepage to a separate per-node thread

btw,  ran the current patchset with one thread(not default 4)
on phones and saw 50%+ allocstall reduction. so the idea
looks like a good direction to go.

Thanks
Barry

Barry Song March 13, 2025, 9:30 a.m. UTC | #16

On Thu, Mar 13, 2025 at 4:52 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Mar 13, 2025 at 4:09 PM Sergey Senozhatsky
> <senozhatsky@chromium.org> wrote:
> >
> > On (25/03/12 11:11), Minchan Kim wrote:
> > > On Fri, Mar 07, 2025 at 08:01:02PM +0800, Qun-Wei Lin wrote:
> > > > This patch series introduces a new mechanism called kcompressd to
> > > > improve the efficiency of memory reclaiming in the operating system. The
> > > > main goal is to separate the tasks of page scanning and page compression
> > > > into distinct processes or threads, thereby reducing the load on the
> > > > kswapd thread and enhancing overall system performance under high memory
> > > > pressure conditions.
> > > >
> > > > Problem:
> > > >  In the current system, the kswapd thread is responsible for both
> > > >  scanning the LRU pages and compressing pages into the ZRAM. This
> > > >  combined responsibility can lead to significant performance bottlenecks,
> > > >  especially under high memory pressure. The kswapd thread becomes a
> > > >  single point of contention, causing delays in memory reclaiming and
> > > >  overall system performance degradation.
> > >
> > > Isn't it general problem if backend for swap is slow(but synchronous)?
> > > I think zram need to support asynchrnous IO(can do introduce multiple
> > > threads to compress batched pages) and doesn't declare it's
> > > synchrnous device for the case.
> >
> > The current conclusion is that kcompressd will sit above zram,
> > because zram is not the only compressing swap backend we have.
>
> also. it is not good to hack zram to be aware of if it is kswapd
> , direct reclaim , proactive reclaim and block device with
> mounted filesystem.
>
> so i am thinking sth as below
>
> page_io.c
>
> if (sync_device or zswap_enabled())
>    schedule swap_writepage to a separate per-node thread
>

Hi Qun-wei, Nhat, Sergey and Minchan,

I managed to find some time to prototype a kcompressd that supports
both zswap and zram, though it has only been build-tested.

Hi Qun-wei,

Apologies, but I’m quite busy with other tasks and don’t have time to
debug or test it. Please feel free to test it. When you submit v2, you’re
welcome to keep yourself as the author of the patch as v1.

If you’re okay with it, you can also add me as a co-developer in the
changelog. The below prototype, I'd rather start with a per-node thread
approach. While this might not provide the greatest benefit, it carries
the least risk and helps avoid complex questions, such as how to
determine the number of threads. - And we have actually observed a
significant reduction in allocstall by using a single thread to
asynchronously handle kswapd's compression as I reported.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index dbb0ad69e17f..4f9ee2fb338d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -23,6 +23,7 @@
 #include <linux/page-flags.h>
 #include <linux/local_lock.h>
 #include <linux/zswap.h>
+#include <linux/kfifo.h>
 #include <asm/page.h>
 
 /* Free memory management - zoned buddy allocator.  */
@@ -1389,6 +1390,11 @@ typedef struct pglist_data {
 
 	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
 
+#define KCOMPRESS_FIFO_SIZE 256
+	wait_queue_head_t kcompressd_wait;
+	struct task_struct *kcompressd;
+	struct kfifo kcompress_fifo;
+
 #ifdef CONFIG_COMPACTION
 	int kcompactd_max_order;
 	enum zone_type kcompactd_highest_zoneidx;
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 281802a7a10d..8cd143f59e76 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1410,6 +1410,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	pgdat_init_kcompactd(pgdat);
 
 	init_waitqueue_head(&pgdat->kswapd_wait);
+	init_waitqueue_head(&pgdat->kcompressd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
 	for (i = 0; i < NR_VMSCAN_THROTTLE; i++)
diff --git a/mm/page_io.c b/mm/page_io.c
index 4bce19df557b..7bbd14991ffb 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -233,6 +233,33 @@ static void swap_zeromap_folio_clear(struct folio *folio)
 	}
 }
 
+static bool swap_sched_async_compress(struct folio *folio)
+{
+	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	int nid = numa_node_id();
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	if (unlikely(!pgdat->kcompressd))
+		return false;
+
+	if (!current_is_kswapd())
+		return false;
+
+	if (!folio_test_anon(folio))
+		return false;
+	/*
+	 * This case needs to synchronously return AOP_WRITEPAGE_ACTIVATE
+	 */
+	if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio)))
+		return false;
+
+	sis = swp_swap_info(folio->swap);
+	if (zswap_is_enabled() || data_race(sis->flags & SWP_SYNCHRONOUS_IO))
+		return kfifo_in(&pgdat->kcompress_fifo, folio, sizeof(folio));
+
+	return false;
+}
+
 /*
  * We may have stale swap cache pages in memory: notice
  * them here and get rid of the unnecessary final write.
@@ -275,6 +302,15 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		 */
 		swap_zeromap_folio_clear(folio);
 	}
+
+	/*
+	 * Compression within zswap and zram might block rmap, unmap
+	 * of both file and anon pages, try to do compression async
+	 * if possible
+	 */
+	if (swap_sched_async_compress(folio))
+		return 0;
+
 	if (zswap_store(folio)) {
 		count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
 		folio_unlock(folio);
@@ -289,6 +325,38 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 	return 0;
 }
 
+int kcompressd(void *p)
+{
+	pg_data_t *pgdat = (pg_data_t *)p;
+	struct folio *folio;
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = SWAP_CLUSTER_MAX,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
+		.for_reclaim = 1,
+	};
+
+	while (!kthread_should_stop()) {
+		wait_event_interruptible(pgdat->kcompressd_wait,
+				!kfifo_is_empty(&pgdat->kcompress_fifo));
+
+		if (kthread_should_stop())
+			break;
+		while(!kfifo_is_empty(&pgdat->kcompress_fifo)) {
+			if (kfifo_out(&pgdat->kcompress_fifo, &folio, sizeof(folio))) {
+				if (zswap_store(folio)) {
+					count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
+					folio_unlock(folio);
+					return 0;
+				}
+				__swap_writepage(folio, &wbc);
+			}
+		}
+	}
+	return 0;
+}
+
 static inline void count_swpout_vm_event(struct folio *folio)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/mm/swap.h b/mm/swap.h
index 0abb68091b4f..38d61c6a06f1 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -21,6 +21,7 @@ static inline void swap_read_unplug(struct swap_iocb *plug)
 void swap_write_unplug(struct swap_iocb *sio);
 int swap_writepage(struct page *page, struct writeback_control *wbc);
 void __swap_writepage(struct folio *folio, struct writeback_control *wbc);
+int kcompressd(void *p);
 
 /* linux/mm/swap_state.c */
 /* One swap address space for each 64M swap space */
@@ -198,6 +199,11 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 	return 0;
 }
 
+static inline int kcompressd(void *p)
+{
+	return 0;
+}
+
 #endif /* CONFIG_SWAP */
 
 #endif /* _MM_SWAP_H */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2bc740637a6c..ba0245b74e45 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7370,6 +7370,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 void __meminit kswapd_run(int nid)
 {
 	pg_data_t *pgdat = NODE_DATA(nid);
+	int ret;
 
 	pgdat_kswapd_lock(pgdat);
 	if (!pgdat->kswapd) {
@@ -7383,7 +7384,23 @@ void __meminit kswapd_run(int nid)
 		} else {
 			wake_up_process(pgdat->kswapd);
 		}
+		ret = kfifo_alloc(&pgdat->kcompress_fifo,
+				KCOMPRESS_FIFO_SIZE * sizeof(struct folio *),
+				GFP_KERNEL);
+		if (ret)
+			goto out;
+		pgdat->kcompressd = kthread_create_on_node(kcompressd, pgdat, nid,
+				"kcompressd%d", nid);
+		if (IS_ERR(pgdat->kcompressd)) {
+			pr_err("Failed to start kcompressd on node %d，ret=%ld\n",
+					nid, PTR_ERR(pgdat->kcompressd));
+			pgdat->kcompressd = NULL;
+			kfifo_free(&pgdat->kcompress_fifo);
+		} else {
+			wake_up_process(pgdat->kcompressd);
+		}
 	}
+out:
 	pgdat_kswapd_unlock(pgdat);
 }
 
@@ -7402,6 +7419,11 @@ void __meminit kswapd_stop(int nid)
 		kthread_stop(kswapd);
 		pgdat->kswapd = NULL;
 	}
+	if (pgdat->kcompressd) {
+		kthread_stop(pgdat->kcompressd);
+		pgdat->kcompressd = NULL;
+		kfifo_free(&pgdat->kcompress_fifo);
+	}
 	pgdat_kswapd_unlock(pgdat);
 }
 

> btw,  ran the current patchset with one thread(not default 4)
> on phones and saw 50%+ allocstall reduction. so the idea
> looks like a good direction to go.
>

Thanks
Barry

Minchan Kim March 13, 2025, 4:07 p.m. UTC | #17

On Thu, Mar 13, 2025 at 04:45:54PM +1300, Barry Song wrote:
> On Thu, Mar 13, 2025 at 4:09 PM Sergey Senozhatsky
> <senozhatsky@chromium.org> wrote:
> >
> > On (25/03/12 11:11), Minchan Kim wrote:
> > > On Fri, Mar 07, 2025 at 08:01:02PM +0800, Qun-Wei Lin wrote:
> > > > This patch series introduces a new mechanism called kcompressd to
> > > > improve the efficiency of memory reclaiming in the operating system. The
> > > > main goal is to separate the tasks of page scanning and page compression
> > > > into distinct processes or threads, thereby reducing the load on the
> > > > kswapd thread and enhancing overall system performance under high memory
> > > > pressure conditions.
> > > >
> > > > Problem:
> > > >  In the current system, the kswapd thread is responsible for both
> > > >  scanning the LRU pages and compressing pages into the ZRAM. This
> > > >  combined responsibility can lead to significant performance bottlenecks,
> > > >  especially under high memory pressure. The kswapd thread becomes a
> > > >  single point of contention, causing delays in memory reclaiming and
> > > >  overall system performance degradation.
> > >
> > > Isn't it general problem if backend for swap is slow(but synchronous)?
> > > I think zram need to support asynchrnous IO(can do introduce multiple
> > > threads to compress batched pages) and doesn't declare it's
> > > synchrnous device for the case.
> >
> > The current conclusion is that kcompressd will sit above zram,
> > because zram is not the only compressing swap backend we have.

Then, how handles the file IO case?

> 
> also. it is not good to hack zram to be aware of if it is kswapd
> , direct reclaim , proactive reclaim and block device with
> mounted filesystem.

Why shouldn't zram be aware of that instead of just introducing
queues in the zram with multiple compression threads?

> 
> so i am thinking sth as below
> 
> page_io.c
> 
> if (sync_device or zswap_enabled())
>    schedule swap_writepage to a separate per-node thread

I am not sure that's a good idea to mix a feature to solve different
layers. That wouldn't be only swap problem. Such an parallelism under
device  is common technique these days and it would help file IO cases.

Furthermore, it would open the chance for zram to try compress
multiple pages at once.

Barry Song March 13, 2025, 4:58 p.m. UTC | #18

On Fri, Mar 14, 2025 at 5:07 AM Minchan Kim <minchan@kernel.org> wrote:
>
> On Thu, Mar 13, 2025 at 04:45:54PM +1300, Barry Song wrote:
> > On Thu, Mar 13, 2025 at 4:09 PM Sergey Senozhatsky
> > <senozhatsky@chromium.org> wrote:
> > >
> > > On (25/03/12 11:11), Minchan Kim wrote:
> > > > On Fri, Mar 07, 2025 at 08:01:02PM +0800, Qun-Wei Lin wrote:
> > > > > This patch series introduces a new mechanism called kcompressd to
> > > > > improve the efficiency of memory reclaiming in the operating system. The
> > > > > main goal is to separate the tasks of page scanning and page compression
> > > > > into distinct processes or threads, thereby reducing the load on the
> > > > > kswapd thread and enhancing overall system performance under high memory
> > > > > pressure conditions.
> > > > >
> > > > > Problem:
> > > > >  In the current system, the kswapd thread is responsible for both
> > > > >  scanning the LRU pages and compressing pages into the ZRAM. This
> > > > >  combined responsibility can lead to significant performance bottlenecks,
> > > > >  especially under high memory pressure. The kswapd thread becomes a
> > > > >  single point of contention, causing delays in memory reclaiming and
> > > > >  overall system performance degradation.
> > > >
> > > > Isn't it general problem if backend for swap is slow(but synchronous)?
> > > > I think zram need to support asynchrnous IO(can do introduce multiple
> > > > threads to compress batched pages) and doesn't declare it's
> > > > synchrnous device for the case.
> > >
> > > The current conclusion is that kcompressd will sit above zram,
> > > because zram is not the only compressing swap backend we have.
>
> Then, how handles the file IO case?

I didn't quite catch your question :-)

>
> >
> > also. it is not good to hack zram to be aware of if it is kswapd
> > , direct reclaim , proactive reclaim and block device with
> > mounted filesystem.
>
> Why shouldn't zram be aware of that instead of just introducing
> queues in the zram with multiple compression threads?
>

My view is the opposite of yours :-)

Integrating kswapd, direct reclaim, etc., into the zram driver
would violate layering principles. zram is purely a block device
driver, and how it is used should be handled separately. Callers have
greater flexibility to determine its usage, similar to how different
I/O models exist in user space.

Currently, Qun-Wei's patch checks whether the current thread is kswapd.
If it is, compression is performed asynchronously by threads;
otherwise, it is done in the current thread. In the future, we may
have additional reclaim threads, such as for damon or
madv_pageout, etc.

> >
> > so i am thinking sth as below
> >
> > page_io.c
> >
> > if (sync_device or zswap_enabled())
> >    schedule swap_writepage to a separate per-node thread
>
> I am not sure that's a good idea to mix a feature to solve different
> layers. That wouldn't be only swap problem. Such an parallelism under
> device  is common technique these days and it would help file IO cases.
>

zswap and zram share the same needs, and handling this in page_io
can benefit both through common code. It is up to the callers to decide
the I/O model.

I agree that "parallelism under the device" is a common technique,
but our case is different—the device achieves parallelism with
offload hardware, whereas we rely on CPUs, which can be scarce.
These threads may also preempt CPUs that are critically needed
by other non-compression tasks, and burst power consumption
can sometimes be difficult to control.

> Furthermore, it would open the chance for zram to try compress
> multiple pages at once.

We are already in this situation when multiple callers use zram simultaneously,
such as during direct reclaim or with a mounted filesystem.

Of course, this allows multiple pages to be compressed simultaneously,
even if the user is single-threaded. However, determining when to enable
these threads and whether they will be effective is challenging, as it
depends on system load. For example, Qun-Wei's patch chose not to use
threads for direct reclaim as, I guess,  it might be harmful.

Thanks
Barry

Minchan Kim March 13, 2025, 5:33 p.m. UTC | #19

On Fri, Mar 14, 2025 at 05:58:00AM +1300, Barry Song wrote:
> On Fri, Mar 14, 2025 at 5:07 AM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Thu, Mar 13, 2025 at 04:45:54PM +1300, Barry Song wrote:
> > > On Thu, Mar 13, 2025 at 4:09 PM Sergey Senozhatsky
> > > <senozhatsky@chromium.org> wrote:
> > > >
> > > > On (25/03/12 11:11), Minchan Kim wrote:
> > > > > On Fri, Mar 07, 2025 at 08:01:02PM +0800, Qun-Wei Lin wrote:
> > > > > > This patch series introduces a new mechanism called kcompressd to
> > > > > > improve the efficiency of memory reclaiming in the operating system. The
> > > > > > main goal is to separate the tasks of page scanning and page compression
> > > > > > into distinct processes or threads, thereby reducing the load on the
> > > > > > kswapd thread and enhancing overall system performance under high memory
> > > > > > pressure conditions.
> > > > > >
> > > > > > Problem:
> > > > > >  In the current system, the kswapd thread is responsible for both
> > > > > >  scanning the LRU pages and compressing pages into the ZRAM. This
> > > > > >  combined responsibility can lead to significant performance bottlenecks,
> > > > > >  especially under high memory pressure. The kswapd thread becomes a
> > > > > >  single point of contention, causing delays in memory reclaiming and
> > > > > >  overall system performance degradation.
> > > > >
> > > > > Isn't it general problem if backend for swap is slow(but synchronous)?
> > > > > I think zram need to support asynchrnous IO(can do introduce multiple
> > > > > threads to compress batched pages) and doesn't declare it's
> > > > > synchrnous device for the case.
> > > >
> > > > The current conclusion is that kcompressd will sit above zram,
> > > > because zram is not the only compressing swap backend we have.
> >
> > Then, how handles the file IO case?
> 
> I didn't quite catch your question :-)

Sorry for not clear.

What I meant was zram is also used for fs backend storage, not only
for swapbackend. The multiple simultaneous compression can help the case,
too.

> 
> >
> > >
> > > also. it is not good to hack zram to be aware of if it is kswapd
> > > , direct reclaim , proactive reclaim and block device with
> > > mounted filesystem.
> >
> > Why shouldn't zram be aware of that instead of just introducing
> > queues in the zram with multiple compression threads?
> >
> 
> My view is the opposite of yours :-)
> 
> Integrating kswapd, direct reclaim, etc., into the zram driver
> would violate layering principles. zram is purely a block device

That's the my question. What's the reason zram need to know about
kswapd, direct_reclaim and so on? I didn't understand your input.

> driver, and how it is used should be handled separately. Callers have
> greater flexibility to determine its usage, similar to how different
> I/O models exist in user space.
> 
> Currently, Qun-Wei's patch checks whether the current thread is kswapd.
> If it is, compression is performed asynchronously by threads;
> otherwise, it is done in the current thread. In the future, we may

Okay, then, why should we do that without following normal asynchrnous
disk storage? VM justs put the IO request and sometimes congestion
control. Why is other logic needed for the case?

> have additional reclaim threads, such as for damon or
> madv_pageout, etc.
> 
> > >
> > > so i am thinking sth as below
> > >
> > > page_io.c
> > >
> > > if (sync_device or zswap_enabled())
> > >    schedule swap_writepage to a separate per-node thread
> >
> > I am not sure that's a good idea to mix a feature to solve different
> > layers. That wouldn't be only swap problem. Such an parallelism under
> > device  is common technique these days and it would help file IO cases.
> >
> 
> zswap and zram share the same needs, and handling this in page_io
> can benefit both through common code. It is up to the callers to decide
> the I/O model.
> 
> I agree that "parallelism under the device" is a common technique,
> but our case is different—the device achieves parallelism with
> offload hardware, whereas we rely on CPUs, which can be scarce.
> These threads may also preempt CPUs that are critically needed
> by other non-compression tasks, and burst power consumption
> can sometimes be difficult to control.

That's general problem for common resources in the system and always
trace-off domain in the workload areas. Eng folks has tried to tune
them statically/dynamically depending on system behavior considering
what they priorites.

> 
> > Furthermore, it would open the chance for zram to try compress
> > multiple pages at once.
> 
> We are already in this situation when multiple callers use zram simultaneously,
> such as during direct reclaim or with a mounted filesystem.
> 
> Of course, this allows multiple pages to be compressed simultaneously,
> even if the user is single-threaded. However, determining when to enable
> these threads and whether they will be effective is challenging, as it
> depends on system load. For example, Qun-Wei's patch chose not to use
> threads for direct reclaim as, I guess,  it might be harmful.

Direct reclaim is already harmful and that's why VM has the logic 
to throttle writeback or other special logics for kswapd or direct
reclaim path for th IO, which could be applied into the zram, too.

Barry Song March 13, 2025, 8:37 p.m. UTC | #20

On Fri, Mar 14, 2025 at 6:33 AM Minchan Kim <minchan@kernel.org> wrote:
>
> On Fri, Mar 14, 2025 at 05:58:00AM +1300, Barry Song wrote:
> > On Fri, Mar 14, 2025 at 5:07 AM Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > On Thu, Mar 13, 2025 at 04:45:54PM +1300, Barry Song wrote:
> > > > On Thu, Mar 13, 2025 at 4:09 PM Sergey Senozhatsky
> > > > <senozhatsky@chromium.org> wrote:
> > > > >
> > > > > On (25/03/12 11:11), Minchan Kim wrote:
> > > > > > On Fri, Mar 07, 2025 at 08:01:02PM +0800, Qun-Wei Lin wrote:
> > > > > > > This patch series introduces a new mechanism called kcompressd to
> > > > > > > improve the efficiency of memory reclaiming in the operating system. The
> > > > > > > main goal is to separate the tasks of page scanning and page compression
> > > > > > > into distinct processes or threads, thereby reducing the load on the
> > > > > > > kswapd thread and enhancing overall system performance under high memory
> > > > > > > pressure conditions.
> > > > > > >
> > > > > > > Problem:
> > > > > > >  In the current system, the kswapd thread is responsible for both
> > > > > > >  scanning the LRU pages and compressing pages into the ZRAM. This
> > > > > > >  combined responsibility can lead to significant performance bottlenecks,
> > > > > > >  especially under high memory pressure. The kswapd thread becomes a
> > > > > > >  single point of contention, causing delays in memory reclaiming and
> > > > > > >  overall system performance degradation.
> > > > > >
> > > > > > Isn't it general problem if backend for swap is slow(but synchronous)?
> > > > > > I think zram need to support asynchrnous IO(can do introduce multiple
> > > > > > threads to compress batched pages) and doesn't declare it's
> > > > > > synchrnous device for the case.
> > > > >
> > > > > The current conclusion is that kcompressd will sit above zram,
> > > > > because zram is not the only compressing swap backend we have.
> > >
> > > Then, how handles the file IO case?
> >
> > I didn't quite catch your question :-)
>
> Sorry for not clear.
>
> What I meant was zram is also used for fs backend storage, not only
> for swapbackend. The multiple simultaneous compression can help the case,
> too.

I agree that multiple asynchronous threads might transparently improve
userspace read/write performance with just one thread or a very few threads.
However, it's unclear how genuine the requirement is. On the other hand,
in such cases, userspace can always optimize read/write bandwidth, for
example, by using aio_write() or similar methods if they do care about
the read/write bandwidth.

Once the user has multiple threads (close to the number of CPU cores),
asynchronous multi-threading won't offer any benefit and will only result
in increased context switching. I guess that is caused by the fundamental
difference between zram and other real devices with hardware offloads -
that zram always relies on the CPU and operates synchronously(no
offload, no interrupt from HW to notify the completion of compression).

>
> >
> > >
> > > >
> > > > also. it is not good to hack zram to be aware of if it is kswapd
> > > > , direct reclaim , proactive reclaim and block device with
> > > > mounted filesystem.
> > >
> > > Why shouldn't zram be aware of that instead of just introducing
> > > queues in the zram with multiple compression threads?
> > >
> >
> > My view is the opposite of yours :-)
> >
> > Integrating kswapd, direct reclaim, etc., into the zram driver
> > would violate layering principles. zram is purely a block device
>
> That's the my question. What's the reason zram need to know about
> kswapd, direct_reclaim and so on? I didn't understand your input.

Qun-Wei's patch 2/2, which modifies the zram driver, contains the following
code within the zram driver:

+int schedule_bio_write(void *mem, struct bio *bio, compress_callback cb)
+{
+ ...
+
+        if (!nr_kcompressd || !current_is_kswapd())
+                 return -EBUSY;
+
+}

It's clear that Qun-Wei decided to disable asynchronous threading unless
the user is kswapd. Qun-Wei might be able to provide more insight on this
decision.

My guess is:

1. Determining the optimal number of threads is challenging due to varying
CPU topologies and software workloads. For example, if there are 8 threads
writing to zram, the default 4 threads might be slower than using all 8 threads
synchronously. For servers, we could have hundreds of CPUs.
On the other hand, if there is only one thread writing to zram, using 4 threads
might interfere with other workloads too much and cause the phone to heat up
quickly.

2. kswapd is the user that truly benefits from asynchronous threads. Since
it handles asynchronous memory reclamation, speeding up its process
reduces the likelihood of entering slowpath / direct reclamation. This is
where it has the greatest potential to make a positive impact.

>
> > driver, and how it is used should be handled separately. Callers have
> > greater flexibility to determine its usage, similar to how different
> > I/O models exist in user space.
> >
> > Currently, Qun-Wei's patch checks whether the current thread is kswapd.
> > If it is, compression is performed asynchronously by threads;
> > otherwise, it is done in the current thread. In the future, we may
>
> Okay, then, why should we do that without following normal asynchrnous
> disk storage? VM justs put the IO request and sometimes congestion
> control. Why is other logic needed for the case?

It seems there is still some uncertainty about why current_is_kswapd()
is necessary, so let's get input from Qun-Wei as well.

Despite all the discussions, one important point remains: zswap might
also need this asynchronous thread. For months, Yosry and Nhat have
been urging the zram and zswap teams to collaborate on those shared
requirements. Having one per-node thread for each kswapd could be the
low-hanging fruit for both zswap and zram.

Additionally, I don't see how the prototype I proposed here [1] would conflict
with potential future optimizations in zram, particularly those aimed at
improving filesystem read/write performance through multiple asynchronous
threads, if that is indeed a valid requirement.

[1] https://lore.kernel.org/lkml/20250313093005.13998-1-21cnbao@gmail.com/

>
> > have additional reclaim threads, such as for damon or
> > madv_pageout, etc.
> >
> > > >
> > > > so i am thinking sth as below
> > > >
> > > > page_io.c
> > > >
> > > > if (sync_device or zswap_enabled())
> > > >    schedule swap_writepage to a separate per-node thread
> > >
> > > I am not sure that's a good idea to mix a feature to solve different
> > > layers. That wouldn't be only swap problem. Such an parallelism under
> > > device  is common technique these days and it would help file IO cases.
> > >
> >
> > zswap and zram share the same needs, and handling this in page_io
> > can benefit both through common code. It is up to the callers to decide
> > the I/O model.
> >
> > I agree that "parallelism under the device" is a common technique,
> > but our case is different—the device achieves parallelism with
> > offload hardware, whereas we rely on CPUs, which can be scarce.
> > These threads may also preempt CPUs that are critically needed
> > by other non-compression tasks, and burst power consumption
> > can sometimes be difficult to control.
>
> That's general problem for common resources in the system and always
> trace-off domain in the workload areas. Eng folks has tried to tune
> them statically/dynamically depending on system behavior considering
> what they priorites.

Right, but haven't we yet taken on the task of tuning multi-threaded zram?

>
> >
> > > Furthermore, it would open the chance for zram to try compress
> > > multiple pages at once.
> >
> > We are already in this situation when multiple callers use zram simultaneously,
> > such as during direct reclaim or with a mounted filesystem.
> >
> > Of course, this allows multiple pages to be compressed simultaneously,
> > even if the user is single-threaded. However, determining when to enable
> > these threads and whether they will be effective is challenging, as it
> > depends on system load. For example, Qun-Wei's patch chose not to use
> > threads for direct reclaim as, I guess,  it might be harmful.
>
> Direct reclaim is already harmful and that's why VM has the logic
> to throttle writeback or other special logics for kswapd or direct
> reclaim path for th IO, which could be applied into the zram, too.

I'm not entirely sure that the existing congestion or throttled writeback
can automatically tune itself effectively with non-offload resources. For
offload resources, the number of queues and the bandwidth remain stable,
but for CPUs, they fluctuate based on changes in system workloads.

Thanks
Barry

[0/2] Improve Zram by separating compression context from kswapd

Message

Comments