diff mbox series

mm: zswap: fix the lack of page lru flag in zswap_writeback_entry

Message ID 20231024142706.195517-1-hezhongkun.hzk@bytedance.com (mailing list archive)
State New
Headers show
Series mm: zswap: fix the lack of page lru flag in zswap_writeback_entry | expand

Commit Message

Zhongkun He Oct. 24, 2023, 2:27 p.m. UTC
The zswap_writeback_entry() will add a page to the swap cache, decompress
the entry data into the page, and issue a bio write to write the page back
to the swap device. Move the page to the tail of lru list  through
SetPageReclaim(page) and folio_rotate_reclaimable().

Currently, about half of the pages will fail to move to the tail of lru
list because there is no LRU flag in page which is not in the LRU list but
the cpu_fbatches. So fix it.

Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
---
 mm/zswap.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Andrew Morton Dec. 29, 2023, 7:44 p.m. UTC | #1
On Tue, 24 Oct 2023 22:27:06 +0800 Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:

> The zswap_writeback_entry() will add a page to the swap cache, decompress
> the entry data into the page, and issue a bio write to write the page back
> to the swap device. Move the page to the tail of lru list  through
> SetPageReclaim(page) and folio_rotate_reclaimable().
> 
> Currently, about half of the pages will fail to move to the tail of lru
> list because there is no LRU flag in page which is not in the LRU list but
> the cpu_fbatches. So fix it.

Some reviewer input on this change would be helpful, please.

Also, is there any measurable change in overall performance?

> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1139,6 +1139,11 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  	/* move it to the tail of the inactive list after end_writeback */
>  	SetPageReclaim(page);
>  
> +	if (!PageLRU(page)) {
> +		/* drain lru cache to help folio_rotate_reclaimable() */
> +		lru_add_drain();
> +	}
> +
>  	/* start writeback */
>  	__swap_writepage(page, &wbc);
>  	put_page(page);
> -- 
> 2.25.1
>
Nhat Pham Dec. 30, 2023, 2:09 a.m. UTC | #2
On Tue, Oct 24, 2023 at 7:27 AM Zhongkun He
<hezhongkun.hzk@bytedance.com> wrote:
>

My apologies for the delayed response. I have a couple of questions.

> The zswap_writeback_entry() will add a page to the swap cache, decompress
> the entry data into the page, and issue a bio write to write the page back
> to the swap device. Move the page to the tail of lru list  through
> SetPageReclaim(page) and folio_rotate_reclaimable().
>
> Currently, about half of the pages will fail to move to the tail of lru

May I ask what's the downstream effect of this? i.e so what if it
fails? And yes, as Andrew pointed out, it'd be nice if the patch
changelog spells out any observable or measurable change from the
user's POV.

> list because there is no LRU flag in page which is not in the LRU list but
> the cpu_fbatches. So fix it.

This sentence is a bit confusing to me. Does this mean the page
currently being processed for writeback is not in the LRU list
(!PageLRU(page)), but IN one of the cpu folio batches? Which makes
folio_rotate_reclaimable() fails on this page later on in the
_swap_writepage() path? (hence the necessity of lru_add_drain()?)

Let me know if I'm misunderstanding the intention of this patch. I
know it's a bit pedantic, but spelling things out (ideally in the
changelog itself) will help the reviewers, as well as future
contributors who want to study the codebase and make changes to it.

>
> Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>

Thanks and look forward to your response,
Nhat

P/S: Have a nice holiday season and happy new year!
Zhongkun He Jan. 2, 2024, 11:39 a.m. UTC | #3
On Sat, Dec 30, 2023 at 10:09 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Oct 24, 2023 at 7:27 AM Zhongkun He
> <hezhongkun.hzk@bytedance.com> wrote:
> >
>
> My apologies for the delayed response. I have a couple of questions.
>
> > The zswap_writeback_entry() will add a page to the swap cache, decompress
> > the entry data into the page, and issue a bio write to write the page back
> > to the swap device. Move the page to the tail of lru list  through
> > SetPageReclaim(page) and folio_rotate_reclaimable().
> >
> > Currently, about half of the pages will fail to move to the tail of lru
>
> May I ask what's the downstream effect of this? i.e so what if it
> fails? And yes, as Andrew pointed out, it'd be nice if the patch
> changelog spells out any observable or measurable change from the
> user's POV.
>

The swap cache page used to decompress zswap_entry should be
moved  to the tail of the inactive list after end_writeback, We can release
them in time.Just like the following code in zswap_writeback_entry().

   /* move it to the tail of the inactive list after end_writeback */
   SetPageReclaim(page);

After the writeback is over, the function of
folio_rotate_reclaimable() will fail
because the page is not in the LRU list but in some of the cpu folio batches.
Therefore we did not achieve the goal of setting SetPageReclaim(page), and
the pages could not be free in time.

> > list because there is no LRU flag in page which is not in the LRU list but
> > the cpu_fbatches. So fix it.
>
> This sentence is a bit confusing to me. Does this mean the page
> currently being processed for writeback is not in the LRU list
> (!PageLRU(page)), but IN one of the cpu folio batches? Which makes
> folio_rotate_reclaimable() fails on this page later on in the
> _swap_writepage() path? (hence the necessity of lru_add_drain()?)
>

Yes, exactly.

> Let me know if I'm misunderstanding the intention of this patch. I
> know it's a bit pedantic, but spelling things out (ideally in the
> changelog itself) will help the reviewers, as well as future
> contributors who want to study the codebase and make changes to it.
>

Sorry,my bad.

> >
> > Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
>
> Thanks and look forward to your response,
> Nhat
>
> P/S: Have a nice holiday season and happy new year!

Here are the steps and results of the performance test:
1:zswap+ zram (simplified model with on IO)
2:disabel zswap/parameters/same_filled_pages_enabled (stress have same pages)
3:stress --vm 1 --vm-bytes 2g --vm-hang 0   (2Gi anon pages)
4: In order to quickly release zswap_entry, I used the previous
     patch (I will send it again later).
https://lore.kernel.org/all/20231025095248.458789-1-hezhongkun.hzk@bytedance.com/

Performance result:
   reclaim 1Gi zswap_entry

time echo 1 > writeback_time_threshold
(will release the zswap_entry,  not been accessed for more than 1 seconds )

Base                                 With this patch
real    0m1.015s               real    0m1.043s
user    0m0.000s              user    0m0.001s
sys     0m1.013s               sys     0m1.040s
So no obvious performance regression was found.

After writeback, we perform the following steps to release the memory again
echo 1g > memory.reclaim

Base:
                     total        used        recalim        total        used
Mem:           38Gi       2.5Gi       ---->             38Gi       1.5Gi
Swap:         5.0Gi       1.0Gi       ---->              5Gi        1.5Gi
used memory  -1G   swap +0.5g
It means that  half of the pages are failed to move to the tail of lru list,
So we need to release an additional 0.5Gi anon pages to swap space.

With this patch:
                     total        used        recalim        total        used
Mem:           38Gi       2.6Gi       ---->             38Gi       1.6Gi
Swap:         5.0Gi       1.0Gi       ---->              5Gi        1Gi

used memory  -1Gi,  swap +0Gi
It means that we release all the pages which have been add to the tail of
lru list in zswap_writeback_entry() and folio_rotate_reclaimable().


Thanks for your time Nhat and Andrew. Happy New Year!
Zhongkun He Jan. 2, 2024, 2:09 p.m. UTC | #4
>
> Base:
>                      total        used        recalim        total        used
> Mem:           38Gi       2.5Gi       ---->             38Gi       1.5Gi
> Swap:         5.0Gi       1.0Gi       ---->              5Gi        1.5Gi
> used memory  -1G   swap +0.5g
> It means that  half of the pages are failed to move to the tail of lru list,
> So we need to release an additional 0.5Gi anon pages to swap space.

Based on the results of multiple tests, the following information is updated.

Base:
                      total        used        recalim        total        used
 Mem:           38Gi       2.6Gi       ---->             38Gi       1.6Gi
 Swap:         5.0Gi       1.0Gi       ---->              5Gi        1.9Gi
 used memory  -1G   swap +0.9g
It means that  most of the pages failed to move to the tail of the lru list,
So we need to release an additional 0.9Gi anon pages to swap space.

Thanks again.
Nhat Pham Jan. 2, 2024, 11:27 p.m. UTC | #5
On Tue, Jan 2, 2024 at 3:39 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
>
> On Sat, Dec 30, 2023 at 10:09 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Tue, Oct 24, 2023 at 7:27 AM Zhongkun He
> > <hezhongkun.hzk@bytedance.com> wrote:
> > >
> >
> > My apologies for the delayed response. I have a couple of questions.
> >
> > > The zswap_writeback_entry() will add a page to the swap cache, decompress
> > > the entry data into the page, and issue a bio write to write the page back
> > > to the swap device. Move the page to the tail of lru list  through
> > > SetPageReclaim(page) and folio_rotate_reclaimable().
> > >
> > > Currently, about half of the pages will fail to move to the tail of lru
> >
> > May I ask what's the downstream effect of this? i.e so what if it
> > fails? And yes, as Andrew pointed out, it'd be nice if the patch
> > changelog spells out any observable or measurable change from the
> > user's POV.
> >
>
> The swap cache page used to decompress zswap_entry should be
> moved  to the tail of the inactive list after end_writeback, We can release
> them in time.Just like the following code in zswap_writeback_entry().
>
>    /* move it to the tail of the inactive list after end_writeback */
>    SetPageReclaim(page);
>
> After the writeback is over, the function of
> folio_rotate_reclaimable() will fail
> because the page is not in the LRU list but in some of the cpu folio batches.
> Therefore we did not achieve the goal of setting SetPageReclaim(page), and
> the pages could not be free in time.
>
> > > list because there is no LRU flag in page which is not in the LRU list but
> > > the cpu_fbatches. So fix it.
> >
> > This sentence is a bit confusing to me. Does this mean the page
> > currently being processed for writeback is not in the LRU list
> > (!PageLRU(page)), but IN one of the cpu folio batches? Which makes
> > folio_rotate_reclaimable() fails on this page later on in the
> > _swap_writepage() path? (hence the necessity of lru_add_drain()?)
> >
>
> Yes, exactly.
>
> > Let me know if I'm misunderstanding the intention of this patch. I
> > know it's a bit pedantic, but spelling things out (ideally in the
> > changelog itself) will help the reviewers, as well as future
> > contributors who want to study the codebase and make changes to it.
> >
>
> Sorry,my bad.
>
> > >
> > > Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> >
> > Thanks and look forward to your response,
> > Nhat
> >
> > P/S: Have a nice holiday season and happy new year!
>
> Here are the steps and results of the performance test:
> 1:zswap+ zram (simplified model with on IO)
> 2:disabel zswap/parameters/same_filled_pages_enabled (stress have same pages)
> 3:stress --vm 1 --vm-bytes 2g --vm-hang 0   (2Gi anon pages)
> 4: In order to quickly release zswap_entry, I used the previous
>      patch (I will send it again later).
> https://lore.kernel.org/all/20231025095248.458789-1-hezhongkun.hzk@bytedance.com/
>
> Performance result:
>    reclaim 1Gi zswap_entry
>
> time echo 1 > writeback_time_threshold
> (will release the zswap_entry,  not been accessed for more than 1 seconds )
>
> Base                                 With this patch
> real    0m1.015s               real    0m1.043s
> user    0m0.000s              user    0m0.001s
> sys     0m1.013s               sys     0m1.040s
> So no obvious performance regression was found.

That's around 2.7% increase in real time, no? Admittedly, this
micro-benchmark is too small to conclude either way, but the data
doesn't seem to be in your favor.

I'm a bit concerned about the overhead here, given that with this
patch we will drain the per-cpu batch on every written-back entry.
That's quite a high frequency, especially since we're moving towards
more writeback (either with the new zswap shrinker, or your time
threshold-based writeback mechanism). For instance, there seems to be
some (local/per-cpu) locking involved, no? Could there be some form of
lock contentions there (especially since with the new shrinker, you
can have a lot of concurrent writeback contexts?)

Furthermore, note that a writeback from zswap to swap saves less
memory than a writeback from memory to swap, so the effect of the
extra overhead will be even more pronounced. That is, the amount of
extra work done (from this change) to save one unit of memory would be
even larger than if we call lru_add_drain() every time we swap out a
page (from memory -> swap). And I'm pretty sure we don't call
lru_add_drain() every time we swap out a page - I believe we call
lru_add_drain() every time we perform a shrink action. For e.g, in
shrink_inactive_list(). That's much coarser in granularity than here.

Also, IIUC, the more often we perform lru_add_drain(), the less
batching effect we will obtain. IOW, the overhead of maintaining the
batch will become higher relative to the performance gains from
batching.

Maybe I'm missing something - could you walk me through how
lru_add_drain() is fine here, from this POV? Thanks!

>
> After writeback, we perform the following steps to release the memory again
> echo 1g > memory.reclaim
>
> Base:
>                      total        used        recalim        total        used
> Mem:           38Gi       2.5Gi       ---->             38Gi       1.5Gi
> Swap:         5.0Gi       1.0Gi       ---->              5Gi        1.5Gi
> used memory  -1G   swap +0.5g
> It means that  half of the pages are failed to move to the tail of lru list,
> So we need to release an additional 0.5Gi anon pages to swap space.
>
> With this patch:
>                      total        used        recalim        total        used
> Mem:           38Gi       2.6Gi       ---->             38Gi       1.6Gi
> Swap:         5.0Gi       1.0Gi       ---->              5Gi        1Gi
>
> used memory  -1Gi,  swap +0Gi
> It means that we release all the pages which have been add to the tail of
> lru list in zswap_writeback_entry() and folio_rotate_reclaimable().
>

OTOH, this suggests that we're onto something. Swap usage seems to
decrease quite a bit. Sounds like a real problem that this patch is
tackling.
(Please add this benchmark result to future changelog. It'll help
demonstrate the problem).

I'm inclined to ack this patch, but it'd be nice if you can assuage my
concerns above (with some justification and/or larger benchmark).

(Or perhaps, we have to drain, but less frequently/higher up the stack?)

Thanks,
Nhat

>
> Thanks for your time Nhat and Andrew. Happy New Year!
Zhongkun He Jan. 3, 2024, 2:12 p.m. UTC | #6
> That's around 2.7% increase in real time, no? Admittedly, this
> micro-benchmark is too small to conclude either way, but the data
> doesn't seem to be in your favor.
>
> I'm a bit concerned about the overhead here, given that with this
> patch we will drain the per-cpu batch on every written-back entry.
> That's quite a high frequency, especially since we're moving towards
> more writeback (either with the new zswap shrinker, or your time
> threshold-based writeback mechanism). For instance, there seems to be
> some (local/per-cpu) locking involved, no? Could there be some form of
> lock contentions there (especially since with the new shrinker, you
> can have a lot of concurrent writeback contexts?)
>
> Furthermore, note that a writeback from zswap to swap saves less
> memory than a writeback from memory to swap, so the effect of the
> extra overhead will be even more pronounced. That is, the amount of
> extra work done (from this change) to save one unit of memory would be
> even larger than if we call lru_add_drain() every time we swap out a
> page (from memory -> swap). And I'm pretty sure we don't call
> lru_add_drain() every time we swap out a page - I believe we call
> lru_add_drain() every time we perform a shrink action. For e.g, in
> shrink_inactive_list(). That's much coarser in granularity than here.
>
> Also, IIUC, the more often we perform lru_add_drain(), the less
> batching effect we will obtain. IOW, the overhead of maintaining the
> batch will become higher relative to the performance gains from
> batching.
>
> Maybe I'm missing something - could you walk me through how
> lru_add_drain() is fine here, from this POV? Thanks!
>
> >
> > After writeback, we perform the following steps to release the memory again
> > echo 1g > memory.reclaim
> >
> > Base:
> >                      total        used        recalim        total        used
> > Mem:           38Gi       2.5Gi       ---->             38Gi       1.5Gi
> > Swap:         5.0Gi       1.0Gi       ---->              5Gi        1.5Gi
> > used memory  -1G   swap +0.5g
> > It means that  half of the pages are failed to move to the tail of lru list,
> > So we need to release an additional 0.5Gi anon pages to swap space.
> >
> > With this patch:
> >                      total        used        recalim        total        used
> > Mem:           38Gi       2.6Gi       ---->             38Gi       1.6Gi
> > Swap:         5.0Gi       1.0Gi       ---->              5Gi        1Gi
> >
> > used memory  -1Gi,  swap +0Gi
> > It means that we release all the pages which have been add to the tail of
> > lru list in zswap_writeback_entry() and folio_rotate_reclaimable().
> >
>
> OTOH, this suggests that we're onto something. Swap usage seems to
> decrease quite a bit. Sounds like a real problem that this patch is
> tackling.
> (Please add this benchmark result to future changelog. It'll help
> demonstrate the problem).

Yes

>
> I'm inclined to ack this patch, but it'd be nice if you can assuage my
> concerns above (with some justification and/or larger benchmark).
>

OK,thanks.

> (Or perhaps, we have to drain, but less frequently/higher up the stack?)
>

I've reviewed the code again and have no idea. It would be better if
you have any suggestions.

New test:
This patch will add the execution of folio_rotate_reclaimable(not executed
without this patch) and lru_add_drain,including percpu lock competition.
I bind a new task to allocate memory and use the same batch lock to compete
with the target process, on the same CPU.
context:
1:stress --vm 1 --vm-bytes 1g    (bind to cpu0)
2:stress --vm 1 --vm-bytes 5g --vm-hang 0(bind to cpu0)
3:reclaim pages, and writeback 5G zswap_entry in cpu0 and node 0.

Average time of five tests

Base      patch            patch + compete
4.947      5.0676          5.1336
                +2.4%          +3.7%
compete means: a new stress run in cpu0 to compete with the writeback process.
PID USER        %CPU  %MEM     TIME+ COMMAND                         P
 1367 root         49.5      0.0       1:09.17 bash     (writeback)            0
 1737 root         49.5      2.2       0:27.46 stress      (use percpu
lock)    0

around 2.4% increase in real time,including the execution of
folio_rotate_reclaimable(not executed without this patch) and lru_add_drain,but
no lock contentions.

around 1.3% additional  increase in real time with lock contentions on the same
cpu.

There is another option here, which is not to move the page to the
tail of the inactive
list after end_writeback and delete the following code in
zswap_writeback_entry(),
which did not work properly. But the pages will not be released first.

/* move it to the tail of the inactive list after end_writeback */
SetPageReclaim(page);

Thanks,
Zhongkun

> Thanks,
> Nhat
>
> >
> > Thanks for your time Nhat and Andrew. Happy New Year!
Nhat Pham Jan. 4, 2024, 7:42 p.m. UTC | #7
On Wed, Jan 3, 2024 at 6:12 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
>
> > That's around 2.7% increase in real time, no? Admittedly, this
> > micro-benchmark is too small to conclude either way, but the data
> > doesn't seem to be in your favor.
> >
> > I'm a bit concerned about the overhead here, given that with this
> > patch we will drain the per-cpu batch on every written-back entry.
> > That's quite a high frequency, especially since we're moving towards
> > more writeback (either with the new zswap shrinker, or your time
> > threshold-based writeback mechanism). For instance, there seems to be
> > some (local/per-cpu) locking involved, no? Could there be some form of
> > lock contentions there (especially since with the new shrinker, you
> > can have a lot of concurrent writeback contexts?)
> >
> > Furthermore, note that a writeback from zswap to swap saves less
> > memory than a writeback from memory to swap, so the effect of the
> > extra overhead will be even more pronounced. That is, the amount of
> > extra work done (from this change) to save one unit of memory would be
> > even larger than if we call lru_add_drain() every time we swap out a
> > page (from memory -> swap). And I'm pretty sure we don't call
> > lru_add_drain() every time we swap out a page - I believe we call
> > lru_add_drain() every time we perform a shrink action. For e.g, in
> > shrink_inactive_list(). That's much coarser in granularity than here.
> >
> > Also, IIUC, the more often we perform lru_add_drain(), the less
> > batching effect we will obtain. IOW, the overhead of maintaining the
> > batch will become higher relative to the performance gains from
> > batching.
> >
> > Maybe I'm missing something - could you walk me through how
> > lru_add_drain() is fine here, from this POV? Thanks!
> >
> > >
> > > After writeback, we perform the following steps to release the memory again
> > > echo 1g > memory.reclaim
> > >
> > > Base:
> > >                      total        used        recalim        total        used
> > > Mem:           38Gi       2.5Gi       ---->             38Gi       1.5Gi
> > > Swap:         5.0Gi       1.0Gi       ---->              5Gi        1.5Gi
> > > used memory  -1G   swap +0.5g
> > > It means that  half of the pages are failed to move to the tail of lru list,
> > > So we need to release an additional 0.5Gi anon pages to swap space.
> > >
> > > With this patch:
> > >                      total        used        recalim        total        used
> > > Mem:           38Gi       2.6Gi       ---->             38Gi       1.6Gi
> > > Swap:         5.0Gi       1.0Gi       ---->              5Gi        1Gi
> > >
> > > used memory  -1Gi,  swap +0Gi
> > > It means that we release all the pages which have been add to the tail of
> > > lru list in zswap_writeback_entry() and folio_rotate_reclaimable().
> > >
> >
> > OTOH, this suggests that we're onto something. Swap usage seems to
> > decrease quite a bit. Sounds like a real problem that this patch is
> > tackling.
> > (Please add this benchmark result to future changelog. It'll help
> > demonstrate the problem).
>
> Yes
>
> >
> > I'm inclined to ack this patch, but it'd be nice if you can assuage my
> > concerns above (with some justification and/or larger benchmark).
> >
>
> OK,thanks.
>
> > (Or perhaps, we have to drain, but less frequently/higher up the stack?)
> >
>
> I've reviewed the code again and have no idea. It would be better if
> you have any suggestions.

Hmm originally I was thinking of doing an (unconditional)
lru_add_drain() outside of zswap_writeback_entry() - once in
shrink_worker() and/or zswap_shrinker_scan(), before we write back any
of the entries. Not sure if it would work/help here tho - haven't
tested that idea yet.

>
> New test:
> This patch will add the execution of folio_rotate_reclaimable(not executed
> without this patch) and lru_add_drain,including percpu lock competition.
> I bind a new task to allocate memory and use the same batch lock to compete
> with the target process, on the same CPU.
> context:
> 1:stress --vm 1 --vm-bytes 1g    (bind to cpu0)
> 2:stress --vm 1 --vm-bytes 5g --vm-hang 0(bind to cpu0)
> 3:reclaim pages, and writeback 5G zswap_entry in cpu0 and node 0.
>
> Average time of five tests
>
> Base      patch            patch + compete
> 4.947      5.0676          5.1336
>                 +2.4%          +3.7%
> compete means: a new stress run in cpu0 to compete with the writeback process.
> PID USER        %CPU  %MEM     TIME+ COMMAND                         P
>  1367 root         49.5      0.0       1:09.17 bash     (writeback)            0
>  1737 root         49.5      2.2       0:27.46 stress      (use percpu
> lock)    0
>
> around 2.4% increase in real time,including the execution of
> folio_rotate_reclaimable(not executed without this patch) and lru_add_drain,but
> no lock contentions.

Hmm looks like the regression is still there, no?

>
> around 1.3% additional  increase in real time with lock contentions on the same
> cpu.
>
> There is another option here, which is not to move the page to the
> tail of the inactive
> list after end_writeback and delete the following code in
> zswap_writeback_entry(),
> which did not work properly. But the pages will not be released first.
>
> /* move it to the tail of the inactive list after end_writeback */
> SetPageReclaim(page);

Or only SetPageReclaim on pages on LRU?

>
> Thanks,
> Zhongkun
>
> > Thanks,
> > Nhat
> >
> > >
> > > Thanks for your time Nhat and Andrew. Happy New Year!
Zhongkun He Jan. 5, 2024, 2:10 p.m. UTC | #8
>
> Hmm originally I was thinking of doing an (unconditional)
> lru_add_drain() outside of zswap_writeback_entry() - once in
> shrink_worker() and/or zswap_shrinker_scan(), before we write back any
> of the entries. Not sure if it would work/help here tho - haven't
> tested that idea yet.
>

The pages are allocated by __read_swap_cache_async() in
 zswap_writeback_entry() and it must be newly allocated,
not cached in swap.
Please see the code below in zswap_writeback_entry()

page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
                NO_INTERLEAVE_INDEX, &page_was_allocated);
    if (!page) {
        goto fail;}
    /* Found an existing page, we raced with load/swapin */
    if (!page_was_allocated) {
        put_page(page);
        ret = -EEXIST;
        goto fail;
    }

So when it comes to SetPageReclaim(page),
The page has just been allocated and is still in the percpu batch,
which has not been added to the LRU.

Therefore,lru_add_drain() did not work outside the
zswap_writeback_entry()

> >
> > New test:
> > This patch will add the execution of folio_rotate_reclaimable(not executed
> > without this patch) and lru_add_drain,including percpu lock competition.
> > I bind a new task to allocate memory and use the same batch lock to compete
> > with the target process, on the same CPU.
> > context:
> > 1:stress --vm 1 --vm-bytes 1g    (bind to cpu0)
> > 2:stress --vm 1 --vm-bytes 5g --vm-hang 0(bind to cpu0)
> > 3:reclaim pages, and writeback 5G zswap_entry in cpu0 and node 0.
> >
> > Average time of five tests
> >
> > Base      patch            patch + compete
> > 4.947      5.0676          5.1336
> >                 +2.4%          +3.7%
> > compete means: a new stress run in cpu0 to compete with the writeback process.
> > PID USER        %CPU  %MEM     TIME+ COMMAND                         P
> >  1367 root         49.5      0.0       1:09.17 bash     (writeback)            0
> >  1737 root         49.5      2.2       0:27.46 stress      (use percpu
> > lock)    0
> >
> > around 2.4% increase in real time,including the execution of
> > folio_rotate_reclaimable(not executed without this patch) and lru_add_drain,but
> > no lock contentions.
>
> Hmm looks like the regression is still there, no?

Yes, it cannot be eliminated.

>
> >
> > around 1.3% additional  increase in real time with lock contentions on the same
> > cpu.
> >
> > There is another option here, which is not to move the page to the
> > tail of the inactive
> > list after end_writeback and delete the following code in
> > zswap_writeback_entry(),
> > which did not work properly. But the pages will not be released first.
> >
> > /* move it to the tail of the inactive list after end_writeback */
> > SetPageReclaim(page);
>
> Or only SetPageReclaim on pages on LRU?

No, all the pages are newly allocted and not on LRU.

This patch should add lru_add_drain() directly, remove the
if statement.
The purpose of writing back data is to release the page,
so I think it is necessary to fix it.

Thanks for your time, Nhat.
Nhat Pham Jan. 7, 2024, 6:53 p.m. UTC | #9
On Fri, Jan 5, 2024 at 6:10 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
>
> >
> > Hmm originally I was thinking of doing an (unconditional)
> > lru_add_drain() outside of zswap_writeback_entry() - once in
> > shrink_worker() and/or zswap_shrinker_scan(), before we write back any
> > of the entries. Not sure if it would work/help here tho - haven't
> > tested that idea yet.
> >
>
> The pages are allocated by __read_swap_cache_async() in
>  zswap_writeback_entry() and it must be newly allocated,
> not cached in swap.
> Please see the code below in zswap_writeback_entry()
>
> page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
>                 NO_INTERLEAVE_INDEX, &page_was_allocated);
>     if (!page) {
>         goto fail;}
>     /* Found an existing page, we raced with load/swapin */
>     if (!page_was_allocated) {
>         put_page(page);
>         ret = -EEXIST;
>         goto fail;
>     }
>
> So when it comes to SetPageReclaim(page),
> The page has just been allocated and is still in the percpu batch,
> which has not been added to the LRU.
>
> Therefore,lru_add_drain() did not work outside the
> zswap_writeback_entry()

You're right hmmm.

>
> > >
> > > New test:
> > > This patch will add the execution of folio_rotate_reclaimable(not executed
> > > without this patch) and lru_add_drain,including percpu lock competition.
> > > I bind a new task to allocate memory and use the same batch lock to compete
> > > with the target process, on the same CPU.
> > > context:
> > > 1:stress --vm 1 --vm-bytes 1g    (bind to cpu0)
> > > 2:stress --vm 1 --vm-bytes 5g --vm-hang 0(bind to cpu0)
> > > 3:reclaim pages, and writeback 5G zswap_entry in cpu0 and node 0.
> > >
> > > Average time of five tests
> > >
> > > Base      patch            patch + compete
> > > 4.947      5.0676          5.1336
> > >                 +2.4%          +3.7%
> > > compete means: a new stress run in cpu0 to compete with the writeback process.
> > > PID USER        %CPU  %MEM     TIME+ COMMAND                         P
> > >  1367 root         49.5      0.0       1:09.17 bash     (writeback)            0
> > >  1737 root         49.5      2.2       0:27.46 stress      (use percpu
> > > lock)    0
> > >
> > > around 2.4% increase in real time,including the execution of
> > > folio_rotate_reclaimable(not executed without this patch) and lru_add_drain,but
> > > no lock contentions.
> >
> > Hmm looks like the regression is still there, no?
>
> Yes, it cannot be eliminated.

Yeah this solution is not convincing to me. The overall effect of this
patch is we're trading runtime to save some swap usage. That seems
backward to me? Are there any other observable benefits to this?

Otherwise, unless the benchmark is an adversarial workload that is not
representative of the common case (and you'll need to show me some
alternative benchmark numbers or justifications to convince me here),
IMHO this is a risky change to merge. This is not a feature that is
gated behind a flag that users can safely ignore/disable.

>
> >
> > >
> > > around 1.3% additional  increase in real time with lock contentions on the same
> > > cpu.
> > >
> > > There is another option here, which is not to move the page to the
> > > tail of the inactive
> > > list after end_writeback and delete the following code in
> > > zswap_writeback_entry(),
> > > which did not work properly. But the pages will not be released first.
> > >
> > > /* move it to the tail of the inactive list after end_writeback */
> > > SetPageReclaim(page);
> >
> > Or only SetPageReclaim on pages on LRU?
>
> No, all the pages are newly allocted and not on LRU.
>
> This patch should add lru_add_drain() directly, remove the
> if statement.
> The purpose of writing back data is to release the page,
> so I think it is necessary to fix it.
>
> Thanks for your time, Nhat.

Is there a way to detect these kinds of folios at
folio_rotate_reclaimable() callsite? i.e if this is a zswap-owned page
etc. etc.?
Nhat Pham Jan. 7, 2024, 9:29 p.m. UTC | #10
On Fri, Jan 5, 2024 at 6:10 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
>
> > > There is another option here, which is not to move the page to the
> > > tail of the inactive
> > > list after end_writeback and delete the following code in
> > > zswap_writeback_entry(),
> > > which did not work properly. But the pages will not be released first.
> > >
> > > /* move it to the tail of the inactive list after end_writeback */
> > > SetPageReclaim(page);


Ok, so I took a look at the patch that originally introduced this
piece of logic:

https://github.com/torvalds/linux/commit/b349acc76b7f65400b85abd09a5379ddd6fa5a97

Looks like it's not for the sake of correctness, but only as a
best-effort optimization (reducing page scanning). If it doesn't bring
any benefit (i.e due to the newly allocated page still on the cpu
batch), then we can consider removing it. After all, if you're right
and it's not really doing anything here - why bother. Perhaps we can
replace this with some other mechanism to avoid it being scanned for
reclaim.

I would cc Weijie as well, as he is the original author of this.
Nhat Pham Jan. 7, 2024, 9:59 p.m. UTC | #11
On Sun, Jan 7, 2024 at 1:29 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Fri, Jan 5, 2024 at 6:10 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
> >
> > > > There is another option here, which is not to move the page to the
> > > > tail of the inactive
> > > > list after end_writeback and delete the following code in
> > > > zswap_writeback_entry(),
> > > > which did not work properly. But the pages will not be released first.
> > > >
> > > > /* move it to the tail of the inactive list after end_writeback */
> > > > SetPageReclaim(page);
>
>
> Ok, so I took a look at the patch that originally introduced this
> piece of logic:
>
> https://github.com/torvalds/linux/commit/b349acc76b7f65400b85abd09a5379ddd6fa5a97
>
> Looks like it's not for the sake of correctness, but only as a
> best-effort optimization (reducing page scanning). If it doesn't bring
> any benefit (i.e due to the newly allocated page still on the cpu
> batch), then we can consider removing it. After all, if you're right
> and it's not really doing anything here - why bother. Perhaps we can
> replace this with some other mechanism to avoid it being scanned for
> reclaim.

For instance, we can grab the local lock, look for the folio in the
add batch and take the folio off it, then add it to the rotate batch
instead? Not sure if this is doable within folio_rotate_reclaimable(),
or you'll have to manually perform this yourself (and remove the
PG_reclaim flag set here so that folio_end_writeback() doesn't try to
handle it).

There is still some overhead with this, but at least we don't have to
*drain everything* (which looks like what's lru_add_drain() ->
lru_add_drain_cpu() is doing). The latter sounds expensive and
unnecessary, whereas this is just one element addition and one element
removal - and if IIUC the size of the per-cpu add batch is capped at
15, so lookup + removal (if possible) shouldn't be too expensive?

Just throwing ideas out there :)

>
> I would cc Weijie as well, as he is the original author of this.
Yosry Ahmed Jan. 8, 2024, 11:12 p.m. UTC | #12
On Sun, Jan 7, 2024 at 1:59 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Sun, Jan 7, 2024 at 1:29 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Fri, Jan 5, 2024 at 6:10 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
> > >
> > > > > There is another option here, which is not to move the page to the
> > > > > tail of the inactive
> > > > > list after end_writeback and delete the following code in
> > > > > zswap_writeback_entry(),
> > > > > which did not work properly. But the pages will not be released first.
> > > > >
> > > > > /* move it to the tail of the inactive list after end_writeback */
> > > > > SetPageReclaim(page);
> >
> >
> > Ok, so I took a look at the patch that originally introduced this
> > piece of logic:
> >
> > https://github.com/torvalds/linux/commit/b349acc76b7f65400b85abd09a5379ddd6fa5a97
> >
> > Looks like it's not for the sake of correctness, but only as a
> > best-effort optimization (reducing page scanning). If it doesn't bring
> > any benefit (i.e due to the newly allocated page still on the cpu
> > batch), then we can consider removing it. After all, if you're right
> > and it's not really doing anything here - why bother. Perhaps we can
> > replace this with some other mechanism to avoid it being scanned for
> > reclaim.
>
> For instance, we can grab the local lock, look for the folio in the
> add batch and take the folio off it, then add it to the rotate batch
> instead? Not sure if this is doable within folio_rotate_reclaimable(),
> or you'll have to manually perform this yourself (and remove the
> PG_reclaim flag set here so that folio_end_writeback() doesn't try to
> handle it).
>
> There is still some overhead with this, but at least we don't have to
> *drain everything* (which looks like what's lru_add_drain() ->
> lru_add_drain_cpu() is doing). The latter sounds expensive and
> unnecessary, whereas this is just one element addition and one element
> removal - and if IIUC the size of the per-cpu add batch is capped at
> 15, so lookup + removal (if possible) shouldn't be too expensive?
>
> Just throwing ideas out there :)

Sorry for being late to the party. It seems to me that all of this
hassle can be avoided if lru_add_fn() did the right thing in this case
and added the folio to the tail of the lru directly. I am no expert in
how the page flags work here, but it seems like we can do something
like this in lru_add_fn():

if (folio_test_reclaim(folio))
    lruvec_add_folio_tail(lruvec, folio);
else
    lruvec_add_folio(lruvec, folio);

I think the main problem with this is that PG_reclaim is an alias to
PG_readahead, so readahead pages will also go to the tail of the lru,
which is probably not good.

A more intrusive alternative is to introduce a folio_lru_add_tail()
variant that always adds pages to the tail, and optionally call that
from __read_swap_cache_async() instead of folio_lru_add() based on a
new boolean argument. The zswap code can set that boolean argument
during writeback to make sure newly allocated folios are always added
to the tail of the lru.
Zhongkun He Jan. 9, 2024, 2:43 a.m. UTC | #13
On Mon, Jan 8, 2024 at 6:00 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Sun, Jan 7, 2024 at 1:29 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Fri, Jan 5, 2024 at 6:10 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
> > >
> > > > > There is another option here, which is not to move the page to the
> > > > > tail of the inactive
> > > > > list after end_writeback and delete the following code in
> > > > > zswap_writeback_entry(),
> > > > > which did not work properly. But the pages will not be released first.
> > > > >
> > > > > /* move it to the tail of the inactive list after end_writeback */
> > > > > SetPageReclaim(page);
> >
> >
> > Ok, so I took a look at the patch that originally introduced this
> > piece of logic:
> >
> > https://github.com/torvalds/linux/commit/b349acc76b7f65400b85abd09a5379ddd6fa5a97
> >
> > Looks like it's not for the sake of correctness, but only as a
> > best-effort optimization (reducing page scanning). If it doesn't bring
> > any benefit (i.e due to the newly allocated page still on the cpu
> > batch), then we can consider removing it. After all, if you're right
> > and it's not really doing anything here - why bother. Perhaps we can
> > replace this with some other mechanism to avoid it being scanned for
> > reclaim.
>
> For instance, we can grab the local lock, look for the folio in the
> add batch and take the folio off it, then add it to the rotate batch
> instead? Not sure if this is doable within folio_rotate_reclaimable(),
> or you'll have to manually perform this yourself (and remove the
> PG_reclaim flag set here so that folio_end_writeback() doesn't try to
> handle it).
>
> There is still some overhead with this, but at least we don't have to
> *drain everything* (which looks like what's lru_add_drain() ->
> lru_add_drain_cpu() is doing). The latter sounds expensive and
> unnecessary, whereas this is just one element addition and one element
> removal - and if IIUC the size of the per-cpu add batch is capped at
> 15, so lookup + removal (if possible) shouldn't be too expensive?
>
> Just throwing ideas out there :)

Thanks  for your  time,Nhat.
I will try other ways to solve this problem.

>
> >
> > I would cc Weijie as well, as he is the original author of this.
Zhongkun He Jan. 9, 2024, 3:13 a.m. UTC | #14
Hi Yosry, glad to hear from you and happy new year!

> Sorry for being late to the party. It seems to me that all of this
> hassle can be avoided if lru_add_fn() did the right thing in this case
> and added the folio to the tail of the lru directly. I am no expert in
> how the page flags work here, but it seems like we can do something
> like this in lru_add_fn():
>
> if (folio_test_reclaim(folio))
>     lruvec_add_folio_tail(lruvec, folio);
> else
>     lruvec_add_folio(lruvec, folio);
>
> I think the main problem with this is that PG_reclaim is an alias to
> PG_readahead, so readahead pages will also go to the tail of the lru,
> which is probably not good.

Agree with you, I will try it.

>
> A more intrusive alternative is to introduce a folio_lru_add_tail()
> variant that always adds pages to the tail, and optionally call that
> from __read_swap_cache_async() instead of folio_lru_add() based on a
> new boolean argument. The zswap code can set that boolean argument
> during writeback to make sure newly allocated folios are always added
> to the tail of the lru.

I have the same idea and also find it intrusive. I think the first solution
is very good and I will try it. If it works, I will send the next version.

 Thank you very much.
Yosry Ahmed Jan. 9, 2024, 4:29 p.m. UTC | #15
On Mon, Jan 8, 2024 at 7:13 PM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
>
> Hi Yosry, glad to hear from you and happy new year!
>
> > Sorry for being late to the party. It seems to me that all of this
> > hassle can be avoided if lru_add_fn() did the right thing in this case
> > and added the folio to the tail of the lru directly. I am no expert in
> > how the page flags work here, but it seems like we can do something
> > like this in lru_add_fn():
> >
> > if (folio_test_reclaim(folio))
> >     lruvec_add_folio_tail(lruvec, folio);
> > else
> >     lruvec_add_folio(lruvec, folio);
> >
> > I think the main problem with this is that PG_reclaim is an alias to
> > PG_readahead, so readahead pages will also go to the tail of the lru,
> > which is probably not good.
>
> Agree with you, I will try it.

+Matthew Wilcox

I think we need to figure out if it's okay to do this first, because
it will affect pages with PG_readahead as well.

>
> >
> > A more intrusive alternative is to introduce a folio_lru_add_tail()
> > variant that always adds pages to the tail, and optionally call that
> > from __read_swap_cache_async() instead of folio_lru_add() based on a
> > new boolean argument. The zswap code can set that boolean argument
> > during writeback to make sure newly allocated folios are always added
> > to the tail of the lru.
>
> I have the same idea and also find it intrusive. I think the first solution
> is very good and I will try it. If it works, I will send the next version.

One way to avoid introducing folio_lru_add_tail() and blumping a
boolean from zswap is to have a per-task context (similar to
memalloc_nofs_save()), that makes folio_add_lru() automatically add
folios to the tail of the LRU. I am not sure if this is an acceptable
approach though in terms of per-task flags and such.
Nhat Pham Jan. 10, 2024, 1:32 a.m. UTC | #16
On Tue, Jan 9, 2024 at 8:30 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Jan 8, 2024 at 7:13 PM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
> >
> > Hi Yosry, glad to hear from you and happy new year!
> >
> > > Sorry for being late to the party. It seems to me that all of this
> > > hassle can be avoided if lru_add_fn() did the right thing in this case
> > > and added the folio to the tail of the lru directly. I am no expert in
> > > how the page flags work here, but it seems like we can do something
> > > like this in lru_add_fn():
> > >
> > > if (folio_test_reclaim(folio))
> > >     lruvec_add_folio_tail(lruvec, folio);
> > > else
> > >     lruvec_add_folio(lruvec, folio);
> > >
> > > I think the main problem with this is that PG_reclaim is an alias to
> > > PG_readahead, so readahead pages will also go to the tail of the lru,
> > > which is probably not good.

This sounds dangerous. This is going to introduce a rather large
unexpected side effect - we're changing the readahead behavior in a
seemingly small zswap optimization. In fact, I'd argue that if we do
this, the readahead behavior change will be the "main effect", and the
zswap-side change would be a "happy consequence". We should run a lot
of benchmarking and document the change extensively if we pursue this
route.

> >
> > Agree with you, I will try it.
>
> +Matthew Wilcox
>
> I think we need to figure out if it's okay to do this first, because
> it will affect pages with PG_readahead as well.
>
> >
> > >
> > > A more intrusive alternative is to introduce a folio_lru_add_tail()
> > > variant that always adds pages to the tail, and optionally call that
> > > from __read_swap_cache_async() instead of folio_lru_add() based on a
> > > new boolean argument. The zswap code can set that boolean argument
> > > during writeback to make sure newly allocated folios are always added
> > > to the tail of the lru.

Unless some page flag/readahead expert can confirm that the first
option is safe, my vote is on this option. I mean, it's fairly minimal
codewise, no? Just a bunch of plumbing. We can also keep the other
call sites intact if we just rename the old versions - something along
the line of:

__read_swap_cache_async_head(..., bool add_to_lru_head)
{
...
if (add_to_lru_head)
  folio_add_lru(folio)
else
  folio_add_lru_tail(folio);
}

__read_swap_cache_async(...)
{
   return __read_swap_cache_async_tail(..., true);
}

A bit boilerplate? Sure. But this seems safer, and I doubt it's *that*
much more work.

> >
> > I have the same idea and also find it intrusive. I think the first solution
> > is very good and I will try it. If it works, I will send the next version.
>
> One way to avoid introducing folio_lru_add_tail() and blumping a
> boolean from zswap is to have a per-task context (similar to
> memalloc_nofs_save()), that makes folio_add_lru() automatically add
> folios to the tail of the LRU. I am not sure if this is an acceptable
> approach though in terms of per-task flags and such.

This seems a bit hacky and obscure, but maybe it could work.
Zhongkun He Jan. 11, 2024, 2:57 a.m. UTC | #17
On Wed, Jan 10, 2024 at 12:30 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Jan 8, 2024 at 7:13 PM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
> >
> > Hi Yosry, glad to hear from you and happy new year!
> >
> > > Sorry for being late to the party. It seems to me that all of this
> > > hassle can be avoided if lru_add_fn() did the right thing in this case
> > > and added the folio to the tail of the lru directly. I am no expert in
> > > how the page flags work here, but it seems like we can do something
> > > like this in lru_add_fn():
> > >
> > > if (folio_test_reclaim(folio))
> > >     lruvec_add_folio_tail(lruvec, folio);
> > > else
> > >     lruvec_add_folio(lruvec, folio);
> > >
> > > I think the main problem with this is that PG_reclaim is an alias to
> > > PG_readahead, so readahead pages will also go to the tail of the lru,
> > > which is probably not good.
> >
> > Agree with you, I will try it.
>
> +Matthew Wilcox
>
> I think we need to figure out if it's okay to do this first, because
> it will affect pages with PG_readahead as well.

Yes, I've tested it and there is one more thing
that needs to be modified.

+       if (folio_test_reclaim(folio))
+               lruvec_add_folio_tail(lruvec, folio);
+       else
+               lruvec_add_folio(lruvec, folio);

@@ -1583,10 +1583,8 @@ void folio_end_writeback(struct folio *folio)
         * a gain to justify taking an atomic operation penalty at the
         * end of every folio writeback.
         */
-       if (folio_test_reclaim(folio)) {
+       if (folio_test_reclaim(folio) && folio_rotate_reclaimable(folio))
                folio_clear_reclaim(folio);
-               folio_rotate_reclaimable(folio);
-       }

-void folio_rotate_reclaimable(struct folio *folio)
+bool folio_rotate_reclaimable(struct folio *folio)
 {
        if (success)
+               return true;
        }
+       return false;
 }

>
> >
> > >
> > > A more intrusive alternative is to introduce a folio_lru_add_tail()
> > > variant that always adds pages to the tail, and optionally call that
> > > from __read_swap_cache_async() instead of folio_lru_add() based on a
> > > new boolean argument. The zswap code can set that boolean argument
> > > during writeback to make sure newly allocated folios are always added
> > > to the tail of the lru.
> >
> > I have the same idea and also find it intrusive. I think the first solution
> > is very good and I will try it. If it works, I will send the next version.
>
> One way to avoid introducing folio_lru_add_tail() and blumping a
> boolean from zswap is to have a per-task context (similar to
> memalloc_nofs_save()), that makes folio_add_lru() automatically add
> folios to the tail of the LRU. I am not sure if this is an acceptable
> approach though in terms of per-task flags and such.

I got it.
Zhongkun He Jan. 11, 2024, 3:48 a.m. UTC | #18
>
> This sounds dangerous. This is going to introduce a rather large
> unexpected side effect - we're changing the readahead behavior in a
> seemingly small zswap optimization. In fact, I'd argue that if we do
> this, the readahead behavior change will be the "main effect", and the
> zswap-side change would be a "happy consequence". We should run a lot
> of benchmarking and document the change extensively if we pursue this
> route.
>

I agree with the unexpected side effect,  and here I need
to clarify the original intention of this patch.Please see the memory
offloading steps below.


memory      zswap(reclaim)          memory+swap (writeback)
1G                 0.5G                        1G(tmp memory) + 1G(swap)

If the decompressed memory cannot be released in time,
zswap's writeback has great side effects(mostly clod pages).
On the one hand, the memory space has not been reduced,
but has increased (from 0.5G->1G).
At the same time, it is not put the pages to the tail of the lru.
When the memory is insufficient, other pages will be squeezed out
and released early.
With this patch, we can put the tmp pages to the tail and reclaim it
in time when the memory is insufficient or actively reclaimed.
So I think this patch makes sense and hope it can be fixed with a
suitable approaches.

>
> Unless some page flag/readahead expert can confirm that the first
> option is safe, my vote is on this option. I mean, it's fairly minimal
> codewise, no? Just a bunch of plumbing. We can also keep the other
> call sites intact if we just rename the old versions - something along
> the line of:
>
> __read_swap_cache_async_head(..., bool add_to_lru_head)
> {
> ...
> if (add_to_lru_head)
>   folio_add_lru(folio)
> else
>   folio_add_lru_tail(folio);
> }
>
> __read_swap_cache_async(...)
> {
>    return __read_swap_cache_async_tail(..., true);
> }
>
> A bit boilerplate? Sure. But this seems safer, and I doubt it's *that*
> much more work.
>

Yes, agree. I will try it again.

> > >
> > > I have the same idea and also find it intrusive. I think the first solution
> > > is very good and I will try it. If it works, I will send the next version.
> >
> > One way to avoid introducing folio_lru_add_tail() and blumping a
> > boolean from zswap is to have a per-task context (similar to
> > memalloc_nofs_save()), that makes folio_add_lru() automatically add
> > folios to the tail of the LRU. I am not sure if this is an acceptable
> > approach though in terms of per-task flags and such.
>
> This seems a bit hacky and obscure, but maybe it could work.
Yosry Ahmed Jan. 11, 2024, 11:27 a.m. UTC | #19
On Wed, Jan 10, 2024 at 7:49 PM Zhongkun He
<hezhongkun.hzk@bytedance.com> wrote:
>
> >
> > This sounds dangerous. This is going to introduce a rather large
> > unexpected side effect - we're changing the readahead behavior in a
> > seemingly small zswap optimization. In fact, I'd argue that if we do
> > this, the readahead behavior change will be the "main effect", and the
> > zswap-side change would be a "happy consequence". We should run a lot
> > of benchmarking and document the change extensively if we pursue this
> > route.
> >
>
> I agree with the unexpected side effect,  and here I need
> to clarify the original intention of this patch.Please see the memory
> offloading steps below.
>
>
> memory      zswap(reclaim)          memory+swap (writeback)
> 1G                 0.5G                        1G(tmp memory) + 1G(swap)
>
> If the decompressed memory cannot be released in time,
> zswap's writeback has great side effects(mostly clod pages).
> On the one hand, the memory space has not been reduced,
> but has increased (from 0.5G->1G).
> At the same time, it is not put the pages to the tail of the lru.
> When the memory is insufficient, other pages will be squeezed out
> and released early.
> With this patch, we can put the tmp pages to the tail and reclaim it
> in time when the memory is insufficient or actively reclaimed.
> So I think this patch makes sense and hope it can be fixed with a
> suitable approaches.
>
> >
> > Unless some page flag/readahead expert can confirm that the first
> > option is safe, my vote is on this option. I mean, it's fairly minimal
> > codewise, no? Just a bunch of plumbing. We can also keep the other
> > call sites intact if we just rename the old versions - something along
> > the line of:
> >
> > __read_swap_cache_async_head(..., bool add_to_lru_head)
> > {
> > ...
> > if (add_to_lru_head)
> >   folio_add_lru(folio)
> > else
> >   folio_add_lru_tail(folio);
> > }
> >
> > __read_swap_cache_async(...)
> > {
> >    return __read_swap_cache_async_tail(..., true);
> > }
> >
> > A bit boilerplate? Sure. But this seems safer, and I doubt it's *that*
> > much more work.
> >
>
> Yes, agree. I will try it again.

I agree with Nhat here. Unless someone with enough readahead/page
flags knowledge says putting PG_readahead pages at the tail of the LRU
is okay (doubtful), I think we should opt for introducing a
folio_add_lru_tail() as I initially suggested.
Nhat Pham Jan. 11, 2024, 7:25 p.m. UTC | #20
On Wed, Jan 10, 2024 at 7:49 PM Zhongkun He
<hezhongkun.hzk@bytedance.com> wrote:
>
> >
> > This sounds dangerous. This is going to introduce a rather large
> > unexpected side effect - we're changing the readahead behavior in a
> > seemingly small zswap optimization. In fact, I'd argue that if we do
> > this, the readahead behavior change will be the "main effect", and the
> > zswap-side change would be a "happy consequence". We should run a lot
> > of benchmarking and document the change extensively if we pursue this
> > route.
> >
>
> I agree with the unexpected side effect,  and here I need
> to clarify the original intention of this patch.Please see the memory
> offloading steps below.
>
>
> memory      zswap(reclaim)          memory+swap (writeback)
> 1G                 0.5G                        1G(tmp memory) + 1G(swap)
>
> If the decompressed memory cannot be released in time,
> zswap's writeback has great side effects(mostly clod pages).
> On the one hand, the memory space has not been reduced,
> but has increased (from 0.5G->1G).
> At the same time, it is not put the pages to the tail of the lru.
> When the memory is insufficient, other pages will be squeezed out
> and released early.
> With this patch, we can put the tmp pages to the tail and reclaim it
> in time when the memory is insufficient or actively reclaimed.
> So I think this patch makes sense and hope it can be fixed with a
> suitable approaches.

Makes sense to me. IIUC, that's the original intention behind calling
SetPageReclaim() - unfortunately that doesn't work :) And IIRC, your
original attempt shows reduction in swap usage (albeit at the cost of
performance regression), which means we're onto something. I believe
that the folio_lru_add_tail() approach will work :)

Please include a version of the clarification paragraph above in your
later version to explain the goal of the optimization, along with
suitable benchmark numbers to show the effect (such as minimal change
in performance, and reduction in some metrics). Maybe include the link
to the original patch that introduces SetPageReclaim() too, to show
the motivation behind all of this :) It'd be nice to have all the
contexts readily available, in case we need to revisit this in the
future (as was the case with the SetPageReclaim() here).

>
> >
> > Unless some page flag/readahead expert can confirm that the first
> > option is safe, my vote is on this option. I mean, it's fairly minimal
> > codewise, no? Just a bunch of plumbing. We can also keep the other
> > call sites intact if we just rename the old versions - something along
> > the line of:
> >
> > __read_swap_cache_async_head(..., bool add_to_lru_head)
> > {
> > ...
> > if (add_to_lru_head)
> >   folio_add_lru(folio)
> > else
> >   folio_add_lru_tail(folio);
> > }
> >
> > __read_swap_cache_async(...)
> > {
> >    return __read_swap_cache_async_tail(..., true);
> > }
> >
> > A bit boilerplate? Sure. But this seems safer, and I doubt it's *that*
> > much more work.
> >
>
> Yes, agree. I will try it again.

Look forward to seeing it! Thanks for your patience and for working on this.
Zhongkun He Jan. 12, 2024, 7:08 a.m. UTC | #21
On Fri, Jan 12, 2024 at 3:25 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Wed, Jan 10, 2024 at 7:49 PM Zhongkun He
> <hezhongkun.hzk@bytedance.com> wrote:
> >
> > >
> > > This sounds dangerous. This is going to introduce a rather large
> > > unexpected side effect - we're changing the readahead behavior in a
> > > seemingly small zswap optimization. In fact, I'd argue that if we do
> > > this, the readahead behavior change will be the "main effect", and the
> > > zswap-side change would be a "happy consequence". We should run a lot
> > > of benchmarking and document the change extensively if we pursue this
> > > route.
> > >
> >
> > I agree with the unexpected side effect,  and here I need
> > to clarify the original intention of this patch.Please see the memory
> > offloading steps below.
> >
> >
> > memory      zswap(reclaim)          memory+swap (writeback)
> > 1G                 0.5G                        1G(tmp memory) + 1G(swap)
> >
> > If the decompressed memory cannot be released in time,
> > zswap's writeback has great side effects(mostly clod pages).
> > On the one hand, the memory space has not been reduced,
> > but has increased (from 0.5G->1G).
> > At the same time, it is not put the pages to the tail of the lru.
> > When the memory is insufficient, other pages will be squeezed out
> > and released early.
> > With this patch, we can put the tmp pages to the tail and reclaim it
> > in time when the memory is insufficient or actively reclaimed.
> > So I think this patch makes sense and hope it can be fixed with a
> > suitable approaches.
>
> Makes sense to me. IIUC, that's the original intention behind calling
> SetPageReclaim() - unfortunately that doesn't work :) And IIRC, your
> original attempt shows reduction in swap usage (albeit at the cost of
> performance regression), which means we're onto something. I believe
> that the folio_lru_add_tail() approach will work :)
>
> Please include a version of the clarification paragraph above in your
> later version to explain the goal of the optimization, along with
> suitable benchmark numbers to show the effect (such as minimal change
> in performance, and reduction in some metrics). Maybe include the link
> to the original patch that introduces SetPageReclaim() too, to show
> the motivation behind all of this :) It'd be nice to have all the
> contexts readily available, in case we need to revisit this in the
> future (as was the case with the SetPageReclaim() here).
>

OK.

> >
> > >
> > > Unless some page flag/readahead expert can confirm that the first
> > > option is safe, my vote is on this option. I mean, it's fairly minimal
> > > codewise, no? Just a bunch of plumbing. We can also keep the other
> > > call sites intact if we just rename the old versions - something along
> > > the line of:
> > >
> > > __read_swap_cache_async_head(..., bool add_to_lru_head)
> > > {
> > > ...
> > > if (add_to_lru_head)
> > >   folio_add_lru(folio)
> > > else
> > >   folio_add_lru_tail(folio);
> > > }
> > >
> > > __read_swap_cache_async(...)
> > > {
> > >    return __read_swap_cache_async_tail(..., true);
> > > }
> > >
> > > A bit boilerplate? Sure. But this seems safer, and I doubt it's *that*
> > > much more work.
> > >
> >
> > Yes, agree. I will try it again.
>
> Look forward to seeing it! Thanks for your patience and for working on this.

Thanks for your time.
Zhongkun He Jan. 16, 2024, 1:40 p.m. UTC | #22
> > > >
> > > > Unless some page flag/readahead expert can confirm that the first
> > > > option is safe, my vote is on this option. I mean, it's fairly minimal
> > > > codewise, no? Just a bunch of plumbing. We can also keep the other
> > > > call sites intact if we just rename the old versions - something along
> > > > the line of:
> > > >
> > > > __read_swap_cache_async_head(..., bool add_to_lru_head)
> > > > {
> > > > ...
> > > > if (add_to_lru_head)
> > > >   folio_add_lru(folio)
> > > > else
> > > >   folio_add_lru_tail(folio);
> > > > }
> > > >
> > > > __read_swap_cache_async(...)
> > > > {
> > > >    return __read_swap_cache_async_tail(..., true);
> > > > }
> > > >
> > > > A bit boilerplate? Sure. But this seems safer, and I doubt it's *that*
> > > > much more work.
> > > >
> > >
> > > Yes, agree. I will try it again.
> >
> > Look forward to seeing it! Thanks for your patience and for working on this.

Please forgive me for adding additional information about this patch.

I have finished the opt for introducing a folio_add_lru_tail(), but
there are many
questions:
1) A new page can be move to LRU only by lru_add_fn, so
    folio_add_lru_tail could not add pages to LRU for the following code
    in folio_batch_move_lru(),which is added by Alex Shi for
    serializing memcg changes in pagevec_lru_move_fn[1].

/* block memcg migration while the folio moves between lru */
        if (move_fn != lru_add_fn && !folio_test_clear_lru(folio))
            continue;
To achieve the goal, we need to add a new function like  lru_add_fn
which does not have the lru flag and folio_add_lru_tail()
+               if (move_fn != lru_add_fn && move_fn != lru_move_tail_fn_new &&
+                       !folio_test_clear_lru(folio))

2)  __read_swap_cache_async has six parameters, so there is no space to
add a new one, add_to_lru_head.

So it seems a bit hacky just for a special case for the reasons above.

Back to the beginning,  lru_add_drain() is the simplest option,which is common
below the __read_swap_cache_async(). Please see the function
swap_cluster_readahead()
and swap_vma_readahead(), of course it has been batched.

Or we should  leave this problem alone,before we can write back zswap
in batches.

Thanks again.
Yosry Ahmed Jan. 16, 2024, 8:28 p.m. UTC | #23
On Tue, Jan 16, 2024 at 5:40 AM Zhongkun He
<hezhongkun.hzk@bytedance.com> wrote:
>
> > > > >
> > > > > Unless some page flag/readahead expert can confirm that the first
> > > > > option is safe, my vote is on this option. I mean, it's fairly minimal
> > > > > codewise, no? Just a bunch of plumbing. We can also keep the other
> > > > > call sites intact if we just rename the old versions - something along
> > > > > the line of:
> > > > >
> > > > > __read_swap_cache_async_head(..., bool add_to_lru_head)
> > > > > {
> > > > > ...
> > > > > if (add_to_lru_head)
> > > > >   folio_add_lru(folio)
> > > > > else
> > > > >   folio_add_lru_tail(folio);
> > > > > }
> > > > >
> > > > > __read_swap_cache_async(...)
> > > > > {
> > > > >    return __read_swap_cache_async_tail(..., true);
> > > > > }
> > > > >
> > > > > A bit boilerplate? Sure. But this seems safer, and I doubt it's *that*
> > > > > much more work.
> > > > >
> > > >
> > > > Yes, agree. I will try it again.
> > >
> > > Look forward to seeing it! Thanks for your patience and for working on this.
>
> Please forgive me for adding additional information about this patch.
>
> I have finished the opt for introducing a folio_add_lru_tail(), but
> there are many
> questions:
> 1) A new page can be move to LRU only by lru_add_fn, so
>     folio_add_lru_tail could not add pages to LRU for the following code
>     in folio_batch_move_lru(),which is added by Alex Shi for
>     serializing memcg changes in pagevec_lru_move_fn[1].
>
> /* block memcg migration while the folio moves between lru */
>         if (move_fn != lru_add_fn && !folio_test_clear_lru(folio))
>             continue;
> To achieve the goal, we need to add a new function like  lru_add_fn
> which does not have the lru flag and folio_add_lru_tail()
> +               if (move_fn != lru_add_fn && move_fn != lru_move_tail_fn_new &&
> +                       !folio_test_clear_lru(folio))
>
> 2)  __read_swap_cache_async has six parameters, so there is no space to
> add a new one, add_to_lru_head.
>
> So it seems a bit hacky just for a special case for the reasons above.

It's a lot of plumbing for sure. Adding a flag to current task_struct
is a less-noisy yet-still-hacky solution. I am not saying we should do
it, but it's an option. I am not sure how much task flags we have to
spare.

>
> Back to the beginning,  lru_add_drain() is the simplest option,which is common
> below the __read_swap_cache_async(). Please see the function
> swap_cluster_readahead()
> and swap_vma_readahead(), of course it has been batched.
>
> Or we should  leave this problem alone,before we can write back zswap
> in batches.

Calling lru_add_drain() for every written back page is an overkill
imo. If we have writeback batching at some point, it may make more
sense then.

Adding Michal Hocko was recently complaining [1] about lru_add_drain()
being called unnecessarily elsewhere.

[1]https://lore.kernel.org/linux-mm/ZaD9BNtXZfY2UtVI@tiehlicka/
Matthew Wilcox (Oracle) Jan. 16, 2024, 9:03 p.m. UTC | #24
On Tue, Jan 16, 2024 at 09:40:05PM +0800, Zhongkun He wrote:
> 2)  __read_swap_cache_async has six parameters, so there is no space to
> add a new one, add_to_lru_head.

That's easy enough.  Define a new set of flags, make one of them the
equivalent of skip_if_exists.  Something like:

typedef unsigned int __bitwise read_swap_t;

#define READ_SWAP_SKIP_EXISTING	((__force read_swap_t)0x00000001)
#define READ_SWAP_ADD_TAIL	((__force read_swap_t)0x00000002)

There's only six callers of __read_swap_cache_async() to convert, so not
really a big deal.
Zhongkun He Jan. 17, 2024, 9:52 a.m. UTC | #25
> >
> > Please forgive me for adding additional information about this patch.
> >
> > I have finished the opt for introducing a folio_add_lru_tail(), but
> > there are many
> > questions:
> > 1) A new page can be move to LRU only by lru_add_fn, so
> >     folio_add_lru_tail could not add pages to LRU for the following code
> >     in folio_batch_move_lru(),which is added by Alex Shi for
> >     serializing memcg changes in pagevec_lru_move_fn[1].
> >
> > /* block memcg migration while the folio moves between lru */
> >         if (move_fn != lru_add_fn && !folio_test_clear_lru(folio))
> >             continue;
> > To achieve the goal, we need to add a new function like  lru_add_fn
> > which does not have the lru flag and folio_add_lru_tail()
> > +               if (move_fn != lru_add_fn && move_fn != lru_move_tail_fn_new &&
> > +                       !folio_test_clear_lru(folio))
> >
> > 2)  __read_swap_cache_async has six parameters, so there is no space to
> > add a new one, add_to_lru_head.
> >
> > So it seems a bit hacky just for a special case for the reasons above.
>
> It's a lot of plumbing for sure. Adding a flag to current task_struct
> is a less-noisy yet-still-hacky solution. I am not saying we should do
> it, but it's an option. I am not sure how much task flags we have to
> spare.

Got it.
>
> >
> > Back to the beginning,  lru_add_drain() is the simplest option,which is common
> > below the __read_swap_cache_async(). Please see the function
> > swap_cluster_readahead()
> > and swap_vma_readahead(), of course it has been batched.
> >
> > Or we should  leave this problem alone,before we can write back zswap
> > in batches.
>
> Calling lru_add_drain() for every written back page is an overkill
> imo. If we have writeback batching at some point, it may make more
> sense then.

Agree.

>
> Adding Michal Hocko was recently complaining [1] about lru_add_drain()
> being called unnecessarily elsewhere.

Got it, thanks.
>
> [1]https://lore.kernel.org/linux-mm/ZaD9BNtXZfY2UtVI@tiehlicka/
Zhongkun He Jan. 17, 2024, 10:41 a.m. UTC | #26
On Wed, Jan 17, 2024 at 5:04 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jan 16, 2024 at 09:40:05PM +0800, Zhongkun He wrote:
> > 2)  __read_swap_cache_async has six parameters, so there is no space to
> > add a new one, add_to_lru_head.
>
> That's easy enough.  Define a new set of flags, make one of them the
> equivalent of skip_if_exists.  Something like:
>
> typedef unsigned int __bitwise read_swap_t;
>
> #define READ_SWAP_SKIP_EXISTING ((__force read_swap_t)0x00000001)
> #define READ_SWAP_ADD_TAIL      ((__force read_swap_t)0x00000002)
>
> There's only six callers of __read_swap_cache_async() to convert, so not
> really a big deal.
>

Yes, thanks for your suggestion.
The major problem is not the parameters, but the need to add three functions
to deal with a special case.  Thanks again.
Yosry Ahmed Jan. 17, 2024, 5:53 p.m. UTC | #27
On Wed, Jan 17, 2024 at 1:53 AM Zhongkun He
<hezhongkun.hzk@bytedance.com> wrote:
>
> > >
> > > Please forgive me for adding additional information about this patch.
> > >
> > > I have finished the opt for introducing a folio_add_lru_tail(), but
> > > there are many
> > > questions:
> > > 1) A new page can be move to LRU only by lru_add_fn, so
> > >     folio_add_lru_tail could not add pages to LRU for the following code
> > >     in folio_batch_move_lru(),which is added by Alex Shi for
> > >     serializing memcg changes in pagevec_lru_move_fn[1].
> > >
> > > /* block memcg migration while the folio moves between lru */
> > >         if (move_fn != lru_add_fn && !folio_test_clear_lru(folio))
> > >             continue;
> > > To achieve the goal, we need to add a new function like  lru_add_fn
> > > which does not have the lru flag and folio_add_lru_tail()
> > > +               if (move_fn != lru_add_fn && move_fn != lru_move_tail_fn_new &&
> > > +                       !folio_test_clear_lru(folio))
> > >
> > > 2)  __read_swap_cache_async has six parameters, so there is no space to
> > > add a new one, add_to_lru_head.
> > >
> > > So it seems a bit hacky just for a special case for the reasons above.
> >
> > It's a lot of plumbing for sure. Adding a flag to current task_struct
> > is a less-noisy yet-still-hacky solution. I am not saying we should do
> > it, but it's an option. I am not sure how much task flags we have to
> > spare.
>
> Got it.

Actually this won't really work. Writebak can be asynchronous, so
there would be no logical place to unset the flag.
Nhat Pham Jan. 17, 2024, 7:29 p.m. UTC | #28
On Wed, Jan 17, 2024 at 1:52 AM Zhongkun He
<hezhongkun.hzk@bytedance.com> wrote:
>
> > >
> > > Please forgive me for adding additional information about this patch.
> > >
> > > I have finished the opt for introducing a folio_add_lru_tail(), but
> > > there are many
> > > questions:
> > > 1) A new page can be move to LRU only by lru_add_fn, so
> > >     folio_add_lru_tail could not add pages to LRU for the following code
> > >     in folio_batch_move_lru(),which is added by Alex Shi for
> > >     serializing memcg changes in pagevec_lru_move_fn[1].
> > >
> > > /* block memcg migration while the folio moves between lru */
> > >         if (move_fn != lru_add_fn && !folio_test_clear_lru(folio))
> > >             continue;
> > > To achieve the goal, we need to add a new function like  lru_add_fn
> > > which does not have the lru flag and folio_add_lru_tail()
> > > +               if (move_fn != lru_add_fn && move_fn != lru_move_tail_fn_new &&
> > > +                       !folio_test_clear_lru(folio))

Hmm yeah, I guess it is a bit more plumbing to do. I prefer this
though - not very fond of hacking current's flag just for a small
optimization :) And I'd argue this is the "right" thing to do -
draining the other LRU operation batches just so that we can
successfully perform an add-to-tail seems hacky and wrong to me.

> > >
> > > 2)  __read_swap_cache_async has six parameters, so there is no space to
> > > add a new one, add_to_lru_head.

Matthew's solution seems fine to me, no? i.e using a single flag
parameter to encapsulate all boolean arguments.

> > >
> > > So it seems a bit hacky just for a special case for the reasons above.
> >
> > It's a lot of plumbing for sure. Adding a flag to current task_struct
> > is a less-noisy yet-still-hacky solution. I am not saying we should do
> > it, but it's an option. I am not sure how much task flags we have to
> > spare.
>
> Got it.
> >
> > >
> > > Back to the beginning,  lru_add_drain() is the simplest option,which is common
> > > below the __read_swap_cache_async(). Please see the function
> > > swap_cluster_readahead()
> > > and swap_vma_readahead(), of course it has been batched.
> > >
> > > Or we should  leave this problem alone,before we can write back zswap
> > > in batches.
> >
> > Calling lru_add_drain() for every written back page is an overkill
> > imo. If we have writeback batching at some point, it may make more
> > sense then.
>
> Agree.

Agree. lru_add_drain() does quite a bit, and doing it for every
written page makes batching less effective. And as argued above, I
don't think we should do this.

I'm fine with waiting til writeback batching too :) But that will be a
bigger task.


>
> >
> > Adding Michal Hocko was recently complaining [1] about lru_add_drain()
> > being called unnecessarily elsewhere.
>
> Got it, thanks.
> >
> > [1]https://lore.kernel.org/linux-mm/ZaD9BNtXZfY2UtVI@tiehlicka/
diff mbox series

Patch

diff --git a/mm/zswap.c b/mm/zswap.c
index 083c693602b8..b9b94cbd403c 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1139,6 +1139,11 @@  static int zswap_writeback_entry(struct zswap_entry *entry,
 	/* move it to the tail of the inactive list after end_writeback */
 	SetPageReclaim(page);
 
+	if (!PageLRU(page)) {
+		/* drain lru cache to help folio_rotate_reclaimable() */
+		lru_add_drain();
+	}
+
 	/* start writeback */
 	__swap_writepage(page, &wbc);
 	put_page(page);