diff mbox series

[RFC] mm: count zeromap read and set for swapout and swapin

Message ID 20241027011959.9226-1-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series [RFC] mm: count zeromap read and set for swapout and swapin | expand

Commit Message

Barry Song Oct. 27, 2024, 1:19 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

When the proportion of folios from the zero map is small, missing their
accounting may not significantly impact profiling. However, it’s easy
to construct a scenario where this becomes an issue—for example,
allocating 1 GB of memory, writing zeros from userspace, followed by
MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out
and swap-in counts seem to vanish into a black hole, potentially
causing semantic ambiguity.

We have two ways to address this:

1. Add a separate counter specifically for the zero map.
2. Continue using the current accounting, treating the zero map like
a normal backend. (This aligns with the current behavior of zRAM
when supporting same-page fills at the device level.)

This patch adopts option 2. I'm curious if others have different
opinions, so I'm marking it as RFC.

Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap")
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/page_io.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

Comments

Nhat Pham Oct. 27, 2024, 2:45 a.m. UTC | #1
On Sat, Oct 26, 2024 at 6:20 PM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> When the proportion of folios from the zero map is small, missing their
> accounting may not significantly impact profiling. However, it’s easy
> to construct a scenario where this becomes an issue—for example,
> allocating 1 GB of memory, writing zeros from userspace, followed by
> MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out
> and swap-in counts seem to vanish into a black hole, potentially
> causing semantic ambiguity.

I agree. It also makes developing around this area more challenging.
I'm working on the swap abstraction, and sometimes I can't tell if I
screwed up somewhere, or if a proportion of these allocated entries go
towards this optimization...

Thanks for taking a stab at fixing this, Barry!

>
> We have two ways to address this:
>
> 1. Add a separate counter specifically for the zero map.
> 2. Continue using the current accounting, treating the zero map like
> a normal backend. (This aligns with the current behavior of zRAM
> when supporting same-page fills at the device level.)

Hmm, my understanding of the pswpout/pswpin counters is that they only
apply to IO done directly to the backend device, no? That's why we
have a separate set of counters for zswap, and do not count them
towards pswp(in|out).

For users who have swap files on physical disks, the performance
difference between reading directly from the swapfile and going
through these optimizations could be really large. I think it makes
sense to have a separate set of counters for zero-mapped pages
(ideally, both at the host level and at the cgroup level?)
Barry Song Oct. 28, 2024, 2:32 a.m. UTC | #2
On Sun, Oct 27, 2024 at 3:45 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Sat, Oct 26, 2024 at 6:20 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > When the proportion of folios from the zero map is small, missing their
> > accounting may not significantly impact profiling. However, it’s easy
> > to construct a scenario where this becomes an issue—for example,
> > allocating 1 GB of memory, writing zeros from userspace, followed by
> > MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out
> > and swap-in counts seem to vanish into a black hole, potentially
> > causing semantic ambiguity.
>
> I agree. It also makes developing around this area more challenging.
> I'm working on the swap abstraction, and sometimes I can't tell if I
> screwed up somewhere, or if a proportion of these allocated entries go
> towards this optimization...
>
> Thanks for taking a stab at fixing this, Barry!
>
> >
> > We have two ways to address this:
> >
> > 1. Add a separate counter specifically for the zero map.
> > 2. Continue using the current accounting, treating the zero map like
> > a normal backend. (This aligns with the current behavior of zRAM
> > when supporting same-page fills at the device level.)
>
> Hmm, my understanding of the pswpout/pswpin counters is that they only
> apply to IO done directly to the backend device, no? That's why we
> have a separate set of counters for zswap, and do not count them
> towards pswp(in|out).
>
> For users who have swap files on physical disks, the performance
> difference between reading directly from the swapfile and going
> through these optimizations could be really large. I think it makes
> sense to have a separate set of counters for zero-mapped pages
> (ideally, both at the host level and at the cgroup level?)

agree it is better to have a separate counter for zeromap.
then it raises a question: what is the proper name for it :-)

zeromap_swpin, zeromap_swpout seems too long? and zswpin
and zswpout have been used by zswap

Thanks
barry
Usama Arif Oct. 28, 2024, 12:23 p.m. UTC | #3
On 28/10/2024 02:32, Barry Song wrote:
> On Sun, Oct 27, 2024 at 3:45 PM Nhat Pham <nphamcs@gmail.com> wrote:
>>
>> On Sat, Oct 26, 2024 at 6:20 PM Barry Song <21cnbao@gmail.com> wrote:
>>>
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> When the proportion of folios from the zero map is small, missing their
>>> accounting may not significantly impact profiling. However, it’s easy
>>> to construct a scenario where this becomes an issue—for example,
>>> allocating 1 GB of memory, writing zeros from userspace, followed by
>>> MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out
>>> and swap-in counts seem to vanish into a black hole, potentially
>>> causing semantic ambiguity.
>>
>> I agree. It also makes developing around this area more challenging.
>> I'm working on the swap abstraction, and sometimes I can't tell if I
>> screwed up somewhere, or if a proportion of these allocated entries go
>> towards this optimization...
>>
>> Thanks for taking a stab at fixing this, Barry!
>>
>>>
>>> We have two ways to address this:
>>>
>>> 1. Add a separate counter specifically for the zero map.
>>> 2. Continue using the current accounting, treating the zero map like
>>> a normal backend. (This aligns with the current behavior of zRAM
>>> when supporting same-page fills at the device level.)
>>
>> Hmm, my understanding of the pswpout/pswpin counters is that they only
>> apply to IO done directly to the backend device, no? That's why we
>> have a separate set of counters for zswap, and do not count them
>> towards pswp(in|out).
>>
>> For users who have swap files on physical disks, the performance
>> difference between reading directly from the swapfile and going
>> through these optimizations could be really large. I think it makes
>> sense to have a separate set of counters for zero-mapped pages
>> (ideally, both at the host level and at the cgroup level?)
> 
> agree it is better to have a separate counter for zeromap.
> then it raises a question: what is the proper name for it :-)
> 
> zeromap_swpin, zeromap_swpout seems too long? and zswpin
> and zswpout have been used by zswap
> 
> Thanks
> barry

I wonder if instead of having counters, it might be better to keep track
of the number of zeropages currently stored in zeromap, similar to how
zswap_same_filled_pages did it. It will be more complicated then this
patch, but would give more insight of the current state of the system.

Joshua (in CC) was going to have a look at that.

Thanks,
Usama
Nhat Pham Oct. 28, 2024, 4:33 p.m. UTC | #4
On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
> I wonder if instead of having counters, it might be better to keep track
> of the number of zeropages currently stored in zeromap, similar to how
> zswap_same_filled_pages did it. It will be more complicated then this
> patch, but would give more insight of the current state of the system.
>
> Joshua (in CC) was going to have a look at that.

I don't think one can substitute for the other.

The "current zeromapped page" counter gives you a breakdown of where
memory resides, whereas the in/out counters explain past performance
based on events that have happened. That's why you have zswap,
zswapped, zswpout, zswpin.
Nhat Pham Oct. 28, 2024, 4:34 p.m. UTC | #5
On Sun, Oct 27, 2024 at 7:32 PM Barry Song <21cnbao@gmail.com> wrote:
>
>
> agree it is better to have a separate counter for zeromap.
> then it raises a question: what is the proper name for it :-)
>
> zeromap_swpin, zeromap_swpout seems too long? and zswpin
> and zswpout have been used by zswap

Hmmmmm. How about zeroswpin? zeroswpout?

zeromap_swpin and zeromap_swpout is not the end of the world, FWIW :)
Usama Arif Oct. 28, 2024, 5 p.m. UTC | #6
On 28/10/2024 16:33, Nhat Pham wrote:
> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>> I wonder if instead of having counters, it might be better to keep track
>> of the number of zeropages currently stored in zeromap, similar to how
>> zswap_same_filled_pages did it. It will be more complicated then this
>> patch, but would give more insight of the current state of the system.
>>
>> Joshua (in CC) was going to have a look at that.
> 
> I don't think one can substitute for the other.

Yes agreed, they have separate uses and provide different information, but
maybe wasteful to have both types of counters? They are counters so maybe
dont consume too much resources but I think we should still think about
it..

If you think from a production context, I feel like the number of pages
currently in zeromap could be more useful than just accumulation of all
zeromap loads/stores, which after days is just going to be a very large
number. You can compare the accumulations at different points in time,
but they dont take into account freeing swap slots and swap_reclaim.

If we are open to having both types then its ok. But might be good to
have that discussion here.

> 
> The "current zeromapped page" counter gives you a breakdown of where
> memory resides, whereas the in/out counters explain past performance
> based on events that have happened. That's why you have zswap,
> zswapped, zswpout, zswpin.
Yosry Ahmed Oct. 28, 2024, 5:08 p.m. UTC | #7
On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 28/10/2024 16:33, Nhat Pham wrote:
> > On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >> I wonder if instead of having counters, it might be better to keep track
> >> of the number of zeropages currently stored in zeromap, similar to how
> >> zswap_same_filled_pages did it. It will be more complicated then this
> >> patch, but would give more insight of the current state of the system.
> >>
> >> Joshua (in CC) was going to have a look at that.
> >
> > I don't think one can substitute for the other.
>
> Yes agreed, they have separate uses and provide different information, but
> maybe wasteful to have both types of counters? They are counters so maybe
> dont consume too much resources but I think we should still think about
> it..

Not for or against here, but I would say that statement is debatable
at best for memcg stats :)

Each new counter consumes 2 longs per-memcg per-CPU (see
memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can
quickly add up with a large number of CPUs/memcgs/stats.

Also, when flushing the stats we iterate all of them to propagate
updates from per-CPU counters. This is already a slowpath so adding
one stat is not a big deal, but again because we iterate all stats on
multiple CPUs (and sometimes on each node as well), the overall flush
latency becomes a concern sometimes.

All of that is not to say we shouldn't add more memcg stats, but we
have to be mindful of the resources.
David Hildenbrand Oct. 28, 2024, 5:17 p.m. UTC | #8
On 28.10.24 17:34, Nhat Pham wrote:
> On Sun, Oct 27, 2024 at 7:32 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>>
>> agree it is better to have a separate counter for zeromap.
>> then it raises a question: what is the proper name for it :-)
>>
>> zeromap_swpin, zeromap_swpout seems too long? and zswpin
>> and zswpout have been used by zswap
> 
> Hmmmmm. How about zeroswpin? zeroswpout?

Is this kind of a "swpin_skip" / "swpout_skip" ? Because we optimized it 
out?
Usama Arif Oct. 28, 2024, 5:19 p.m. UTC | #9
On 28/10/2024 17:08, Yosry Ahmed wrote:
> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 28/10/2024 16:33, Nhat Pham wrote:
>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>
>>>> I wonder if instead of having counters, it might be better to keep track
>>>> of the number of zeropages currently stored in zeromap, similar to how
>>>> zswap_same_filled_pages did it. It will be more complicated then this
>>>> patch, but would give more insight of the current state of the system.
>>>>
>>>> Joshua (in CC) was going to have a look at that.
>>>
>>> I don't think one can substitute for the other.
>>
>> Yes agreed, they have separate uses and provide different information, but
>> maybe wasteful to have both types of counters? They are counters so maybe
>> dont consume too much resources but I think we should still think about
>> it..
> 
> Not for or against here, but I would say that statement is debatable
> at best for memcg stats :)
> 
> Each new counter consumes 2 longs per-memcg per-CPU (see
> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can
> quickly add up with a large number of CPUs/memcgs/stats.
> 
> Also, when flushing the stats we iterate all of them to propagate
> updates from per-CPU counters. This is already a slowpath so adding
> one stat is not a big deal, but again because we iterate all stats on
> multiple CPUs (and sometimes on each node as well), the overall flush
> latency becomes a concern sometimes.
> 
> All of that is not to say we shouldn't add more memcg stats, but we
> have to be mindful of the resources.

Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is
also not much).

Not trying to block this patch in anyway. Just think its a good point
to discuss here if we are ok with both types of counters. If its too wasteful
then which one we should have.
Barry Song Oct. 28, 2024, 7:54 p.m. UTC | #10
On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 28/10/2024 17:08, Yosry Ahmed wrote:
> > On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >>
> >>
> >> On 28/10/2024 16:33, Nhat Pham wrote:
> >>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>
> >>>> I wonder if instead of having counters, it might be better to keep track
> >>>> of the number of zeropages currently stored in zeromap, similar to how
> >>>> zswap_same_filled_pages did it. It will be more complicated then this
> >>>> patch, but would give more insight of the current state of the system.
> >>>>
> >>>> Joshua (in CC) was going to have a look at that.
> >>>
> >>> I don't think one can substitute for the other.
> >>
> >> Yes agreed, they have separate uses and provide different information, but
> >> maybe wasteful to have both types of counters? They are counters so maybe
> >> dont consume too much resources but I think we should still think about
> >> it..
> >
> > Not for or against here, but I would say that statement is debatable
> > at best for memcg stats :)
> >
> > Each new counter consumes 2 longs per-memcg per-CPU (see
> > memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can
> > quickly add up with a large number of CPUs/memcgs/stats.
> >
> > Also, when flushing the stats we iterate all of them to propagate
> > updates from per-CPU counters. This is already a slowpath so adding
> > one stat is not a big deal, but again because we iterate all stats on
> > multiple CPUs (and sometimes on each node as well), the overall flush
> > latency becomes a concern sometimes.
> >
> > All of that is not to say we shouldn't add more memcg stats, but we
> > have to be mindful of the resources.
>
> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is
> also not much).
>
> Not trying to block this patch in anyway. Just think its a good point
> to discuss here if we are ok with both types of counters. If its too wasteful
> then which one we should have.

Hi Usama,
my point is that with all the below three counters:
1. PSWPIN/PSWPOUT
2. ZSWPIN/ZSWPOUT
3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever)

Shouldn't we have been able to determine the portion of zeromap
swap indirectly?

Thanks
Barry
Yosry Ahmed Oct. 28, 2024, 7:58 p.m. UTC | #11
On Mon, Oct 28, 2024 at 12:54 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >
> >
> >
> > On 28/10/2024 17:08, Yosry Ahmed wrote:
> > > On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
> > >>
> > >>
> > >>
> > >> On 28/10/2024 16:33, Nhat Pham wrote:
> > >>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
> > >>>>
> > >>>> I wonder if instead of having counters, it might be better to keep track
> > >>>> of the number of zeropages currently stored in zeromap, similar to how
> > >>>> zswap_same_filled_pages did it. It will be more complicated then this
> > >>>> patch, but would give more insight of the current state of the system.
> > >>>>
> > >>>> Joshua (in CC) was going to have a look at that.
> > >>>
> > >>> I don't think one can substitute for the other.
> > >>
> > >> Yes agreed, they have separate uses and provide different information, but
> > >> maybe wasteful to have both types of counters? They are counters so maybe
> > >> dont consume too much resources but I think we should still think about
> > >> it..
> > >
> > > Not for or against here, but I would say that statement is debatable
> > > at best for memcg stats :)
> > >
> > > Each new counter consumes 2 longs per-memcg per-CPU (see
> > > memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can
> > > quickly add up with a large number of CPUs/memcgs/stats.
> > >
> > > Also, when flushing the stats we iterate all of them to propagate
> > > updates from per-CPU counters. This is already a slowpath so adding
> > > one stat is not a big deal, but again because we iterate all stats on
> > > multiple CPUs (and sometimes on each node as well), the overall flush
> > > latency becomes a concern sometimes.
> > >
> > > All of that is not to say we shouldn't add more memcg stats, but we
> > > have to be mindful of the resources.
> >
> > Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is
> > also not much).
> >
> > Not trying to block this patch in anyway. Just think its a good point
> > to discuss here if we are ok with both types of counters. If its too wasteful
> > then which one we should have.
>
> Hi Usama,
> my point is that with all the below three counters:
> 1. PSWPIN/PSWPOUT
> 2. ZSWPIN/ZSWPOUT
> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever)
>
> Shouldn't we have been able to determine the portion of zeromap
> swap indirectly?

What about swap entries that get freed without being swapped in (e.g.
swapped out anon memory freed, MADV_FREE, shmem truncate, etc)?

>
> Thanks
> Barry
Usama Arif Oct. 28, 2024, 8 p.m. UTC | #12
On 28/10/2024 19:54, Barry Song wrote:
> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 28/10/2024 17:08, Yosry Ahmed wrote:
>>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On 28/10/2024 16:33, Nhat Pham wrote:
>>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>>
>>>>>> I wonder if instead of having counters, it might be better to keep track
>>>>>> of the number of zeropages currently stored in zeromap, similar to how
>>>>>> zswap_same_filled_pages did it. It will be more complicated then this
>>>>>> patch, but would give more insight of the current state of the system.
>>>>>>
>>>>>> Joshua (in CC) was going to have a look at that.
>>>>>
>>>>> I don't think one can substitute for the other.
>>>>
>>>> Yes agreed, they have separate uses and provide different information, but
>>>> maybe wasteful to have both types of counters? They are counters so maybe
>>>> dont consume too much resources but I think we should still think about
>>>> it..
>>>
>>> Not for or against here, but I would say that statement is debatable
>>> at best for memcg stats :)
>>>
>>> Each new counter consumes 2 longs per-memcg per-CPU (see
>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can
>>> quickly add up with a large number of CPUs/memcgs/stats.
>>>
>>> Also, when flushing the stats we iterate all of them to propagate
>>> updates from per-CPU counters. This is already a slowpath so adding
>>> one stat is not a big deal, but again because we iterate all stats on
>>> multiple CPUs (and sometimes on each node as well), the overall flush
>>> latency becomes a concern sometimes.
>>>
>>> All of that is not to say we shouldn't add more memcg stats, but we
>>> have to be mindful of the resources.
>>
>> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is
>> also not much).
>>
>> Not trying to block this patch in anyway. Just think its a good point
>> to discuss here if we are ok with both types of counters. If its too wasteful
>> then which one we should have.
> 
> Hi Usama,
> my point is that with all the below three counters:
> 1. PSWPIN/PSWPOUT
> 2. ZSWPIN/ZSWPOUT
> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever)
> 
> Shouldn't we have been able to determine the portion of zeromap
> swap indirectly?
> 

Hmm, I might be wrong, but I would have thought no?

What if you swapout a zero folio, but then discard it?
zeromap_swpout would be incremented, but zeromap_swapin would not.

> Thanks
> Barry
Barry Song Oct. 28, 2024, 8:42 p.m. UTC | #13
On Tue, Oct 29, 2024 at 4:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 28/10/2024 19:54, Barry Song wrote:
> > On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >>
> >>
> >> On 28/10/2024 17:08, Yosry Ahmed wrote:
> >>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 28/10/2024 16:33, Nhat Pham wrote:
> >>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>>>
> >>>>>> I wonder if instead of having counters, it might be better to keep track
> >>>>>> of the number of zeropages currently stored in zeromap, similar to how
> >>>>>> zswap_same_filled_pages did it. It will be more complicated then this
> >>>>>> patch, but would give more insight of the current state of the system.
> >>>>>>
> >>>>>> Joshua (in CC) was going to have a look at that.
> >>>>>
> >>>>> I don't think one can substitute for the other.
> >>>>
> >>>> Yes agreed, they have separate uses and provide different information, but
> >>>> maybe wasteful to have both types of counters? They are counters so maybe
> >>>> dont consume too much resources but I think we should still think about
> >>>> it..
> >>>
> >>> Not for or against here, but I would say that statement is debatable
> >>> at best for memcg stats :)
> >>>
> >>> Each new counter consumes 2 longs per-memcg per-CPU (see
> >>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can
> >>> quickly add up with a large number of CPUs/memcgs/stats.
> >>>
> >>> Also, when flushing the stats we iterate all of them to propagate
> >>> updates from per-CPU counters. This is already a slowpath so adding
> >>> one stat is not a big deal, but again because we iterate all stats on
> >>> multiple CPUs (and sometimes on each node as well), the overall flush
> >>> latency becomes a concern sometimes.
> >>>
> >>> All of that is not to say we shouldn't add more memcg stats, but we
> >>> have to be mindful of the resources.
> >>
> >> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is
> >> also not much).
> >>
> >> Not trying to block this patch in anyway. Just think its a good point
> >> to discuss here if we are ok with both types of counters. If its too wasteful
> >> then which one we should have.
> >
> > Hi Usama,
> > my point is that with all the below three counters:
> > 1. PSWPIN/PSWPOUT
> > 2. ZSWPIN/ZSWPOUT
> > 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever)
> >
> > Shouldn't we have been able to determine the portion of zeromap
> > swap indirectly?
> >
>
> Hmm, I might be wrong, but I would have thought no?
>
> What if you swapout a zero folio, but then discard it?
> zeromap_swpout would be incremented, but zeromap_swapin would not.

I understand. It looks like we have two issues to tackle:
1. We shouldn't let zeromap swap in or out anything that vanishes into
a black hole
2. We want to find out how much I/O/memory has been saved due to zeromap so far

From my perspective, issue 1 requires a "fix", while issue 2 is more
of an optimization.

I consider issue 1 to be more critical because, after observing a phone
running for some time, I've been able to roughly estimate the portion
zeromap can
help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a
SWPIN counter. However, I agree that issue 2 still holds significant value
as a separate patch.

Thanks
Barry
Usama Arif Oct. 28, 2024, 8:51 p.m. UTC | #14
On 28/10/2024 20:42, Barry Song wrote:
> On Tue, Oct 29, 2024 at 4:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 28/10/2024 19:54, Barry Song wrote:
>>> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On 28/10/2024 17:08, Yosry Ahmed wrote:
>>>>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 28/10/2024 16:33, Nhat Pham wrote:
>>>>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>>>>
>>>>>>>> I wonder if instead of having counters, it might be better to keep track
>>>>>>>> of the number of zeropages currently stored in zeromap, similar to how
>>>>>>>> zswap_same_filled_pages did it. It will be more complicated then this
>>>>>>>> patch, but would give more insight of the current state of the system.
>>>>>>>>
>>>>>>>> Joshua (in CC) was going to have a look at that.
>>>>>>>
>>>>>>> I don't think one can substitute for the other.
>>>>>>
>>>>>> Yes agreed, they have separate uses and provide different information, but
>>>>>> maybe wasteful to have both types of counters? They are counters so maybe
>>>>>> dont consume too much resources but I think we should still think about
>>>>>> it..
>>>>>
>>>>> Not for or against here, but I would say that statement is debatable
>>>>> at best for memcg stats :)
>>>>>
>>>>> Each new counter consumes 2 longs per-memcg per-CPU (see
>>>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can
>>>>> quickly add up with a large number of CPUs/memcgs/stats.
>>>>>
>>>>> Also, when flushing the stats we iterate all of them to propagate
>>>>> updates from per-CPU counters. This is already a slowpath so adding
>>>>> one stat is not a big deal, but again because we iterate all stats on
>>>>> multiple CPUs (and sometimes on each node as well), the overall flush
>>>>> latency becomes a concern sometimes.
>>>>>
>>>>> All of that is not to say we shouldn't add more memcg stats, but we
>>>>> have to be mindful of the resources.
>>>>
>>>> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is
>>>> also not much).
>>>>
>>>> Not trying to block this patch in anyway. Just think its a good point
>>>> to discuss here if we are ok with both types of counters. If its too wasteful
>>>> then which one we should have.
>>>
>>> Hi Usama,
>>> my point is that with all the below three counters:
>>> 1. PSWPIN/PSWPOUT
>>> 2. ZSWPIN/ZSWPOUT
>>> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever)
>>>
>>> Shouldn't we have been able to determine the portion of zeromap
>>> swap indirectly?
>>>
>>
>> Hmm, I might be wrong, but I would have thought no?
>>
>> What if you swapout a zero folio, but then discard it?
>> zeromap_swpout would be incremented, but zeromap_swapin would not.
> 
> I understand. It looks like we have two issues to tackle:
> 1. We shouldn't let zeromap swap in or out anything that vanishes into
> a black hole
> 2. We want to find out how much I/O/memory has been saved due to zeromap so far
> 
> From my perspective, issue 1 requires a "fix", while issue 2 is more
> of an optimization.

Hmm I dont understand why point 1 would be an issue.

If its discarded thats fine as far as I can see.

As a reference, memory.stat.zswapped != memory.stat.zswapout - memory.stat.zswapin.
Because zswapped would take into account swapped out anon memory freed, MADV_FREE,
shmem truncate, etc as Yosry said about zeromap, But zswapout and zswapin dont.


> 
> I consider issue 1 to be more critical because, after observing a phone
> running for some time, I've been able to roughly estimate the portion
> zeromap can
> help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a
> SWPIN counter. However, I agree that issue 2 still holds significant value
> as a separate patch.
> 
> Thanks
> Barry
Barry Song Oct. 28, 2024, 9:15 p.m. UTC | #15
On Tue, Oct 29, 2024 at 4:51 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 28/10/2024 20:42, Barry Song wrote:
> > On Tue, Oct 29, 2024 at 4:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >>
> >>
> >> On 28/10/2024 19:54, Barry Song wrote:
> >>> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 28/10/2024 17:08, Yosry Ahmed wrote:
> >>>>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 28/10/2024 16:33, Nhat Pham wrote:
> >>>>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> I wonder if instead of having counters, it might be better to keep track
> >>>>>>>> of the number of zeropages currently stored in zeromap, similar to how
> >>>>>>>> zswap_same_filled_pages did it. It will be more complicated then this
> >>>>>>>> patch, but would give more insight of the current state of the system.
> >>>>>>>>
> >>>>>>>> Joshua (in CC) was going to have a look at that.
> >>>>>>>
> >>>>>>> I don't think one can substitute for the other.
> >>>>>>
> >>>>>> Yes agreed, they have separate uses and provide different information, but
> >>>>>> maybe wasteful to have both types of counters? They are counters so maybe
> >>>>>> dont consume too much resources but I think we should still think about
> >>>>>> it..
> >>>>>
> >>>>> Not for or against here, but I would say that statement is debatable
> >>>>> at best for memcg stats :)
> >>>>>
> >>>>> Each new counter consumes 2 longs per-memcg per-CPU (see
> >>>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can
> >>>>> quickly add up with a large number of CPUs/memcgs/stats.
> >>>>>
> >>>>> Also, when flushing the stats we iterate all of them to propagate
> >>>>> updates from per-CPU counters. This is already a slowpath so adding
> >>>>> one stat is not a big deal, but again because we iterate all stats on
> >>>>> multiple CPUs (and sometimes on each node as well), the overall flush
> >>>>> latency becomes a concern sometimes.
> >>>>>
> >>>>> All of that is not to say we shouldn't add more memcg stats, but we
> >>>>> have to be mindful of the resources.
> >>>>
> >>>> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is
> >>>> also not much).
> >>>>
> >>>> Not trying to block this patch in anyway. Just think its a good point
> >>>> to discuss here if we are ok with both types of counters. If its too wasteful
> >>>> then which one we should have.
> >>>
> >>> Hi Usama,
> >>> my point is that with all the below three counters:
> >>> 1. PSWPIN/PSWPOUT
> >>> 2. ZSWPIN/ZSWPOUT
> >>> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever)
> >>>
> >>> Shouldn't we have been able to determine the portion of zeromap
> >>> swap indirectly?
> >>>
> >>
> >> Hmm, I might be wrong, but I would have thought no?
> >>
> >> What if you swapout a zero folio, but then discard it?
> >> zeromap_swpout would be incremented, but zeromap_swapin would not.
> >
> > I understand. It looks like we have two issues to tackle:
> > 1. We shouldn't let zeromap swap in or out anything that vanishes into
> > a black hole
> > 2. We want to find out how much I/O/memory has been saved due to zeromap so far
> >
> > From my perspective, issue 1 requires a "fix", while issue 2 is more
> > of an optimization.
>
> Hmm I dont understand why point 1 would be an issue.
>
> If its discarded thats fine as far as I can see.

it is fine to you and probably me who knows zeromap as well :-) but
any userspace code
as below might be entirely confused:

p = malloc(1G);
write p to 0; or write part of p to 0
madv_pageout(p, 1g)
read p to swapin.

The entire procedure used to involve 1GB of swap out and 1GB of swap in by any
means. Now, it has recorded 0 swaps counted.

I don't expect userspace is as smart as you :-)

>
> As a reference, memory.stat.zswapped != memory.stat.zswapout - memory.stat.zswapin.
> Because zswapped would take into account swapped out anon memory freed, MADV_FREE,
> shmem truncate, etc as Yosry said about zeromap, But zswapout and zswapin dont.

I understand. However, I believe what we really need to focus on is
this: if we’ve
swapped out, for instance, 100GB in the past hour, how much of that 100GB is
zero? This information can help us assess the proportion of zero data in the
workload, along with the potential benefits that zeromap can provide for memory,
I/O space, or read/write operations. Additionally, having the second count
can enhance accuracy when considering MADV_DONTNEED, FREE, TRUNCATE,
and so on.

>
>
> >
> > I consider issue 1 to be more critical because, after observing a phone
> > running for some time, I've been able to roughly estimate the portion
> > zeromap can
> > help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a
> > SWPIN counter. However, I agree that issue 2 still holds significant value
> > as a separate patch.
> >

Thanks
 Barry
Usama Arif Oct. 28, 2024, 9:24 p.m. UTC | #16
On 28/10/2024 21:15, Barry Song wrote:
> On Tue, Oct 29, 2024 at 4:51 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 28/10/2024 20:42, Barry Song wrote:
>>> On Tue, Oct 29, 2024 at 4:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On 28/10/2024 19:54, Barry Song wrote:
>>>>> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 28/10/2024 17:08, Yosry Ahmed wrote:
>>>>>>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 28/10/2024 16:33, Nhat Pham wrote:
>>>>>>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> I wonder if instead of having counters, it might be better to keep track
>>>>>>>>>> of the number of zeropages currently stored in zeromap, similar to how
>>>>>>>>>> zswap_same_filled_pages did it. It will be more complicated then this
>>>>>>>>>> patch, but would give more insight of the current state of the system.
>>>>>>>>>>
>>>>>>>>>> Joshua (in CC) was going to have a look at that.
>>>>>>>>>
>>>>>>>>> I don't think one can substitute for the other.
>>>>>>>>
>>>>>>>> Yes agreed, they have separate uses and provide different information, but
>>>>>>>> maybe wasteful to have both types of counters? They are counters so maybe
>>>>>>>> dont consume too much resources but I think we should still think about
>>>>>>>> it..
>>>>>>>
>>>>>>> Not for or against here, but I would say that statement is debatable
>>>>>>> at best for memcg stats :)
>>>>>>>
>>>>>>> Each new counter consumes 2 longs per-memcg per-CPU (see
>>>>>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can
>>>>>>> quickly add up with a large number of CPUs/memcgs/stats.
>>>>>>>
>>>>>>> Also, when flushing the stats we iterate all of them to propagate
>>>>>>> updates from per-CPU counters. This is already a slowpath so adding
>>>>>>> one stat is not a big deal, but again because we iterate all stats on
>>>>>>> multiple CPUs (and sometimes on each node as well), the overall flush
>>>>>>> latency becomes a concern sometimes.
>>>>>>>
>>>>>>> All of that is not to say we shouldn't add more memcg stats, but we
>>>>>>> have to be mindful of the resources.
>>>>>>
>>>>>> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is
>>>>>> also not much).
>>>>>>
>>>>>> Not trying to block this patch in anyway. Just think its a good point
>>>>>> to discuss here if we are ok with both types of counters. If its too wasteful
>>>>>> then which one we should have.
>>>>>
>>>>> Hi Usama,
>>>>> my point is that with all the below three counters:
>>>>> 1. PSWPIN/PSWPOUT
>>>>> 2. ZSWPIN/ZSWPOUT
>>>>> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever)
>>>>>
>>>>> Shouldn't we have been able to determine the portion of zeromap
>>>>> swap indirectly?
>>>>>
>>>>
>>>> Hmm, I might be wrong, but I would have thought no?
>>>>
>>>> What if you swapout a zero folio, but then discard it?
>>>> zeromap_swpout would be incremented, but zeromap_swapin would not.
>>>
>>> I understand. It looks like we have two issues to tackle:
>>> 1. We shouldn't let zeromap swap in or out anything that vanishes into
>>> a black hole
>>> 2. We want to find out how much I/O/memory has been saved due to zeromap so far
>>>
>>> From my perspective, issue 1 requires a "fix", while issue 2 is more
>>> of an optimization.
>>
>> Hmm I dont understand why point 1 would be an issue.
>>
>> If its discarded thats fine as far as I can see.
> 
> it is fine to you and probably me who knows zeromap as well :-) but
> any userspace code
> as below might be entirely confused:
> 
> p = malloc(1G);
> write p to 0; or write part of p to 0
> madv_pageout(p, 1g)
> read p to swapin.
> 
> The entire procedure used to involve 1GB of swap out and 1GB of swap in by any
> means. Now, it has recorded 0 swaps counted.
> 
> I don't expect userspace is as smart as you :-)
> 
Ah I completely agree, we need to account for it in some metric. I probably
misunderstood when you said "We shouldn't let zeromap swap in or out anything that
vanishes into a black hole", by we should not have the zeromap optimization for those
cases. What I guess you meant is we need to account for it in some metric.

>>
>> As a reference, memory.stat.zswapped != memory.stat.zswapout - memory.stat.zswapin.
>> Because zswapped would take into account swapped out anon memory freed, MADV_FREE,
>> shmem truncate, etc as Yosry said about zeromap, But zswapout and zswapin dont.
> 
> I understand. However, I believe what we really need to focus on is
> this: if we’ve
> swapped out, for instance, 100GB in the past hour, how much of that 100GB is
> zero? This information can help us assess the proportion of zero data in the
> workload, along with the potential benefits that zeromap can provide for memory,
> I/O space, or read/write operations. Additionally, having the second count
> can enhance accuracy when considering MADV_DONTNEED, FREE, TRUNCATE,
> and so on.
> 
Yes completely agree!

I think we can look into adding all three metrics, zeromap_swapped, zeromap_swpout,
zeromap_swpin (or whatever name works).

>>
>>
>>>
>>> I consider issue 1 to be more critical because, after observing a phone
>>> running for some time, I've been able to roughly estimate the portion
>>> zeromap can
>>> help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a
>>> SWPIN counter. However, I agree that issue 2 still holds significant value
>>> as a separate patch.
>>>
> 
> Thanks
>  Barry
Barry Song Oct. 28, 2024, 9:40 p.m. UTC | #17
On Tue, Oct 29, 2024 at 5:24 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 28/10/2024 21:15, Barry Song wrote:
> > On Tue, Oct 29, 2024 at 4:51 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >>
> >>
> >> On 28/10/2024 20:42, Barry Song wrote:
> >>> On Tue, Oct 29, 2024 at 4:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 28/10/2024 19:54, Barry Song wrote:
> >>>>> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 28/10/2024 17:08, Yosry Ahmed wrote:
> >>>>>>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 28/10/2024 16:33, Nhat Pham wrote:
> >>>>>>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I wonder if instead of having counters, it might be better to keep track
> >>>>>>>>>> of the number of zeropages currently stored in zeromap, similar to how
> >>>>>>>>>> zswap_same_filled_pages did it. It will be more complicated then this
> >>>>>>>>>> patch, but would give more insight of the current state of the system.
> >>>>>>>>>>
> >>>>>>>>>> Joshua (in CC) was going to have a look at that.
> >>>>>>>>>
> >>>>>>>>> I don't think one can substitute for the other.
> >>>>>>>>
> >>>>>>>> Yes agreed, they have separate uses and provide different information, but
> >>>>>>>> maybe wasteful to have both types of counters? They are counters so maybe
> >>>>>>>> dont consume too much resources but I think we should still think about
> >>>>>>>> it..
> >>>>>>>
> >>>>>>> Not for or against here, but I would say that statement is debatable
> >>>>>>> at best for memcg stats :)
> >>>>>>>
> >>>>>>> Each new counter consumes 2 longs per-memcg per-CPU (see
> >>>>>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can
> >>>>>>> quickly add up with a large number of CPUs/memcgs/stats.
> >>>>>>>
> >>>>>>> Also, when flushing the stats we iterate all of them to propagate
> >>>>>>> updates from per-CPU counters. This is already a slowpath so adding
> >>>>>>> one stat is not a big deal, but again because we iterate all stats on
> >>>>>>> multiple CPUs (and sometimes on each node as well), the overall flush
> >>>>>>> latency becomes a concern sometimes.
> >>>>>>>
> >>>>>>> All of that is not to say we shouldn't add more memcg stats, but we
> >>>>>>> have to be mindful of the resources.
> >>>>>>
> >>>>>> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is
> >>>>>> also not much).
> >>>>>>
> >>>>>> Not trying to block this patch in anyway. Just think its a good point
> >>>>>> to discuss here if we are ok with both types of counters. If its too wasteful
> >>>>>> then which one we should have.
> >>>>>
> >>>>> Hi Usama,
> >>>>> my point is that with all the below three counters:
> >>>>> 1. PSWPIN/PSWPOUT
> >>>>> 2. ZSWPIN/ZSWPOUT
> >>>>> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever)
> >>>>>
> >>>>> Shouldn't we have been able to determine the portion of zeromap
> >>>>> swap indirectly?
> >>>>>
> >>>>
> >>>> Hmm, I might be wrong, but I would have thought no?
> >>>>
> >>>> What if you swapout a zero folio, but then discard it?
> >>>> zeromap_swpout would be incremented, but zeromap_swapin would not.
> >>>
> >>> I understand. It looks like we have two issues to tackle:
> >>> 1. We shouldn't let zeromap swap in or out anything that vanishes into
> >>> a black hole
> >>> 2. We want to find out how much I/O/memory has been saved due to zeromap so far
> >>>
> >>> From my perspective, issue 1 requires a "fix", while issue 2 is more
> >>> of an optimization.
> >>
> >> Hmm I dont understand why point 1 would be an issue.
> >>
> >> If its discarded thats fine as far as I can see.
> >
> > it is fine to you and probably me who knows zeromap as well :-) but
> > any userspace code
> > as below might be entirely confused:
> >
> > p = malloc(1G);
> > write p to 0; or write part of p to 0
> > madv_pageout(p, 1g)
> > read p to swapin.
> >
> > The entire procedure used to involve 1GB of swap out and 1GB of swap in by any
> > means. Now, it has recorded 0 swaps counted.
> >
> > I don't expect userspace is as smart as you :-)
> >
> Ah I completely agree, we need to account for it in some metric. I probably
> misunderstood when you said "We shouldn't let zeromap swap in or out anything that
> vanishes into a black hole", by we should not have the zeromap optimization for those
> cases. What I guess you meant is we need to account for it in some metric.
>
> >>
> >> As a reference, memory.stat.zswapped != memory.stat.zswapout - memory.stat.zswapin.
> >> Because zswapped would take into account swapped out anon memory freed, MADV_FREE,
> >> shmem truncate, etc as Yosry said about zeromap, But zswapout and zswapin dont.
> >
> > I understand. However, I believe what we really need to focus on is
> > this: if we’ve
> > swapped out, for instance, 100GB in the past hour, how much of that 100GB is
> > zero? This information can help us assess the proportion of zero data in the
> > workload, along with the potential benefits that zeromap can provide for memory,
> > I/O space, or read/write operations. Additionally, having the second count
> > can enhance accuracy when considering MADV_DONTNEED, FREE, TRUNCATE,
> > and so on.
> >
> Yes completely agree!
>
> I think we can look into adding all three metrics, zeromap_swapped, zeromap_swpout,
> zeromap_swpin (or whatever name works).

It's great to reach an agreement. Let me work on some patches for it.

By the way, I recently had an idea: if we can conduct the zeromap check
earlier - for example - before allocating swap slots and pageout(), could
we completely eliminate swap slot occupation and allocation/release
for zeromap data? For example, we could use a special swap
entry value in the PTE to indicate zero content and directly fill it with
zeros when swapping back. We've observed that swap slot allocation and
freeing can consume a lot of CPU and slow down functions like
zap_pte_range and swap-in. If we can entirely skip these steps, it
could improve performance. However, I'm uncertain about the benefits we
would gain if we only have 1-2% zeromap data.

I'm just putting this idea out there to see if you're interested in moving
forward with it. :-)

>
> >>
> >>
> >>>
> >>> I consider issue 1 to be more critical because, after observing a phone
> >>> running for some time, I've been able to roughly estimate the portion
> >>> zeromap can
> >>> help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a
> >>> SWPIN counter. However, I agree that issue 2 still holds significant value
> >>> as a separate patch.
> >>>
> >

Thanks
Barry
Usama Arif Oct. 28, 2024, 9:49 p.m. UTC | #18
On 28/10/2024 21:40, Barry Song wrote:
> On Tue, Oct 29, 2024 at 5:24 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 28/10/2024 21:15, Barry Song wrote:
>>> On Tue, Oct 29, 2024 at 4:51 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On 28/10/2024 20:42, Barry Song wrote:
>>>>> On Tue, Oct 29, 2024 at 4:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 28/10/2024 19:54, Barry Song wrote:
>>>>>>> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 28/10/2024 17:08, Yosry Ahmed wrote:
>>>>>>>>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 28/10/2024 16:33, Nhat Pham wrote:
>>>>>>>>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I wonder if instead of having counters, it might be better to keep track
>>>>>>>>>>>> of the number of zeropages currently stored in zeromap, similar to how
>>>>>>>>>>>> zswap_same_filled_pages did it. It will be more complicated then this
>>>>>>>>>>>> patch, but would give more insight of the current state of the system.
>>>>>>>>>>>>
>>>>>>>>>>>> Joshua (in CC) was going to have a look at that.
>>>>>>>>>>>
>>>>>>>>>>> I don't think one can substitute for the other.
>>>>>>>>>>
>>>>>>>>>> Yes agreed, they have separate uses and provide different information, but
>>>>>>>>>> maybe wasteful to have both types of counters? They are counters so maybe
>>>>>>>>>> dont consume too much resources but I think we should still think about
>>>>>>>>>> it..
>>>>>>>>>
>>>>>>>>> Not for or against here, but I would say that statement is debatable
>>>>>>>>> at best for memcg stats :)
>>>>>>>>>
>>>>>>>>> Each new counter consumes 2 longs per-memcg per-CPU (see
>>>>>>>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can
>>>>>>>>> quickly add up with a large number of CPUs/memcgs/stats.
>>>>>>>>>
>>>>>>>>> Also, when flushing the stats we iterate all of them to propagate
>>>>>>>>> updates from per-CPU counters. This is already a slowpath so adding
>>>>>>>>> one stat is not a big deal, but again because we iterate all stats on
>>>>>>>>> multiple CPUs (and sometimes on each node as well), the overall flush
>>>>>>>>> latency becomes a concern sometimes.
>>>>>>>>>
>>>>>>>>> All of that is not to say we shouldn't add more memcg stats, but we
>>>>>>>>> have to be mindful of the resources.
>>>>>>>>
>>>>>>>> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is
>>>>>>>> also not much).
>>>>>>>>
>>>>>>>> Not trying to block this patch in anyway. Just think its a good point
>>>>>>>> to discuss here if we are ok with both types of counters. If its too wasteful
>>>>>>>> then which one we should have.
>>>>>>>
>>>>>>> Hi Usama,
>>>>>>> my point is that with all the below three counters:
>>>>>>> 1. PSWPIN/PSWPOUT
>>>>>>> 2. ZSWPIN/ZSWPOUT
>>>>>>> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever)
>>>>>>>
>>>>>>> Shouldn't we have been able to determine the portion of zeromap
>>>>>>> swap indirectly?
>>>>>>>
>>>>>>
>>>>>> Hmm, I might be wrong, but I would have thought no?
>>>>>>
>>>>>> What if you swapout a zero folio, but then discard it?
>>>>>> zeromap_swpout would be incremented, but zeromap_swapin would not.
>>>>>
>>>>> I understand. It looks like we have two issues to tackle:
>>>>> 1. We shouldn't let zeromap swap in or out anything that vanishes into
>>>>> a black hole
>>>>> 2. We want to find out how much I/O/memory has been saved due to zeromap so far
>>>>>
>>>>> From my perspective, issue 1 requires a "fix", while issue 2 is more
>>>>> of an optimization.
>>>>
>>>> Hmm I dont understand why point 1 would be an issue.
>>>>
>>>> If its discarded thats fine as far as I can see.
>>>
>>> it is fine to you and probably me who knows zeromap as well :-) but
>>> any userspace code
>>> as below might be entirely confused:
>>>
>>> p = malloc(1G);
>>> write p to 0; or write part of p to 0
>>> madv_pageout(p, 1g)
>>> read p to swapin.
>>>
>>> The entire procedure used to involve 1GB of swap out and 1GB of swap in by any
>>> means. Now, it has recorded 0 swaps counted.
>>>
>>> I don't expect userspace is as smart as you :-)
>>>
>> Ah I completely agree, we need to account for it in some metric. I probably
>> misunderstood when you said "We shouldn't let zeromap swap in or out anything that
>> vanishes into a black hole", by we should not have the zeromap optimization for those
>> cases. What I guess you meant is we need to account for it in some metric.
>>
>>>>
>>>> As a reference, memory.stat.zswapped != memory.stat.zswapout - memory.stat.zswapin.
>>>> Because zswapped would take into account swapped out anon memory freed, MADV_FREE,
>>>> shmem truncate, etc as Yosry said about zeromap, But zswapout and zswapin dont.
>>>
>>> I understand. However, I believe what we really need to focus on is
>>> this: if we’ve
>>> swapped out, for instance, 100GB in the past hour, how much of that 100GB is
>>> zero? This information can help us assess the proportion of zero data in the
>>> workload, along with the potential benefits that zeromap can provide for memory,
>>> I/O space, or read/write operations. Additionally, having the second count
>>> can enhance accuracy when considering MADV_DONTNEED, FREE, TRUNCATE,
>>> and so on.
>>>
>> Yes completely agree!
>>
>> I think we can look into adding all three metrics, zeromap_swapped, zeromap_swpout,
>> zeromap_swpin (or whatever name works).
> 
> It's great to reach an agreement. Let me work on some patches for it.

Thanks!

> 
> By the way, I recently had an idea: if we can conduct the zeromap check
> earlier - for example - before allocating swap slots and pageout(), could
> we completely eliminate swap slot occupation and allocation/release
> for zeromap data? For example, we could use a special swap
> entry value in the PTE to indicate zero content and directly fill it with
> zeros when swapping back. We've observed that swap slot allocation and
> freeing can consume a lot of CPU and slow down functions like
> zap_pte_range and swap-in. If we can entirely skip these steps, it
> could improve performance. However, I'm uncertain about the benefits we
> would gain if we only have 1-2% zeromap data.

If I remember correctly this was one of the ideas floated around in the
initial version of the zeromap series, but it was evaluated as a lot more
complicated to do than what the current zeromap code looks like. But I
think its definitely worth looking into!
 
> 
> I'm just putting this idea out there to see if you're interested in moving
> forward with it. :-)
> 
>>
>>>>
>>>>
>>>>>
>>>>> I consider issue 1 to be more critical because, after observing a phone
>>>>> running for some time, I've been able to roughly estimate the portion
>>>>> zeromap can
>>>>> help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a
>>>>> SWPIN counter. However, I agree that issue 2 still holds significant value
>>>>> as a separate patch.
>>>>>
>>>
> 
> Thanks
> Barry
Barry Song Oct. 28, 2024, 10:11 p.m. UTC | #19
On Tue, Oct 29, 2024 at 5:49 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 28/10/2024 21:40, Barry Song wrote:
> > On Tue, Oct 29, 2024 at 5:24 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >>
> >>
> >> On 28/10/2024 21:15, Barry Song wrote:
> >>> On Tue, Oct 29, 2024 at 4:51 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 28/10/2024 20:42, Barry Song wrote:
> >>>>> On Tue, Oct 29, 2024 at 4:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 28/10/2024 19:54, Barry Song wrote:
> >>>>>>> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 28/10/2024 17:08, Yosry Ahmed wrote:
> >>>>>>>>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 28/10/2024 16:33, Nhat Pham wrote:
> >>>>>>>>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> I wonder if instead of having counters, it might be better to keep track
> >>>>>>>>>>>> of the number of zeropages currently stored in zeromap, similar to how
> >>>>>>>>>>>> zswap_same_filled_pages did it. It will be more complicated then this
> >>>>>>>>>>>> patch, but would give more insight of the current state of the system.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Joshua (in CC) was going to have a look at that.
> >>>>>>>>>>>
> >>>>>>>>>>> I don't think one can substitute for the other.
> >>>>>>>>>>
> >>>>>>>>>> Yes agreed, they have separate uses and provide different information, but
> >>>>>>>>>> maybe wasteful to have both types of counters? They are counters so maybe
> >>>>>>>>>> dont consume too much resources but I think we should still think about
> >>>>>>>>>> it..
> >>>>>>>>>
> >>>>>>>>> Not for or against here, but I would say that statement is debatable
> >>>>>>>>> at best for memcg stats :)
> >>>>>>>>>
> >>>>>>>>> Each new counter consumes 2 longs per-memcg per-CPU (see
> >>>>>>>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can
> >>>>>>>>> quickly add up with a large number of CPUs/memcgs/stats.
> >>>>>>>>>
> >>>>>>>>> Also, when flushing the stats we iterate all of them to propagate
> >>>>>>>>> updates from per-CPU counters. This is already a slowpath so adding
> >>>>>>>>> one stat is not a big deal, but again because we iterate all stats on
> >>>>>>>>> multiple CPUs (and sometimes on each node as well), the overall flush
> >>>>>>>>> latency becomes a concern sometimes.
> >>>>>>>>>
> >>>>>>>>> All of that is not to say we shouldn't add more memcg stats, but we
> >>>>>>>>> have to be mindful of the resources.
> >>>>>>>>
> >>>>>>>> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is
> >>>>>>>> also not much).
> >>>>>>>>
> >>>>>>>> Not trying to block this patch in anyway. Just think its a good point
> >>>>>>>> to discuss here if we are ok with both types of counters. If its too wasteful
> >>>>>>>> then which one we should have.
> >>>>>>>
> >>>>>>> Hi Usama,
> >>>>>>> my point is that with all the below three counters:
> >>>>>>> 1. PSWPIN/PSWPOUT
> >>>>>>> 2. ZSWPIN/ZSWPOUT
> >>>>>>> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever)
> >>>>>>>
> >>>>>>> Shouldn't we have been able to determine the portion of zeromap
> >>>>>>> swap indirectly?
> >>>>>>>
> >>>>>>
> >>>>>> Hmm, I might be wrong, but I would have thought no?
> >>>>>>
> >>>>>> What if you swapout a zero folio, but then discard it?
> >>>>>> zeromap_swpout would be incremented, but zeromap_swapin would not.
> >>>>>
> >>>>> I understand. It looks like we have two issues to tackle:
> >>>>> 1. We shouldn't let zeromap swap in or out anything that vanishes into
> >>>>> a black hole
> >>>>> 2. We want to find out how much I/O/memory has been saved due to zeromap so far
> >>>>>
> >>>>> From my perspective, issue 1 requires a "fix", while issue 2 is more
> >>>>> of an optimization.
> >>>>
> >>>> Hmm I dont understand why point 1 would be an issue.
> >>>>
> >>>> If its discarded thats fine as far as I can see.
> >>>
> >>> it is fine to you and probably me who knows zeromap as well :-) but
> >>> any userspace code
> >>> as below might be entirely confused:
> >>>
> >>> p = malloc(1G);
> >>> write p to 0; or write part of p to 0
> >>> madv_pageout(p, 1g)
> >>> read p to swapin.
> >>>
> >>> The entire procedure used to involve 1GB of swap out and 1GB of swap in by any
> >>> means. Now, it has recorded 0 swaps counted.
> >>>
> >>> I don't expect userspace is as smart as you :-)
> >>>
> >> Ah I completely agree, we need to account for it in some metric. I probably
> >> misunderstood when you said "We shouldn't let zeromap swap in or out anything that
> >> vanishes into a black hole", by we should not have the zeromap optimization for those
> >> cases. What I guess you meant is we need to account for it in some metric.
> >>
> >>>>
> >>>> As a reference, memory.stat.zswapped != memory.stat.zswapout - memory.stat.zswapin.
> >>>> Because zswapped would take into account swapped out anon memory freed, MADV_FREE,
> >>>> shmem truncate, etc as Yosry said about zeromap, But zswapout and zswapin dont.
> >>>
> >>> I understand. However, I believe what we really need to focus on is
> >>> this: if we’ve
> >>> swapped out, for instance, 100GB in the past hour, how much of that 100GB is
> >>> zero? This information can help us assess the proportion of zero data in the
> >>> workload, along with the potential benefits that zeromap can provide for memory,
> >>> I/O space, or read/write operations. Additionally, having the second count
> >>> can enhance accuracy when considering MADV_DONTNEED, FREE, TRUNCATE,
> >>> and so on.
> >>>
> >> Yes completely agree!
> >>
> >> I think we can look into adding all three metrics, zeromap_swapped, zeromap_swpout,
> >> zeromap_swpin (or whatever name works).
> >
> > It's great to reach an agreement. Let me work on some patches for it.
>
> Thanks!
>
> >
> > By the way, I recently had an idea: if we can conduct the zeromap check
> > earlier - for example - before allocating swap slots and pageout(), could
> > we completely eliminate swap slot occupation and allocation/release
> > for zeromap data? For example, we could use a special swap
> > entry value in the PTE to indicate zero content and directly fill it with
> > zeros when swapping back. We've observed that swap slot allocation and
> > freeing can consume a lot of CPU and slow down functions like
> > zap_pte_range and swap-in. If we can entirely skip these steps, it
> > could improve performance. However, I'm uncertain about the benefits we
> > would gain if we only have 1-2% zeromap data.
>
> If I remember correctly this was one of the ideas floated around in the
> initial version of the zeromap series, but it was evaluated as a lot more
> complicated to do than what the current zeromap code looks like. But I
> think its definitely worth looking into!

Sorry for the noise. I didn't review the initial discussion. But my feeling
is that it might be valuable considering the report from Zhiguo:

https://lore.kernel.org/linux-mm/20240805153639.1057-1-justinjiang@vivo.com/

In fact, our recent benchmark also indicates that swap free could account
for a significant portion in do_swap_page().

>
> >
> > I'm just putting this idea out there to see if you're interested in moving
> > forward with it. :-)
> >
> >>
> >>>>
> >>>>
> >>>>>
> >>>>> I consider issue 1 to be more critical because, after observing a phone
> >>>>> running for some time, I've been able to roughly estimate the portion
> >>>>> zeromap can
> >>>>> help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a
> >>>>> SWPIN counter. However, I agree that issue 2 still holds significant value
> >>>>> as a separate patch.
> >>>>>
> >>>
> >

Thanks
Barry
Yosry Ahmed Oct. 28, 2024, 10:32 p.m. UTC | #20
[..]
> > > By the way, I recently had an idea: if we can conduct the zeromap check
> > > earlier - for example - before allocating swap slots and pageout(), could
> > > we completely eliminate swap slot occupation and allocation/release
> > > for zeromap data? For example, we could use a special swap
> > > entry value in the PTE to indicate zero content and directly fill it with
> > > zeros when swapping back. We've observed that swap slot allocation and
> > > freeing can consume a lot of CPU and slow down functions like
> > > zap_pte_range and swap-in. If we can entirely skip these steps, it
> > > could improve performance. However, I'm uncertain about the benefits we
> > > would gain if we only have 1-2% zeromap data.
> >
> > If I remember correctly this was one of the ideas floated around in the
> > initial version of the zeromap series, but it was evaluated as a lot more
> > complicated to do than what the current zeromap code looks like. But I
> > think its definitely worth looking into!

Yup, I did suggest this on the first version:
https://lore.kernel.org/linux-mm/CAJD7tkYcTV_GOZV3qR6uxgFEvYXw1rP-h7WQjDnsdwM=g9cpAw@mail.gmail.com/

, and Usama took a stab at implementing it in the second version:
https://lore.kernel.org/linux-mm/20240604105950.1134192-1-usamaarif642@gmail.com/

David and Shakeel pointed out a few problems. I think they are
fixable, but the complexity/benefit tradeoff was getting unclear at
that point.

If we can make it work without too much complexity, that would be
great of course.

>
> Sorry for the noise. I didn't review the initial discussion. But my feeling
> is that it might be valuable considering the report from Zhiguo:
>
> https://lore.kernel.org/linux-mm/20240805153639.1057-1-justinjiang@vivo.com/
>
> In fact, our recent benchmark also indicates that swap free could account
> for a significant portion in do_swap_page().

As Shakeel mentioned in a reply to Usama's patch mentioned above, we
would need to check the contents of the page after it's unmapped. So
likely we need to allocate a swap slot, walk the rmap and unmap, check
contents, walk the rmap again and update the PTEs, free the swap slot.

So the swap free will be essentially moved from the fault path to the
reclaim path, not eliminated. It may still be worth it, not sure. We
also need to make sure we keep the rmap intact after the first walk
and unmap in case we need to go back and update the PTEs again.

Overall, I think the complexity is unlikely to be low.
Barry Song Oct. 28, 2024, 10:51 p.m. UTC | #21
On Tue, Oct 29, 2024 at 6:33 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> [..]
> > > > By the way, I recently had an idea: if we can conduct the zeromap check
> > > > earlier - for example - before allocating swap slots and pageout(), could
> > > > we completely eliminate swap slot occupation and allocation/release
> > > > for zeromap data? For example, we could use a special swap
> > > > entry value in the PTE to indicate zero content and directly fill it with
> > > > zeros when swapping back. We've observed that swap slot allocation and
> > > > freeing can consume a lot of CPU and slow down functions like
> > > > zap_pte_range and swap-in. If we can entirely skip these steps, it
> > > > could improve performance. However, I'm uncertain about the benefits we
> > > > would gain if we only have 1-2% zeromap data.
> > >
> > > If I remember correctly this was one of the ideas floated around in the
> > > initial version of the zeromap series, but it was evaluated as a lot more
> > > complicated to do than what the current zeromap code looks like. But I
> > > think its definitely worth looking into!
>
> Yup, I did suggest this on the first version:
> https://lore.kernel.org/linux-mm/CAJD7tkYcTV_GOZV3qR6uxgFEvYXw1rP-h7WQjDnsdwM=g9cpAw@mail.gmail.com/
>
> , and Usama took a stab at implementing it in the second version:
> https://lore.kernel.org/linux-mm/20240604105950.1134192-1-usamaarif642@gmail.com/
>
> David and Shakeel pointed out a few problems. I think they are
> fixable, but the complexity/benefit tradeoff was getting unclear at
> that point.
>
> If we can make it work without too much complexity, that would be
> great of course.
>
> >
> > Sorry for the noise. I didn't review the initial discussion. But my feeling
> > is that it might be valuable considering the report from Zhiguo:
> >
> > https://lore.kernel.org/linux-mm/20240805153639.1057-1-justinjiang@vivo.com/
> >
> > In fact, our recent benchmark also indicates that swap free could account
> > for a significant portion in do_swap_page().
>
> As Shakeel mentioned in a reply to Usama's patch mentioned above, we
> would need to check the contents of the page after it's unmapped. So
> likely we need to allocate a swap slot, walk the rmap and unmap, check
> contents, walk the rmap again and update the PTEs, free the swap slot.
>

So the issue is that we can't check the content before allocating slots and
unmapping during reclamation? If we find the content is zero, can we skip
all slot operations and go directly to rmap/unmap by using a special PTE?

> So the swap free will be essentially moved from the fault path to the
> reclaim path, not eliminated. It may still be worth it, not sure. We
> also need to make sure we keep the rmap intact after the first walk
> and unmap in case we need to go back and update the PTEs again.
>
> Overall, I think the complexity is unlikely to be low.
Yosry Ahmed Oct. 28, 2024, 10:54 p.m. UTC | #22
On Mon, Oct 28, 2024 at 3:52 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Oct 29, 2024 at 6:33 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > [..]
> > > > > By the way, I recently had an idea: if we can conduct the zeromap check
> > > > > earlier - for example - before allocating swap slots and pageout(), could
> > > > > we completely eliminate swap slot occupation and allocation/release
> > > > > for zeromap data? For example, we could use a special swap
> > > > > entry value in the PTE to indicate zero content and directly fill it with
> > > > > zeros when swapping back. We've observed that swap slot allocation and
> > > > > freeing can consume a lot of CPU and slow down functions like
> > > > > zap_pte_range and swap-in. If we can entirely skip these steps, it
> > > > > could improve performance. However, I'm uncertain about the benefits we
> > > > > would gain if we only have 1-2% zeromap data.
> > > >
> > > > If I remember correctly this was one of the ideas floated around in the
> > > > initial version of the zeromap series, but it was evaluated as a lot more
> > > > complicated to do than what the current zeromap code looks like. But I
> > > > think its definitely worth looking into!
> >
> > Yup, I did suggest this on the first version:
> > https://lore.kernel.org/linux-mm/CAJD7tkYcTV_GOZV3qR6uxgFEvYXw1rP-h7WQjDnsdwM=g9cpAw@mail.gmail.com/
> >
> > , and Usama took a stab at implementing it in the second version:
> > https://lore.kernel.org/linux-mm/20240604105950.1134192-1-usamaarif642@gmail.com/
> >
> > David and Shakeel pointed out a few problems. I think they are
> > fixable, but the complexity/benefit tradeoff was getting unclear at
> > that point.
> >
> > If we can make it work without too much complexity, that would be
> > great of course.
> >
> > >
> > > Sorry for the noise. I didn't review the initial discussion. But my feeling
> > > is that it might be valuable considering the report from Zhiguo:
> > >
> > > https://lore.kernel.org/linux-mm/20240805153639.1057-1-justinjiang@vivo.com/
> > >
> > > In fact, our recent benchmark also indicates that swap free could account
> > > for a significant portion in do_swap_page().
> >
> > As Shakeel mentioned in a reply to Usama's patch mentioned above, we
> > would need to check the contents of the page after it's unmapped. So
> > likely we need to allocate a swap slot, walk the rmap and unmap, check
> > contents, walk the rmap again and update the PTEs, free the swap slot.
> >
>
> So the issue is that we can't check the content before allocating slots and
> unmapping during reclamation? If we find the content is zero, can we skip
> all slot operations and go directly to rmap/unmap by using a special PTE?

We need to unmap first before checking the content, otherwise the
content can change right after we check it.
Barry Song Oct. 28, 2024, 11:03 p.m. UTC | #23
On Tue, Oct 29, 2024 at 6:54 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Oct 28, 2024 at 3:52 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Tue, Oct 29, 2024 at 6:33 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > [..]
> > > > > > By the way, I recently had an idea: if we can conduct the zeromap check
> > > > > > earlier - for example - before allocating swap slots and pageout(), could
> > > > > > we completely eliminate swap slot occupation and allocation/release
> > > > > > for zeromap data? For example, we could use a special swap
> > > > > > entry value in the PTE to indicate zero content and directly fill it with
> > > > > > zeros when swapping back. We've observed that swap slot allocation and
> > > > > > freeing can consume a lot of CPU and slow down functions like
> > > > > > zap_pte_range and swap-in. If we can entirely skip these steps, it
> > > > > > could improve performance. However, I'm uncertain about the benefits we
> > > > > > would gain if we only have 1-2% zeromap data.
> > > > >
> > > > > If I remember correctly this was one of the ideas floated around in the
> > > > > initial version of the zeromap series, but it was evaluated as a lot more
> > > > > complicated to do than what the current zeromap code looks like. But I
> > > > > think its definitely worth looking into!
> > >
> > > Yup, I did suggest this on the first version:
> > > https://lore.kernel.org/linux-mm/CAJD7tkYcTV_GOZV3qR6uxgFEvYXw1rP-h7WQjDnsdwM=g9cpAw@mail.gmail.com/
> > >
> > > , and Usama took a stab at implementing it in the second version:
> > > https://lore.kernel.org/linux-mm/20240604105950.1134192-1-usamaarif642@gmail.com/
> > >
> > > David and Shakeel pointed out a few problems. I think they are
> > > fixable, but the complexity/benefit tradeoff was getting unclear at
> > > that point.
> > >
> > > If we can make it work without too much complexity, that would be
> > > great of course.
> > >
> > > >
> > > > Sorry for the noise. I didn't review the initial discussion. But my feeling
> > > > is that it might be valuable considering the report from Zhiguo:
> > > >
> > > > https://lore.kernel.org/linux-mm/20240805153639.1057-1-justinjiang@vivo.com/
> > > >
> > > > In fact, our recent benchmark also indicates that swap free could account
> > > > for a significant portion in do_swap_page().
> > >
> > > As Shakeel mentioned in a reply to Usama's patch mentioned above, we
> > > would need to check the contents of the page after it's unmapped. So
> > > likely we need to allocate a swap slot, walk the rmap and unmap, check
> > > contents, walk the rmap again and update the PTEs, free the swap slot.
> > >
> >
> > So the issue is that we can't check the content before allocating slots and
> > unmapping during reclamation? If we find the content is zero, can we skip
> > all slot operations and go directly to rmap/unmap by using a special PTE?
>
> We need to unmap first before checking the content, otherwise the
> content can change right after we check it.

Well, do we have a way to terminate the unmap if we find pte_dirty and ensure
the folio is still mapped after try_to_unmap_one()? Then we could
activate it again
after try_to_unmap.

It might just be noise. Let me take some more time to think about it. :-)
Nhat Pham Oct. 29, 2024, 5:46 p.m. UTC | #24
On Mon, Oct 28, 2024 at 4:03 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Oct 29, 2024 at 6:54 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Mon, Oct 28, 2024 at 3:52 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Tue, Oct 29, 2024 at 6:33 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > [..]
> > > > > > > By the way, I recently had an idea: if we can conduct the zeromap check
> > > > > > > earlier - for example - before allocating swap slots and pageout(), could
> > > > > > > we completely eliminate swap slot occupation and allocation/release
> > > > > > > for zeromap data? For example, we could use a special swap
> > > > > > > entry value in the PTE to indicate zero content and directly fill it with
> > > > > > > zeros when swapping back. We've observed that swap slot allocation and
> > > > > > > freeing can consume a lot of CPU and slow down functions like
> > > > > > > zap_pte_range and swap-in. If we can entirely skip these steps, it
> > > > > > > could improve performance. However, I'm uncertain about the benefits we
> > > > > > > would gain if we only have 1-2% zeromap data.
> > > > > >
> > > > > > If I remember correctly this was one of the ideas floated around in the
> > > > > > initial version of the zeromap series, but it was evaluated as a lot more
> > > > > > complicated to do than what the current zeromap code looks like. But I
> > > > > > think its definitely worth looking into!
> > > >
> > > > Yup, I did suggest this on the first version:
> > > > https://lore.kernel.org/linux-mm/CAJD7tkYcTV_GOZV3qR6uxgFEvYXw1rP-h7WQjDnsdwM=g9cpAw@mail.gmail.com/
> > > >
> > > > , and Usama took a stab at implementing it in the second version:
> > > > https://lore.kernel.org/linux-mm/20240604105950.1134192-1-usamaarif642@gmail.com/
> > > >
> > > > David and Shakeel pointed out a few problems. I think they are
> > > > fixable, but the complexity/benefit tradeoff was getting unclear at
> > > > that point.
> > > >
> > > > If we can make it work without too much complexity, that would be
> > > > great of course.
> > > >
> > > > >
> > > > > Sorry for the noise. I didn't review the initial discussion. But my feeling
> > > > > is that it might be valuable considering the report from Zhiguo:
> > > > >
> > > > > https://lore.kernel.org/linux-mm/20240805153639.1057-1-justinjiang@vivo.com/
> > > > >
> > > > > In fact, our recent benchmark also indicates that swap free could account
> > > > > for a significant portion in do_swap_page().
> > > >
> > > > As Shakeel mentioned in a reply to Usama's patch mentioned above, we
> > > > would need to check the contents of the page after it's unmapped. So
> > > > likely we need to allocate a swap slot, walk the rmap and unmap, check
> > > > contents, walk the rmap again and update the PTEs, free the swap slot.
> > > >
> > >
> > > So the issue is that we can't check the content before allocating slots and
> > > unmapping during reclamation? If we find the content is zero, can we skip
> > > all slot operations and go directly to rmap/unmap by using a special PTE?
> >
> > We need to unmap first before checking the content, otherwise the
> > content can change right after we check it.
>
> Well, do we have a way to terminate the unmap if we find pte_dirty and ensure
> the folio is still mapped after try_to_unmap_one()? Then we could
> activate it again
> after try_to_unmap.
>
> It might just be noise. Let me take some more time to think about it. :-)

FWIW, the swap abstraction layer Yosry proposed last year (and I'm
working on right now) will allow you to store these zeromapped swap
entries without requiring any swap slots allocated on the swapfile.
It's basically the same thing as swap/zswap decoupling.

Not stopping you guys from optimizing it, since all I have right now
is a (most certainly buggy) prototype + there might be benefits if we
can get around the swap subsystem altogether for these zero mapped
entries. Just letting you know there's a backup plan :)
Yosry Ahmed Oct. 29, 2024, 5:55 p.m. UTC | #25
> FWIW, the swap abstraction layer Yosry proposed last year (and I'm
> working on right now) will allow you to store these zeromapped swap
> entries without requiring any swap slots allocated on the swapfile.
> It's basically the same thing as swap/zswap decoupling.

I don't know if I said this before, but thank you for keeping this
alive. Much appreciated :)

>
> Not stopping you guys from optimizing it, since all I have right now
> is a (most certainly buggy) prototype + there might be benefits if we
> can get around the swap subsystem altogether for these zero mapped
> entries. Just letting you know there's a backup plan :)
Nhat Pham Oct. 30, 2024, 11:46 p.m. UTC | #26
On Tue, Oct 29, 2024 at 10:55 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> > FWIW, the swap abstraction layer Yosry proposed last year (and I'm
> > working on right now) will allow you to store these zeromapped swap
> > entries without requiring any swap slots allocated on the swapfile.
> > It's basically the same thing as swap/zswap decoupling.
>
> I don't know if I said this before, but thank you for keeping this
> alive. Much appreciated :)
>

Of course, it's my pleasure :) I'll be sure to cc you when I send the
prototype out as an RFC - your input is much appreciated, as always :)
diff mbox series

Patch

diff --git a/mm/page_io.c b/mm/page_io.c
index 5d9b6e6cf96c..90c5ea870038 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -226,6 +226,19 @@  static void swap_zeromap_folio_clear(struct folio *folio)
 	}
 }
 
+static inline void count_swpout_vm_event(struct folio *folio)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (unlikely(folio_test_pmd_mappable(folio))) {
+		count_memcg_folio_events(folio, THP_SWPOUT, 1);
+		count_vm_event(THP_SWPOUT);
+	}
+#endif
+	count_mthp_stat(folio_order(folio), MTHP_STAT_SWPOUT);
+	count_memcg_folio_events(folio, PSWPOUT, folio_nr_pages(folio));
+	count_vm_events(PSWPOUT, folio_nr_pages(folio));
+}
+
 /*
  * We may have stale swap cache pages in memory: notice
  * them here and get rid of the unnecessary final write.
@@ -258,6 +271,7 @@  int swap_writepage(struct page *page, struct writeback_control *wbc)
 	 */
 	if (is_folio_zero_filled(folio)) {
 		swap_zeromap_folio_set(folio);
+		count_swpout_vm_event(folio);
 		folio_unlock(folio);
 		return 0;
 	} else {
@@ -282,19 +296,6 @@  int swap_writepage(struct page *page, struct writeback_control *wbc)
 	return 0;
 }
 
-static inline void count_swpout_vm_event(struct folio *folio)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (unlikely(folio_test_pmd_mappable(folio))) {
-		count_memcg_folio_events(folio, THP_SWPOUT, 1);
-		count_vm_event(THP_SWPOUT);
-	}
-#endif
-	count_mthp_stat(folio_order(folio), MTHP_STAT_SWPOUT);
-	count_memcg_folio_events(folio, PSWPOUT, folio_nr_pages(folio));
-	count_vm_events(PSWPOUT, folio_nr_pages(folio));
-}
-
 #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
 static void bio_associate_blkg_from_page(struct bio *bio, struct folio *folio)
 {
@@ -621,6 +622,9 @@  void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 	delayacct_swapin_start();
 
 	if (swap_read_folio_zeromap(folio)) {
+		count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN);
+		count_memcg_folio_events(folio, PSWPIN, folio_nr_pages(folio));
+		count_vm_events(PSWPIN, folio_nr_pages(folio));
 		folio_unlock(folio);
 		goto finish;
 	} else if (zswap_load(folio)) {