diff mbox series

[v2] mm: count zeromap read and set for swapout and swapin

Message ID 20241102101240.35072-1-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series [v2] mm: count zeromap read and set for swapout and swapin | expand

Commit Message

Barry Song Nov. 2, 2024, 10:12 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

When the proportion of folios from the zero map is small, missing their
accounting may not significantly impact profiling. However, it’s easy
to construct a scenario where this becomes an issue—for example,
allocating 1 GB of memory, writing zeros from userspace, followed by
MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out
and swap-in counts seem to vanish into a black hole, potentially
causing semantic ambiguity.

We have two ways to address this:

1. Add a separate counter specifically for the zero map.
2. Continue using the current accounting, treating the zero map like
a normal backend. (This aligns with the current behavior of zRAM
when supporting same-page fills at the device level.)

This patch adopts option 1 as pswpin/pswpout counters are that they
only apply to IO done directly to the backend device (as noted by
Nhat Pham).

We can find these counters from /proc/vmstat (counters for the whole
system) and memcg's memory.stat (counters for the interested memcg).

For example:

$ grep -E 'swpin_zero|swpout_zero' /proc/vmstat
swpin_zero 1648
swpout_zero 33536

$ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat
swpin_zero 3905
swpout_zero 3985

Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap")
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 -v2:
 * add separate counters rather than using pswpin/out; thanks
 for the comments from Usama, David, Yosry and Nhat;
 * Usama also suggested a new counter like swapped_zero, I
 prefer that one be separated as an enhancement patch not
 a hotfix. will probably handle it later on.

 Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++++
 include/linux/vm_event_item.h           |  2 ++
 mm/memcontrol.c                         |  4 ++++
 mm/page_io.c                            | 16 ++++++++++++++++
 mm/vmstat.c                             |  2 ++
 5 files changed, 34 insertions(+)

Comments

Usama Arif Nov. 2, 2024, 12:32 p.m. UTC | #1
On 02/11/2024 10:12, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> When the proportion of folios from the zero map is small, missing their
> accounting may not significantly impact profiling. However, it’s easy
> to construct a scenario where this becomes an issue—for example,
> allocating 1 GB of memory, writing zeros from userspace, followed by
> MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out
> and swap-in counts seem to vanish into a black hole, potentially
> causing semantic ambiguity.
> 
> We have two ways to address this:
> 
> 1. Add a separate counter specifically for the zero map.
> 2. Continue using the current accounting, treating the zero map like
> a normal backend. (This aligns with the current behavior of zRAM
> when supporting same-page fills at the device level.)
> 
> This patch adopts option 1 as pswpin/pswpout counters are that they
> only apply to IO done directly to the backend device (as noted by
> Nhat Pham).
> 
> We can find these counters from /proc/vmstat (counters for the whole
> system) and memcg's memory.stat (counters for the interested memcg).
> 
> For example:
> 
> $ grep -E 'swpin_zero|swpout_zero' /proc/vmstat
> swpin_zero 1648
> swpout_zero 33536
> 
> $ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat
> swpin_zero 3905
> swpout_zero 3985
> 
> Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap")
I don't think its a hotfix (or even a fix). It was discussed in the initial
series to add these as a follow up and Joshua was going to do this soon.
Its not fixing any bug in the initial series.

> Cc: Usama Arif <usamaarif642@gmail.com>
> Cc: Chengming Zhou <chengming.zhou@linux.dev>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> Cc: Nhat Pham <nphamcs@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Chris Li <chrisl@kernel.org>
> Cc: "Huang, Ying" <ying.huang@intel.com>
> Cc: Kairui Song <kasong@tencent.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  -v2:
>  * add separate counters rather than using pswpin/out; thanks
>  for the comments from Usama, David, Yosry and Nhat;
>  * Usama also suggested a new counter like swapped_zero, I
>  prefer that one be separated as an enhancement patch not
>  a hotfix. will probably handle it later on.
> 
I dont think either of them would be a hotfix.

>  Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++++
>  include/linux/vm_event_item.h           |  2 ++
>  mm/memcontrol.c                         |  4 ++++
>  mm/page_io.c                            | 16 ++++++++++++++++
>  mm/vmstat.c                             |  2 ++
>  5 files changed, 34 insertions(+)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index db3799f1483e..984eb3c9d05b 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1599,6 +1599,16 @@ The following nested keys are defined.
>  	  pglazyfreed (npn)
>  		Amount of reclaimed lazyfree pages
>  
> +	  swpin_zero
> +		Number of pages moved into memory with zero content, meaning no
> +		copy exists in the backend swapfile, allowing swap-in to avoid
> +		I/O read overhead.
> +
> +	  swpout_zero
> +		Number of pages moved out of memory with zero content, meaning no
> +		copy is needed in the backend swapfile, allowing swap-out to avoid
> +		I/O write overhead.
> +

Maybe zero-filled pages might be a better term in both. 

>  	  zswpin
>  		Number of pages moved in to memory from zswap.
>  
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index aed952d04132..f70d0958095c 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -134,6 +134,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #ifdef CONFIG_SWAP
>  		SWAP_RA,
>  		SWAP_RA_HIT,
> +		SWPIN_ZERO,
> +		SWPOUT_ZERO,
>  #ifdef CONFIG_KSM
>  		KSM_SWPIN_COPY,
>  #endif
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5e44d6e7591e..7b3503d12aaf 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -441,6 +441,10 @@ static const unsigned int memcg_vm_event_stat[] = {
>  	PGDEACTIVATE,
>  	PGLAZYFREE,
>  	PGLAZYFREED,
> +#ifdef CONFIG_SWAP
> +	SWPIN_ZERO,
> +	SWPOUT_ZERO,
> +#endif
>  #ifdef CONFIG_ZSWAP
>  	ZSWPIN,
>  	ZSWPOUT,
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 5d9b6e6cf96c..4b4ea8e49cf6 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -204,7 +204,9 @@ static bool is_folio_zero_filled(struct folio *folio)
>  
>  static void swap_zeromap_folio_set(struct folio *folio)
>  {
> +	struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
>  	struct swap_info_struct *sis = swp_swap_info(folio->swap);
> +	int nr_pages = folio_nr_pages(folio);
>  	swp_entry_t entry;
>  	unsigned int i;
>  
> @@ -212,6 +214,12 @@ static void swap_zeromap_folio_set(struct folio *folio)
>  		entry = page_swap_entry(folio_page(folio, i));
>  		set_bit(swp_offset(entry), sis->zeromap);
>  	}
> +
> +	count_vm_events(SWPOUT_ZERO, nr_pages);
> +	if (objcg) {
> +		count_objcg_events(objcg, SWPOUT_ZERO, nr_pages);
> +		obj_cgroup_put(objcg);
> +	}
>  }
>  
>  static void swap_zeromap_folio_clear(struct folio *folio)
> @@ -507,6 +515,7 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
>  static bool swap_read_folio_zeromap(struct folio *folio)
>  {
>  	int nr_pages = folio_nr_pages(folio);
> +	struct obj_cgroup *objcg;
>  	bool is_zeromap;
>  
>  	/*
> @@ -521,6 +530,13 @@ static bool swap_read_folio_zeromap(struct folio *folio)
>  	if (!is_zeromap)
>  		return false;
>  
> +	objcg = get_obj_cgroup_from_folio(folio);
> +	count_vm_events(SWPIN_ZERO, nr_pages);
> +	if (objcg) {
> +		count_objcg_events(objcg, SWPIN_ZERO, nr_pages);
> +		obj_cgroup_put(objcg);
> +	}
> +
>  	folio_zero_range(folio, 0, folio_size(folio));
>  	folio_mark_uptodate(folio);
>  	return true;
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 22a294556b58..c8ef7352f9ed 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1418,6 +1418,8 @@ const char * const vmstat_text[] = {
>  #ifdef CONFIG_SWAP
>  	"swap_ra",
>  	"swap_ra_hit",
> +	"swpin_zero",
> +	"swpout_zero",
>  #ifdef CONFIG_KSM
>  	"ksm_swpin_copy",
>  #endif
Barry Song Nov. 2, 2024, 12:59 p.m. UTC | #2
On Sat, Nov 2, 2024 at 8:32 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 02/11/2024 10:12, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > When the proportion of folios from the zero map is small, missing their
> > accounting may not significantly impact profiling. However, it’s easy
> > to construct a scenario where this becomes an issue—for example,
> > allocating 1 GB of memory, writing zeros from userspace, followed by
> > MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out
> > and swap-in counts seem to vanish into a black hole, potentially
> > causing semantic ambiguity.
> >
> > We have two ways to address this:
> >
> > 1. Add a separate counter specifically for the zero map.
> > 2. Continue using the current accounting, treating the zero map like
> > a normal backend. (This aligns with the current behavior of zRAM
> > when supporting same-page fills at the device level.)
> >
> > This patch adopts option 1 as pswpin/pswpout counters are that they
> > only apply to IO done directly to the backend device (as noted by
> > Nhat Pham).
> >
> > We can find these counters from /proc/vmstat (counters for the whole
> > system) and memcg's memory.stat (counters for the interested memcg).
> >
> > For example:
> >
> > $ grep -E 'swpin_zero|swpout_zero' /proc/vmstat
> > swpin_zero 1648
> > swpout_zero 33536
> >
> > $ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat
> > swpin_zero 3905
> > swpout_zero 3985
> >
> > Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap")
> I don't think its a hotfix (or even a fix). It was discussed in the initial
> series to add these as a follow up and Joshua was going to do this soon.
> Its not fixing any bug in the initial series.

I would prefer that all kernel versions with zeromap include this
counter; otherwise,
it could be confusing to determine where swap-in and swap-out have occurred,
as shown by the small program below:

p =malloc(1g);
write p to zero
madvise_pageout
read p;

Previously, there was 1GB of swap-in and swap-out activity reported, but
now nothing is shown.

I don't mean to suggest that there's a bug in the zeromap code; rather,
having this counter would help clear up any confusion.

I didn't realize Joshua was handling it. Is he still planning to? If
so, I can leave it
with Joshua if that was the plan :-)

>
> > Cc: Usama Arif <usamaarif642@gmail.com>
> > Cc: Chengming Zhou <chengming.zhou@linux.dev>
> > Cc: Yosry Ahmed <yosryahmed@google.com>
> > Cc: Nhat Pham <nphamcs@gmail.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> > Cc: Shakeel Butt <shakeel.butt@linux.dev>
> > Cc: Andi Kleen <ak@linux.intel.com>
> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Cc: Chris Li <chrisl@kernel.org>
> > Cc: "Huang, Ying" <ying.huang@intel.com>
> > Cc: Kairui Song <kasong@tencent.com>
> > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  -v2:
> >  * add separate counters rather than using pswpin/out; thanks
> >  for the comments from Usama, David, Yosry and Nhat;
> >  * Usama also suggested a new counter like swapped_zero, I
> >  prefer that one be separated as an enhancement patch not
> >  a hotfix. will probably handle it later on.
> >
> I dont think either of them would be a hotfix.

As mentioned above, this isn't about fixing a bug; it's simply to ensure
that swap-related metrics don't disappear.

>
> >  Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++++
> >  include/linux/vm_event_item.h           |  2 ++
> >  mm/memcontrol.c                         |  4 ++++
> >  mm/page_io.c                            | 16 ++++++++++++++++
> >  mm/vmstat.c                             |  2 ++
> >  5 files changed, 34 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index db3799f1483e..984eb3c9d05b 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1599,6 +1599,16 @@ The following nested keys are defined.
> >         pglazyfreed (npn)
> >               Amount of reclaimed lazyfree pages
> >
> > +       swpin_zero
> > +             Number of pages moved into memory with zero content, meaning no
> > +             copy exists in the backend swapfile, allowing swap-in to avoid
> > +             I/O read overhead.
> > +
> > +       swpout_zero
> > +             Number of pages moved out of memory with zero content, meaning no
> > +             copy is needed in the backend swapfile, allowing swap-out to avoid
> > +             I/O write overhead.
> > +
>
> Maybe zero-filled pages might be a better term in both.

Do you mean dropping "with zero content" and replacing it by
Number of zero-filled pages moved out of memory ? I'm fine
with the change.

>
> >         zswpin
> >               Number of pages moved in to memory from zswap.
> >
> > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> > index aed952d04132..f70d0958095c 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -134,6 +134,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  #ifdef CONFIG_SWAP
> >               SWAP_RA,
> >               SWAP_RA_HIT,
> > +             SWPIN_ZERO,
> > +             SWPOUT_ZERO,
> >  #ifdef CONFIG_KSM
> >               KSM_SWPIN_COPY,
> >  #endif
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 5e44d6e7591e..7b3503d12aaf 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -441,6 +441,10 @@ static const unsigned int memcg_vm_event_stat[] = {
> >       PGDEACTIVATE,
> >       PGLAZYFREE,
> >       PGLAZYFREED,
> > +#ifdef CONFIG_SWAP
> > +     SWPIN_ZERO,
> > +     SWPOUT_ZERO,
> > +#endif
> >  #ifdef CONFIG_ZSWAP
> >       ZSWPIN,
> >       ZSWPOUT,
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index 5d9b6e6cf96c..4b4ea8e49cf6 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -204,7 +204,9 @@ static bool is_folio_zero_filled(struct folio *folio)
> >
> >  static void swap_zeromap_folio_set(struct folio *folio)
> >  {
> > +     struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
> >       struct swap_info_struct *sis = swp_swap_info(folio->swap);
> > +     int nr_pages = folio_nr_pages(folio);
> >       swp_entry_t entry;
> >       unsigned int i;
> >
> > @@ -212,6 +214,12 @@ static void swap_zeromap_folio_set(struct folio *folio)
> >               entry = page_swap_entry(folio_page(folio, i));
> >               set_bit(swp_offset(entry), sis->zeromap);
> >       }
> > +
> > +     count_vm_events(SWPOUT_ZERO, nr_pages);
> > +     if (objcg) {
> > +             count_objcg_events(objcg, SWPOUT_ZERO, nr_pages);
> > +             obj_cgroup_put(objcg);
> > +     }
> >  }
> >
> >  static void swap_zeromap_folio_clear(struct folio *folio)
> > @@ -507,6 +515,7 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
> >  static bool swap_read_folio_zeromap(struct folio *folio)
> >  {
> >       int nr_pages = folio_nr_pages(folio);
> > +     struct obj_cgroup *objcg;
> >       bool is_zeromap;
> >
> >       /*
> > @@ -521,6 +530,13 @@ static bool swap_read_folio_zeromap(struct folio *folio)
> >       if (!is_zeromap)
> >               return false;
> >
> > +     objcg = get_obj_cgroup_from_folio(folio);
> > +     count_vm_events(SWPIN_ZERO, nr_pages);
> > +     if (objcg) {
> > +             count_objcg_events(objcg, SWPIN_ZERO, nr_pages);
> > +             obj_cgroup_put(objcg);
> > +     }
> > +
> >       folio_zero_range(folio, 0, folio_size(folio));
> >       folio_mark_uptodate(folio);
> >       return true;
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 22a294556b58..c8ef7352f9ed 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1418,6 +1418,8 @@ const char * const vmstat_text[] = {
> >  #ifdef CONFIG_SWAP
> >       "swap_ra",
> >       "swap_ra_hit",
> > +     "swpin_zero",
> > +     "swpout_zero",
> >  #ifdef CONFIG_KSM
> >       "ksm_swpin_copy",
> >  #endif
>

Thanks
Barry
Usama Arif Nov. 2, 2024, 2:43 p.m. UTC | #3
On 02/11/2024 12:59, Barry Song wrote:
> On Sat, Nov 2, 2024 at 8:32 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 02/11/2024 10:12, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> When the proportion of folios from the zero map is small, missing their
>>> accounting may not significantly impact profiling. However, it’s easy
>>> to construct a scenario where this becomes an issue—for example,
>>> allocating 1 GB of memory, writing zeros from userspace, followed by
>>> MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out
>>> and swap-in counts seem to vanish into a black hole, potentially
>>> causing semantic ambiguity.
>>>
>>> We have two ways to address this:
>>>
>>> 1. Add a separate counter specifically for the zero map.
>>> 2. Continue using the current accounting, treating the zero map like
>>> a normal backend. (This aligns with the current behavior of zRAM
>>> when supporting same-page fills at the device level.)
>>>
>>> This patch adopts option 1 as pswpin/pswpout counters are that they
>>> only apply to IO done directly to the backend device (as noted by
>>> Nhat Pham).
>>>
>>> We can find these counters from /proc/vmstat (counters for the whole
>>> system) and memcg's memory.stat (counters for the interested memcg).
>>>
>>> For example:
>>>
>>> $ grep -E 'swpin_zero|swpout_zero' /proc/vmstat
>>> swpin_zero 1648
>>> swpout_zero 33536
>>>
>>> $ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat
>>> swpin_zero 3905
>>> swpout_zero 3985
>>>
>>> Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap")
>> I don't think its a hotfix (or even a fix). It was discussed in the initial
>> series to add these as a follow up and Joshua was going to do this soon.
>> Its not fixing any bug in the initial series.
> 
> I would prefer that all kernel versions with zeromap include this
> counter; otherwise,
> it could be confusing to determine where swap-in and swap-out have occurred,
> as shown by the small program below:
> 
> p =malloc(1g);
> write p to zero
> madvise_pageout
> read p;
> 
> Previously, there was 1GB of swap-in and swap-out activity reported, but
> now nothing is shown.
> 
> I don't mean to suggest that there's a bug in the zeromap code; rather,
> having this counter would help clear up any confusion.
> 
> I didn't realize Joshua was handling it. Is he still planning to? If
> so, I can leave it
> with Joshua if that was the plan :-)
> 

Please do continue with this patch, I think he was going to look at the
swapped_zero version that we discussed earlier anyways. Will let Joshua
comment on it.

>>
>>> Cc: Usama Arif <usamaarif642@gmail.com>
>>> Cc: Chengming Zhou <chengming.zhou@linux.dev>
>>> Cc: Yosry Ahmed <yosryahmed@google.com>
>>> Cc: Nhat Pham <nphamcs@gmail.com>
>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
>>> Cc: Shakeel Butt <shakeel.butt@linux.dev>
>>> Cc: Andi Kleen <ak@linux.intel.com>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: Chris Li <chrisl@kernel.org>
>>> Cc: "Huang, Ying" <ying.huang@intel.com>
>>> Cc: Kairui Song <kasong@tencent.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>  -v2:
>>>  * add separate counters rather than using pswpin/out; thanks
>>>  for the comments from Usama, David, Yosry and Nhat;
>>>  * Usama also suggested a new counter like swapped_zero, I
>>>  prefer that one be separated as an enhancement patch not
>>>  a hotfix. will probably handle it later on.
>>>
>> I dont think either of them would be a hotfix.
> 
> As mentioned above, this isn't about fixing a bug; it's simply to ensure
> that swap-related metrics don't disappear.
> 
>>
>>>  Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++++
>>>  include/linux/vm_event_item.h           |  2 ++
>>>  mm/memcontrol.c                         |  4 ++++
>>>  mm/page_io.c                            | 16 ++++++++++++++++
>>>  mm/vmstat.c                             |  2 ++
>>>  5 files changed, 34 insertions(+)
>>>
>>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>>> index db3799f1483e..984eb3c9d05b 100644
>>> --- a/Documentation/admin-guide/cgroup-v2.rst
>>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>>> @@ -1599,6 +1599,16 @@ The following nested keys are defined.
>>>         pglazyfreed (npn)
>>>               Amount of reclaimed lazyfree pages
>>>
>>> +       swpin_zero
>>> +             Number of pages moved into memory with zero content, meaning no
>>> +             copy exists in the backend swapfile, allowing swap-in to avoid
>>> +             I/O read overhead.
>>> +
>>> +       swpout_zero
>>> +             Number of pages moved out of memory with zero content, meaning no
>>> +             copy is needed in the backend swapfile, allowing swap-out to avoid
>>> +             I/O write overhead.
>>> +
>>
>> Maybe zero-filled pages might be a better term in both.
> 
> Do you mean dropping "with zero content" and replacing it by
> Number of zero-filled pages moved out of memory ? I'm fine
> with the change.

Yes, mainly because if you do swapout of memory that was memset 0 its still
content, just zero-filled.

Thanks,
Usama
> 
>>
>>>         zswpin
>>>               Number of pages moved in to memory from zswap.
>>>
>>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>>> index aed952d04132..f70d0958095c 100644
>>> --- a/include/linux/vm_event_item.h
>>> +++ b/include/linux/vm_event_item.h
>>> @@ -134,6 +134,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>>  #ifdef CONFIG_SWAP
>>>               SWAP_RA,
>>>               SWAP_RA_HIT,
>>> +             SWPIN_ZERO,
>>> +             SWPOUT_ZERO,
>>>  #ifdef CONFIG_KSM
>>>               KSM_SWPIN_COPY,
>>>  #endif
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 5e44d6e7591e..7b3503d12aaf 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -441,6 +441,10 @@ static const unsigned int memcg_vm_event_stat[] = {
>>>       PGDEACTIVATE,
>>>       PGLAZYFREE,
>>>       PGLAZYFREED,
>>> +#ifdef CONFIG_SWAP
>>> +     SWPIN_ZERO,
>>> +     SWPOUT_ZERO,
>>> +#endif
>>>  #ifdef CONFIG_ZSWAP
>>>       ZSWPIN,
>>>       ZSWPOUT,
>>> diff --git a/mm/page_io.c b/mm/page_io.c
>>> index 5d9b6e6cf96c..4b4ea8e49cf6 100644
>>> --- a/mm/page_io.c
>>> +++ b/mm/page_io.c
>>> @@ -204,7 +204,9 @@ static bool is_folio_zero_filled(struct folio *folio)
>>>
>>>  static void swap_zeromap_folio_set(struct folio *folio)
>>>  {
>>> +     struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
>>>       struct swap_info_struct *sis = swp_swap_info(folio->swap);
>>> +     int nr_pages = folio_nr_pages(folio);
>>>       swp_entry_t entry;
>>>       unsigned int i;
>>>
>>> @@ -212,6 +214,12 @@ static void swap_zeromap_folio_set(struct folio *folio)
>>>               entry = page_swap_entry(folio_page(folio, i));
>>>               set_bit(swp_offset(entry), sis->zeromap);
>>>       }
>>> +
>>> +     count_vm_events(SWPOUT_ZERO, nr_pages);
>>> +     if (objcg) {
>>> +             count_objcg_events(objcg, SWPOUT_ZERO, nr_pages);
>>> +             obj_cgroup_put(objcg);
>>> +     }
>>>  }
>>>
>>>  static void swap_zeromap_folio_clear(struct folio *folio)
>>> @@ -507,6 +515,7 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
>>>  static bool swap_read_folio_zeromap(struct folio *folio)
>>>  {
>>>       int nr_pages = folio_nr_pages(folio);
>>> +     struct obj_cgroup *objcg;
>>>       bool is_zeromap;
>>>
>>>       /*
>>> @@ -521,6 +530,13 @@ static bool swap_read_folio_zeromap(struct folio *folio)
>>>       if (!is_zeromap)
>>>               return false;
>>>
>>> +     objcg = get_obj_cgroup_from_folio(folio);
>>> +     count_vm_events(SWPIN_ZERO, nr_pages);
>>> +     if (objcg) {
>>> +             count_objcg_events(objcg, SWPIN_ZERO, nr_pages);
>>> +             obj_cgroup_put(objcg);
>>> +     }
>>> +
>>>       folio_zero_range(folio, 0, folio_size(folio));
>>>       folio_mark_uptodate(folio);
>>>       return true;
>>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>>> index 22a294556b58..c8ef7352f9ed 100644
>>> --- a/mm/vmstat.c
>>> +++ b/mm/vmstat.c
>>> @@ -1418,6 +1418,8 @@ const char * const vmstat_text[] = {
>>>  #ifdef CONFIG_SWAP
>>>       "swap_ra",
>>>       "swap_ra_hit",
>>> +     "swpin_zero",
>>> +     "swpout_zero",
>>>  #ifdef CONFIG_KSM
>>>       "ksm_swpin_copy",
>>>  #endif
>>
> 
> Thanks
> Barry
David Hildenbrand Nov. 4, 2024, 12:32 p.m. UTC | #4
On 02.11.24 13:59, Barry Song wrote:
> On Sat, Nov 2, 2024 at 8:32 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 02/11/2024 10:12, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> When the proportion of folios from the zero map is small, missing their
>>> accounting may not significantly impact profiling. However, it’s easy
>>> to construct a scenario where this becomes an issue—for example,
>>> allocating 1 GB of memory, writing zeros from userspace, followed by
>>> MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out
>>> and swap-in counts seem to vanish into a black hole, potentially
>>> causing semantic ambiguity.
>>>
>>> We have two ways to address this:
>>>
>>> 1. Add a separate counter specifically for the zero map.
>>> 2. Continue using the current accounting, treating the zero map like
>>> a normal backend. (This aligns with the current behavior of zRAM
>>> when supporting same-page fills at the device level.)
>>>
>>> This patch adopts option 1 as pswpin/pswpout counters are that they
>>> only apply to IO done directly to the backend device (as noted by
>>> Nhat Pham).
>>>
>>> We can find these counters from /proc/vmstat (counters for the whole
>>> system) and memcg's memory.stat (counters for the interested memcg).
>>>
>>> For example:
>>>
>>> $ grep -E 'swpin_zero|swpout_zero' /proc/vmstat
>>> swpin_zero 1648
>>> swpout_zero 33536
>>>
>>> $ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat
>>> swpin_zero 3905
>>> swpout_zero 3985
>>>
>>> Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap")
>> I don't think its a hotfix (or even a fix). It was discussed in the initial
>> series to add these as a follow up and Joshua was going to do this soon.
>> Its not fixing any bug in the initial series.
> 
> I would prefer that all kernel versions with zeromap include this
> counter; otherwise,
> it could be confusing to determine where swap-in and swap-out have occurred,
> as shown by the small program below:
> 
> p =malloc(1g);
> write p to zero
> madvise_pageout
> read p;
> 
> Previously, there was 1GB of swap-in and swap-out activity reported, but
> now nothing is shown.
> 
> I don't mean to suggest that there's a bug in the zeromap code; rather,
> having this counter would help clear up any confusion.
> 
> I didn't realize Joshua was handling it. Is he still planning to? If
> so, I can leave it
> with Joshua if that was the plan :-)
> 
>>
>>> Cc: Usama Arif <usamaarif642@gmail.com>
>>> Cc: Chengming Zhou <chengming.zhou@linux.dev>
>>> Cc: Yosry Ahmed <yosryahmed@google.com>
>>> Cc: Nhat Pham <nphamcs@gmail.com>
>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
>>> Cc: Shakeel Butt <shakeel.butt@linux.dev>
>>> Cc: Andi Kleen <ak@linux.intel.com>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: Chris Li <chrisl@kernel.org>
>>> Cc: "Huang, Ying" <ying.huang@intel.com>
>>> Cc: Kairui Song <kasong@tencent.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>   -v2:
>>>   * add separate counters rather than using pswpin/out; thanks
>>>   for the comments from Usama, David, Yosry and Nhat;
>>>   * Usama also suggested a new counter like swapped_zero, I
>>>   prefer that one be separated as an enhancement patch not
>>>   a hotfix. will probably handle it later on.
>>>
>> I dont think either of them would be a hotfix.
> 
> As mentioned above, this isn't about fixing a bug; it's simply to ensure
> that swap-related metrics don't disappear.

Documentation/process/submitting-patches.rst:

"A Fixes: tag indicates that the patch fixes an issue in a previous 
commit. It is used to make it easy to determine where a bug originated, 
which can help review a bug fix."

If there is no BUG, I'm afraid you are abusing that tag.

Also, I don't really understand the problem? We added an optimization, 
great. Who will be complaining about that?

Above you write "it could be confusing to determine where swap-in and 
swap-out have occurred" -- when is that confusion supposed to happen in 
practice? How will the confused individuals know that they must take a 
look at that new metric, even if it is in place?

I think we should just add the new stats and call it a day.
David Hildenbrand Nov. 4, 2024, 12:42 p.m. UTC | #5
On 02.11.24 11:12, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> When the proportion of folios from the zero map is small, missing their
> accounting may not significantly impact profiling. However, it’s easy
> to construct a scenario where this becomes an issue—for example,
> allocating 1 GB of memory, writing zeros from userspace, followed by
> MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out
> and swap-in counts seem to vanish into a black hole, potentially
> causing semantic ambiguity.
> 
> We have two ways to address this:
> 
> 1. Add a separate counter specifically for the zero map.
> 2. Continue using the current accounting, treating the zero map like
> a normal backend. (This aligns with the current behavior of zRAM
> when supporting same-page fills at the device level.)
> 
> This patch adopts option 1 as pswpin/pswpout counters are that they
> only apply to IO done directly to the backend device (as noted by
> Nhat Pham).
> 
> We can find these counters from /proc/vmstat (counters for the whole
> system) and memcg's memory.stat (counters for the interested memcg).
> 
> For example:
> 
> $ grep -E 'swpin_zero|swpout_zero' /proc/vmstat
> swpin_zero 1648
> swpout_zero 33536
> 
> $ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat
> swpin_zero 3905
> swpout_zero 3985
> 
> Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap")
> Cc: Usama Arif <usamaarif642@gmail.com>
> Cc: Chengming Zhou <chengming.zhou@linux.dev>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> Cc: Nhat Pham <nphamcs@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Chris Li <chrisl@kernel.org>
> Cc: "Huang, Ying" <ying.huang@intel.com>
> Cc: Kairui Song <kasong@tencent.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>   -v2:
>   * add separate counters rather than using pswpin/out; thanks
>   for the comments from Usama, David, Yosry and Nhat;
>   * Usama also suggested a new counter like swapped_zero, I
>   prefer that one be separated as an enhancement patch not
>   a hotfix. will probably handle it later on.
> 
>   Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++++
>   include/linux/vm_event_item.h           |  2 ++
>   mm/memcontrol.c                         |  4 ++++
>   mm/page_io.c                            | 16 ++++++++++++++++
>   mm/vmstat.c                             |  2 ++
>   5 files changed, 34 insertions(+)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index db3799f1483e..984eb3c9d05b 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1599,6 +1599,16 @@ The following nested keys are defined.
>   	  pglazyfreed (npn)
>   		Amount of reclaimed lazyfree pages
>   
> +	  swpin_zero
> +		Number of pages moved into memory with zero content, meaning no
> +		copy exists in the backend swapfile, allowing swap-in to avoid
> +		I/O read overhead.
> +
> +	  swpout_zero
> +		Number of pages moved out of memory with zero content, meaning no
> +		copy is needed in the backend swapfile, allowing swap-out to avoid
> +		I/O write overhead.

Hm, can make it a bit clearer that this is a pure optimization and refer 
to the other counters?

swpin_zero
	Portion of "pswpin" pages for which I/O was optimized out
	because the page content was detected to be zero during swapout.

swpout_zero
	Portion of "pswout" pages for which I/O was optimized out
	because the page content was detected to be zero.
Joshua Hahn Nov. 4, 2024, 4:24 p.m. UTC | #6
On 02/11/2024 14:43:07, Usama Arif <usamaarif642@gmail.com> wrote:

> On 02/11/2024 12:59, Barry Song wrote:
>> On Sat, Nov 2, 2024 at 8:32 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>>
>>> On 02/11/2024 10:12, Barry Song wrote:
>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>
>>>> When the proportion of folios from the zero map is small, missing their
>>>> accounting may not significantly impact profiling. However, it’s easy
>>>> to construct a scenario where this becomes an issue—for example,
>>>> allocating 1 GB of memory, writing zeros from userspace, followed by
>>>> MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out
>>>> and swap-in counts seem to vanish into a black hole, potentially
>>>> causing semantic ambiguity.
>>>>
>>>> This patch adopts option 1 as pswpin/pswpout counters are that they
>>>> only apply to IO done directly to the backend device (as noted by
>>>> Nhat Pham).
>>>>
>>>> Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap")
>>> I don't think its a hotfix (or even a fix). It was discussed in the initial
>>> series to add these as a follow up and Joshua was going to do this soon.
>>> Its not fixing any bug in the initial series.
>>
>> I didn't realize Joshua was handling it. Is he still planning to? If
>> so, I can leave it
>> with Joshua if that was the plan :-)
>>
>
> Please do continue with this patch, I think he was going to look at the
> swapped_zero version that we discussed earlier anyways. Will let Joshua
> comment on it.

Hi Usama and Barry,

First of all, I am sorry for not participating in the previous conversation
about this, it is my fault for the lack of communication on my end on the
status of zero_swapped (name pending). Sorry for the confusion!

I am hoping to pick this up in a few days (I have been working on a few
patches in different subsystems, but I will  wrap up the work on these
very soon).

As far as I can tell, zero_swapped and swp{in,out}zero seem to be
orthogonal, and as Nhat pointed out [1] I think there is need for both.

Thank you for this patch Barry, and thank you Usama for keeping me in
the loop! Have a great day!
Joshua

[1] https://lore.kernel.org/linux-mm/882008b6-13e0-41d8-91fa-f26c585120d8@gmail.com/T/#m7c5017bc97d56843242d3e006cd7e1f0fd0f1a38
Johannes Weiner Nov. 4, 2024, 4:34 p.m. UTC | #7
On Mon, Nov 04, 2024 at 01:42:08PM +0100, David Hildenbrand wrote:
> On 02.11.24 11:12, Barry Song wrote:
> > @@ -1599,6 +1599,16 @@ The following nested keys are defined.
> >   	  pglazyfreed (npn)
> >   		Amount of reclaimed lazyfree pages
> >   
> > +	  swpin_zero
> > +		Number of pages moved into memory with zero content, meaning no
> > +		copy exists in the backend swapfile, allowing swap-in to avoid
> > +		I/O read overhead.
> > +
> > +	  swpout_zero
> > +		Number of pages moved out of memory with zero content, meaning no
> > +		copy is needed in the backend swapfile, allowing swap-out to avoid
> > +		I/O write overhead.
> 
> Hm, can make it a bit clearer that this is a pure optimization and refer 
> to the other counters?
> 
> swpin_zero
> 	Portion of "pswpin" pages for which I/O was optimized out
> 	because the page content was detected to be zero during swapout.

AFAICS the zeropages currently don't show up in pswpin/pswpout, so
these are independent counters, not subsets.

I'm leaning towards Barry's side on the fixes tag. When zswap handled
the same-filled pages, we would count them in zswpin/out. From a user
POV, especially one using zswap, the behavior didn't change, but the
counts giving insight into this (potentially significant) VM activity
disappeared. This is arguably a regression.

> swpout_zero
> 	Portion of "pswout" pages for which I/O was optimized out
> 	because the page content was detected to be zero.

Are we sure we want to commit to the "zero" in the name here? Until
very recently, zswap optimized all same-filled pages. It's possible
somebody might want to bring that back down the line.

In reference to the above, I'd actually prefer putting them back into
zswpin/zswpout. Sure, they're not handled by zswap.c proper, but this
is arguably just an implementation detail; from a user POV this is
still just (a form of) compression in lieu of IO to the swap backend.

IMO there is no need for coming up with a separate category. Just add
them to zswpin/zswpout and remove the CONFIG_ZSWAP guards from them?
David Hildenbrand Nov. 4, 2024, 5:10 p.m. UTC | #8
On 04.11.24 17:34, Johannes Weiner wrote:
> On Mon, Nov 04, 2024 at 01:42:08PM +0100, David Hildenbrand wrote:
>> On 02.11.24 11:12, Barry Song wrote:
>>> @@ -1599,6 +1599,16 @@ The following nested keys are defined.
>>>    	  pglazyfreed (npn)
>>>    		Amount of reclaimed lazyfree pages
>>>    
>>> +	  swpin_zero
>>> +		Number of pages moved into memory with zero content, meaning no
>>> +		copy exists in the backend swapfile, allowing swap-in to avoid
>>> +		I/O read overhead.
>>> +
>>> +	  swpout_zero
>>> +		Number of pages moved out of memory with zero content, meaning no
>>> +		copy is needed in the backend swapfile, allowing swap-out to avoid
>>> +		I/O write overhead.
>>
>> Hm, can make it a bit clearer that this is a pure optimization and refer
>> to the other counters?
>>
>> swpin_zero
>> 	Portion of "pswpin" pages for which I/O was optimized out
>> 	because the page content was detected to be zero during swapout.
> 
> AFAICS the zeropages currently don't show up in pswpin/pswpout, so
> these are independent counters, not subsets.

Ah. now I understand the problem. The whole "move out of memory" "move 
into memory" here is quite confusing TBH. We're not moving anything, 
we're optimizing out the move completely ... yes, you could call it 
compression (below).

> 
> I'm leaning towards Barry's side on the fixes tag.

I think the documentation when to use the Fixes: tag is pretty clear.

Introducing new counters can hardly be considered a bugfix. Missing to 
adjust some counters that *existing tools* would know/use might be  IMO 
(below).

>  When zswap handled
> the same-filled pages, we would count them in zswpin/out. From a user
> POV, especially one using zswap, the behavior didn't change, but the
> counts giving insight into this (potentially significant) VM activity
> disappeared. This is arguably a regression.
 > >> swpout_zero
>> 	Portion of "pswout" pages for which I/O was optimized out
>> 	because the page content was detected to be zero.
> 
> Are we sure we want to commit to the "zero" in the name here? Until
> very recently, zswap optimized all same-filled pages. It's possible
> somebody might want to bring that back down the line.

Agreed.

> 
> In reference to the above, I'd actually prefer putting them back into
> zswpin/zswpout. Sure, they're not handled by zswap.c proper, but this
> is arguably just an implementation detail; from a user POV this is
> still just (a form of) compression in lieu of IO to the swap backend.
> 
> IMO there is no need for coming up with a separate category. Just add
> them to zswpin/zswpout and remove the CONFIG_ZSWAP guards from them?

Would work for me, although it might be a bit confusing if no zswap is 
configured. Maybe just needs to be documented.
Usama Arif Nov. 4, 2024, 6:48 p.m. UTC | #9
On 04/11/2024 17:10, David Hildenbrand wrote:
> On 04.11.24 17:34, Johannes Weiner wrote:
>> On Mon, Nov 04, 2024 at 01:42:08PM +0100, David Hildenbrand wrote:
>>> On 02.11.24 11:12, Barry Song wrote:
>>>> @@ -1599,6 +1599,16 @@ The following nested keys are defined.
>>>>          pglazyfreed (npn)
>>>>            Amount of reclaimed lazyfree pages
>>>>    +      swpin_zero
>>>> +        Number of pages moved into memory with zero content, meaning no
>>>> +        copy exists in the backend swapfile, allowing swap-in to avoid
>>>> +        I/O read overhead.
>>>> +
>>>> +      swpout_zero
>>>> +        Number of pages moved out of memory with zero content, meaning no
>>>> +        copy is needed in the backend swapfile, allowing swap-out to avoid
>>>> +        I/O write overhead.
>>>
>>> Hm, can make it a bit clearer that this is a pure optimization and refer
>>> to the other counters?
>>>
>>> swpin_zero
>>>     Portion of "pswpin" pages for which I/O was optimized out
>>>     because the page content was detected to be zero during swapout.
>>
>> AFAICS the zeropages currently don't show up in pswpin/pswpout, so
>> these are independent counters, not subsets.
> 
> Ah. now I understand the problem. The whole "move out of memory" "move into memory" here is quite confusing TBH. We're not moving anything, we're optimizing out the move completely ... yes, you could call it compression (below).
> 
>>
>> I'm leaning towards Barry's side on the fixes tag.
> 
> I think the documentation when to use the Fixes: tag is pretty clear.
> 
> Introducing new counters can hardly be considered a bugfix. Missing to adjust some counters that *existing tools* would know/use might be  IMO (below).
> 
>>  When zswap handled
>> the same-filled pages, we would count them in zswpin/out. From a user
>> POV, especially one using zswap, the behavior didn't change, but the
>> counts giving insight into this (potentially significant) VM activity
>> disappeared. This is arguably a regression.
>> >> swpout_zero
>>>     Portion of "pswout" pages for which I/O was optimized out
>>>     because the page content was detected to be zero.
>>
>> Are we sure we want to commit to the "zero" in the name here? Until
>> very recently, zswap optimized all same-filled pages. It's possible
>> somebody might want to bring that back down the line.
> 
> Agreed.
> 
>>
>> In reference to the above, I'd actually prefer putting them back into
>> zswpin/zswpout. Sure, they're not handled by zswap.c proper, but this
>> is arguably just an implementation detail; from a user POV this is
>> still just (a form of) compression in lieu of IO to the swap backend.
>>
>> IMO there is no need for coming up with a separate category. Just add
>> them to zswpin/zswpout and remove the CONFIG_ZSWAP guards from them?
> 

hmm, I actually don't like the idea of using zswpin/zswpout. Its a
bit confusing if zswap is disabled and zswap counters are incrementing?

Also, it means that when zswap is enabled, you won't be able to distinguish
between zswap and zeropage optimization.

> Would work for me, although it might be a bit confusing if no zswap is configured. Maybe just needs to be documented.
>
David Hildenbrand Nov. 4, 2024, 8:56 p.m. UTC | #10
On 04.11.24 19:48, Usama Arif wrote:
> 
> 
> On 04/11/2024 17:10, David Hildenbrand wrote:
>> On 04.11.24 17:34, Johannes Weiner wrote:
>>> On Mon, Nov 04, 2024 at 01:42:08PM +0100, David Hildenbrand wrote:
>>>> On 02.11.24 11:12, Barry Song wrote:
>>>>> @@ -1599,6 +1599,16 @@ The following nested keys are defined.
>>>>>           pglazyfreed (npn)
>>>>>             Amount of reclaimed lazyfree pages
>>>>>     +      swpin_zero
>>>>> +        Number of pages moved into memory with zero content, meaning no
>>>>> +        copy exists in the backend swapfile, allowing swap-in to avoid
>>>>> +        I/O read overhead.
>>>>> +
>>>>> +      swpout_zero
>>>>> +        Number of pages moved out of memory with zero content, meaning no
>>>>> +        copy is needed in the backend swapfile, allowing swap-out to avoid
>>>>> +        I/O write overhead.
>>>>
>>>> Hm, can make it a bit clearer that this is a pure optimization and refer
>>>> to the other counters?
>>>>
>>>> swpin_zero
>>>>      Portion of "pswpin" pages for which I/O was optimized out
>>>>      because the page content was detected to be zero during swapout.
>>>
>>> AFAICS the zeropages currently don't show up in pswpin/pswpout, so
>>> these are independent counters, not subsets.
>>
>> Ah. now I understand the problem. The whole "move out of memory" "move into memory" here is quite confusing TBH. We're not moving anything, we're optimizing out the move completely ... yes, you could call it compression (below).
>>
>>>
>>> I'm leaning towards Barry's side on the fixes tag.
>>
>> I think the documentation when to use the Fixes: tag is pretty clear.
>>
>> Introducing new counters can hardly be considered a bugfix. Missing to adjust some counters that *existing tools* would know/use might be  IMO (below).
>>
>>>   When zswap handled
>>> the same-filled pages, we would count them in zswpin/out. From a user
>>> POV, especially one using zswap, the behavior didn't change, but the
>>> counts giving insight into this (potentially significant) VM activity
>>> disappeared. This is arguably a regression.
>>>>> swpout_zero
>>>>      Portion of "pswout" pages for which I/O was optimized out
>>>>      because the page content was detected to be zero.
>>>
>>> Are we sure we want to commit to the "zero" in the name here? Until
>>> very recently, zswap optimized all same-filled pages. It's possible
>>> somebody might want to bring that back down the line.
>>
>> Agreed.
>>
>>>
>>> In reference to the above, I'd actually prefer putting them back into
>>> zswpin/zswpout. Sure, they're not handled by zswap.c proper, but this
>>> is arguably just an implementation detail; from a user POV this is
>>> still just (a form of) compression in lieu of IO to the swap backend.
>>>
>>> IMO there is no need for coming up with a separate category. Just add
>>> them to zswpin/zswpout and remove the CONFIG_ZSWAP guards from them?
>>
> 
> hmm, I actually don't like the idea of using zswpin/zswpout. Its a
> bit confusing if zswap is disabled and zswap counters are incrementing?
> 
> Also, it means that when zswap is enabled, you won't be able to distinguish
> between zswap and zeropage optimization.

Does it matter? Because in the past the same would have happened, no 
(back when this was done in zswap code)?
Usama Arif Nov. 4, 2024, 9:24 p.m. UTC | #11
On 04/11/2024 20:56, David Hildenbrand wrote:
> On 04.11.24 19:48, Usama Arif wrote:
>>
>>
>> On 04/11/2024 17:10, David Hildenbrand wrote:
>>> On 04.11.24 17:34, Johannes Weiner wrote:
>>>> On Mon, Nov 04, 2024 at 01:42:08PM +0100, David Hildenbrand wrote:
>>>>> On 02.11.24 11:12, Barry Song wrote:
>>>>>> @@ -1599,6 +1599,16 @@ The following nested keys are defined.
>>>>>>           pglazyfreed (npn)
>>>>>>             Amount of reclaimed lazyfree pages
>>>>>>     +      swpin_zero
>>>>>> +        Number of pages moved into memory with zero content, meaning no
>>>>>> +        copy exists in the backend swapfile, allowing swap-in to avoid
>>>>>> +        I/O read overhead.
>>>>>> +
>>>>>> +      swpout_zero
>>>>>> +        Number of pages moved out of memory with zero content, meaning no
>>>>>> +        copy is needed in the backend swapfile, allowing swap-out to avoid
>>>>>> +        I/O write overhead.
>>>>>
>>>>> Hm, can make it a bit clearer that this is a pure optimization and refer
>>>>> to the other counters?
>>>>>
>>>>> swpin_zero
>>>>>      Portion of "pswpin" pages for which I/O was optimized out
>>>>>      because the page content was detected to be zero during swapout.
>>>>
>>>> AFAICS the zeropages currently don't show up in pswpin/pswpout, so
>>>> these are independent counters, not subsets.
>>>
>>> Ah. now I understand the problem. The whole "move out of memory" "move into memory" here is quite confusing TBH. We're not moving anything, we're optimizing out the move completely ... yes, you could call it compression (below).
>>>
>>>>
>>>> I'm leaning towards Barry's side on the fixes tag.
>>>
>>> I think the documentation when to use the Fixes: tag is pretty clear.
>>>
>>> Introducing new counters can hardly be considered a bugfix. Missing to adjust some counters that *existing tools* would know/use might be  IMO (below).
>>>
>>>>   When zswap handled
>>>> the same-filled pages, we would count them in zswpin/out. From a user
>>>> POV, especially one using zswap, the behavior didn't change, but the
>>>> counts giving insight into this (potentially significant) VM activity
>>>> disappeared. This is arguably a regression.
>>>>>> swpout_zero
>>>>>      Portion of "pswout" pages for which I/O was optimized out
>>>>>      because the page content was detected to be zero.
>>>>
>>>> Are we sure we want to commit to the "zero" in the name here? Until
>>>> very recently, zswap optimized all same-filled pages. It's possible
>>>> somebody might want to bring that back down the line.
>>>
>>> Agreed.
>>>
>>>>
>>>> In reference to the above, I'd actually prefer putting them back into
>>>> zswpin/zswpout. Sure, they're not handled by zswap.c proper, but this
>>>> is arguably just an implementation detail; from a user POV this is
>>>> still just (a form of) compression in lieu of IO to the swap backend.
>>>>
>>>> IMO there is no need for coming up with a separate category. Just add
>>>> them to zswpin/zswpout and remove the CONFIG_ZSWAP guards from them?
>>>
>>
>> hmm, I actually don't like the idea of using zswpin/zswpout. Its a
>> bit confusing if zswap is disabled and zswap counters are incrementing?
>>
>> Also, it means that when zswap is enabled, you won't be able to distinguish
>> between zswap and zeropage optimization.
> 
> Does it matter? Because in the past the same would have happened, no (back when this was done in zswap code)?
> 

When it was in zswap code, there was zswap_same_filled_pages stat as well to see
how many zero-filled pages were part of zswap. (Not the same as counter, but you
could still get a good idea about same filled page usage).

The other thing is it affects zram as well..

Maybe We could have a hybrid approach?
i.e. have the zswpin/zswpout counter incremented at zero filled pages as suggested,
but then also have a zero_swapped stat that tells how much of the zeromap is 
currently set (similar to zswapped).
Barry Song Nov. 5, 2024, 1:28 a.m. UTC | #12
On Tue, Nov 5, 2024 at 10:24 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 04/11/2024 20:56, David Hildenbrand wrote:
> > On 04.11.24 19:48, Usama Arif wrote:
> >>
> >>
> >> On 04/11/2024 17:10, David Hildenbrand wrote:
> >>> On 04.11.24 17:34, Johannes Weiner wrote:
> >>>> On Mon, Nov 04, 2024 at 01:42:08PM +0100, David Hildenbrand wrote:
> >>>>> On 02.11.24 11:12, Barry Song wrote:
> >>>>>> @@ -1599,6 +1599,16 @@ The following nested keys are defined.
> >>>>>>           pglazyfreed (npn)
> >>>>>>             Amount of reclaimed lazyfree pages
> >>>>>>     +      swpin_zero
> >>>>>> +        Number of pages moved into memory with zero content, meaning no
> >>>>>> +        copy exists in the backend swapfile, allowing swap-in to avoid
> >>>>>> +        I/O read overhead.
> >>>>>> +
> >>>>>> +      swpout_zero
> >>>>>> +        Number of pages moved out of memory with zero content, meaning no
> >>>>>> +        copy is needed in the backend swapfile, allowing swap-out to avoid
> >>>>>> +        I/O write overhead.
> >>>>>
> >>>>> Hm, can make it a bit clearer that this is a pure optimization and refer
> >>>>> to the other counters?
> >>>>>
> >>>>> swpin_zero
> >>>>>      Portion of "pswpin" pages for which I/O was optimized out
> >>>>>      because the page content was detected to be zero during swapout.
> >>>>
> >>>> AFAICS the zeropages currently don't show up in pswpin/pswpout, so
> >>>> these are independent counters, not subsets.
> >>>
> >>> Ah. now I understand the problem. The whole "move out of memory" "move into memory" here is quite confusing TBH. We're not moving anything, we're optimizing out the move completely ... yes, you could call it compression (below).
> >>>
> >>>>
> >>>> I'm leaning towards Barry's side on the fixes tag.
> >>>
> >>> I think the documentation when to use the Fixes: tag is pretty clear.
> >>>
> >>> Introducing new counters can hardly be considered a bugfix. Missing to adjust some counters that *existing tools* would know/use might be  IMO (below).
> >>>
> >>>>   When zswap handled
> >>>> the same-filled pages, we would count them in zswpin/out. From a user
> >>>> POV, especially one using zswap, the behavior didn't change, but the
> >>>> counts giving insight into this (potentially significant) VM activity
> >>>> disappeared. This is arguably a regression.
> >>>>>> swpout_zero
> >>>>>      Portion of "pswout" pages for which I/O was optimized out
> >>>>>      because the page content was detected to be zero.
> >>>>
> >>>> Are we sure we want to commit to the "zero" in the name here? Until
> >>>> very recently, zswap optimized all same-filled pages. It's possible
> >>>> somebody might want to bring that back down the line.
> >>>
> >>> Agreed.
> >>>
> >>>>
> >>>> In reference to the above, I'd actually prefer putting them back into
> >>>> zswpin/zswpout. Sure, they're not handled by zswap.c proper, but this
> >>>> is arguably just an implementation detail; from a user POV this is
> >>>> still just (a form of) compression in lieu of IO to the swap backend.
> >>>>
> >>>> IMO there is no need for coming up with a separate category. Just add
> >>>> them to zswpin/zswpout and remove the CONFIG_ZSWAP guards from them?
> >>>
> >>
> >> hmm, I actually don't like the idea of using zswpin/zswpout. Its a
> >> bit confusing if zswap is disabled and zswap counters are incrementing?
> >>
> >> Also, it means that when zswap is enabled, you won't be able to distinguish
> >> between zswap and zeropage optimization.
> >
> > Does it matter? Because in the past the same would have happened, no (back when this was done in zswap code)?
> >
>
> When it was in zswap code, there was zswap_same_filled_pages stat as well to see
> how many zero-filled pages were part of zswap. (Not the same as counter, but you
> could still get a good idea about same filled page usage).
>
> The other thing is it affects zram as well..
>
> Maybe We could have a hybrid approach?
> i.e. have the zswpin/zswpout counter incremented at zero filled pages as suggested,
> but then also have a zero_swapped stat that tells how much of the zeromap is
> currently set (similar to zswapped).

I still think we should keep zswap and zeromap separate. On a system
without zswap,
zero-page swap-in and swap-out are included in pswpin and pswpout counts.

Although zram has same_page_filled, it's still treated as a block
device after the
swap layer.

Thanks
Barry
Andrew Morton Nov. 5, 2024, 3:40 a.m. UTC | #13
On Mon, 4 Nov 2024 13:32:55 +0100 David Hildenbrand <david@redhat.com> wrote:

> > As mentioned above, this isn't about fixing a bug; it's simply to ensure
> > that swap-related metrics don't disappear.
> 
> Documentation/process/submitting-patches.rst:
> 
> "A Fixes: tag indicates that the patch fixes an issue in a previous 
> commit. It is used to make it easy to determine where a bug originated, 
> which can help review a bug fix."
> 
> If there is no BUG, I'm afraid you are abusing that tag.

I think the abuse is reasonable.  We have no Should-be-included-with:.

0ca0c24e3211 is only in 6.12-rcX so this is the time to make
userspace-visible tweaks, so the 6.12 interfaces are the same as the
6.13+ interfaces (which is what I think is happening here?)

And including the Fixes in this patch might be useful to someone who is
backporting 0ca0c24e3211 into some earlier kernel for their own
purposes.
David Hildenbrand Nov. 5, 2024, 8:23 a.m. UTC | #14
On 05.11.24 04:40, Andrew Morton wrote:
> On Mon, 4 Nov 2024 13:32:55 +0100 David Hildenbrand <david@redhat.com> wrote:
> 
>>> As mentioned above, this isn't about fixing a bug; it's simply to ensure
>>> that swap-related metrics don't disappear.
>>
>> Documentation/process/submitting-patches.rst:
>>
>> "A Fixes: tag indicates that the patch fixes an issue in a previous
>> commit. It is used to make it easy to determine where a bug originated,
>> which can help review a bug fix."
>>
>> If there is no BUG, I'm afraid you are abusing that tag.
> 
> I think the abuse is reasonable.  We have no Should-be-included-with:.

A "Belongs-to:" might make sense, for this kind of stuff that is still 
only in an RFC. Or we update the doc to explicitly spell out this 
special case of using "Fixes" to sort out something into the RC.

Because if this would be already in a released kernel, it would get a 
bit trickier: stable rules explicitly spell out "fix a real bug".

> 
> 0ca0c24e3211 is only in 6.12-rcX so this is the time to make
> userspace-visible tweaks, so the 6.12 interfaces are the same as the
> 6.13+ interfaces (which is what I think is happening here?)
 > > And including the Fixes in this patch might be useful to someone who is
> backporting 0ca0c24e3211 into some earlier kernel for their own
> purposes.

Just to be clear: adding new counters would hardly be fixing existing 
tools that perform calculations based on existing counters. So we are 
already changing the "userspace-visible" portion in some way, and I have 
no idea what in vmstat we consider "stable".

But I still don't think it's all that big of a deal except in some 
handcrafted scenarios hardly anybody cares about; the cover letter is 
also pretty clear on that.

So I'll shut up now and let people figure out the naming first, and if a 
new counter is required at all :)
David Hildenbrand Nov. 5, 2024, 8:24 a.m. UTC | #15
On 05.11.24 09:23, David Hildenbrand wrote:
> On 05.11.24 04:40, Andrew Morton wrote:
>> On Mon, 4 Nov 2024 13:32:55 +0100 David Hildenbrand <david@redhat.com> wrote:
>>
>>>> As mentioned above, this isn't about fixing a bug; it's simply to ensure
>>>> that swap-related metrics don't disappear.
>>>
>>> Documentation/process/submitting-patches.rst:
>>>
>>> "A Fixes: tag indicates that the patch fixes an issue in a previous
>>> commit. It is used to make it easy to determine where a bug originated,
>>> which can help review a bug fix."
>>>
>>> If there is no BUG, I'm afraid you are abusing that tag.
>>
>> I think the abuse is reasonable.  We have no Should-be-included-with:.
> 
> A "Belongs-to:" might make sense, for this kind of stuff that is still
> only in an RFC.

s/RFC/RC/
Barry Song Nov. 5, 2024, 9:15 a.m. UTC | #16
On Tue, Nov 5, 2024 at 9:23 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 05.11.24 04:40, Andrew Morton wrote:
> > On Mon, 4 Nov 2024 13:32:55 +0100 David Hildenbrand <david@redhat.com> wrote:
> >
> >>> As mentioned above, this isn't about fixing a bug; it's simply to ensure
> >>> that swap-related metrics don't disappear.
> >>
> >> Documentation/process/submitting-patches.rst:
> >>
> >> "A Fixes: tag indicates that the patch fixes an issue in a previous
> >> commit. It is used to make it easy to determine where a bug originated,
> >> which can help review a bug fix."
> >>
> >> If there is no BUG, I'm afraid you are abusing that tag.
> >
> > I think the abuse is reasonable.  We have no Should-be-included-with:.
>
> A "Belongs-to:" might make sense, for this kind of stuff that is still
> only in an RFC. Or we update the doc to explicitly spell out this
> special case of using "Fixes" to sort out something into the RC.
>
> Because if this would be already in a released kernel, it would get a
> bit trickier: stable rules explicitly spell out "fix a real bug".
>
> >
> > 0ca0c24e3211 is only in 6.12-rcX so this is the time to make
> > userspace-visible tweaks, so the 6.12 interfaces are the same as the
> > 6.13+ interfaces (which is what I think is happening here?)
>  > > And including the Fixes in this patch might be useful to someone who is
> > backporting 0ca0c24e3211 into some earlier kernel for their own
> > purposes.
>
> Just to be clear: adding new counters would hardly be fixing existing
> tools that perform calculations based on existing counters. So we are
> already changing the "userspace-visible" portion in some way, and I have
> no idea what in vmstat we consider "stable".
>
> But I still don't think it's all that big of a deal except in some
> handcrafted scenarios hardly anybody cares about; the cover letter is
> also pretty clear on that.

I may have been mistaken in the cover letter. According to the zswap data Usama
provided for servers, zero-filled pages accounted for about 1%. So
really doesn't
matter too much, but I just checked with Hailong from our team—he has data
on same-page-filled usage in Android apps, which on average show 3-4%
same-page-filled, with over 85% being zero-filled. Some apps even reach
6-7% zero-filled pages. We previously used these counters to profile
optimizations, but with zeromap now serving as the frontend for swap files,
those counters will disappear entirely from both zRAM and pswpin/pswpout
metrics, as folios are filtered earlier.

Hailong essentially has a table that looks like the below which could be
collected from the existing counters:

com.xxx.app     5% same-page-filled.    88% zero
com.yyy.app     6% same-page-filled.     85% zero
com.zzz.map   6.7 same-page-filled.       88% zero
....

Anyone on 6.12 will be unable to track zero-filled pages unless they
backport this patch from a newer kernel version if it doesn’t make it
into 6.12.

Whether it's marked as 'Belongs-to:' or 'Fixes:', I'd prefer we aim to
land it in
6.12 :-)

>
> So I'll shut up now and let people figure out the naming first, and if a
> new counter is required at all :)
>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry
Usama Arif Nov. 5, 2024, 10:44 a.m. UTC | #17
On 05/11/2024 09:15, Barry Song wrote:
> On Tue, Nov 5, 2024 at 9:23 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 05.11.24 04:40, Andrew Morton wrote:
>>> On Mon, 4 Nov 2024 13:32:55 +0100 David Hildenbrand <david@redhat.com> wrote:
>>>
>>>>> As mentioned above, this isn't about fixing a bug; it's simply to ensure
>>>>> that swap-related metrics don't disappear.
>>>>
>>>> Documentation/process/submitting-patches.rst:
>>>>
>>>> "A Fixes: tag indicates that the patch fixes an issue in a previous
>>>> commit. It is used to make it easy to determine where a bug originated,
>>>> which can help review a bug fix."
>>>>
>>>> If there is no BUG, I'm afraid you are abusing that tag.
>>>
>>> I think the abuse is reasonable.  We have no Should-be-included-with:.
>>
>> A "Belongs-to:" might make sense, for this kind of stuff that is still
>> only in an RFC. Or we update the doc to explicitly spell out this
>> special case of using "Fixes" to sort out something into the RC.
>>
>> Because if this would be already in a released kernel, it would get a
>> bit trickier: stable rules explicitly spell out "fix a real bug".
>>
>>>
>>> 0ca0c24e3211 is only in 6.12-rcX so this is the time to make
>>> userspace-visible tweaks, so the 6.12 interfaces are the same as the
>>> 6.13+ interfaces (which is what I think is happening here?)
>>  > > And including the Fixes in this patch might be useful to someone who is
>>> backporting 0ca0c24e3211 into some earlier kernel for their own
>>> purposes.
>>
>> Just to be clear: adding new counters would hardly be fixing existing
>> tools that perform calculations based on existing counters. So we are
>> already changing the "userspace-visible" portion in some way, and I have
>> no idea what in vmstat we consider "stable".
>>
>> But I still don't think it's all that big of a deal except in some
>> handcrafted scenarios hardly anybody cares about; the cover letter is
>> also pretty clear on that.
> 
> I may have been mistaken in the cover letter. According to the zswap data Usama
> provided for servers, zero-filled pages accounted for about 1%.

10% not 1% (https://lore.kernel.org/all/20240612124750.2220726-1-usamaarif642@gmail.com/).

> So
> really doesn't
> matter too much, but I just checked with Hailong from our team—he has data
> on same-page-filled usage in Android apps, which on average show 3-4%
> same-page-filled, with over 85% being zero-filled. Some apps even reach
> 6-7% zero-filled pages. We previously used these counters to profile
> optimizations, but with zeromap now serving as the frontend for swap files,
> those counters will disappear entirely from both zRAM and pswpin/pswpout
> metrics, as folios are filtered earlier.
> 
This is what I meant in https://lore.kernel.org/all/79deed1a-9b0e-42e0-be2f-f0c3ef5fee11@gmail.com/
when I said it affects zram as well!

I am happy with the current version of the patch, just need the change
in Documentation/admin-guide/cgroup-v2.rst.

Thanks,
Usama

> Hailong essentially has a table that looks like the below which could be
> collected from the existing counters:
> 
> com.xxx.app     5% same-page-filled.    88% zero
> com.yyy.app     6% same-page-filled.     85% zero
> com.zzz.map   6.7 same-page-filled.       88% zero
> ....
> 
> Anyone on 6.12 will be unable to track zero-filled pages unless they
> backport this patch from a newer kernel version if it doesn’t make it
> into 6.12.
> 
> Whether it's marked as 'Belongs-to:' or 'Fixes:', I'd prefer we aim to
> land it in
> 6.12 :-)
> 
>>
>> So I'll shut up now and let people figure out the naming first, and if a
>> new counter is required at all :)
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> Thanks
> Barry
Barry Song Nov. 5, 2024, 10:57 a.m. UTC | #18
On Tue, Nov 5, 2024 at 11:44 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 05/11/2024 09:15, Barry Song wrote:
> > On Tue, Nov 5, 2024 at 9:23 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 05.11.24 04:40, Andrew Morton wrote:
> >>> On Mon, 4 Nov 2024 13:32:55 +0100 David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>>>> As mentioned above, this isn't about fixing a bug; it's simply to ensure
> >>>>> that swap-related metrics don't disappear.
> >>>>
> >>>> Documentation/process/submitting-patches.rst:
> >>>>
> >>>> "A Fixes: tag indicates that the patch fixes an issue in a previous
> >>>> commit. It is used to make it easy to determine where a bug originated,
> >>>> which can help review a bug fix."
> >>>>
> >>>> If there is no BUG, I'm afraid you are abusing that tag.
> >>>
> >>> I think the abuse is reasonable.  We have no Should-be-included-with:.
> >>
> >> A "Belongs-to:" might make sense, for this kind of stuff that is still
> >> only in an RFC. Or we update the doc to explicitly spell out this
> >> special case of using "Fixes" to sort out something into the RC.
> >>
> >> Because if this would be already in a released kernel, it would get a
> >> bit trickier: stable rules explicitly spell out "fix a real bug".
> >>
> >>>
> >>> 0ca0c24e3211 is only in 6.12-rcX so this is the time to make
> >>> userspace-visible tweaks, so the 6.12 interfaces are the same as the
> >>> 6.13+ interfaces (which is what I think is happening here?)
> >>  > > And including the Fixes in this patch might be useful to someone who is
> >>> backporting 0ca0c24e3211 into some earlier kernel for their own
> >>> purposes.
> >>
> >> Just to be clear: adding new counters would hardly be fixing existing
> >> tools that perform calculations based on existing counters. So we are
> >> already changing the "userspace-visible" portion in some way, and I have
> >> no idea what in vmstat we consider "stable".
> >>
> >> But I still don't think it's all that big of a deal except in some
> >> handcrafted scenarios hardly anybody cares about; the cover letter is
> >> also pretty clear on that.
> >
> > I may have been mistaken in the cover letter. According to the zswap data Usama
> > provided for servers, zero-filled pages accounted for about 1%.
>
> 10% not 1% (https://lore.kernel.org/all/20240612124750.2220726-1-usamaarif642@gmail.com/).
>

Sorry. My memory must have faded; my mind is confused by the 1% non-zero
same-page data and the 10% same-page data. Getting old :-)

> > So
> > really doesn't
> > matter too much, but I just checked with Hailong from our team—he has data
> > on same-page-filled usage in Android apps, which on average show 3-4%
> > same-page-filled, with over 85% being zero-filled. Some apps even reach
> > 6-7% zero-filled pages. We previously used these counters to profile
> > optimizations, but with zeromap now serving as the frontend for swap files,
> > those counters will disappear entirely from both zRAM and pswpin/pswpout
> > metrics, as folios are filtered earlier.
> >
> This is what I meant in https://lore.kernel.org/all/79deed1a-9b0e-42e0-be2f-f0c3ef5fee11@gmail.com/
> when I said it affects zram as well!
>
> I am happy with the current version of the patch, just need the change
> in Documentation/admin-guide/cgroup-v2.rst.

I will update the document and send version 3 tomorrow, incorporating both
your comments on "zero-filled" and David's suggestion regarding "move out of
memory".

>
> Thanks,
> Usama
>
> > Hailong essentially has a table that looks like the below which could be
> > collected from the existing counters:
> >
> > com.xxx.app     5% same-page-filled.    88% zero
> > com.yyy.app     6% same-page-filled.     85% zero
> > com.zzz.map   6.7 same-page-filled.       88% zero
> > ....
> >
> > Anyone on 6.12 will be unable to track zero-filled pages unless they
> > backport this patch from a newer kernel version if it doesn’t make it
> > into 6.12.
> >
> > Whether it's marked as 'Belongs-to:' or 'Fixes:', I'd prefer we aim to
> > land it in
> > 6.12 :-)
> >
> >>
> >> So I'll shut up now and let people figure out the naming first, and if a
> >> new counter is required at all :)
> >>
> >> --
> >> Cheers,
> >>
> >> David / dhildenb
> >>
> >

Thanks
Barry
David Hildenbrand Nov. 5, 2024, 11:09 a.m. UTC | #19
On 05.11.24 10:15, Barry Song wrote:
> On Tue, Nov 5, 2024 at 9:23 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 05.11.24 04:40, Andrew Morton wrote:
>>> On Mon, 4 Nov 2024 13:32:55 +0100 David Hildenbrand <david@redhat.com> wrote:
>>>
>>>>> As mentioned above, this isn't about fixing a bug; it's simply to ensure
>>>>> that swap-related metrics don't disappear.
>>>>
>>>> Documentation/process/submitting-patches.rst:
>>>>
>>>> "A Fixes: tag indicates that the patch fixes an issue in a previous
>>>> commit. It is used to make it easy to determine where a bug originated,
>>>> which can help review a bug fix."
>>>>
>>>> If there is no BUG, I'm afraid you are abusing that tag.
>>>
>>> I think the abuse is reasonable.  We have no Should-be-included-with:.
>>
>> A "Belongs-to:" might make sense, for this kind of stuff that is still
>> only in an RFC. Or we update the doc to explicitly spell out this
>> special case of using "Fixes" to sort out something into the RC.
>>
>> Because if this would be already in a released kernel, it would get a
>> bit trickier: stable rules explicitly spell out "fix a real bug".
>>
>>>
>>> 0ca0c24e3211 is only in 6.12-rcX so this is the time to make
>>> userspace-visible tweaks, so the 6.12 interfaces are the same as the
>>> 6.13+ interfaces (which is what I think is happening here?)
>>   > > And including the Fixes in this patch might be useful to someone who is
>>> backporting 0ca0c24e3211 into some earlier kernel for their own
>>> purposes.
>>
>> Just to be clear: adding new counters would hardly be fixing existing
>> tools that perform calculations based on existing counters. So we are
>> already changing the "userspace-visible" portion in some way, and I have
>> no idea what in vmstat we consider "stable".
>>
>> But I still don't think it's all that big of a deal except in some
>> handcrafted scenarios hardly anybody cares about; the cover letter is
>> also pretty clear on that.
> 
> I may have been mistaken in the cover letter. According to the zswap data Usama
> provided for servers, zero-filled pages accounted for about 1%. So
> really doesn't
> matter too much, but I just checked with Hailong from our team—he has data
> on same-page-filled usage in Android apps, which on average show 3-4%
> same-page-filled, with over 85% being zero-filled. Some apps even reach
> 6-7% zero-filled pages. We previously used these counters to profile
> optimizations, but with zeromap now serving as the frontend for swap files,
> those counters will disappear entirely from both zRAM and pswpin/pswpout
> metrics, as folios are filtered earlier.
> 
> Hailong essentially has a table that looks like the below which could be
> collected from the existing counters:
> 
> com.xxx.app     5% same-page-filled.    88% zero
> com.yyy.app     6% same-page-filled.     85% zero
> com.zzz.map   6.7 same-page-filled.       88% zero
> ....
> 
> Anyone on 6.12 will be unable to track zero-filled pages unless they
> backport this patch from a newer kernel version if it doesn’t make it
> into 6.12.
> 
> Whether it's marked as 'Belongs-to:' or 'Fixes:', I'd prefer we aim to
> land it in
> 6.12 :-)

Thanks for sharing these numbers. Just for the records, I think the 
optimization is quite powerful. What I am not convinced is that the 
missing stat is a big deal.

If I upgrade my car and it suddenly consumes 5% less fuel/energy I (a) 
either won't notice at all or (b) am happy and don't care why; I won't 
start digging through the updated manual, looking for some hidden gages 
that might explain why :)

But I might be missing something important, and I do understand why it 
can be helpful to have it black-on-white how much this optimization 
gives you. So feel free to move forward with this as you like, having it 
in 6.12 sounds very reasonable to me.
Nhat Pham Nov. 5, 2024, 7:35 p.m. UTC | #20
On Sat, Nov 2, 2024 at 3:12 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> When the proportion of folios from the zero map is small, missing their
> accounting may not significantly impact profiling. However, it’s easy
> to construct a scenario where this becomes an issue—for example,
> allocating 1 GB of memory, writing zeros from userspace, followed by
> MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out
> and swap-in counts seem to vanish into a black hole, potentially
> causing semantic ambiguity.
>
> We have two ways to address this:
>
> 1. Add a separate counter specifically for the zero map.
> 2. Continue using the current accounting, treating the zero map like
> a normal backend. (This aligns with the current behavior of zRAM
> when supporting same-page fills at the device level.)
>
> This patch adopts option 1 as pswpin/pswpout counters are that they
> only apply to IO done directly to the backend device (as noted by
> Nhat Pham).
>
> We can find these counters from /proc/vmstat (counters for the whole
> system) and memcg's memory.stat (counters for the interested memcg).
>
> For example:
>
> $ grep -E 'swpin_zero|swpout_zero' /proc/vmstat
> swpin_zero 1648
> swpout_zero 33536
>
> $ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat
> swpin_zero 3905
> swpout_zero 3985
>

LGTM FWIW, so I'll leave my review tag here:

Reviewed-by: Nhat Pham <nphamcs@gmail.com>

Too many emails in this thread, but my opinions is:

1. A fix tag is appropriate. It's not a kernel bug per se, but it's
incredibly confusing, and can potentially throw off user space agents
who rely on the rate of change of these counters as signals.

2. I do think we should use a separate set of counters for this
optimization. No strong opinions regarding combining this with the
zswap counters, but it can get confusing for users when they
enable/disable zswap.

If we are to combine, I'd be much more comfortable if we have a
generic name, like the one David suggested in v1 ("swpin_skip" /
"swpout_skip"). This would still require some API change tho, so not
sure if this is the best approach? :)

It would also be appropriate if we bring back the same-filled
optimization (which should be doable in the swap ID world, but I
digress).
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index db3799f1483e..984eb3c9d05b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1599,6 +1599,16 @@  The following nested keys are defined.
 	  pglazyfreed (npn)
 		Amount of reclaimed lazyfree pages
 
+	  swpin_zero
+		Number of pages moved into memory with zero content, meaning no
+		copy exists in the backend swapfile, allowing swap-in to avoid
+		I/O read overhead.
+
+	  swpout_zero
+		Number of pages moved out of memory with zero content, meaning no
+		copy is needed in the backend swapfile, allowing swap-out to avoid
+		I/O write overhead.
+
 	  zswpin
 		Number of pages moved in to memory from zswap.
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index aed952d04132..f70d0958095c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -134,6 +134,8 @@  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_SWAP
 		SWAP_RA,
 		SWAP_RA_HIT,
+		SWPIN_ZERO,
+		SWPOUT_ZERO,
 #ifdef CONFIG_KSM
 		KSM_SWPIN_COPY,
 #endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5e44d6e7591e..7b3503d12aaf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -441,6 +441,10 @@  static const unsigned int memcg_vm_event_stat[] = {
 	PGDEACTIVATE,
 	PGLAZYFREE,
 	PGLAZYFREED,
+#ifdef CONFIG_SWAP
+	SWPIN_ZERO,
+	SWPOUT_ZERO,
+#endif
 #ifdef CONFIG_ZSWAP
 	ZSWPIN,
 	ZSWPOUT,
diff --git a/mm/page_io.c b/mm/page_io.c
index 5d9b6e6cf96c..4b4ea8e49cf6 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -204,7 +204,9 @@  static bool is_folio_zero_filled(struct folio *folio)
 
 static void swap_zeromap_folio_set(struct folio *folio)
 {
+	struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
 	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	int nr_pages = folio_nr_pages(folio);
 	swp_entry_t entry;
 	unsigned int i;
 
@@ -212,6 +214,12 @@  static void swap_zeromap_folio_set(struct folio *folio)
 		entry = page_swap_entry(folio_page(folio, i));
 		set_bit(swp_offset(entry), sis->zeromap);
 	}
+
+	count_vm_events(SWPOUT_ZERO, nr_pages);
+	if (objcg) {
+		count_objcg_events(objcg, SWPOUT_ZERO, nr_pages);
+		obj_cgroup_put(objcg);
+	}
 }
 
 static void swap_zeromap_folio_clear(struct folio *folio)
@@ -507,6 +515,7 @@  static void sio_read_complete(struct kiocb *iocb, long ret)
 static bool swap_read_folio_zeromap(struct folio *folio)
 {
 	int nr_pages = folio_nr_pages(folio);
+	struct obj_cgroup *objcg;
 	bool is_zeromap;
 
 	/*
@@ -521,6 +530,13 @@  static bool swap_read_folio_zeromap(struct folio *folio)
 	if (!is_zeromap)
 		return false;
 
+	objcg = get_obj_cgroup_from_folio(folio);
+	count_vm_events(SWPIN_ZERO, nr_pages);
+	if (objcg) {
+		count_objcg_events(objcg, SWPIN_ZERO, nr_pages);
+		obj_cgroup_put(objcg);
+	}
+
 	folio_zero_range(folio, 0, folio_size(folio));
 	folio_mark_uptodate(folio);
 	return true;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 22a294556b58..c8ef7352f9ed 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1418,6 +1418,8 @@  const char * const vmstat_text[] = {
 #ifdef CONFIG_SWAP
 	"swap_ra",
 	"swap_ra_hit",
+	"swpin_zero",
+	"swpout_zero",
 #ifdef CONFIG_KSM
 	"ksm_swpin_copy",
 #endif