diff mbox series

[RFC,v1,2/4] mm: vmstat: Per mTHP-size zswap_store vmstat event counters.

Message ID 20240814062830.26833-3-kanchana.p.sridhar@intel.com (mailing list archive)
State New
Headers show
Series mm: ZSWAP swap-out of mTHP folios | expand

Commit Message

Sridhar, Kanchana P Aug. 14, 2024, 6:28 a.m. UTC
Added vmstat event counters per mTHP-size that can be used to account
for folios of different sizes being successfully stored in ZSWAP.

For this RFC, it is not clear if these zswpout counters should instead
be added as part of the existing mTHP stats in
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats.

The following is also a viable option, should it make better sense,
for instance, as:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout.

If so, we would be able to distinguish between mTHP zswap and
non-zswap swapouts through:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout

and

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout

respectively.

Comments would be appreciated as to which approach is preferable.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/vm_event_item.h | 15 +++++++++++++++
 mm/vmstat.c                   | 15 +++++++++++++++
 2 files changed, 30 insertions(+)

Comments

Barry Song Aug. 14, 2024, 7:48 a.m. UTC | #1
On Wed, Aug 14, 2024 at 6:28 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Added vmstat event counters per mTHP-size that can be used to account
> for folios of different sizes being successfully stored in ZSWAP.
>
> For this RFC, it is not clear if these zswpout counters should instead
> be added as part of the existing mTHP stats in
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats.
>
> The following is also a viable option, should it make better sense,
> for instance, as:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout.
>
> If so, we would be able to distinguish between mTHP zswap and
> non-zswap swapouts through:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
>
> and
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout
>
> respectively.
>
> Comments would be appreciated as to which approach is preferable.

Even though swapout might go through zswap, from the perspective of
the mm core, it shouldn't be aware of that. Shouldn't zswpout be part
of swpout? Why are they separate? no matter if a mTHP has been
put in zswap, it has been swapped-out to mm-core? No?


>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  include/linux/vm_event_item.h | 15 +++++++++++++++
>  mm/vmstat.c                   | 15 +++++++++++++++
>  2 files changed, 30 insertions(+)
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 747943bc8cc2..2451bcfcf05c 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -114,6 +114,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                 THP_ZERO_PAGE_ALLOC,
>                 THP_ZERO_PAGE_ALLOC_FAILED,
>                 THP_SWPOUT,
> +#ifdef CONFIG_ZSWAP
> +               ZSWPOUT_PMD_THP_FOLIO,
> +#endif
>                 THP_SWPOUT_FALLBACK,
>  #endif
>  #ifdef CONFIG_MEMORY_BALLOON
> @@ -143,6 +146,18 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                 ZSWPIN,
>                 ZSWPOUT,
>                 ZSWPWB,
> +               ZSWPOUT_4KB_FOLIO,
> +#ifdef CONFIG_THP_SWAP
> +               mTHP_ZSWPOUT_8kB,
> +               mTHP_ZSWPOUT_16kB,
> +               mTHP_ZSWPOUT_32kB,
> +               mTHP_ZSWPOUT_64kB,
> +               mTHP_ZSWPOUT_128kB,
> +               mTHP_ZSWPOUT_256kB,
> +               mTHP_ZSWPOUT_512kB,
> +               mTHP_ZSWPOUT_1024kB,
> +               mTHP_ZSWPOUT_2048kB,
> +#endif

This implementation hardcodes assumptions about the page size being 4KB,
but page sizes can vary, and so can the THP orders?

>  #endif
>  #ifdef CONFIG_X86
>                 DIRECT_MAP_LEVEL2_SPLIT,
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 8507c497218b..0e66c8b0c486 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1375,6 +1375,9 @@ const char * const vmstat_text[] = {
>         "thp_zero_page_alloc",
>         "thp_zero_page_alloc_failed",
>         "thp_swpout",
> +#ifdef CONFIG_ZSWAP
> +       "zswpout_pmd_thp_folio",
> +#endif
>         "thp_swpout_fallback",
>  #endif
>  #ifdef CONFIG_MEMORY_BALLOON
> @@ -1405,6 +1408,18 @@ const char * const vmstat_text[] = {
>         "zswpin",
>         "zswpout",
>         "zswpwb",
> +       "zswpout_4kb_folio",
> +#ifdef CONFIG_THP_SWAP
> +       "mthp_zswpout_8kb",
> +       "mthp_zswpout_16kb",
> +       "mthp_zswpout_32kb",
> +       "mthp_zswpout_64kb",
> +       "mthp_zswpout_128kb",
> +       "mthp_zswpout_256kb",
> +       "mthp_zswpout_512kb",
> +       "mthp_zswpout_1024kb",
> +       "mthp_zswpout_2048kb",
> +#endif

The issue here is that the number of THP orders
can vary across different platforms.

>  #endif
>  #ifdef CONFIG_X86
>         "direct_map_level2_splits",
> --
> 2.27.0
>

Thanks
Barry
Sridhar, Kanchana P Aug. 14, 2024, 5:40 p.m. UTC | #2
Hi Barry,

> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Wednesday, August 14, 2024 12:49 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store
> vmstat event counters.
> 
> On Wed, Aug 14, 2024 at 6:28 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Added vmstat event counters per mTHP-size that can be used to account
> > for folios of different sizes being successfully stored in ZSWAP.
> >
> > For this RFC, it is not clear if these zswpout counters should instead
> > be added as part of the existing mTHP stats in
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats.
> >
> > The following is also a viable option, should it make better sense,
> > for instance, as:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout.
> >
> > If so, we would be able to distinguish between mTHP zswap and
> > non-zswap swapouts through:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > and
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout
> >
> > respectively.
> >
> > Comments would be appreciated as to which approach is preferable.
> 
> Even though swapout might go through zswap, from the perspective of
> the mm core, it shouldn't be aware of that. Shouldn't zswpout be part
> of swpout? Why are they separate? no matter if a mTHP has been
> put in zswap, it has been swapped-out to mm-core? No?

Thanks for the code review comments. This is a good point. I was keeping in
mind the convention used by existing vmstat event counters that distinguish
zswpout/zswpin from pswpout/pswpin events.

If we want to keep the distinction in mTHP swapouts, would adding a
separate MTHP_STAT_ZSWPOUT to "enum mthp_stat_item" be Ok?

In any case, it looks like all that would be needed is a call to
count_mthp_stat(folio_order(folio), MTHP_STAT_[Z]SWPOUT) in the
general case.

I will make this change in v2, depending on whether or not the
separation of zswpout vs. non-zswap swpout is recommended for
mTHP.

> 
> 
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  include/linux/vm_event_item.h | 15 +++++++++++++++
> >  mm/vmstat.c                   | 15 +++++++++++++++
> >  2 files changed, 30 insertions(+)
> >
> > diff --git a/include/linux/vm_event_item.h
> b/include/linux/vm_event_item.h
> > index 747943bc8cc2..2451bcfcf05c 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -114,6 +114,9 @@ enum vm_event_item { PGPGIN, PGPGOUT,
> PSWPIN, PSWPOUT,
> >                 THP_ZERO_PAGE_ALLOC,
> >                 THP_ZERO_PAGE_ALLOC_FAILED,
> >                 THP_SWPOUT,
> > +#ifdef CONFIG_ZSWAP
> > +               ZSWPOUT_PMD_THP_FOLIO,
> > +#endif
> >                 THP_SWPOUT_FALLBACK,
> >  #endif
> >  #ifdef CONFIG_MEMORY_BALLOON
> > @@ -143,6 +146,18 @@ enum vm_event_item { PGPGIN, PGPGOUT,
> PSWPIN, PSWPOUT,
> >                 ZSWPIN,
> >                 ZSWPOUT,
> >                 ZSWPWB,
> > +               ZSWPOUT_4KB_FOLIO,
> > +#ifdef CONFIG_THP_SWAP
> > +               mTHP_ZSWPOUT_8kB,
> > +               mTHP_ZSWPOUT_16kB,
> > +               mTHP_ZSWPOUT_32kB,
> > +               mTHP_ZSWPOUT_64kB,
> > +               mTHP_ZSWPOUT_128kB,
> > +               mTHP_ZSWPOUT_256kB,
> > +               mTHP_ZSWPOUT_512kB,
> > +               mTHP_ZSWPOUT_1024kB,
> > +               mTHP_ZSWPOUT_2048kB,
> > +#endif
> 
> This implementation hardcodes assumptions about the page size being 4KB,
> but page sizes can vary, and so can the THP orders?

Agreed, will address in v2.

> 
> >  #endif
> >  #ifdef CONFIG_X86
> >                 DIRECT_MAP_LEVEL2_SPLIT,
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 8507c497218b..0e66c8b0c486 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1375,6 +1375,9 @@ const char * const vmstat_text[] = {
> >         "thp_zero_page_alloc",
> >         "thp_zero_page_alloc_failed",
> >         "thp_swpout",
> > +#ifdef CONFIG_ZSWAP
> > +       "zswpout_pmd_thp_folio",
> > +#endif
> >         "thp_swpout_fallback",
> >  #endif
> >  #ifdef CONFIG_MEMORY_BALLOON
> > @@ -1405,6 +1408,18 @@ const char * const vmstat_text[] = {
> >         "zswpin",
> >         "zswpout",
> >         "zswpwb",
> > +       "zswpout_4kb_folio",
> > +#ifdef CONFIG_THP_SWAP
> > +       "mthp_zswpout_8kb",
> > +       "mthp_zswpout_16kb",
> > +       "mthp_zswpout_32kb",
> > +       "mthp_zswpout_64kb",
> > +       "mthp_zswpout_128kb",
> > +       "mthp_zswpout_256kb",
> > +       "mthp_zswpout_512kb",
> > +       "mthp_zswpout_1024kb",
> > +       "mthp_zswpout_2048kb",
> > +#endif
> 
> The issue here is that the number of THP orders
> can vary across different platforms.

Agreed, will address in v2.

Thanks,
Kanchana

> 
> >  #endif
> >  #ifdef CONFIG_X86
> >         "direct_map_level2_splits",
> > --
> > 2.27.0
> >
> 
> Thanks
> Barry
Barry Song Aug. 14, 2024, 11:24 p.m. UTC | #3
On Thu, Aug 15, 2024 at 5:40 AM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi Barry,
>
> > -----Original Message-----
> > From: Barry Song <21cnbao@gmail.com>
> > Sent: Wednesday, August 14, 2024 12:49 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> > ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>; akpm@linux-
> > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store
> > vmstat event counters.
> >
> > On Wed, Aug 14, 2024 at 6:28 PM Kanchana P Sridhar
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > > Added vmstat event counters per mTHP-size that can be used to account
> > > for folios of different sizes being successfully stored in ZSWAP.
> > >
> > > For this RFC, it is not clear if these zswpout counters should instead
> > > be added as part of the existing mTHP stats in
> > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats.
> > >
> > > The following is also a viable option, should it make better sense,
> > > for instance, as:
> > >
> > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout.
> > >
> > > If so, we would be able to distinguish between mTHP zswap and
> > > non-zswap swapouts through:
> > >
> > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> > >
> > > and
> > >
> > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout
> > >
> > > respectively.
> > >
> > > Comments would be appreciated as to which approach is preferable.
> >
> > Even though swapout might go through zswap, from the perspective of
> > the mm core, it shouldn't be aware of that. Shouldn't zswpout be part
> > of swpout? Why are they separate? no matter if a mTHP has been
> > put in zswap, it has been swapped-out to mm-core? No?
>
> Thanks for the code review comments. This is a good point. I was keeping in
> mind the convention used by existing vmstat event counters that distinguish
> zswpout/zswpin from pswpout/pswpin events.
>
> If we want to keep the distinction in mTHP swapouts, would adding a
> separate MTHP_STAT_ZSWPOUT to "enum mthp_stat_item" be Ok?
>

I'm not entirely sure how important the zswpout counter is. To me, it doesn't
seem as critical as swpout and swpout_fallback, which are more useful for
system profiling. zswapout feels more like an internal detail related to
how the swap-out process is handled? If this is the case, we might not
need this per-size counter.

Otherwise, I believe sysfs is a better place to avoid all the chaos in vmstat
to handle various orders and sizes. So the question is, per-size zswpout
counter is really important or just for debugging purposes?

> In any case, it looks like all that would be needed is a call to
> count_mthp_stat(folio_order(folio), MTHP_STAT_[Z]SWPOUT) in the
> general case.
>
> I will make this change in v2, depending on whether or not the
> separation of zswpout vs. non-zswap swpout is recommended for
> mTHP.
>
> >
> >
> > >
> > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > > ---
> > >  include/linux/vm_event_item.h | 15 +++++++++++++++
> > >  mm/vmstat.c                   | 15 +++++++++++++++
> > >  2 files changed, 30 insertions(+)
> > >
> > > diff --git a/include/linux/vm_event_item.h
> > b/include/linux/vm_event_item.h
> > > index 747943bc8cc2..2451bcfcf05c 100644
> > > --- a/include/linux/vm_event_item.h
> > > +++ b/include/linux/vm_event_item.h
> > > @@ -114,6 +114,9 @@ enum vm_event_item { PGPGIN, PGPGOUT,
> > PSWPIN, PSWPOUT,
> > >                 THP_ZERO_PAGE_ALLOC,
> > >                 THP_ZERO_PAGE_ALLOC_FAILED,
> > >                 THP_SWPOUT,
> > > +#ifdef CONFIG_ZSWAP
> > > +               ZSWPOUT_PMD_THP_FOLIO,
> > > +#endif
> > >                 THP_SWPOUT_FALLBACK,
> > >  #endif
> > >  #ifdef CONFIG_MEMORY_BALLOON
> > > @@ -143,6 +146,18 @@ enum vm_event_item { PGPGIN, PGPGOUT,
> > PSWPIN, PSWPOUT,
> > >                 ZSWPIN,
> > >                 ZSWPOUT,
> > >                 ZSWPWB,
> > > +               ZSWPOUT_4KB_FOLIO,
> > > +#ifdef CONFIG_THP_SWAP
> > > +               mTHP_ZSWPOUT_8kB,
> > > +               mTHP_ZSWPOUT_16kB,
> > > +               mTHP_ZSWPOUT_32kB,
> > > +               mTHP_ZSWPOUT_64kB,
> > > +               mTHP_ZSWPOUT_128kB,
> > > +               mTHP_ZSWPOUT_256kB,
> > > +               mTHP_ZSWPOUT_512kB,
> > > +               mTHP_ZSWPOUT_1024kB,
> > > +               mTHP_ZSWPOUT_2048kB,
> > > +#endif
> >
> > This implementation hardcodes assumptions about the page size being 4KB,
> > but page sizes can vary, and so can the THP orders?
>
> Agreed, will address in v2.
>
> >
> > >  #endif
> > >  #ifdef CONFIG_X86
> > >                 DIRECT_MAP_LEVEL2_SPLIT,
> > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > index 8507c497218b..0e66c8b0c486 100644
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -1375,6 +1375,9 @@ const char * const vmstat_text[] = {
> > >         "thp_zero_page_alloc",
> > >         "thp_zero_page_alloc_failed",
> > >         "thp_swpout",
> > > +#ifdef CONFIG_ZSWAP
> > > +       "zswpout_pmd_thp_folio",
> > > +#endif
> > >         "thp_swpout_fallback",
> > >  #endif
> > >  #ifdef CONFIG_MEMORY_BALLOON
> > > @@ -1405,6 +1408,18 @@ const char * const vmstat_text[] = {
> > >         "zswpin",
> > >         "zswpout",
> > >         "zswpwb",
> > > +       "zswpout_4kb_folio",
> > > +#ifdef CONFIG_THP_SWAP
> > > +       "mthp_zswpout_8kb",
> > > +       "mthp_zswpout_16kb",
> > > +       "mthp_zswpout_32kb",
> > > +       "mthp_zswpout_64kb",
> > > +       "mthp_zswpout_128kb",
> > > +       "mthp_zswpout_256kb",
> > > +       "mthp_zswpout_512kb",
> > > +       "mthp_zswpout_1024kb",
> > > +       "mthp_zswpout_2048kb",
> > > +#endif
> >
> > The issue here is that the number of THP orders
> > can vary across different platforms.
>
> Agreed, will address in v2.
>
> Thanks,
> Kanchana
>
> >
> > >  #endif
> > >  #ifdef CONFIG_X86
> > >         "direct_map_level2_splits",
> > > --
> > > 2.27.0
> > >

Thanks
Barry
Sridhar, Kanchana P Aug. 15, 2024, 1:37 a.m. UTC | #4
Hi Barry,

> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Wednesday, August 14, 2024 4:25 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store
> vmstat event counters.
> 
> On Thu, Aug 15, 2024 at 5:40 AM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi Barry,
> >
> > > -----Original Message-----
> > > From: Barry Song <21cnbao@gmail.com>
> > > Sent: Wednesday, August 14, 2024 12:49 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> > > ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> akpm@linux-
> > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [RFC PATCH v1 2/4] mm: vmstat: Per mTHP-size zswap_store
> > > vmstat event counters.
> > >
> > > On Wed, Aug 14, 2024 at 6:28 PM Kanchana P Sridhar
> > > <kanchana.p.sridhar@intel.com> wrote:
> > > >
> > > > Added vmstat event counters per mTHP-size that can be used to account
> > > > for folios of different sizes being successfully stored in ZSWAP.
> > > >
> > > > For this RFC, it is not clear if these zswpout counters should instead
> > > > be added as part of the existing mTHP stats in
> > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats.
> > > >
> > > > The following is also a viable option, should it make better sense,
> > > > for instance, as:
> > > >
> > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout.
> > > >
> > > > If so, we would be able to distinguish between mTHP zswap and
> > > > non-zswap swapouts through:
> > > >
> > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> > > >
> > > > and
> > > >
> > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout
> > > >
> > > > respectively.
> > > >
> > > > Comments would be appreciated as to which approach is preferable.
> > >
> > > Even though swapout might go through zswap, from the perspective of
> > > the mm core, it shouldn't be aware of that. Shouldn't zswpout be part
> > > of swpout? Why are they separate? no matter if a mTHP has been
> > > put in zswap, it has been swapped-out to mm-core? No?
> >
> > Thanks for the code review comments. This is a good point. I was keeping in
> > mind the convention used by existing vmstat event counters that distinguish
> > zswpout/zswpin from pswpout/pswpin events.
> >
> > If we want to keep the distinction in mTHP swapouts, would adding a
> > separate MTHP_STAT_ZSWPOUT to "enum mthp_stat_item" be Ok?
> >
> 
> I'm not entirely sure how important the zswpout counter is. To me, it doesn't
> seem as critical as swpout and swpout_fallback, which are more useful for
> system profiling. zswapout feels more like an internal detail related to
> how the swap-out process is handled? If this is the case, we might not
> need this per-size counter.
> 
> Otherwise, I believe sysfs is a better place to avoid all the chaos in vmstat
> to handle various orders and sizes. So the question is, per-size zswpout
> counter is really important or just for debugging purposes?

I agree, sysfs would be a cleaner mTHP stats accounting solution, given the
existing mTHP swpout stats under the per-order
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout.

I personally find distinct zswap vs. bdev/fs swapout accounting useful
for debugging, and for overall reclaim path characterization.
For instance, the impact of different zswap compressors' compress latency
on zswpout activity for a given workload. Is a slowdown in compress latency
causing active/hot memory to be reclaimed and immediately faulted in?
Does better zswap compress efficiency co-relate to more cold memory
as mTHP to be reclaimed? How does the reclaim path efficiency
improvement resulting from improving zswap_store mTHP performance
co-relate with ZSWAP utilization and memory savings? I have found these
counters useful in understanding some of these characteristics.

I also believe it helps to account for the number of mTHP being stored in
different compress tiers. For e.g. how many mTHP were stored in zswap vs.
being rejected and stored in the backing swap device. This could help say
in provisioning zswap memory, and knowing the impact of zswap compress
path latency on scaling.

Another interesting characteristic that mTHP zswpout accounting could help
understand would be compressor incompressibility and/or zpool fragmentation;
and being able to better co-relate the zswap/reject_* sysfs counters with
mTHP [z]swpout stats.

Look forward to inputs from yourself and others on the direction and next steps.

Thanks,
Kanchana

> 
> > In any case, it looks like all that would be needed is a call to
> > count_mthp_stat(folio_order(folio), MTHP_STAT_[Z]SWPOUT) in the
> > general case.
> >
> > I will make this change in v2, depending on whether or not the
> > separation of zswpout vs. non-zswap swpout is recommended for
> > mTHP.
> >
> > >
> > >
> > > >
> > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > > > ---
> > > >  include/linux/vm_event_item.h | 15 +++++++++++++++
> > > >  mm/vmstat.c                   | 15 +++++++++++++++
> > > >  2 files changed, 30 insertions(+)
> > > >
> > > > diff --git a/include/linux/vm_event_item.h
> > > b/include/linux/vm_event_item.h
> > > > index 747943bc8cc2..2451bcfcf05c 100644
> > > > --- a/include/linux/vm_event_item.h
> > > > +++ b/include/linux/vm_event_item.h
> > > > @@ -114,6 +114,9 @@ enum vm_event_item { PGPGIN, PGPGOUT,
> > > PSWPIN, PSWPOUT,
> > > >                 THP_ZERO_PAGE_ALLOC,
> > > >                 THP_ZERO_PAGE_ALLOC_FAILED,
> > > >                 THP_SWPOUT,
> > > > +#ifdef CONFIG_ZSWAP
> > > > +               ZSWPOUT_PMD_THP_FOLIO,
> > > > +#endif
> > > >                 THP_SWPOUT_FALLBACK,
> > > >  #endif
> > > >  #ifdef CONFIG_MEMORY_BALLOON
> > > > @@ -143,6 +146,18 @@ enum vm_event_item { PGPGIN, PGPGOUT,
> > > PSWPIN, PSWPOUT,
> > > >                 ZSWPIN,
> > > >                 ZSWPOUT,
> > > >                 ZSWPWB,
> > > > +               ZSWPOUT_4KB_FOLIO,
> > > > +#ifdef CONFIG_THP_SWAP
> > > > +               mTHP_ZSWPOUT_8kB,
> > > > +               mTHP_ZSWPOUT_16kB,
> > > > +               mTHP_ZSWPOUT_32kB,
> > > > +               mTHP_ZSWPOUT_64kB,
> > > > +               mTHP_ZSWPOUT_128kB,
> > > > +               mTHP_ZSWPOUT_256kB,
> > > > +               mTHP_ZSWPOUT_512kB,
> > > > +               mTHP_ZSWPOUT_1024kB,
> > > > +               mTHP_ZSWPOUT_2048kB,
> > > > +#endif
> > >
> > > This implementation hardcodes assumptions about the page size being
> 4KB,
> > > but page sizes can vary, and so can the THP orders?
> >
> > Agreed, will address in v2.
> >
> > >
> > > >  #endif
> > > >  #ifdef CONFIG_X86
> > > >                 DIRECT_MAP_LEVEL2_SPLIT,
> > > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > > index 8507c497218b..0e66c8b0c486 100644
> > > > --- a/mm/vmstat.c
> > > > +++ b/mm/vmstat.c
> > > > @@ -1375,6 +1375,9 @@ const char * const vmstat_text[] = {
> > > >         "thp_zero_page_alloc",
> > > >         "thp_zero_page_alloc_failed",
> > > >         "thp_swpout",
> > > > +#ifdef CONFIG_ZSWAP
> > > > +       "zswpout_pmd_thp_folio",
> > > > +#endif
> > > >         "thp_swpout_fallback",
> > > >  #endif
> > > >  #ifdef CONFIG_MEMORY_BALLOON
> > > > @@ -1405,6 +1408,18 @@ const char * const vmstat_text[] = {
> > > >         "zswpin",
> > > >         "zswpout",
> > > >         "zswpwb",
> > > > +       "zswpout_4kb_folio",
> > > > +#ifdef CONFIG_THP_SWAP
> > > > +       "mthp_zswpout_8kb",
> > > > +       "mthp_zswpout_16kb",
> > > > +       "mthp_zswpout_32kb",
> > > > +       "mthp_zswpout_64kb",
> > > > +       "mthp_zswpout_128kb",
> > > > +       "mthp_zswpout_256kb",
> > > > +       "mthp_zswpout_512kb",
> > > > +       "mthp_zswpout_1024kb",
> > > > +       "mthp_zswpout_2048kb",
> > > > +#endif
> > >
> > > The issue here is that the number of THP orders
> > > can vary across different platforms.
> >
> > Agreed, will address in v2.
> >
> > Thanks,
> > Kanchana
> >
> > >
> > > >  #endif
> > > >  #ifdef CONFIG_X86
> > > >         "direct_map_level2_splits",
> > > > --
> > > > 2.27.0
> > > >
> 
> Thanks
> Barry
diff mbox series

Patch

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 747943bc8cc2..2451bcfcf05c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -114,6 +114,9 @@  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
 		THP_SWPOUT,
+#ifdef CONFIG_ZSWAP
+		ZSWPOUT_PMD_THP_FOLIO,
+#endif
 		THP_SWPOUT_FALLBACK,
 #endif
 #ifdef CONFIG_MEMORY_BALLOON
@@ -143,6 +146,18 @@  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		ZSWPIN,
 		ZSWPOUT,
 		ZSWPWB,
+		ZSWPOUT_4KB_FOLIO,
+#ifdef CONFIG_THP_SWAP
+		mTHP_ZSWPOUT_8kB,
+		mTHP_ZSWPOUT_16kB,
+		mTHP_ZSWPOUT_32kB,
+		mTHP_ZSWPOUT_64kB,
+		mTHP_ZSWPOUT_128kB,
+		mTHP_ZSWPOUT_256kB,
+		mTHP_ZSWPOUT_512kB,
+		mTHP_ZSWPOUT_1024kB,
+		mTHP_ZSWPOUT_2048kB,
+#endif
 #endif
 #ifdef CONFIG_X86
 		DIRECT_MAP_LEVEL2_SPLIT,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8507c497218b..0e66c8b0c486 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1375,6 +1375,9 @@  const char * const vmstat_text[] = {
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
 	"thp_swpout",
+#ifdef CONFIG_ZSWAP
+	"zswpout_pmd_thp_folio",
+#endif
 	"thp_swpout_fallback",
 #endif
 #ifdef CONFIG_MEMORY_BALLOON
@@ -1405,6 +1408,18 @@  const char * const vmstat_text[] = {
 	"zswpin",
 	"zswpout",
 	"zswpwb",
+	"zswpout_4kb_folio",
+#ifdef CONFIG_THP_SWAP
+	"mthp_zswpout_8kb",
+	"mthp_zswpout_16kb",
+	"mthp_zswpout_32kb",
+	"mthp_zswpout_64kb",
+	"mthp_zswpout_128kb",
+	"mthp_zswpout_256kb",
+	"mthp_zswpout_512kb",
+	"mthp_zswpout_1024kb",
+	"mthp_zswpout_2048kb",
+#endif
 #endif
 #ifdef CONFIG_X86
 	"direct_map_level2_splits",