diff mbox series

arm64: enable THP_SWAP for arm64

Message ID 20220524071403.128644-1-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series arm64: enable THP_SWAP for arm64 | expand

Commit Message

Barry Song May 24, 2022, 7:14 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

THP_SWAP has been proved to improve the swap throughput significantly
on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
splitting THP after swapped out").
As long as arm64 uses 4K page size, it is quite similar with x86_64
by having 2MB PMD THP. So we are going to get similar improvement.
For other page sizes such as 16KB and 64KB, PMD might be too large.
Negative side effects such as IO latency might be a problem. Thus,
we can only safely enable the counterpart of X86_64.

Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

Comments

Catalin Marinas May 24, 2022, 8:12 a.m. UTC | #1
On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> THP_SWAP has been proved to improve the swap throughput significantly
> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
> splitting THP after swapped out").
> As long as arm64 uses 4K page size, it is quite similar with x86_64
> by having 2MB PMD THP. So we are going to get similar improvement.
> For other page sizes such as 16KB and 64KB, PMD might be too large.
> Negative side effects such as IO latency might be a problem. Thus,
> we can only safely enable the counterpart of X86_64.
> 
> Cc: "Huang, Ying" <ying.huang@intel.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Shaohua Li <shli@kernel.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  arch/arm64/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index d550f5acfaf3..8e3771c56fbf 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -98,6 +98,7 @@ config ARM64
>  	select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
>  	select ARCH_WANT_LD_ORPHAN_WARN
>  	select ARCH_WANTS_NO_INSTR
> +	select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES

I'm not opposed to this but I think it would break pages mapped with
PROT_MTE. We have an assumption in mte_sync_tags() that compound pages
are not swapped out (or in). With MTE, we store the tags in a slab
object (128-bytes per swapped page) and restore them when pages are
swapped in. At some point we may teach the core swap code about such
metadata but in the meantime that was the easiest way.
Barry Song May 24, 2022, 10:05 a.m. UTC | #2
On Tue, May 24, 2022 at 8:12 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > THP_SWAP has been proved to improve the swap throughput significantly
> > on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
> > splitting THP after swapped out").
> > As long as arm64 uses 4K page size, it is quite similar with x86_64
> > by having 2MB PMD THP. So we are going to get similar improvement.
> > For other page sizes such as 16KB and 64KB, PMD might be too large.
> > Negative side effects such as IO latency might be a problem. Thus,
> > we can only safely enable the counterpart of X86_64.
> >
> > Cc: "Huang, Ying" <ying.huang@intel.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Shaohua Li <shli@kernel.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  arch/arm64/Kconfig | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index d550f5acfaf3..8e3771c56fbf 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -98,6 +98,7 @@ config ARM64
> >       select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> >       select ARCH_WANT_LD_ORPHAN_WARN
> >       select ARCH_WANTS_NO_INSTR
> > +     select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
>
> I'm not opposed to this but I think it would break pages mapped with
> PROT_MTE. We have an assumption in mte_sync_tags() that compound pages
> are not swapped out (or in). With MTE, we store the tags in a slab

I assume you mean mte_sync_tags() require that THP is not swapped as a whole,
as without THP_SWP, THP is still swapping after being splitted. MTE doesn't stop
THP from swapping through a couple of splitted pages, does it?

> object (128-bytes per swapped page) and restore them when pages are
> swapped in. At some point we may teach the core swap code about such
> metadata but in the meantime that was the easiest way.
>

If my previous assumption is true,  the easiest way to enable THP_SWP
for this moment
might be always letting mm fallback to the splitting way for MTE
hardware. For this
moment, I care about THP_SWP more as none of my hardware has MTE.

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 45c358538f13..d55a2a3e41a9 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -44,6 +44,8 @@
        __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+#define arch_thp_swp_supported !system_supports_mte
+
 /*
  * Outside of a few very special situations (e.g. hibernation), we always
  * use broadcast TLB invalidation instructions, therefore a spurious page
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2999190adc22..064b6b03df9e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -447,4 +447,16 @@ static inline int split_folio_to_list(struct folio *folio,
        return split_huge_page_to_list(&folio->page, list);
 }

+/*
+ * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
+ * limitations in the implementation like arm64 MTE can override this to
+ * false
+ */
+#ifndef arch_thp_swp_supported
+static inline bool arch_thp_swp_supported(void)
+{
+       return true;
+}
+#endif
+
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 2b5531840583..dde685836328 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -309,7 +309,7 @@ swp_entry_t get_swap_page(struct page *page)
        entry.val = 0;

        if (PageTransHuge(page)) {
-               if (IS_ENABLED(CONFIG_THP_SWAP))
+               if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
                        get_swap_pages(1, &entry, HPAGE_PMD_NR);
                goto out;
        }

> --
> Catalin

Thanks
Barry
Barry Song May 24, 2022, 11:15 a.m. UTC | #3
On Tue, May 24, 2022 at 10:05 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, May 24, 2022 at 8:12 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> >
> > On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote:
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > THP_SWAP has been proved to improve the swap throughput significantly
> > > on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
> > > splitting THP after swapped out").
> > > As long as arm64 uses 4K page size, it is quite similar with x86_64
> > > by having 2MB PMD THP. So we are going to get similar improvement.
> > > For other page sizes such as 16KB and 64KB, PMD might be too large.
> > > Negative side effects such as IO latency might be a problem. Thus,
> > > we can only safely enable the counterpart of X86_64.
> > >
> > > Cc: "Huang, Ying" <ying.huang@intel.com>
> > > Cc: Minchan Kim <minchan@kernel.org>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > Cc: Hugh Dickins <hughd@google.com>
> > > Cc: Shaohua Li <shli@kernel.org>
> > > Cc: Rik van Riel <riel@redhat.com>
> > > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >  arch/arm64/Kconfig | 1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > index d550f5acfaf3..8e3771c56fbf 100644
> > > --- a/arch/arm64/Kconfig
> > > +++ b/arch/arm64/Kconfig
> > > @@ -98,6 +98,7 @@ config ARM64
> > >       select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> > >       select ARCH_WANT_LD_ORPHAN_WARN
> > >       select ARCH_WANTS_NO_INSTR
> > > +     select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
> >
> > I'm not opposed to this but I think it would break pages mapped with
> > PROT_MTE. We have an assumption in mte_sync_tags() that compound pages
> > are not swapped out (or in). With MTE, we store the tags in a slab
>
> I assume you mean mte_sync_tags() require that THP is not swapped as a whole,
> as without THP_SWP, THP is still swapping after being splitted. MTE doesn't stop
> THP from swapping through a couple of splitted pages, does it?
>
> > object (128-bytes per swapped page) and restore them when pages are
> > swapped in. At some point we may teach the core swap code about such
> > metadata but in the meantime that was the easiest way.
> >
>
> If my previous assumption is true,  the easiest way to enable THP_SWP
> for this moment
> might be always letting mm fallback to the splitting way for MTE
> hardware. For this
> moment, I care about THP_SWP more as none of my hardware has MTE.
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 45c358538f13..d55a2a3e41a9 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -44,6 +44,8 @@
>         __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> +#define arch_thp_swp_supported !system_supports_mte
> +
>  /*
>   * Outside of a few very special situations (e.g. hibernation), we always
>   * use broadcast TLB invalidation instructions, therefore a spurious page
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2999190adc22..064b6b03df9e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -447,4 +447,16 @@ static inline int split_folio_to_list(struct folio *folio,
>         return split_huge_page_to_list(&folio->page, list);
>  }
>
> +/*
> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> + * limitations in the implementation like arm64 MTE can override this to
> + * false
> + */
> +#ifndef arch_thp_swp_supported
> +static inline bool arch_thp_swp_supported(void)
> +{
> +       return true;
> +}
> +#endif
> +
>  #endif /* _LINUX_HUGE_MM_H */
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 2b5531840583..dde685836328 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -309,7 +309,7 @@ swp_entry_t get_swap_page(struct page *page)
>         entry.val = 0;
>
>         if (PageTransHuge(page)) {
> -               if (IS_ENABLED(CONFIG_THP_SWAP))
> +               if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
>                         get_swap_pages(1, &entry, HPAGE_PMD_NR);
>                 goto out;
>         }
>

Am I actually able to go further to only split MTE tagged pages?

For mm core:

+/*
+ * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
+ * limitations in the implementation like arm64 MTE can override this to
+ * false
+ */
+#ifndef arch_thp_swp_supported
+static inline bool arch_thp_swp_supported(struct page *page)
+{
+       return true;
+}
+#endif
+

For arm64:
+#define arch_thp_swp_supported(page) !test_bit(PG_mte_tagged, &page->flags)

But I don't have MTE hardware to test. So to me, totally disabling THP_SWP
is safer.

thoughts?
> > --
> > Catalin
>
> Thanks
> Barry
Catalin Marinas May 24, 2022, 7:14 p.m. UTC | #4
On Tue, May 24, 2022 at 10:05:35PM +1200, Barry Song wrote:
> On Tue, May 24, 2022 at 8:12 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote:
> > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > index d550f5acfaf3..8e3771c56fbf 100644
> > > --- a/arch/arm64/Kconfig
> > > +++ b/arch/arm64/Kconfig
> > > @@ -98,6 +98,7 @@ config ARM64
> > >       select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> > >       select ARCH_WANT_LD_ORPHAN_WARN
> > >       select ARCH_WANTS_NO_INSTR
> > > +     select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
> >
> > I'm not opposed to this but I think it would break pages mapped with
> > PROT_MTE. We have an assumption in mte_sync_tags() that compound pages
> > are not swapped out (or in). With MTE, we store the tags in a slab
> 
> I assume you mean mte_sync_tags() require that THP is not swapped as a whole,
> as without THP_SWP, THP is still swapping after being splitted. MTE doesn't stop
> THP from swapping through a couple of splitted pages, does it?

That's correct, split THP page are swapped out/in just fine.

> > object (128-bytes per swapped page) and restore them when pages are
> > swapped in. At some point we may teach the core swap code about such
> > metadata but in the meantime that was the easiest way.
> 
> If my previous assumption is true,  the easiest way to enable THP_SWP
> for this moment might be always letting mm fallback to the splitting
> way for MTE hardware. For this moment, I care about THP_SWP more as
> none of my hardware has MTE.
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 45c358538f13..d55a2a3e41a9 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -44,6 +44,8 @@
>         __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> 
> +#define arch_thp_swp_supported !system_supports_mte
> +
>  /*
>   * Outside of a few very special situations (e.g. hibernation), we always
>   * use broadcast TLB invalidation instructions, therefore a spurious page
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2999190adc22..064b6b03df9e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -447,4 +447,16 @@ static inline int split_folio_to_list(struct folio *folio,
>         return split_huge_page_to_list(&folio->page, list);
>  }
> 
> +/*
> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> + * limitations in the implementation like arm64 MTE can override this to
> + * false
> + */
> +#ifndef arch_thp_swp_supported
> +static inline bool arch_thp_swp_supported(void)
> +{
> +       return true;
> +}
> +#endif
> +
>  #endif /* _LINUX_HUGE_MM_H */
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 2b5531840583..dde685836328 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -309,7 +309,7 @@ swp_entry_t get_swap_page(struct page *page)
>         entry.val = 0;
> 
>         if (PageTransHuge(page)) {
> -               if (IS_ENABLED(CONFIG_THP_SWAP))
> +               if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
>                         get_swap_pages(1, &entry, HPAGE_PMD_NR);
>                 goto out;

I think this should work and with your other proposal it would be
limited to MTE pages:

#define arch_thp_swp_supported(page)	(!test_bit(PG_mte_tagged, &page->flags))

Are THP pages loaded from swap as a whole or are they split? IIRC the
splitting still happens but after the swapping out finishes. Even if
they are loaded as 4K pages, we still have the mte_save_tags() that only
understands small pages currently, so rejecting THP pages is probably
best.
Barry Song May 25, 2022, 11:10 a.m. UTC | #5
On Wed, May 25, 2022 at 7:14 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Tue, May 24, 2022 at 10:05:35PM +1200, Barry Song wrote:
> > On Tue, May 24, 2022 at 8:12 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote:
> > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > > index d550f5acfaf3..8e3771c56fbf 100644
> > > > --- a/arch/arm64/Kconfig
> > > > +++ b/arch/arm64/Kconfig
> > > > @@ -98,6 +98,7 @@ config ARM64
> > > >       select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> > > >       select ARCH_WANT_LD_ORPHAN_WARN
> > > >       select ARCH_WANTS_NO_INSTR
> > > > +     select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
> > >
> > > I'm not opposed to this but I think it would break pages mapped with
> > > PROT_MTE. We have an assumption in mte_sync_tags() that compound pages
> > > are not swapped out (or in). With MTE, we store the tags in a slab
> >
> > I assume you mean mte_sync_tags() require that THP is not swapped as a whole,
> > as without THP_SWP, THP is still swapping after being splitted. MTE doesn't stop
> > THP from swapping through a couple of splitted pages, does it?
>
> That's correct, split THP page are swapped out/in just fine.
>
> > > object (128-bytes per swapped page) and restore them when pages are
> > > swapped in. At some point we may teach the core swap code about such
> > > metadata but in the meantime that was the easiest way.
> >
> > If my previous assumption is true,  the easiest way to enable THP_SWP
> > for this moment might be always letting mm fallback to the splitting
> > way for MTE hardware. For this moment, I care about THP_SWP more as
> > none of my hardware has MTE.
> >
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 45c358538f13..d55a2a3e41a9 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -44,6 +44,8 @@
> >         __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > +#define arch_thp_swp_supported !system_supports_mte
> > +
> >  /*
> >   * Outside of a few very special situations (e.g. hibernation), we always
> >   * use broadcast TLB invalidation instructions, therefore a spurious page
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 2999190adc22..064b6b03df9e 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -447,4 +447,16 @@ static inline int split_folio_to_list(struct folio *folio,
> >         return split_huge_page_to_list(&folio->page, list);
> >  }
> >
> > +/*
> > + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> > + * limitations in the implementation like arm64 MTE can override this to
> > + * false
> > + */
> > +#ifndef arch_thp_swp_supported
> > +static inline bool arch_thp_swp_supported(void)
> > +{
> > +       return true;
> > +}
> > +#endif
> > +
> >  #endif /* _LINUX_HUGE_MM_H */
> > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > index 2b5531840583..dde685836328 100644
> > --- a/mm/swap_slots.c
> > +++ b/mm/swap_slots.c
> > @@ -309,7 +309,7 @@ swp_entry_t get_swap_page(struct page *page)
> >         entry.val = 0;
> >
> >         if (PageTransHuge(page)) {
> > -               if (IS_ENABLED(CONFIG_THP_SWAP))
> > +               if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> >                         get_swap_pages(1, &entry, HPAGE_PMD_NR);
> >                 goto out;
>
> I think this should work and with your other proposal it would be
> limited to MTE pages:
>
> #define arch_thp_swp_supported(page)    (!test_bit(PG_mte_tagged, &page->flags))
>
> Are THP pages loaded from swap as a whole or are they split? IIRC the

i can confirm thp is written as a whole through:
[   90.622863]  __swap_writepage+0xe8/0x580
[   90.622881]  swap_writepage+0x44/0xf8
[   90.622891]  pageout+0xe0/0x2a8
[   90.622906]  shrink_page_list+0x9dc/0xde0
[   90.622917]  shrink_inactive_list+0x1ec/0x3c8
[   90.622928]  shrink_lruvec+0x3dc/0x628
[   90.622939]  shrink_node+0x37c/0x6a0
[   90.622950]  balance_pgdat+0x354/0x668
[   90.622961]  kswapd+0x1e0/0x3c0
[   90.622972]  kthread+0x110/0x120

but i have never got a backtrace in which thp is loaded as a whole though it
seems the code has this path:
int swap_readpage(struct page *page, bool synchronous)
{
        ...
        bio = bio_alloc(sis->bdev, 1, REQ_OP_READ, GFP_KERNEL);
        bio->bi_iter.bi_sector = swap_page_sector(page);
        bio->bi_end_io = end_swap_bio_read;
        bio_add_page(bio, page, thp_size(page), 0);
        ...
        submit_bio(bio);
}


> splitting still happens but after the swapping out finishes. Even if
> they are loaded as 4K pages, we still have the mte_save_tags() that only
> understands small pages currently, so rejecting THP pages is probably
> best.

as anyway i don't have a mte-hardware to do a valid test to go any
further, so i will totally disable thp_swp for hardware having mte for
this moment in patch v2.

>
> --
> Catalin

Thanks
Barry
Catalin Marinas May 25, 2022, 4:54 p.m. UTC | #6
On Wed, May 25, 2022 at 11:10:41PM +1200, Barry Song wrote:
> On Wed, May 25, 2022 at 7:14 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > I think this should work and with your other proposal it would be
> > limited to MTE pages:
> >
> > #define arch_thp_swp_supported(page)    (!test_bit(PG_mte_tagged, &page->flags))
> >
> > Are THP pages loaded from swap as a whole or are they split? IIRC the
> 
> i can confirm thp is written as a whole through:
> [   90.622863]  __swap_writepage+0xe8/0x580
> [   90.622881]  swap_writepage+0x44/0xf8
> [   90.622891]  pageout+0xe0/0x2a8
> [   90.622906]  shrink_page_list+0x9dc/0xde0
> [   90.622917]  shrink_inactive_list+0x1ec/0x3c8
> [   90.622928]  shrink_lruvec+0x3dc/0x628
> [   90.622939]  shrink_node+0x37c/0x6a0
> [   90.622950]  balance_pgdat+0x354/0x668
> [   90.622961]  kswapd+0x1e0/0x3c0
> [   90.622972]  kthread+0x110/0x120
> 
> but i have never got a backtrace in which thp is loaded as a whole though it
> seems the code has this path:
> int swap_readpage(struct page *page, bool synchronous)
> {
>         ...
>         bio = bio_alloc(sis->bdev, 1, REQ_OP_READ, GFP_KERNEL);
>         bio->bi_iter.bi_sector = swap_page_sector(page);
>         bio->bi_end_io = end_swap_bio_read;
>         bio_add_page(bio, page, thp_size(page), 0);
>         ...
>         submit_bio(bio);
> }
> 
> > splitting still happens but after the swapping out finishes. Even if
> > they are loaded as 4K pages, we still have the mte_save_tags() that only
> > understands small pages currently, so rejecting THP pages is probably
> > best.
> 
> as anyway i don't have a mte-hardware to do a valid test to go any
> further, so i will totally disable thp_swp for hardware having mte for
> this moment in patch v2.

It makes sense. If we decide to improve this for MTE, we'll change the
arch check.

Thanks.
Yang Shi May 25, 2022, 5:49 p.m. UTC | #7
On Wed, May 25, 2022 at 4:10 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, May 25, 2022 at 7:14 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> >
> > On Tue, May 24, 2022 at 10:05:35PM +1200, Barry Song wrote:
> > > On Tue, May 24, 2022 at 8:12 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote:
> > > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > > > index d550f5acfaf3..8e3771c56fbf 100644
> > > > > --- a/arch/arm64/Kconfig
> > > > > +++ b/arch/arm64/Kconfig
> > > > > @@ -98,6 +98,7 @@ config ARM64
> > > > >       select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> > > > >       select ARCH_WANT_LD_ORPHAN_WARN
> > > > >       select ARCH_WANTS_NO_INSTR
> > > > > +     select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
> > > >
> > > > I'm not opposed to this but I think it would break pages mapped with
> > > > PROT_MTE. We have an assumption in mte_sync_tags() that compound pages
> > > > are not swapped out (or in). With MTE, we store the tags in a slab
> > >
> > > I assume you mean mte_sync_tags() require that THP is not swapped as a whole,
> > > as without THP_SWP, THP is still swapping after being splitted. MTE doesn't stop
> > > THP from swapping through a couple of splitted pages, does it?
> >
> > That's correct, split THP page are swapped out/in just fine.
> >
> > > > object (128-bytes per swapped page) and restore them when pages are
> > > > swapped in. At some point we may teach the core swap code about such
> > > > metadata but in the meantime that was the easiest way.
> > >
> > > If my previous assumption is true,  the easiest way to enable THP_SWP
> > > for this moment might be always letting mm fallback to the splitting
> > > way for MTE hardware. For this moment, I care about THP_SWP more as
> > > none of my hardware has MTE.
> > >
> > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > > index 45c358538f13..d55a2a3e41a9 100644
> > > --- a/arch/arm64/include/asm/pgtable.h
> > > +++ b/arch/arm64/include/asm/pgtable.h
> > > @@ -44,6 +44,8 @@
> > >         __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> > >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > >
> > > +#define arch_thp_swp_supported !system_supports_mte
> > > +
> > >  /*
> > >   * Outside of a few very special situations (e.g. hibernation), we always
> > >   * use broadcast TLB invalidation instructions, therefore a spurious page
> > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > index 2999190adc22..064b6b03df9e 100644
> > > --- a/include/linux/huge_mm.h
> > > +++ b/include/linux/huge_mm.h
> > > @@ -447,4 +447,16 @@ static inline int split_folio_to_list(struct folio *folio,
> > >         return split_huge_page_to_list(&folio->page, list);
> > >  }
> > >
> > > +/*
> > > + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> > > + * limitations in the implementation like arm64 MTE can override this to
> > > + * false
> > > + */
> > > +#ifndef arch_thp_swp_supported
> > > +static inline bool arch_thp_swp_supported(void)
> > > +{
> > > +       return true;
> > > +}
> > > +#endif
> > > +
> > >  #endif /* _LINUX_HUGE_MM_H */
> > > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > > index 2b5531840583..dde685836328 100644
> > > --- a/mm/swap_slots.c
> > > +++ b/mm/swap_slots.c
> > > @@ -309,7 +309,7 @@ swp_entry_t get_swap_page(struct page *page)
> > >         entry.val = 0;
> > >
> > >         if (PageTransHuge(page)) {
> > > -               if (IS_ENABLED(CONFIG_THP_SWAP))
> > > +               if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> > >                         get_swap_pages(1, &entry, HPAGE_PMD_NR);
> > >                 goto out;
> >
> > I think this should work and with your other proposal it would be
> > limited to MTE pages:
> >
> > #define arch_thp_swp_supported(page)    (!test_bit(PG_mte_tagged, &page->flags))
> >
> > Are THP pages loaded from swap as a whole or are they split? IIRC the
>
> i can confirm thp is written as a whole through:
> [   90.622863]  __swap_writepage+0xe8/0x580
> [   90.622881]  swap_writepage+0x44/0xf8
> [   90.622891]  pageout+0xe0/0x2a8
> [   90.622906]  shrink_page_list+0x9dc/0xde0
> [   90.622917]  shrink_inactive_list+0x1ec/0x3c8
> [   90.622928]  shrink_lruvec+0x3dc/0x628
> [   90.622939]  shrink_node+0x37c/0x6a0
> [   90.622950]  balance_pgdat+0x354/0x668
> [   90.622961]  kswapd+0x1e0/0x3c0
> [   90.622972]  kthread+0x110/0x120
>
> but i have never got a backtrace in which thp is loaded as a whole though it
> seems the code has this path:

THP could be swapped out in a whole, but never swapped in as THP. Just
the single base page (4K on x86) is swapped in.

> int swap_readpage(struct page *page, bool synchronous)
> {
>         ...
>         bio = bio_alloc(sis->bdev, 1, REQ_OP_READ, GFP_KERNEL);
>         bio->bi_iter.bi_sector = swap_page_sector(page);
>         bio->bi_end_io = end_swap_bio_read;
>         bio_add_page(bio, page, thp_size(page), 0);
>         ...
>         submit_bio(bio);
> }
>
>
> > splitting still happens but after the swapping out finishes. Even if
> > they are loaded as 4K pages, we still have the mte_save_tags() that only
> > understands small pages currently, so rejecting THP pages is probably
> > best.
>
> as anyway i don't have a mte-hardware to do a valid test to go any
> further, so i will totally disable thp_swp for hardware having mte for
> this moment in patch v2.
>
> >
> > --
> > Catalin
>
> Thanks
> Barry
>
Anshuman Khandual May 26, 2022, 8:13 a.m. UTC | #8
On 5/24/22 16:45, Barry Song wrote:
> On Tue, May 24, 2022 at 10:05 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Tue, May 24, 2022 at 8:12 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
>>>
>>> On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote:
>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>
>>>> THP_SWAP has been proved to improve the swap throughput significantly
>>>> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
>>>> splitting THP after swapped out").
>>>> As long as arm64 uses 4K page size, it is quite similar with x86_64
>>>> by having 2MB PMD THP. So we are going to get similar improvement.
>>>> For other page sizes such as 16KB and 64KB, PMD might be too large.
>>>> Negative side effects such as IO latency might be a problem. Thus,
>>>> we can only safely enable the counterpart of X86_64.
>>>>
>>>> Cc: "Huang, Ying" <ying.huang@intel.com>
>>>> Cc: Minchan Kim <minchan@kernel.org>
>>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>>> Cc: Hugh Dickins <hughd@google.com>
>>>> Cc: Shaohua Li <shli@kernel.org>
>>>> Cc: Rik van Riel <riel@redhat.com>
>>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>> ---
>>>>  arch/arm64/Kconfig | 1 +
>>>>  1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>> index d550f5acfaf3..8e3771c56fbf 100644
>>>> --- a/arch/arm64/Kconfig
>>>> +++ b/arch/arm64/Kconfig
>>>> @@ -98,6 +98,7 @@ config ARM64
>>>>       select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
>>>>       select ARCH_WANT_LD_ORPHAN_WARN
>>>>       select ARCH_WANTS_NO_INSTR
>>>> +     select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
>>>
>>> I'm not opposed to this but I think it would break pages mapped with
>>> PROT_MTE. We have an assumption in mte_sync_tags() that compound pages
>>> are not swapped out (or in). With MTE, we store the tags in a slab
>>
>> I assume you mean mte_sync_tags() require that THP is not swapped as a whole,
>> as without THP_SWP, THP is still swapping after being splitted. MTE doesn't stop
>> THP from swapping through a couple of splitted pages, does it?
>>
>>> object (128-bytes per swapped page) and restore them when pages are
>>> swapped in. At some point we may teach the core swap code about such
>>> metadata but in the meantime that was the easiest way.
>>>
>>
>> If my previous assumption is true,  the easiest way to enable THP_SWP
>> for this moment
>> might be always letting mm fallback to the splitting way for MTE
>> hardware. For this
>> moment, I care about THP_SWP more as none of my hardware has MTE.
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 45c358538f13..d55a2a3e41a9 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -44,6 +44,8 @@
>>         __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>> +#define arch_thp_swp_supported !system_supports_mte
>> +
>>  /*
>>   * Outside of a few very special situations (e.g. hibernation), we always
>>   * use broadcast TLB invalidation instructions, therefore a spurious page
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2999190adc22..064b6b03df9e 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -447,4 +447,16 @@ static inline int split_folio_to_list(struct folio *folio,
>>         return split_huge_page_to_list(&folio->page, list);
>>  }
>>
>> +/*
>> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
>> + * limitations in the implementation like arm64 MTE can override this to
>> + * false
>> + */
>> +#ifndef arch_thp_swp_supported
>> +static inline bool arch_thp_swp_supported(void)
>> +{
>> +       return true;
>> +}
>> +#endif
>> +
>>  #endif /* _LINUX_HUGE_MM_H */
>> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
>> index 2b5531840583..dde685836328 100644
>> --- a/mm/swap_slots.c
>> +++ b/mm/swap_slots.c
>> @@ -309,7 +309,7 @@ swp_entry_t get_swap_page(struct page *page)
>>         entry.val = 0;
>>
>>         if (PageTransHuge(page)) {
>> -               if (IS_ENABLED(CONFIG_THP_SWAP))
>> +               if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
>>                         get_swap_pages(1, &entry, HPAGE_PMD_NR);
>>                 goto out;
>>         }
>>
> 
> Am I actually able to go further to only split MTE tagged pages?
> 
> For mm core:
> 
> +/*
> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> + * limitations in the implementation like arm64 MTE can override this to
> + * false
> + */
> +#ifndef arch_thp_swp_supported
> +static inline bool arch_thp_swp_supported(struct page *page)
> +{
> +       return true;
> +}
> +#endif
> +
> 
> For arm64:
> +#define arch_thp_swp_supported(page) !test_bit(PG_mte_tagged, &page->flags)

Although not entirely sure, but per page arch_thp_swp_supported() callback
seems bit risky. What if there scenarios or time windows when PG_mte_tagged
is cleared on an otherwise MTE tagged page ? I guess arch_thp_swp_supported()
just returning false on a system with MTE support, is a better option.

> 
> But I don't have MTE hardware to test. So to me, totally disabling THP_SWP
> is safer.
> 
> thoughts?
>>> --
>>> Catalin
>>
>> Thanks
>> Barry
>
Barry Song May 26, 2022, 9:19 a.m. UTC | #9
On Thu, May 26, 2022 at 5:49 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, May 25, 2022 at 4:10 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, May 25, 2022 at 7:14 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > >
> > > On Tue, May 24, 2022 at 10:05:35PM +1200, Barry Song wrote:
> > > > On Tue, May 24, 2022 at 8:12 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > > On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote:
> > > > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > > > > index d550f5acfaf3..8e3771c56fbf 100644
> > > > > > --- a/arch/arm64/Kconfig
> > > > > > +++ b/arch/arm64/Kconfig
> > > > > > @@ -98,6 +98,7 @@ config ARM64
> > > > > >       select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> > > > > >       select ARCH_WANT_LD_ORPHAN_WARN
> > > > > >       select ARCH_WANTS_NO_INSTR
> > > > > > +     select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
> > > > >
> > > > > I'm not opposed to this but I think it would break pages mapped with
> > > > > PROT_MTE. We have an assumption in mte_sync_tags() that compound pages
> > > > > are not swapped out (or in). With MTE, we store the tags in a slab
> > > >
> > > > I assume you mean mte_sync_tags() require that THP is not swapped as a whole,
> > > > as without THP_SWP, THP is still swapping after being splitted. MTE doesn't stop
> > > > THP from swapping through a couple of splitted pages, does it?
> > >
> > > That's correct, split THP page are swapped out/in just fine.
> > >
> > > > > object (128-bytes per swapped page) and restore them when pages are
> > > > > swapped in. At some point we may teach the core swap code about such
> > > > > metadata but in the meantime that was the easiest way.
> > > >
> > > > If my previous assumption is true,  the easiest way to enable THP_SWP
> > > > for this moment might be always letting mm fallback to the splitting
> > > > way for MTE hardware. For this moment, I care about THP_SWP more as
> > > > none of my hardware has MTE.
> > > >
> > > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > > > index 45c358538f13..d55a2a3e41a9 100644
> > > > --- a/arch/arm64/include/asm/pgtable.h
> > > > +++ b/arch/arm64/include/asm/pgtable.h
> > > > @@ -44,6 +44,8 @@
> > > >         __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> > > >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > > >
> > > > +#define arch_thp_swp_supported !system_supports_mte
> > > > +
> > > >  /*
> > > >   * Outside of a few very special situations (e.g. hibernation), we always
> > > >   * use broadcast TLB invalidation instructions, therefore a spurious page
> > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > > index 2999190adc22..064b6b03df9e 100644
> > > > --- a/include/linux/huge_mm.h
> > > > +++ b/include/linux/huge_mm.h
> > > > @@ -447,4 +447,16 @@ static inline int split_folio_to_list(struct folio *folio,
> > > >         return split_huge_page_to_list(&folio->page, list);
> > > >  }
> > > >
> > > > +/*
> > > > + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> > > > + * limitations in the implementation like arm64 MTE can override this to
> > > > + * false
> > > > + */
> > > > +#ifndef arch_thp_swp_supported
> > > > +static inline bool arch_thp_swp_supported(void)
> > > > +{
> > > > +       return true;
> > > > +}
> > > > +#endif
> > > > +
> > > >  #endif /* _LINUX_HUGE_MM_H */
> > > > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > > > index 2b5531840583..dde685836328 100644
> > > > --- a/mm/swap_slots.c
> > > > +++ b/mm/swap_slots.c
> > > > @@ -309,7 +309,7 @@ swp_entry_t get_swap_page(struct page *page)
> > > >         entry.val = 0;
> > > >
> > > >         if (PageTransHuge(page)) {
> > > > -               if (IS_ENABLED(CONFIG_THP_SWAP))
> > > > +               if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> > > >                         get_swap_pages(1, &entry, HPAGE_PMD_NR);
> > > >                 goto out;
> > >
> > > I think this should work and with your other proposal it would be
> > > limited to MTE pages:
> > >
> > > #define arch_thp_swp_supported(page)    (!test_bit(PG_mte_tagged, &page->flags))
> > >
> > > Are THP pages loaded from swap as a whole or are they split? IIRC the
> >
> > i can confirm thp is written as a whole through:
> > [   90.622863]  __swap_writepage+0xe8/0x580
> > [   90.622881]  swap_writepage+0x44/0xf8
> > [   90.622891]  pageout+0xe0/0x2a8
> > [   90.622906]  shrink_page_list+0x9dc/0xde0
> > [   90.622917]  shrink_inactive_list+0x1ec/0x3c8
> > [   90.622928]  shrink_lruvec+0x3dc/0x628
> > [   90.622939]  shrink_node+0x37c/0x6a0
> > [   90.622950]  balance_pgdat+0x354/0x668
> > [   90.622961]  kswapd+0x1e0/0x3c0
> > [   90.622972]  kthread+0x110/0x120
> >
> > but i have never got a backtrace in which thp is loaded as a whole though it
> > seems the code has this path:
>
> THP could be swapped out in a whole, but never swapped in as THP. Just
> the single base page (4K on x86) is swapped in.

yep. it seems swapin_readahead() is never reading a THP or even splitted
pages for this 2MB THP.

the number of pages to be read-ahead is determined either by
/proc/sys/vm/page-cluster if /sys/kernel/mm/swap/vma_ra_enabled is fase
or
by vma read-ahead algorithm if /sys//kernel/mm/swap/vma_ra_enabled is true
And the number is usually quite small.

Am I missing any case in which 2MB can be swapped in as whole either by
splitted pages or a THP?

Thanks
Barry
Yang Shi May 26, 2022, 5:02 p.m. UTC | #10
On Thu, May 26, 2022 at 2:19 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, May 26, 2022 at 5:49 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Wed, May 25, 2022 at 4:10 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Wed, May 25, 2022 at 7:14 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > >
> > > > On Tue, May 24, 2022 at 10:05:35PM +1200, Barry Song wrote:
> > > > > On Tue, May 24, 2022 at 8:12 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > > > On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote:
> > > > > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > > > > > index d550f5acfaf3..8e3771c56fbf 100644
> > > > > > > --- a/arch/arm64/Kconfig
> > > > > > > +++ b/arch/arm64/Kconfig
> > > > > > > @@ -98,6 +98,7 @@ config ARM64
> > > > > > >       select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> > > > > > >       select ARCH_WANT_LD_ORPHAN_WARN
> > > > > > >       select ARCH_WANTS_NO_INSTR
> > > > > > > +     select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
> > > > > >
> > > > > > I'm not opposed to this but I think it would break pages mapped with
> > > > > > PROT_MTE. We have an assumption in mte_sync_tags() that compound pages
> > > > > > are not swapped out (or in). With MTE, we store the tags in a slab
> > > > >
> > > > > I assume you mean mte_sync_tags() require that THP is not swapped as a whole,
> > > > > as without THP_SWP, THP is still swapping after being splitted. MTE doesn't stop
> > > > > THP from swapping through a couple of splitted pages, does it?
> > > >
> > > > That's correct, split THP page are swapped out/in just fine.
> > > >
> > > > > > object (128-bytes per swapped page) and restore them when pages are
> > > > > > swapped in. At some point we may teach the core swap code about such
> > > > > > metadata but in the meantime that was the easiest way.
> > > > >
> > > > > If my previous assumption is true,  the easiest way to enable THP_SWP
> > > > > for this moment might be always letting mm fallback to the splitting
> > > > > way for MTE hardware. For this moment, I care about THP_SWP more as
> > > > > none of my hardware has MTE.
> > > > >
> > > > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > > > > index 45c358538f13..d55a2a3e41a9 100644
> > > > > --- a/arch/arm64/include/asm/pgtable.h
> > > > > +++ b/arch/arm64/include/asm/pgtable.h
> > > > > @@ -44,6 +44,8 @@
> > > > >         __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> > > > >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > > > >
> > > > > +#define arch_thp_swp_supported !system_supports_mte
> > > > > +
> > > > >  /*
> > > > >   * Outside of a few very special situations (e.g. hibernation), we always
> > > > >   * use broadcast TLB invalidation instructions, therefore a spurious page
> > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > > > index 2999190adc22..064b6b03df9e 100644
> > > > > --- a/include/linux/huge_mm.h
> > > > > +++ b/include/linux/huge_mm.h
> > > > > @@ -447,4 +447,16 @@ static inline int split_folio_to_list(struct folio *folio,
> > > > >         return split_huge_page_to_list(&folio->page, list);
> > > > >  }
> > > > >
> > > > > +/*
> > > > > + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> > > > > + * limitations in the implementation like arm64 MTE can override this to
> > > > > + * false
> > > > > + */
> > > > > +#ifndef arch_thp_swp_supported
> > > > > +static inline bool arch_thp_swp_supported(void)
> > > > > +{
> > > > > +       return true;
> > > > > +}
> > > > > +#endif
> > > > > +
> > > > >  #endif /* _LINUX_HUGE_MM_H */
> > > > > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > > > > index 2b5531840583..dde685836328 100644
> > > > > --- a/mm/swap_slots.c
> > > > > +++ b/mm/swap_slots.c
> > > > > @@ -309,7 +309,7 @@ swp_entry_t get_swap_page(struct page *page)
> > > > >         entry.val = 0;
> > > > >
> > > > >         if (PageTransHuge(page)) {
> > > > > -               if (IS_ENABLED(CONFIG_THP_SWAP))
> > > > > +               if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> > > > >                         get_swap_pages(1, &entry, HPAGE_PMD_NR);
> > > > >                 goto out;
> > > >
> > > > I think this should work and with your other proposal it would be
> > > > limited to MTE pages:
> > > >
> > > > #define arch_thp_swp_supported(page)    (!test_bit(PG_mte_tagged, &page->flags))
> > > >
> > > > Are THP pages loaded from swap as a whole or are they split? IIRC the
> > >
> > > i can confirm thp is written as a whole through:
> > > [   90.622863]  __swap_writepage+0xe8/0x580
> > > [   90.622881]  swap_writepage+0x44/0xf8
> > > [   90.622891]  pageout+0xe0/0x2a8
> > > [   90.622906]  shrink_page_list+0x9dc/0xde0
> > > [   90.622917]  shrink_inactive_list+0x1ec/0x3c8
> > > [   90.622928]  shrink_lruvec+0x3dc/0x628
> > > [   90.622939]  shrink_node+0x37c/0x6a0
> > > [   90.622950]  balance_pgdat+0x354/0x668
> > > [   90.622961]  kswapd+0x1e0/0x3c0
> > > [   90.622972]  kthread+0x110/0x120
> > >
> > > but i have never got a backtrace in which thp is loaded as a whole though it
> > > seems the code has this path:
> >
> > THP could be swapped out in a whole, but never swapped in as THP. Just
> > the single base page (4K on x86) is swapped in.
>
> yep. it seems swapin_readahead() is never reading a THP or even splitted
> pages for this 2MB THP.
>
> the number of pages to be read-ahead is determined either by
> /proc/sys/vm/page-cluster if /sys/kernel/mm/swap/vma_ra_enabled is fase
> or
> by vma read-ahead algorithm if /sys//kernel/mm/swap/vma_ra_enabled is true
> And the number is usually quite small.
>
> Am I missing any case in which 2MB can be swapped in as whole either by
> splitted pages or a THP?

Even though readahead swaps in 2MB, they are 512 single base pages
rather than THP. They may not be physically continuous at all.

>
> Thanks
> Barry
Barry Song May 27, 2022, 7:29 a.m. UTC | #11
On Fri, May 27, 2022 at 5:03 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, May 26, 2022 at 2:19 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Thu, May 26, 2022 at 5:49 AM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Wed, May 25, 2022 at 4:10 AM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Wed, May 25, 2022 at 7:14 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > >
> > > > > On Tue, May 24, 2022 at 10:05:35PM +1200, Barry Song wrote:
> > > > > > On Tue, May 24, 2022 at 8:12 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > > > > On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote:
> > > > > > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > > > > > > index d550f5acfaf3..8e3771c56fbf 100644
> > > > > > > > --- a/arch/arm64/Kconfig
> > > > > > > > +++ b/arch/arm64/Kconfig
> > > > > > > > @@ -98,6 +98,7 @@ config ARM64
> > > > > > > >       select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> > > > > > > >       select ARCH_WANT_LD_ORPHAN_WARN
> > > > > > > >       select ARCH_WANTS_NO_INSTR
> > > > > > > > +     select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
> > > > > > >
> > > > > > > I'm not opposed to this but I think it would break pages mapped with
> > > > > > > PROT_MTE. We have an assumption in mte_sync_tags() that compound pages
> > > > > > > are not swapped out (or in). With MTE, we store the tags in a slab
> > > > > >
> > > > > > I assume you mean mte_sync_tags() require that THP is not swapped as a whole,
> > > > > > as without THP_SWP, THP is still swapping after being splitted. MTE doesn't stop
> > > > > > THP from swapping through a couple of splitted pages, does it?
> > > > >
> > > > > That's correct, split THP page are swapped out/in just fine.
> > > > >
> > > > > > > object (128-bytes per swapped page) and restore them when pages are
> > > > > > > swapped in. At some point we may teach the core swap code about such
> > > > > > > metadata but in the meantime that was the easiest way.
> > > > > >
> > > > > > If my previous assumption is true,  the easiest way to enable THP_SWP
> > > > > > for this moment might be always letting mm fallback to the splitting
> > > > > > way for MTE hardware. For this moment, I care about THP_SWP more as
> > > > > > none of my hardware has MTE.
> > > > > >
> > > > > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > > > > > index 45c358538f13..d55a2a3e41a9 100644
> > > > > > --- a/arch/arm64/include/asm/pgtable.h
> > > > > > +++ b/arch/arm64/include/asm/pgtable.h
> > > > > > @@ -44,6 +44,8 @@
> > > > > >         __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> > > > > >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > > > > >
> > > > > > +#define arch_thp_swp_supported !system_supports_mte
> > > > > > +
> > > > > >  /*
> > > > > >   * Outside of a few very special situations (e.g. hibernation), we always
> > > > > >   * use broadcast TLB invalidation instructions, therefore a spurious page
> > > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > > > > index 2999190adc22..064b6b03df9e 100644
> > > > > > --- a/include/linux/huge_mm.h
> > > > > > +++ b/include/linux/huge_mm.h
> > > > > > @@ -447,4 +447,16 @@ static inline int split_folio_to_list(struct folio *folio,
> > > > > >         return split_huge_page_to_list(&folio->page, list);
> > > > > >  }
> > > > > >
> > > > > > +/*
> > > > > > + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> > > > > > + * limitations in the implementation like arm64 MTE can override this to
> > > > > > + * false
> > > > > > + */
> > > > > > +#ifndef arch_thp_swp_supported
> > > > > > +static inline bool arch_thp_swp_supported(void)
> > > > > > +{
> > > > > > +       return true;
> > > > > > +}
> > > > > > +#endif
> > > > > > +
> > > > > >  #endif /* _LINUX_HUGE_MM_H */
> > > > > > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > > > > > index 2b5531840583..dde685836328 100644
> > > > > > --- a/mm/swap_slots.c
> > > > > > +++ b/mm/swap_slots.c
> > > > > > @@ -309,7 +309,7 @@ swp_entry_t get_swap_page(struct page *page)
> > > > > >         entry.val = 0;
> > > > > >
> > > > > >         if (PageTransHuge(page)) {
> > > > > > -               if (IS_ENABLED(CONFIG_THP_SWAP))
> > > > > > +               if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> > > > > >                         get_swap_pages(1, &entry, HPAGE_PMD_NR);
> > > > > >                 goto out;
> > > > >
> > > > > I think this should work and with your other proposal it would be
> > > > > limited to MTE pages:
> > > > >
> > > > > #define arch_thp_swp_supported(page)    (!test_bit(PG_mte_tagged, &page->flags))
> > > > >
> > > > > Are THP pages loaded from swap as a whole or are they split? IIRC the
> > > >
> > > > i can confirm thp is written as a whole through:
> > > > [   90.622863]  __swap_writepage+0xe8/0x580
> > > > [   90.622881]  swap_writepage+0x44/0xf8
> > > > [   90.622891]  pageout+0xe0/0x2a8
> > > > [   90.622906]  shrink_page_list+0x9dc/0xde0
> > > > [   90.622917]  shrink_inactive_list+0x1ec/0x3c8
> > > > [   90.622928]  shrink_lruvec+0x3dc/0x628
> > > > [   90.622939]  shrink_node+0x37c/0x6a0
> > > > [   90.622950]  balance_pgdat+0x354/0x668
> > > > [   90.622961]  kswapd+0x1e0/0x3c0
> > > > [   90.622972]  kthread+0x110/0x120
> > > >
> > > > but i have never got a backtrace in which thp is loaded as a whole though it
> > > > seems the code has this path:
> > >
> > > THP could be swapped out in a whole, but never swapped in as THP. Just
> > > the single base page (4K on x86) is swapped in.
> >
> > yep. it seems swapin_readahead() is never reading a THP or even splitted
> > pages for this 2MB THP.
> >
> > the number of pages to be read-ahead is determined either by
> > /proc/sys/vm/page-cluster if /sys/kernel/mm/swap/vma_ra_enabled is fase
> > or
> > by vma read-ahead algorithm if /sys//kernel/mm/swap/vma_ra_enabled is true
> > And the number is usually quite small.
> >
> > Am I missing any case in which 2MB can be swapped in as whole either by
> > splitted pages or a THP?
>
> Even though readahead swaps in 2MB, they are 512 single base pages
> rather than THP. They may not be physically continuous at all.

I actually haven't found out that readahead can swap in 2MB through either
THP or 512 single base pages. per my log, swapin_vma_readahead() usually
swaps in 2,3,4 or 8 pages.

but we do have a case in which we can swap in up to 2MB while doing
collapse:
static bool __collapse_huge_page_swapin(struct mm_struct *mm,
                                        struct vm_area_struct *vma,
                                        unsigned long haddr, pmd_t *pmd,
                                        int referenced)
{
        int swapped_in = 0;
        vm_fault_t ret = 0;
        unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);

        for (address = haddr; address < end; address += PAGE_SIZE) {
                struct vm_fault vmf = {
                        .vma = vma,
                        .address = address,
                        .pgoff = linear_page_index(vma, haddr),
                        .flags = FAULT_FLAG_ALLOW_RETRY,
                        .pmd = pmd,
                };

                vmf.pte = pte_offset_map(pmd, address);
                vmf.orig_pte = *vmf.pte;
                if (!is_swap_pte(vmf.orig_pte)) {
                        pte_unmap(vmf.pte);
                        continue;
                }
                swapped_in++;
                ret = do_swap_page(&vmf);

                ...}
        }

}

It seems Huang Ying once mentioned there was a plan to not split THP
throughout the whole process.

Thanks
Barry
diff mbox series

Patch

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index d550f5acfaf3..8e3771c56fbf 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -98,6 +98,7 @@  config ARM64
 	select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
 	select ARCH_WANT_LD_ORPHAN_WARN
 	select ARCH_WANTS_NO_INSTR
+	select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARM_AMBA
 	select ARM_ARCH_TIMER