Message ID | 20220524071403.128644-1-21cnbao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | arm64: enable THP_SWAP for arm64 | expand |
On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > THP_SWAP has been proved to improve the swap throughput significantly > on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay > splitting THP after swapped out"). > As long as arm64 uses 4K page size, it is quite similar with x86_64 > by having 2MB PMD THP. So we are going to get similar improvement. > For other page sizes such as 16KB and 64KB, PMD might be too large. > Negative side effects such as IO latency might be a problem. Thus, > we can only safely enable the counterpart of X86_64. > > Cc: "Huang, Ying" <ying.huang@intel.com> > Cc: Minchan Kim <minchan@kernel.org> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Hugh Dickins <hughd@google.com> > Cc: Shaohua Li <shli@kernel.org> > Cc: Rik van Riel <riel@redhat.com> > Cc: Andrea Arcangeli <aarcange@redhat.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > arch/arm64/Kconfig | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index d550f5acfaf3..8e3771c56fbf 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -98,6 +98,7 @@ config ARM64 > select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) > select ARCH_WANT_LD_ORPHAN_WARN > select ARCH_WANTS_NO_INSTR > + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES I'm not opposed to this but I think it would break pages mapped with PROT_MTE. We have an assumption in mte_sync_tags() that compound pages are not swapped out (or in). With MTE, we store the tags in a slab object (128-bytes per swapped page) and restore them when pages are swapped in. At some point we may teach the core swap code about such metadata but in the meantime that was the easiest way.
On Tue, May 24, 2022 at 8:12 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > > > THP_SWAP has been proved to improve the swap throughput significantly > > on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay > > splitting THP after swapped out"). > > As long as arm64 uses 4K page size, it is quite similar with x86_64 > > by having 2MB PMD THP. So we are going to get similar improvement. > > For other page sizes such as 16KB and 64KB, PMD might be too large. > > Negative side effects such as IO latency might be a problem. Thus, > > we can only safely enable the counterpart of X86_64. > > > > Cc: "Huang, Ying" <ying.huang@intel.com> > > Cc: Minchan Kim <minchan@kernel.org> > > Cc: Johannes Weiner <hannes@cmpxchg.org> > > Cc: Hugh Dickins <hughd@google.com> > > Cc: Shaohua Li <shli@kernel.org> > > Cc: Rik van Riel <riel@redhat.com> > > Cc: Andrea Arcangeli <aarcange@redhat.com> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > --- > > arch/arm64/Kconfig | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > > index d550f5acfaf3..8e3771c56fbf 100644 > > --- a/arch/arm64/Kconfig > > +++ b/arch/arm64/Kconfig > > @@ -98,6 +98,7 @@ config ARM64 > > select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) > > select ARCH_WANT_LD_ORPHAN_WARN > > select ARCH_WANTS_NO_INSTR > > + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES > > I'm not opposed to this but I think it would break pages mapped with > PROT_MTE. We have an assumption in mte_sync_tags() that compound pages > are not swapped out (or in). With MTE, we store the tags in a slab I assume you mean mte_sync_tags() require that THP is not swapped as a whole, as without THP_SWP, THP is still swapping after being splitted. MTE doesn't stop THP from swapping through a couple of splitted pages, does it? > object (128-bytes per swapped page) and restore them when pages are > swapped in. At some point we may teach the core swap code about such > metadata but in the meantime that was the easiest way. > If my previous assumption is true, the easiest way to enable THP_SWP for this moment might be always letting mm fallback to the splitting way for MTE hardware. For this moment, I care about THP_SWP more as none of my hardware has MTE. diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 45c358538f13..d55a2a3e41a9 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -44,6 +44,8 @@ __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1) #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#define arch_thp_swp_supported !system_supports_mte + /* * Outside of a few very special situations (e.g. hibernation), we always * use broadcast TLB invalidation instructions, therefore a spurious page diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 2999190adc22..064b6b03df9e 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -447,4 +447,16 @@ static inline int split_folio_to_list(struct folio *folio, return split_huge_page_to_list(&folio->page, list); } +/* + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to + * limitations in the implementation like arm64 MTE can override this to + * false + */ +#ifndef arch_thp_swp_supported +static inline bool arch_thp_swp_supported(void) +{ + return true; +} +#endif + #endif /* _LINUX_HUGE_MM_H */ diff --git a/mm/swap_slots.c b/mm/swap_slots.c index 2b5531840583..dde685836328 100644 --- a/mm/swap_slots.c +++ b/mm/swap_slots.c @@ -309,7 +309,7 @@ swp_entry_t get_swap_page(struct page *page) entry.val = 0; if (PageTransHuge(page)) { - if (IS_ENABLED(CONFIG_THP_SWAP)) + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported()) get_swap_pages(1, &entry, HPAGE_PMD_NR); goto out; } > -- > Catalin Thanks Barry
On Tue, May 24, 2022 at 10:05 PM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, May 24, 2022 at 8:12 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote: > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > THP_SWAP has been proved to improve the swap throughput significantly > > > on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay > > > splitting THP after swapped out"). > > > As long as arm64 uses 4K page size, it is quite similar with x86_64 > > > by having 2MB PMD THP. So we are going to get similar improvement. > > > For other page sizes such as 16KB and 64KB, PMD might be too large. > > > Negative side effects such as IO latency might be a problem. Thus, > > > we can only safely enable the counterpart of X86_64. > > > > > > Cc: "Huang, Ying" <ying.huang@intel.com> > > > Cc: Minchan Kim <minchan@kernel.org> > > > Cc: Johannes Weiner <hannes@cmpxchg.org> > > > Cc: Hugh Dickins <hughd@google.com> > > > Cc: Shaohua Li <shli@kernel.org> > > > Cc: Rik van Riel <riel@redhat.com> > > > Cc: Andrea Arcangeli <aarcange@redhat.com> > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > > --- > > > arch/arm64/Kconfig | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > > > index d550f5acfaf3..8e3771c56fbf 100644 > > > --- a/arch/arm64/Kconfig > > > +++ b/arch/arm64/Kconfig > > > @@ -98,6 +98,7 @@ config ARM64 > > > select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) > > > select ARCH_WANT_LD_ORPHAN_WARN > > > select ARCH_WANTS_NO_INSTR > > > + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES > > > > I'm not opposed to this but I think it would break pages mapped with > > PROT_MTE. We have an assumption in mte_sync_tags() that compound pages > > are not swapped out (or in). With MTE, we store the tags in a slab > > I assume you mean mte_sync_tags() require that THP is not swapped as a whole, > as without THP_SWP, THP is still swapping after being splitted. MTE doesn't stop > THP from swapping through a couple of splitted pages, does it? > > > object (128-bytes per swapped page) and restore them when pages are > > swapped in. At some point we may teach the core swap code about such > > metadata but in the meantime that was the easiest way. > > > > If my previous assumption is true, the easiest way to enable THP_SWP > for this moment > might be always letting mm fallback to the splitting way for MTE > hardware. For this > moment, I care about THP_SWP more as none of my hardware has MTE. > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index 45c358538f13..d55a2a3e41a9 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -44,6 +44,8 @@ > __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1) > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > +#define arch_thp_swp_supported !system_supports_mte > + > /* > * Outside of a few very special situations (e.g. hibernation), we always > * use broadcast TLB invalidation instructions, therefore a spurious page > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 2999190adc22..064b6b03df9e 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -447,4 +447,16 @@ static inline int split_folio_to_list(struct folio *folio, > return split_huge_page_to_list(&folio->page, list); > } > > +/* > + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to > + * limitations in the implementation like arm64 MTE can override this to > + * false > + */ > +#ifndef arch_thp_swp_supported > +static inline bool arch_thp_swp_supported(void) > +{ > + return true; > +} > +#endif > + > #endif /* _LINUX_HUGE_MM_H */ > diff --git a/mm/swap_slots.c b/mm/swap_slots.c > index 2b5531840583..dde685836328 100644 > --- a/mm/swap_slots.c > +++ b/mm/swap_slots.c > @@ -309,7 +309,7 @@ swp_entry_t get_swap_page(struct page *page) > entry.val = 0; > > if (PageTransHuge(page)) { > - if (IS_ENABLED(CONFIG_THP_SWAP)) > + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported()) > get_swap_pages(1, &entry, HPAGE_PMD_NR); > goto out; > } > Am I actually able to go further to only split MTE tagged pages? For mm core: +/* + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to + * limitations in the implementation like arm64 MTE can override this to + * false + */ +#ifndef arch_thp_swp_supported +static inline bool arch_thp_swp_supported(struct page *page) +{ + return true; +} +#endif + For arm64: +#define arch_thp_swp_supported(page) !test_bit(PG_mte_tagged, &page->flags) But I don't have MTE hardware to test. So to me, totally disabling THP_SWP is safer. thoughts? > > -- > > Catalin > > Thanks > Barry
On Tue, May 24, 2022 at 10:05:35PM +1200, Barry Song wrote: > On Tue, May 24, 2022 at 8:12 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote: > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > > > index d550f5acfaf3..8e3771c56fbf 100644 > > > --- a/arch/arm64/Kconfig > > > +++ b/arch/arm64/Kconfig > > > @@ -98,6 +98,7 @@ config ARM64 > > > select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) > > > select ARCH_WANT_LD_ORPHAN_WARN > > > select ARCH_WANTS_NO_INSTR > > > + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES > > > > I'm not opposed to this but I think it would break pages mapped with > > PROT_MTE. We have an assumption in mte_sync_tags() that compound pages > > are not swapped out (or in). With MTE, we store the tags in a slab > > I assume you mean mte_sync_tags() require that THP is not swapped as a whole, > as without THP_SWP, THP is still swapping after being splitted. MTE doesn't stop > THP from swapping through a couple of splitted pages, does it? That's correct, split THP page are swapped out/in just fine. > > object (128-bytes per swapped page) and restore them when pages are > > swapped in. At some point we may teach the core swap code about such > > metadata but in the meantime that was the easiest way. > > If my previous assumption is true, the easiest way to enable THP_SWP > for this moment might be always letting mm fallback to the splitting > way for MTE hardware. For this moment, I care about THP_SWP more as > none of my hardware has MTE. > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index 45c358538f13..d55a2a3e41a9 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -44,6 +44,8 @@ > __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1) > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > +#define arch_thp_swp_supported !system_supports_mte > + > /* > * Outside of a few very special situations (e.g. hibernation), we always > * use broadcast TLB invalidation instructions, therefore a spurious page > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 2999190adc22..064b6b03df9e 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -447,4 +447,16 @@ static inline int split_folio_to_list(struct folio *folio, > return split_huge_page_to_list(&folio->page, list); > } > > +/* > + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to > + * limitations in the implementation like arm64 MTE can override this to > + * false > + */ > +#ifndef arch_thp_swp_supported > +static inline bool arch_thp_swp_supported(void) > +{ > + return true; > +} > +#endif > + > #endif /* _LINUX_HUGE_MM_H */ > diff --git a/mm/swap_slots.c b/mm/swap_slots.c > index 2b5531840583..dde685836328 100644 > --- a/mm/swap_slots.c > +++ b/mm/swap_slots.c > @@ -309,7 +309,7 @@ swp_entry_t get_swap_page(struct page *page) > entry.val = 0; > > if (PageTransHuge(page)) { > - if (IS_ENABLED(CONFIG_THP_SWAP)) > + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported()) > get_swap_pages(1, &entry, HPAGE_PMD_NR); > goto out; I think this should work and with your other proposal it would be limited to MTE pages: #define arch_thp_swp_supported(page) (!test_bit(PG_mte_tagged, &page->flags)) Are THP pages loaded from swap as a whole or are they split? IIRC the splitting still happens but after the swapping out finishes. Even if they are loaded as 4K pages, we still have the mte_save_tags() that only understands small pages currently, so rejecting THP pages is probably best.
On Wed, May 25, 2022 at 7:14 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Tue, May 24, 2022 at 10:05:35PM +1200, Barry Song wrote: > > On Tue, May 24, 2022 at 8:12 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote: > > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > > > > index d550f5acfaf3..8e3771c56fbf 100644 > > > > --- a/arch/arm64/Kconfig > > > > +++ b/arch/arm64/Kconfig > > > > @@ -98,6 +98,7 @@ config ARM64 > > > > select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) > > > > select ARCH_WANT_LD_ORPHAN_WARN > > > > select ARCH_WANTS_NO_INSTR > > > > + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES > > > > > > I'm not opposed to this but I think it would break pages mapped with > > > PROT_MTE. We have an assumption in mte_sync_tags() that compound pages > > > are not swapped out (or in). With MTE, we store the tags in a slab > > > > I assume you mean mte_sync_tags() require that THP is not swapped as a whole, > > as without THP_SWP, THP is still swapping after being splitted. MTE doesn't stop > > THP from swapping through a couple of splitted pages, does it? > > That's correct, split THP page are swapped out/in just fine. > > > > object (128-bytes per swapped page) and restore them when pages are > > > swapped in. At some point we may teach the core swap code about such > > > metadata but in the meantime that was the easiest way. > > > > If my previous assumption is true, the easiest way to enable THP_SWP > > for this moment might be always letting mm fallback to the splitting > > way for MTE hardware. For this moment, I care about THP_SWP more as > > none of my hardware has MTE. > > > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > > index 45c358538f13..d55a2a3e41a9 100644 > > --- a/arch/arm64/include/asm/pgtable.h > > +++ b/arch/arm64/include/asm/pgtable.h > > @@ -44,6 +44,8 @@ > > __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1) > > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > > > +#define arch_thp_swp_supported !system_supports_mte > > + > > /* > > * Outside of a few very special situations (e.g. hibernation), we always > > * use broadcast TLB invalidation instructions, therefore a spurious page > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > index 2999190adc22..064b6b03df9e 100644 > > --- a/include/linux/huge_mm.h > > +++ b/include/linux/huge_mm.h > > @@ -447,4 +447,16 @@ static inline int split_folio_to_list(struct folio *folio, > > return split_huge_page_to_list(&folio->page, list); > > } > > > > +/* > > + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to > > + * limitations in the implementation like arm64 MTE can override this to > > + * false > > + */ > > +#ifndef arch_thp_swp_supported > > +static inline bool arch_thp_swp_supported(void) > > +{ > > + return true; > > +} > > +#endif > > + > > #endif /* _LINUX_HUGE_MM_H */ > > diff --git a/mm/swap_slots.c b/mm/swap_slots.c > > index 2b5531840583..dde685836328 100644 > > --- a/mm/swap_slots.c > > +++ b/mm/swap_slots.c > > @@ -309,7 +309,7 @@ swp_entry_t get_swap_page(struct page *page) > > entry.val = 0; > > > > if (PageTransHuge(page)) { > > - if (IS_ENABLED(CONFIG_THP_SWAP)) > > + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported()) > > get_swap_pages(1, &entry, HPAGE_PMD_NR); > > goto out; > > I think this should work and with your other proposal it would be > limited to MTE pages: > > #define arch_thp_swp_supported(page) (!test_bit(PG_mte_tagged, &page->flags)) > > Are THP pages loaded from swap as a whole or are they split? IIRC the i can confirm thp is written as a whole through: [ 90.622863] __swap_writepage+0xe8/0x580 [ 90.622881] swap_writepage+0x44/0xf8 [ 90.622891] pageout+0xe0/0x2a8 [ 90.622906] shrink_page_list+0x9dc/0xde0 [ 90.622917] shrink_inactive_list+0x1ec/0x3c8 [ 90.622928] shrink_lruvec+0x3dc/0x628 [ 90.622939] shrink_node+0x37c/0x6a0 [ 90.622950] balance_pgdat+0x354/0x668 [ 90.622961] kswapd+0x1e0/0x3c0 [ 90.622972] kthread+0x110/0x120 but i have never got a backtrace in which thp is loaded as a whole though it seems the code has this path: int swap_readpage(struct page *page, bool synchronous) { ... bio = bio_alloc(sis->bdev, 1, REQ_OP_READ, GFP_KERNEL); bio->bi_iter.bi_sector = swap_page_sector(page); bio->bi_end_io = end_swap_bio_read; bio_add_page(bio, page, thp_size(page), 0); ... submit_bio(bio); } > splitting still happens but after the swapping out finishes. Even if > they are loaded as 4K pages, we still have the mte_save_tags() that only > understands small pages currently, so rejecting THP pages is probably > best. as anyway i don't have a mte-hardware to do a valid test to go any further, so i will totally disable thp_swp for hardware having mte for this moment in patch v2. > > -- > Catalin Thanks Barry
On Wed, May 25, 2022 at 11:10:41PM +1200, Barry Song wrote: > On Wed, May 25, 2022 at 7:14 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > I think this should work and with your other proposal it would be > > limited to MTE pages: > > > > #define arch_thp_swp_supported(page) (!test_bit(PG_mte_tagged, &page->flags)) > > > > Are THP pages loaded from swap as a whole or are they split? IIRC the > > i can confirm thp is written as a whole through: > [ 90.622863] __swap_writepage+0xe8/0x580 > [ 90.622881] swap_writepage+0x44/0xf8 > [ 90.622891] pageout+0xe0/0x2a8 > [ 90.622906] shrink_page_list+0x9dc/0xde0 > [ 90.622917] shrink_inactive_list+0x1ec/0x3c8 > [ 90.622928] shrink_lruvec+0x3dc/0x628 > [ 90.622939] shrink_node+0x37c/0x6a0 > [ 90.622950] balance_pgdat+0x354/0x668 > [ 90.622961] kswapd+0x1e0/0x3c0 > [ 90.622972] kthread+0x110/0x120 > > but i have never got a backtrace in which thp is loaded as a whole though it > seems the code has this path: > int swap_readpage(struct page *page, bool synchronous) > { > ... > bio = bio_alloc(sis->bdev, 1, REQ_OP_READ, GFP_KERNEL); > bio->bi_iter.bi_sector = swap_page_sector(page); > bio->bi_end_io = end_swap_bio_read; > bio_add_page(bio, page, thp_size(page), 0); > ... > submit_bio(bio); > } > > > splitting still happens but after the swapping out finishes. Even if > > they are loaded as 4K pages, we still have the mte_save_tags() that only > > understands small pages currently, so rejecting THP pages is probably > > best. > > as anyway i don't have a mte-hardware to do a valid test to go any > further, so i will totally disable thp_swp for hardware having mte for > this moment in patch v2. It makes sense. If we decide to improve this for MTE, we'll change the arch check. Thanks.
On Wed, May 25, 2022 at 4:10 AM Barry Song <21cnbao@gmail.com> wrote: > > On Wed, May 25, 2022 at 7:14 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > On Tue, May 24, 2022 at 10:05:35PM +1200, Barry Song wrote: > > > On Tue, May 24, 2022 at 8:12 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote: > > > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > > > > > index d550f5acfaf3..8e3771c56fbf 100644 > > > > > --- a/arch/arm64/Kconfig > > > > > +++ b/arch/arm64/Kconfig > > > > > @@ -98,6 +98,7 @@ config ARM64 > > > > > select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) > > > > > select ARCH_WANT_LD_ORPHAN_WARN > > > > > select ARCH_WANTS_NO_INSTR > > > > > + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES > > > > > > > > I'm not opposed to this but I think it would break pages mapped with > > > > PROT_MTE. We have an assumption in mte_sync_tags() that compound pages > > > > are not swapped out (or in). With MTE, we store the tags in a slab > > > > > > I assume you mean mte_sync_tags() require that THP is not swapped as a whole, > > > as without THP_SWP, THP is still swapping after being splitted. MTE doesn't stop > > > THP from swapping through a couple of splitted pages, does it? > > > > That's correct, split THP page are swapped out/in just fine. > > > > > > object (128-bytes per swapped page) and restore them when pages are > > > > swapped in. At some point we may teach the core swap code about such > > > > metadata but in the meantime that was the easiest way. > > > > > > If my previous assumption is true, the easiest way to enable THP_SWP > > > for this moment might be always letting mm fallback to the splitting > > > way for MTE hardware. For this moment, I care about THP_SWP more as > > > none of my hardware has MTE. > > > > > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > > > index 45c358538f13..d55a2a3e41a9 100644 > > > --- a/arch/arm64/include/asm/pgtable.h > > > +++ b/arch/arm64/include/asm/pgtable.h > > > @@ -44,6 +44,8 @@ > > > __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1) > > > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > > > > > +#define arch_thp_swp_supported !system_supports_mte > > > + > > > /* > > > * Outside of a few very special situations (e.g. hibernation), we always > > > * use broadcast TLB invalidation instructions, therefore a spurious page > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > > index 2999190adc22..064b6b03df9e 100644 > > > --- a/include/linux/huge_mm.h > > > +++ b/include/linux/huge_mm.h > > > @@ -447,4 +447,16 @@ static inline int split_folio_to_list(struct folio *folio, > > > return split_huge_page_to_list(&folio->page, list); > > > } > > > > > > +/* > > > + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to > > > + * limitations in the implementation like arm64 MTE can override this to > > > + * false > > > + */ > > > +#ifndef arch_thp_swp_supported > > > +static inline bool arch_thp_swp_supported(void) > > > +{ > > > + return true; > > > +} > > > +#endif > > > + > > > #endif /* _LINUX_HUGE_MM_H */ > > > diff --git a/mm/swap_slots.c b/mm/swap_slots.c > > > index 2b5531840583..dde685836328 100644 > > > --- a/mm/swap_slots.c > > > +++ b/mm/swap_slots.c > > > @@ -309,7 +309,7 @@ swp_entry_t get_swap_page(struct page *page) > > > entry.val = 0; > > > > > > if (PageTransHuge(page)) { > > > - if (IS_ENABLED(CONFIG_THP_SWAP)) > > > + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported()) > > > get_swap_pages(1, &entry, HPAGE_PMD_NR); > > > goto out; > > > > I think this should work and with your other proposal it would be > > limited to MTE pages: > > > > #define arch_thp_swp_supported(page) (!test_bit(PG_mte_tagged, &page->flags)) > > > > Are THP pages loaded from swap as a whole or are they split? IIRC the > > i can confirm thp is written as a whole through: > [ 90.622863] __swap_writepage+0xe8/0x580 > [ 90.622881] swap_writepage+0x44/0xf8 > [ 90.622891] pageout+0xe0/0x2a8 > [ 90.622906] shrink_page_list+0x9dc/0xde0 > [ 90.622917] shrink_inactive_list+0x1ec/0x3c8 > [ 90.622928] shrink_lruvec+0x3dc/0x628 > [ 90.622939] shrink_node+0x37c/0x6a0 > [ 90.622950] balance_pgdat+0x354/0x668 > [ 90.622961] kswapd+0x1e0/0x3c0 > [ 90.622972] kthread+0x110/0x120 > > but i have never got a backtrace in which thp is loaded as a whole though it > seems the code has this path: THP could be swapped out in a whole, but never swapped in as THP. Just the single base page (4K on x86) is swapped in. > int swap_readpage(struct page *page, bool synchronous) > { > ... > bio = bio_alloc(sis->bdev, 1, REQ_OP_READ, GFP_KERNEL); > bio->bi_iter.bi_sector = swap_page_sector(page); > bio->bi_end_io = end_swap_bio_read; > bio_add_page(bio, page, thp_size(page), 0); > ... > submit_bio(bio); > } > > > > splitting still happens but after the swapping out finishes. Even if > > they are loaded as 4K pages, we still have the mte_save_tags() that only > > understands small pages currently, so rejecting THP pages is probably > > best. > > as anyway i don't have a mte-hardware to do a valid test to go any > further, so i will totally disable thp_swp for hardware having mte for > this moment in patch v2. > > > > > -- > > Catalin > > Thanks > Barry >
On 5/24/22 16:45, Barry Song wrote: > On Tue, May 24, 2022 at 10:05 PM Barry Song <21cnbao@gmail.com> wrote: >> >> On Tue, May 24, 2022 at 8:12 PM Catalin Marinas <catalin.marinas@arm.com> wrote: >>> >>> On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote: >>>> From: Barry Song <v-songbaohua@oppo.com> >>>> >>>> THP_SWAP has been proved to improve the swap throughput significantly >>>> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay >>>> splitting THP after swapped out"). >>>> As long as arm64 uses 4K page size, it is quite similar with x86_64 >>>> by having 2MB PMD THP. So we are going to get similar improvement. >>>> For other page sizes such as 16KB and 64KB, PMD might be too large. >>>> Negative side effects such as IO latency might be a problem. Thus, >>>> we can only safely enable the counterpart of X86_64. >>>> >>>> Cc: "Huang, Ying" <ying.huang@intel.com> >>>> Cc: Minchan Kim <minchan@kernel.org> >>>> Cc: Johannes Weiner <hannes@cmpxchg.org> >>>> Cc: Hugh Dickins <hughd@google.com> >>>> Cc: Shaohua Li <shli@kernel.org> >>>> Cc: Rik van Riel <riel@redhat.com> >>>> Cc: Andrea Arcangeli <aarcange@redhat.com> >>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >>>> --- >>>> arch/arm64/Kconfig | 1 + >>>> 1 file changed, 1 insertion(+) >>>> >>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig >>>> index d550f5acfaf3..8e3771c56fbf 100644 >>>> --- a/arch/arm64/Kconfig >>>> +++ b/arch/arm64/Kconfig >>>> @@ -98,6 +98,7 @@ config ARM64 >>>> select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) >>>> select ARCH_WANT_LD_ORPHAN_WARN >>>> select ARCH_WANTS_NO_INSTR >>>> + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES >>> >>> I'm not opposed to this but I think it would break pages mapped with >>> PROT_MTE. We have an assumption in mte_sync_tags() that compound pages >>> are not swapped out (or in). With MTE, we store the tags in a slab >> >> I assume you mean mte_sync_tags() require that THP is not swapped as a whole, >> as without THP_SWP, THP is still swapping after being splitted. MTE doesn't stop >> THP from swapping through a couple of splitted pages, does it? >> >>> object (128-bytes per swapped page) and restore them when pages are >>> swapped in. At some point we may teach the core swap code about such >>> metadata but in the meantime that was the easiest way. >>> >> >> If my previous assumption is true, the easiest way to enable THP_SWP >> for this moment >> might be always letting mm fallback to the splitting way for MTE >> hardware. For this >> moment, I care about THP_SWP more as none of my hardware has MTE. >> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> index 45c358538f13..d55a2a3e41a9 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -44,6 +44,8 @@ >> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1) >> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ >> >> +#define arch_thp_swp_supported !system_supports_mte >> + >> /* >> * Outside of a few very special situations (e.g. hibernation), we always >> * use broadcast TLB invalidation instructions, therefore a spurious page >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >> index 2999190adc22..064b6b03df9e 100644 >> --- a/include/linux/huge_mm.h >> +++ b/include/linux/huge_mm.h >> @@ -447,4 +447,16 @@ static inline int split_folio_to_list(struct folio *folio, >> return split_huge_page_to_list(&folio->page, list); >> } >> >> +/* >> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to >> + * limitations in the implementation like arm64 MTE can override this to >> + * false >> + */ >> +#ifndef arch_thp_swp_supported >> +static inline bool arch_thp_swp_supported(void) >> +{ >> + return true; >> +} >> +#endif >> + >> #endif /* _LINUX_HUGE_MM_H */ >> diff --git a/mm/swap_slots.c b/mm/swap_slots.c >> index 2b5531840583..dde685836328 100644 >> --- a/mm/swap_slots.c >> +++ b/mm/swap_slots.c >> @@ -309,7 +309,7 @@ swp_entry_t get_swap_page(struct page *page) >> entry.val = 0; >> >> if (PageTransHuge(page)) { >> - if (IS_ENABLED(CONFIG_THP_SWAP)) >> + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported()) >> get_swap_pages(1, &entry, HPAGE_PMD_NR); >> goto out; >> } >> > > Am I actually able to go further to only split MTE tagged pages? > > For mm core: > > +/* > + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to > + * limitations in the implementation like arm64 MTE can override this to > + * false > + */ > +#ifndef arch_thp_swp_supported > +static inline bool arch_thp_swp_supported(struct page *page) > +{ > + return true; > +} > +#endif > + > > For arm64: > +#define arch_thp_swp_supported(page) !test_bit(PG_mte_tagged, &page->flags) Although not entirely sure, but per page arch_thp_swp_supported() callback seems bit risky. What if there scenarios or time windows when PG_mte_tagged is cleared on an otherwise MTE tagged page ? I guess arch_thp_swp_supported() just returning false on a system with MTE support, is a better option. > > But I don't have MTE hardware to test. So to me, totally disabling THP_SWP > is safer. > > thoughts? >>> -- >>> Catalin >> >> Thanks >> Barry >
On Thu, May 26, 2022 at 5:49 AM Yang Shi <shy828301@gmail.com> wrote: > > On Wed, May 25, 2022 at 4:10 AM Barry Song <21cnbao@gmail.com> wrote: > > > > On Wed, May 25, 2022 at 7:14 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > > > On Tue, May 24, 2022 at 10:05:35PM +1200, Barry Song wrote: > > > > On Tue, May 24, 2022 at 8:12 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > > On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote: > > > > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > > > > > > index d550f5acfaf3..8e3771c56fbf 100644 > > > > > > --- a/arch/arm64/Kconfig > > > > > > +++ b/arch/arm64/Kconfig > > > > > > @@ -98,6 +98,7 @@ config ARM64 > > > > > > select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) > > > > > > select ARCH_WANT_LD_ORPHAN_WARN > > > > > > select ARCH_WANTS_NO_INSTR > > > > > > + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES > > > > > > > > > > I'm not opposed to this but I think it would break pages mapped with > > > > > PROT_MTE. We have an assumption in mte_sync_tags() that compound pages > > > > > are not swapped out (or in). With MTE, we store the tags in a slab > > > > > > > > I assume you mean mte_sync_tags() require that THP is not swapped as a whole, > > > > as without THP_SWP, THP is still swapping after being splitted. MTE doesn't stop > > > > THP from swapping through a couple of splitted pages, does it? > > > > > > That's correct, split THP page are swapped out/in just fine. > > > > > > > > object (128-bytes per swapped page) and restore them when pages are > > > > > swapped in. At some point we may teach the core swap code about such > > > > > metadata but in the meantime that was the easiest way. > > > > > > > > If my previous assumption is true, the easiest way to enable THP_SWP > > > > for this moment might be always letting mm fallback to the splitting > > > > way for MTE hardware. For this moment, I care about THP_SWP more as > > > > none of my hardware has MTE. > > > > > > > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > > > > index 45c358538f13..d55a2a3e41a9 100644 > > > > --- a/arch/arm64/include/asm/pgtable.h > > > > +++ b/arch/arm64/include/asm/pgtable.h > > > > @@ -44,6 +44,8 @@ > > > > __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1) > > > > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > > > > > > > +#define arch_thp_swp_supported !system_supports_mte > > > > + > > > > /* > > > > * Outside of a few very special situations (e.g. hibernation), we always > > > > * use broadcast TLB invalidation instructions, therefore a spurious page > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > > > index 2999190adc22..064b6b03df9e 100644 > > > > --- a/include/linux/huge_mm.h > > > > +++ b/include/linux/huge_mm.h > > > > @@ -447,4 +447,16 @@ static inline int split_folio_to_list(struct folio *folio, > > > > return split_huge_page_to_list(&folio->page, list); > > > > } > > > > > > > > +/* > > > > + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to > > > > + * limitations in the implementation like arm64 MTE can override this to > > > > + * false > > > > + */ > > > > +#ifndef arch_thp_swp_supported > > > > +static inline bool arch_thp_swp_supported(void) > > > > +{ > > > > + return true; > > > > +} > > > > +#endif > > > > + > > > > #endif /* _LINUX_HUGE_MM_H */ > > > > diff --git a/mm/swap_slots.c b/mm/swap_slots.c > > > > index 2b5531840583..dde685836328 100644 > > > > --- a/mm/swap_slots.c > > > > +++ b/mm/swap_slots.c > > > > @@ -309,7 +309,7 @@ swp_entry_t get_swap_page(struct page *page) > > > > entry.val = 0; > > > > > > > > if (PageTransHuge(page)) { > > > > - if (IS_ENABLED(CONFIG_THP_SWAP)) > > > > + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported()) > > > > get_swap_pages(1, &entry, HPAGE_PMD_NR); > > > > goto out; > > > > > > I think this should work and with your other proposal it would be > > > limited to MTE pages: > > > > > > #define arch_thp_swp_supported(page) (!test_bit(PG_mte_tagged, &page->flags)) > > > > > > Are THP pages loaded from swap as a whole or are they split? IIRC the > > > > i can confirm thp is written as a whole through: > > [ 90.622863] __swap_writepage+0xe8/0x580 > > [ 90.622881] swap_writepage+0x44/0xf8 > > [ 90.622891] pageout+0xe0/0x2a8 > > [ 90.622906] shrink_page_list+0x9dc/0xde0 > > [ 90.622917] shrink_inactive_list+0x1ec/0x3c8 > > [ 90.622928] shrink_lruvec+0x3dc/0x628 > > [ 90.622939] shrink_node+0x37c/0x6a0 > > [ 90.622950] balance_pgdat+0x354/0x668 > > [ 90.622961] kswapd+0x1e0/0x3c0 > > [ 90.622972] kthread+0x110/0x120 > > > > but i have never got a backtrace in which thp is loaded as a whole though it > > seems the code has this path: > > THP could be swapped out in a whole, but never swapped in as THP. Just > the single base page (4K on x86) is swapped in. yep. it seems swapin_readahead() is never reading a THP or even splitted pages for this 2MB THP. the number of pages to be read-ahead is determined either by /proc/sys/vm/page-cluster if /sys/kernel/mm/swap/vma_ra_enabled is fase or by vma read-ahead algorithm if /sys//kernel/mm/swap/vma_ra_enabled is true And the number is usually quite small. Am I missing any case in which 2MB can be swapped in as whole either by splitted pages or a THP? Thanks Barry
On Thu, May 26, 2022 at 2:19 AM Barry Song <21cnbao@gmail.com> wrote: > > On Thu, May 26, 2022 at 5:49 AM Yang Shi <shy828301@gmail.com> wrote: > > > > On Wed, May 25, 2022 at 4:10 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > On Wed, May 25, 2022 at 7:14 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > > > > > On Tue, May 24, 2022 at 10:05:35PM +1200, Barry Song wrote: > > > > > On Tue, May 24, 2022 at 8:12 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > > > On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote: > > > > > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > > > > > > > index d550f5acfaf3..8e3771c56fbf 100644 > > > > > > > --- a/arch/arm64/Kconfig > > > > > > > +++ b/arch/arm64/Kconfig > > > > > > > @@ -98,6 +98,7 @@ config ARM64 > > > > > > > select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) > > > > > > > select ARCH_WANT_LD_ORPHAN_WARN > > > > > > > select ARCH_WANTS_NO_INSTR > > > > > > > + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES > > > > > > > > > > > > I'm not opposed to this but I think it would break pages mapped with > > > > > > PROT_MTE. We have an assumption in mte_sync_tags() that compound pages > > > > > > are not swapped out (or in). With MTE, we store the tags in a slab > > > > > > > > > > I assume you mean mte_sync_tags() require that THP is not swapped as a whole, > > > > > as without THP_SWP, THP is still swapping after being splitted. MTE doesn't stop > > > > > THP from swapping through a couple of splitted pages, does it? > > > > > > > > That's correct, split THP page are swapped out/in just fine. > > > > > > > > > > object (128-bytes per swapped page) and restore them when pages are > > > > > > swapped in. At some point we may teach the core swap code about such > > > > > > metadata but in the meantime that was the easiest way. > > > > > > > > > > If my previous assumption is true, the easiest way to enable THP_SWP > > > > > for this moment might be always letting mm fallback to the splitting > > > > > way for MTE hardware. For this moment, I care about THP_SWP more as > > > > > none of my hardware has MTE. > > > > > > > > > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > > > > > index 45c358538f13..d55a2a3e41a9 100644 > > > > > --- a/arch/arm64/include/asm/pgtable.h > > > > > +++ b/arch/arm64/include/asm/pgtable.h > > > > > @@ -44,6 +44,8 @@ > > > > > __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1) > > > > > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > > > > > > > > > +#define arch_thp_swp_supported !system_supports_mte > > > > > + > > > > > /* > > > > > * Outside of a few very special situations (e.g. hibernation), we always > > > > > * use broadcast TLB invalidation instructions, therefore a spurious page > > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > > > > index 2999190adc22..064b6b03df9e 100644 > > > > > --- a/include/linux/huge_mm.h > > > > > +++ b/include/linux/huge_mm.h > > > > > @@ -447,4 +447,16 @@ static inline int split_folio_to_list(struct folio *folio, > > > > > return split_huge_page_to_list(&folio->page, list); > > > > > } > > > > > > > > > > +/* > > > > > + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to > > > > > + * limitations in the implementation like arm64 MTE can override this to > > > > > + * false > > > > > + */ > > > > > +#ifndef arch_thp_swp_supported > > > > > +static inline bool arch_thp_swp_supported(void) > > > > > +{ > > > > > + return true; > > > > > +} > > > > > +#endif > > > > > + > > > > > #endif /* _LINUX_HUGE_MM_H */ > > > > > diff --git a/mm/swap_slots.c b/mm/swap_slots.c > > > > > index 2b5531840583..dde685836328 100644 > > > > > --- a/mm/swap_slots.c > > > > > +++ b/mm/swap_slots.c > > > > > @@ -309,7 +309,7 @@ swp_entry_t get_swap_page(struct page *page) > > > > > entry.val = 0; > > > > > > > > > > if (PageTransHuge(page)) { > > > > > - if (IS_ENABLED(CONFIG_THP_SWAP)) > > > > > + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported()) > > > > > get_swap_pages(1, &entry, HPAGE_PMD_NR); > > > > > goto out; > > > > > > > > I think this should work and with your other proposal it would be > > > > limited to MTE pages: > > > > > > > > #define arch_thp_swp_supported(page) (!test_bit(PG_mte_tagged, &page->flags)) > > > > > > > > Are THP pages loaded from swap as a whole or are they split? IIRC the > > > > > > i can confirm thp is written as a whole through: > > > [ 90.622863] __swap_writepage+0xe8/0x580 > > > [ 90.622881] swap_writepage+0x44/0xf8 > > > [ 90.622891] pageout+0xe0/0x2a8 > > > [ 90.622906] shrink_page_list+0x9dc/0xde0 > > > [ 90.622917] shrink_inactive_list+0x1ec/0x3c8 > > > [ 90.622928] shrink_lruvec+0x3dc/0x628 > > > [ 90.622939] shrink_node+0x37c/0x6a0 > > > [ 90.622950] balance_pgdat+0x354/0x668 > > > [ 90.622961] kswapd+0x1e0/0x3c0 > > > [ 90.622972] kthread+0x110/0x120 > > > > > > but i have never got a backtrace in which thp is loaded as a whole though it > > > seems the code has this path: > > > > THP could be swapped out in a whole, but never swapped in as THP. Just > > the single base page (4K on x86) is swapped in. > > yep. it seems swapin_readahead() is never reading a THP or even splitted > pages for this 2MB THP. > > the number of pages to be read-ahead is determined either by > /proc/sys/vm/page-cluster if /sys/kernel/mm/swap/vma_ra_enabled is fase > or > by vma read-ahead algorithm if /sys//kernel/mm/swap/vma_ra_enabled is true > And the number is usually quite small. > > Am I missing any case in which 2MB can be swapped in as whole either by > splitted pages or a THP? Even though readahead swaps in 2MB, they are 512 single base pages rather than THP. They may not be physically continuous at all. > > Thanks > Barry
On Fri, May 27, 2022 at 5:03 AM Yang Shi <shy828301@gmail.com> wrote: > > On Thu, May 26, 2022 at 2:19 AM Barry Song <21cnbao@gmail.com> wrote: > > > > On Thu, May 26, 2022 at 5:49 AM Yang Shi <shy828301@gmail.com> wrote: > > > > > > On Wed, May 25, 2022 at 4:10 AM Barry Song <21cnbao@gmail.com> wrote: > > > > > > > > On Wed, May 25, 2022 at 7:14 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > > > > > > > On Tue, May 24, 2022 at 10:05:35PM +1200, Barry Song wrote: > > > > > > On Tue, May 24, 2022 at 8:12 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > > > > On Tue, May 24, 2022 at 07:14:03PM +1200, Barry Song wrote: > > > > > > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > > > > > > > > index d550f5acfaf3..8e3771c56fbf 100644 > > > > > > > > --- a/arch/arm64/Kconfig > > > > > > > > +++ b/arch/arm64/Kconfig > > > > > > > > @@ -98,6 +98,7 @@ config ARM64 > > > > > > > > select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) > > > > > > > > select ARCH_WANT_LD_ORPHAN_WARN > > > > > > > > select ARCH_WANTS_NO_INSTR > > > > > > > > + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES > > > > > > > > > > > > > > I'm not opposed to this but I think it would break pages mapped with > > > > > > > PROT_MTE. We have an assumption in mte_sync_tags() that compound pages > > > > > > > are not swapped out (or in). With MTE, we store the tags in a slab > > > > > > > > > > > > I assume you mean mte_sync_tags() require that THP is not swapped as a whole, > > > > > > as without THP_SWP, THP is still swapping after being splitted. MTE doesn't stop > > > > > > THP from swapping through a couple of splitted pages, does it? > > > > > > > > > > That's correct, split THP page are swapped out/in just fine. > > > > > > > > > > > > object (128-bytes per swapped page) and restore them when pages are > > > > > > > swapped in. At some point we may teach the core swap code about such > > > > > > > metadata but in the meantime that was the easiest way. > > > > > > > > > > > > If my previous assumption is true, the easiest way to enable THP_SWP > > > > > > for this moment might be always letting mm fallback to the splitting > > > > > > way for MTE hardware. For this moment, I care about THP_SWP more as > > > > > > none of my hardware has MTE. > > > > > > > > > > > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > > > > > > index 45c358538f13..d55a2a3e41a9 100644 > > > > > > --- a/arch/arm64/include/asm/pgtable.h > > > > > > +++ b/arch/arm64/include/asm/pgtable.h > > > > > > @@ -44,6 +44,8 @@ > > > > > > __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1) > > > > > > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > > > > > > > > > > > +#define arch_thp_swp_supported !system_supports_mte > > > > > > + > > > > > > /* > > > > > > * Outside of a few very special situations (e.g. hibernation), we always > > > > > > * use broadcast TLB invalidation instructions, therefore a spurious page > > > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > > > > > index 2999190adc22..064b6b03df9e 100644 > > > > > > --- a/include/linux/huge_mm.h > > > > > > +++ b/include/linux/huge_mm.h > > > > > > @@ -447,4 +447,16 @@ static inline int split_folio_to_list(struct folio *folio, > > > > > > return split_huge_page_to_list(&folio->page, list); > > > > > > } > > > > > > > > > > > > +/* > > > > > > + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to > > > > > > + * limitations in the implementation like arm64 MTE can override this to > > > > > > + * false > > > > > > + */ > > > > > > +#ifndef arch_thp_swp_supported > > > > > > +static inline bool arch_thp_swp_supported(void) > > > > > > +{ > > > > > > + return true; > > > > > > +} > > > > > > +#endif > > > > > > + > > > > > > #endif /* _LINUX_HUGE_MM_H */ > > > > > > diff --git a/mm/swap_slots.c b/mm/swap_slots.c > > > > > > index 2b5531840583..dde685836328 100644 > > > > > > --- a/mm/swap_slots.c > > > > > > +++ b/mm/swap_slots.c > > > > > > @@ -309,7 +309,7 @@ swp_entry_t get_swap_page(struct page *page) > > > > > > entry.val = 0; > > > > > > > > > > > > if (PageTransHuge(page)) { > > > > > > - if (IS_ENABLED(CONFIG_THP_SWAP)) > > > > > > + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported()) > > > > > > get_swap_pages(1, &entry, HPAGE_PMD_NR); > > > > > > goto out; > > > > > > > > > > I think this should work and with your other proposal it would be > > > > > limited to MTE pages: > > > > > > > > > > #define arch_thp_swp_supported(page) (!test_bit(PG_mte_tagged, &page->flags)) > > > > > > > > > > Are THP pages loaded from swap as a whole or are they split? IIRC the > > > > > > > > i can confirm thp is written as a whole through: > > > > [ 90.622863] __swap_writepage+0xe8/0x580 > > > > [ 90.622881] swap_writepage+0x44/0xf8 > > > > [ 90.622891] pageout+0xe0/0x2a8 > > > > [ 90.622906] shrink_page_list+0x9dc/0xde0 > > > > [ 90.622917] shrink_inactive_list+0x1ec/0x3c8 > > > > [ 90.622928] shrink_lruvec+0x3dc/0x628 > > > > [ 90.622939] shrink_node+0x37c/0x6a0 > > > > [ 90.622950] balance_pgdat+0x354/0x668 > > > > [ 90.622961] kswapd+0x1e0/0x3c0 > > > > [ 90.622972] kthread+0x110/0x120 > > > > > > > > but i have never got a backtrace in which thp is loaded as a whole though it > > > > seems the code has this path: > > > > > > THP could be swapped out in a whole, but never swapped in as THP. Just > > > the single base page (4K on x86) is swapped in. > > > > yep. it seems swapin_readahead() is never reading a THP or even splitted > > pages for this 2MB THP. > > > > the number of pages to be read-ahead is determined either by > > /proc/sys/vm/page-cluster if /sys/kernel/mm/swap/vma_ra_enabled is fase > > or > > by vma read-ahead algorithm if /sys//kernel/mm/swap/vma_ra_enabled is true > > And the number is usually quite small. > > > > Am I missing any case in which 2MB can be swapped in as whole either by > > splitted pages or a THP? > > Even though readahead swaps in 2MB, they are 512 single base pages > rather than THP. They may not be physically continuous at all. I actually haven't found out that readahead can swap in 2MB through either THP or 512 single base pages. per my log, swapin_vma_readahead() usually swaps in 2,3,4 or 8 pages. but we do have a case in which we can swap in up to 2MB while doing collapse: static bool __collapse_huge_page_swapin(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd, int referenced) { int swapped_in = 0; vm_fault_t ret = 0; unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE); for (address = haddr; address < end; address += PAGE_SIZE) { struct vm_fault vmf = { .vma = vma, .address = address, .pgoff = linear_page_index(vma, haddr), .flags = FAULT_FLAG_ALLOW_RETRY, .pmd = pmd, }; vmf.pte = pte_offset_map(pmd, address); vmf.orig_pte = *vmf.pte; if (!is_swap_pte(vmf.orig_pte)) { pte_unmap(vmf.pte); continue; } swapped_in++; ret = do_swap_page(&vmf); ...} } } It seems Huang Ying once mentioned there was a plan to not split THP throughout the whole process. Thanks Barry
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index d550f5acfaf3..8e3771c56fbf 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -98,6 +98,7 @@ config ARM64 select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) select ARCH_WANT_LD_ORPHAN_WARN select ARCH_WANTS_NO_INSTR + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES select ARCH_HAS_UBSAN_SANITIZE_ALL select ARM_AMBA select ARM_ARCH_TIMER