diff mbox series

[v5,8/9] mm: multi-gen LRU: Have secondary MMUs participate in aging

Message ID 20240611002145.2078921-9-jthoughton@google.com (mailing list archive)
State New, archived
Headers show
Series mm: multi-gen LRU: Walk secondary MMU page tables while aging | expand

Commit Message

James Houghton June 11, 2024, 12:21 a.m. UTC
Secondary MMUs are currently consulted for access/age information at
eviction time, but before then, we don't get accurate age information.
That is, pages that are mostly accessed through a secondary MMU (like
guest memory, used by KVM) will always just proceed down to the oldest
generation, and then at eviction time, if KVM reports the page to be
young, the page will be activated/promoted back to the youngest
generation.

The added feature bit (0x8), if disabled, will make MGLRU behave as if
there are no secondary MMUs subscribed to MMU notifiers except at
eviction time.

Implement aging with the new mmu_notifier_test_clear_young_fast_only()
notifier. For architectures that do not support this notifier, this
becomes a no-op. For architectures that do implement it, it should be
fast enough to make aging worth it.

Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---

Notes:
    should_look_around() can sometimes use two notifiers now instead of one.
    
    This simply comes from restricting myself from not changing
    mmu_notifier_clear_young() to return more than just "young or not".
    
    I could change mmu_notifier_clear_young() (and
    mmu_notifier_test_young()) to return if it was fast or not. At that
    point, I could just as well combine all the notifiers into one notifier,
    like what was in v2 and v3.

 Documentation/admin-guide/mm/multigen_lru.rst |   6 +-
 include/linux/mmzone.h                        |   6 +-
 mm/rmap.c                                     |   9 +-
 mm/vmscan.c                                   | 185 ++++++++++++++----
 4 files changed, 164 insertions(+), 42 deletions(-)

Comments

Sean Christopherson June 12, 2024, 4:02 p.m. UTC | #1
On Tue, Jun 11, 2024, James Houghton wrote:
> diff --git a/mm/rmap.c b/mm/rmap.c
> index e8fc5ecb59b2..24a3ff639919 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -870,13 +870,10 @@ static bool folio_referenced_one(struct folio *folio,
>  			continue;
>  		}
>  
> -		if (pvmw.pte) {
> -			if (lru_gen_enabled() &&
> -			    pte_young(ptep_get(pvmw.pte))) {
> -				lru_gen_look_around(&pvmw);
> +		if (lru_gen_enabled() && pvmw.pte) {
> +			if (lru_gen_look_around(&pvmw))
>  				referenced++;
> -			}
> -
> +		} else if (pvmw.pte) {
>  			if (ptep_clear_flush_young_notify(vma, address,
>  						pvmw.pte))
>  				referenced++;

Random question not really related to KVM/secondary MMU participation.  AFAICT,
the MGLRU approach doesn't flush TLBs after aging pages.  How does MGLRU mitigate
false negatives on pxx_young() due to the CPU not setting Accessed bits because
of stale TLB entries?
Yu Zhao June 12, 2024, 4:59 p.m. UTC | #2
On Wed, Jun 12, 2024 at 10:02 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jun 11, 2024, James Houghton wrote:
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index e8fc5ecb59b2..24a3ff639919 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -870,13 +870,10 @@ static bool folio_referenced_one(struct folio *folio,
> >                       continue;
> >               }
> >
> > -             if (pvmw.pte) {
> > -                     if (lru_gen_enabled() &&
> > -                         pte_young(ptep_get(pvmw.pte))) {
> > -                             lru_gen_look_around(&pvmw);
> > +             if (lru_gen_enabled() && pvmw.pte) {
> > +                     if (lru_gen_look_around(&pvmw))
> >                               referenced++;
> > -                     }
> > -
> > +             } else if (pvmw.pte) {
> >                       if (ptep_clear_flush_young_notify(vma, address,
> >                                               pvmw.pte))
> >                               referenced++;
>
> Random question not really related to KVM/secondary MMU participation.  AFAICT,
> the MGLRU approach doesn't flush TLBs after aging pages.  How does MGLRU mitigate
> false negatives on pxx_young() due to the CPU not setting Accessed bits because
> of stale TLB entries?

I do think there can be false negatives but we have not been able to
measure their practical impacts since we disabled the flush on some
host MMUs long ago (NOT by MGLRU), e.g., on x86 and ppc,
ptep_clear_flush_young() is just ptep_test_andclear_young(). The
theoretical basis is that, given the TLB coverage trend (Figure 1 in
[1]), when a system is running out of memory, it's unlikely to have
many long-lived entries in its TLB. IOW, if that system had a stable
working set (hot memory) that can fit into its TLB, it wouldn't hit
page reclaim. Again, this is based on the theory (proposition) that
for most systems, their TLB coverages are much smaller than their
memory sizes.

If/when the above proposition doesn't hold, the next step in the page
reclaim path, which is to unmap the PTE, will cause a page fault. The
fault can be minor or major (requires IO), depending on the race
between the reclaiming and accessing threads. In this case, the
tradeoff, in a steady state, is between the PF cost of pages we
shouldn't reclaim and the flush cost of pages we scan. The PF cost is
higher than the flush cost per page. But we scan many pages and only
reclaim a few of them; pages we shouldn't reclaim are a (small)
portion of the latter.

[1] https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf
Sean Christopherson June 12, 2024, 5:23 p.m. UTC | #3
On Wed, Jun 12, 2024, Yu Zhao wrote:
> On Wed, Jun 12, 2024 at 10:02 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Jun 11, 2024, James Houghton wrote:
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index e8fc5ecb59b2..24a3ff639919 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -870,13 +870,10 @@ static bool folio_referenced_one(struct folio *folio,
> > >                       continue;
> > >               }
> > >
> > > -             if (pvmw.pte) {
> > > -                     if (lru_gen_enabled() &&
> > > -                         pte_young(ptep_get(pvmw.pte))) {
> > > -                             lru_gen_look_around(&pvmw);
> > > +             if (lru_gen_enabled() && pvmw.pte) {
> > > +                     if (lru_gen_look_around(&pvmw))
> > >                               referenced++;
> > > -                     }
> > > -
> > > +             } else if (pvmw.pte) {
> > >                       if (ptep_clear_flush_young_notify(vma, address,
> > >                                               pvmw.pte))
> > >                               referenced++;
> >
> > Random question not really related to KVM/secondary MMU participation.  AFAICT,
> > the MGLRU approach doesn't flush TLBs after aging pages.  How does MGLRU mitigate
> > false negatives on pxx_young() due to the CPU not setting Accessed bits because
> > of stale TLB entries?
> 
> I do think there can be false negatives but we have not been able to
> measure their practical impacts since we disabled the flush on some
> host MMUs long ago (NOT by MGLRU), e.g., on x86 and ppc,
> ptep_clear_flush_young() is just ptep_test_andclear_young().

Aha!  That's what I was missing, I somehow didn't see x86's ptep_clear_flush_young().

That begs the question, why does KVM flush TLBs on architectures that don't need
to?  And since kvm_mmu_notifier_clear_young() explicitly doesn't flush, are there
even any KVM-supported architectures for which the flush is mandatory?

Skipping the flush on KVM x86 seems like a complete no-brainer.

Will, Marc and/or Oliver, what are arm64's requirements in this area?  E.g. I see
that arm64's version of __ptep_clear_flush_young() does TLBI but not DSB.  Should
KVM be doing something similar?  Can KVM safely skip even the TBLI?

> theoretical basis is that, given the TLB coverage trend (Figure 1 in
> [1]), when a system is running out of memory, it's unlikely to have
> many long-lived entries in its TLB. IOW, if that system had a stable
> working set (hot memory) that can fit into its TLB, it wouldn't hit
> page reclaim. Again, this is based on the theory (proposition) that
> for most systems, their TLB coverages are much smaller than their
> memory sizes.
> 
> If/when the above proposition doesn't hold, the next step in the page
> reclaim path, which is to unmap the PTE, will cause a page fault. The
> fault can be minor or major (requires IO), depending on the race
> between the reclaiming and accessing threads. In this case, the
> tradeoff, in a steady state, is between the PF cost of pages we
> shouldn't reclaim and the flush cost of pages we scan. The PF cost is
> higher than the flush cost per page. But we scan many pages and only
> reclaim a few of them; pages we shouldn't reclaim are a (small)
> portion of the latter.
> 
> [1] https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf
Oliver Upton June 13, 2024, 6:49 a.m. UTC | #4
On Wed, Jun 12, 2024 at 10:23:38AM -0700, Sean Christopherson wrote:
> On Wed, Jun 12, 2024, Yu Zhao wrote:
> > I do think there can be false negatives but we have not been able to
> > measure their practical impacts since we disabled the flush on some
> > host MMUs long ago (NOT by MGLRU), e.g., on x86 and ppc,
> > ptep_clear_flush_young() is just ptep_test_andclear_young().
> 
> Aha!  That's what I was missing, I somehow didn't see x86's ptep_clear_flush_young().

Heh, well the helper name isn't exactly giving any hints...

> That begs the question, why does KVM flush TLBs on architectures that don't need
> to?  And since kvm_mmu_notifier_clear_young() explicitly doesn't flush, are there
> even any KVM-supported architectures for which the flush is mandatory?
> 
> Skipping the flush on KVM x86 seems like a complete no-brainer.
> 
> Will, Marc and/or Oliver, what are arm64's requirements in this area?  E.g. I see
> that arm64's version of __ptep_clear_flush_young() does TLBI but not DSB.  Should
> KVM be doing something similar?  Can KVM safely skip even the TBLI?

Short answer, yes, KVM can elide TLBIs when clearing AF.

Long answer: Software needs to be extremely careful to ensure that TLBI
elision doesn't lead to a failure to uphold break-before-make requirements,
if we're only concerned with architecture-specific requirements. IOW, the AF
cannot be used as a hint for the presence of TLB entries for a given PTE.

There's the obvious failure of skipping TLBIs for old pages when
unmapping, but that isn't an architecture-specific issue.

So, since KVM/arm64 doesn't play any games with the AF at stage-2, leaving
out a TLBI when aging ought to be fine.
Yu Zhao July 5, 2024, 6:35 p.m. UTC | #5
On Mon, Jun 10, 2024 at 6:22 PM James Houghton <jthoughton@google.com> wrote:
>
> Secondary MMUs are currently consulted for access/age information at
> eviction time, but before then, we don't get accurate age information.
> That is, pages that are mostly accessed through a secondary MMU (like
> guest memory, used by KVM) will always just proceed down to the oldest
> generation, and then at eviction time, if KVM reports the page to be
> young, the page will be activated/promoted back to the youngest
> generation.
>
> The added feature bit (0x8), if disabled, will make MGLRU behave as if
> there are no secondary MMUs subscribed to MMU notifiers except at
> eviction time.
>
> Implement aging with the new mmu_notifier_test_clear_young_fast_only()
> notifier. For architectures that do not support this notifier, this
> becomes a no-op. For architectures that do implement it, it should be
> fast enough to make aging worth it.
>
> Suggested-by: Yu Zhao <yuzhao@google.com>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>
> Notes:
>     should_look_around() can sometimes use two notifiers now instead of one.
>
>     This simply comes from restricting myself from not changing
>     mmu_notifier_clear_young() to return more than just "young or not".
>
>     I could change mmu_notifier_clear_young() (and
>     mmu_notifier_test_young()) to return if it was fast or not. At that
>     point, I could just as well combine all the notifiers into one notifier,
>     like what was in v2 and v3.
>
>  Documentation/admin-guide/mm/multigen_lru.rst |   6 +-
>  include/linux/mmzone.h                        |   6 +-
>  mm/rmap.c                                     |   9 +-
>  mm/vmscan.c                                   | 185 ++++++++++++++----
>  4 files changed, 164 insertions(+), 42 deletions(-)

...

>  static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
>                            struct mm_walk *args)
>  {
> @@ -3357,8 +3416,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
>         struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
>         DEFINE_MAX_SEQ(walk->lruvec);
>         int old_gen, new_gen = lru_gen_from_seq(max_seq);
> +       struct mm_struct *mm = args->mm;
>
> -       pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl);
> +       pte = pte_offset_map_nolock(mm, pmd, start & PMD_MASK, &ptl);
>         if (!pte)
>                 return false;
>         if (!spin_trylock(ptl)) {
> @@ -3376,11 +3436,12 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
>                 total++;
>                 walk->mm_stats[MM_LEAF_TOTAL]++;
>
> -               pfn = get_pte_pfn(ptent, args->vma, addr);
> +               pfn = get_pte_pfn(ptent, args->vma, addr, pgdat);
>                 if (pfn == -1)
>                         continue;
>
> -               if (!pte_young(ptent)) {
> +               if (!pte_young(ptent) &&
> +                   !lru_gen_notifier_test_young(mm, addr)) {
>                         walk->mm_stats[MM_LEAF_OLD]++;
>                         continue;
>                 }
> @@ -3389,8 +3450,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
>                 if (!folio)
>                         continue;
>
> -               if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
> -                       VM_WARN_ON_ONCE(true);
> +               lru_gen_notifier_clear_young(mm, addr, addr + PAGE_SIZE);
> +               if (pte_young(ptent))
> +                       ptep_test_and_clear_young(args->vma, addr, pte + i);
>
>                 young++;
>                 walk->mm_stats[MM_LEAF_YOUNG]++;


There are two ways to structure the test conditions in walk_pte_range():
1. a single pass into the MMU notifier (combine test/clear) which
causes a cache miss from get_pfn_page() if the page is NOT young.
2. two passes into the MMU notifier (separate test/clear) if the page
is young, which does NOT cause a cache miss if the page is NOT young.

v2 can batch up to 64 PTEs, i.e., it only goes into the MMU notifier
twice every 64 PTEs, and therefore the second option is a clear win.

But you are doing twice per PTE. So what's the rationale behind going
with the second option? Was the first option considered?

In addition, what about the non-lockless cases? Would this change make
them worse by grabbing the MMU lock twice per PTE?
James Houghton July 8, 2024, 5:30 p.m. UTC | #6
On Fri, Jul 5, 2024 at 11:36 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Jun 10, 2024 at 6:22 PM James Houghton <jthoughton@google.com> wrote:
> >
> > Secondary MMUs are currently consulted for access/age information at
> > eviction time, but before then, we don't get accurate age information.
> > That is, pages that are mostly accessed through a secondary MMU (like
> > guest memory, used by KVM) will always just proceed down to the oldest
> > generation, and then at eviction time, if KVM reports the page to be
> > young, the page will be activated/promoted back to the youngest
> > generation.
> >
> > The added feature bit (0x8), if disabled, will make MGLRU behave as if
> > there are no secondary MMUs subscribed to MMU notifiers except at
> > eviction time.
> >
> > Implement aging with the new mmu_notifier_test_clear_young_fast_only()
> > notifier. For architectures that do not support this notifier, this
> > becomes a no-op. For architectures that do implement it, it should be
> > fast enough to make aging worth it.
> >
> > Suggested-by: Yu Zhao <yuzhao@google.com>
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >
> > Notes:
> >     should_look_around() can sometimes use two notifiers now instead of one.
> >
> >     This simply comes from restricting myself from not changing
> >     mmu_notifier_clear_young() to return more than just "young or not".
> >
> >     I could change mmu_notifier_clear_young() (and
> >     mmu_notifier_test_young()) to return if it was fast or not. At that
> >     point, I could just as well combine all the notifiers into one notifier,
> >     like what was in v2 and v3.
> >
> >  Documentation/admin-guide/mm/multigen_lru.rst |   6 +-
> >  include/linux/mmzone.h                        |   6 +-
> >  mm/rmap.c                                     |   9 +-
> >  mm/vmscan.c                                   | 185 ++++++++++++++----
> >  4 files changed, 164 insertions(+), 42 deletions(-)
>
> ...
>
> >  static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> >                            struct mm_walk *args)
> >  {
> > @@ -3357,8 +3416,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> >         struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
> >         DEFINE_MAX_SEQ(walk->lruvec);
> >         int old_gen, new_gen = lru_gen_from_seq(max_seq);
> > +       struct mm_struct *mm = args->mm;
> >
> > -       pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl);
> > +       pte = pte_offset_map_nolock(mm, pmd, start & PMD_MASK, &ptl);
> >         if (!pte)
> >                 return false;
> >         if (!spin_trylock(ptl)) {
> > @@ -3376,11 +3436,12 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> >                 total++;
> >                 walk->mm_stats[MM_LEAF_TOTAL]++;
> >
> > -               pfn = get_pte_pfn(ptent, args->vma, addr);
> > +               pfn = get_pte_pfn(ptent, args->vma, addr, pgdat);
> >                 if (pfn == -1)
> >                         continue;
> >
> > -               if (!pte_young(ptent)) {
> > +               if (!pte_young(ptent) &&
> > +                   !lru_gen_notifier_test_young(mm, addr)) {
> >                         walk->mm_stats[MM_LEAF_OLD]++;
> >                         continue;
> >                 }
> > @@ -3389,8 +3450,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> >                 if (!folio)
> >                         continue;
> >
> > -               if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
> > -                       VM_WARN_ON_ONCE(true);
> > +               lru_gen_notifier_clear_young(mm, addr, addr + PAGE_SIZE);
> > +               if (pte_young(ptent))
> > +                       ptep_test_and_clear_young(args->vma, addr, pte + i);
> >
> >                 young++;
> >                 walk->mm_stats[MM_LEAF_YOUNG]++;
>
>
> There are two ways to structure the test conditions in walk_pte_range():
> 1. a single pass into the MMU notifier (combine test/clear) which
> causes a cache miss from get_pfn_page() if the page is NOT young.
> 2. two passes into the MMU notifier (separate test/clear) if the page
> is young, which does NOT cause a cache miss if the page is NOT young.
>
> v2 can batch up to 64 PTEs, i.e., it only goes into the MMU notifier
> twice every 64 PTEs, and therefore the second option is a clear win.
>
> But you are doing twice per PTE. So what's the rationale behind going
> with the second option? Was the first option considered?

Hi Yu,

I didn't consider changing this from your v2[1]. Thanks for bringing it up.

The only real change I have made is that I reordered the
(!test_spte_young() && !pte_young()) to what it is now (!pte_young()
&& !lru_gen_notifier_test_young()) because pte_young() can be
evaluated much faster.

I am happy to change the initial test_young() notifier to a
clear_young() (and drop the later clear_young(). In fact, I think I
should. Making the condition (!pte_young() &&
!lru_gen_notifier_clear_young()) makes sense to me. This returns the
same result as if it were !lru_gen_notifier_test_young() instead,
there is no need for a second clear_young(), and we don't call
get_pfn_folio() on pages that are not young.

WDYT? Have I misunderstood your comment?

Also, I take it your comment was not just about walk_pte_range() but
about the similar bits in lru_gen_look_around() as well, so I'll make
whatever changes we agree on there too (or maybe factor out the common
bits).

[1]: https://lore.kernel.org/kvmarm/20230526234435.662652-11-yuzhao@google.com/

> In addition, what about the non-lockless cases? Would this change make
> them worse by grabbing the MMU lock twice per PTE?

That's a good point. Yes I think calling the notifier twice here would
indeed exacerbate problems with a non-lockless notifier.

Thanks!
Yu Zhao July 8, 2024, 11:41 p.m. UTC | #7
On Mon, Jul 8, 2024 at 11:31 AM James Houghton <jthoughton@google.com> wrote:
>
> On Fri, Jul 5, 2024 at 11:36 AM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Mon, Jun 10, 2024 at 6:22 PM James Houghton <jthoughton@google.com> wrote:
> > >
> > > Secondary MMUs are currently consulted for access/age information at
> > > eviction time, but before then, we don't get accurate age information.
> > > That is, pages that are mostly accessed through a secondary MMU (like
> > > guest memory, used by KVM) will always just proceed down to the oldest
> > > generation, and then at eviction time, if KVM reports the page to be
> > > young, the page will be activated/promoted back to the youngest
> > > generation.
> > >
> > > The added feature bit (0x8), if disabled, will make MGLRU behave as if
> > > there are no secondary MMUs subscribed to MMU notifiers except at
> > > eviction time.
> > >
> > > Implement aging with the new mmu_notifier_test_clear_young_fast_only()
> > > notifier. For architectures that do not support this notifier, this
> > > becomes a no-op. For architectures that do implement it, it should be
> > > fast enough to make aging worth it.
> > >
> > > Suggested-by: Yu Zhao <yuzhao@google.com>
> > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > ---
> > >
> > > Notes:
> > >     should_look_around() can sometimes use two notifiers now instead of one.
> > >
> > >     This simply comes from restricting myself from not changing
> > >     mmu_notifier_clear_young() to return more than just "young or not".
> > >
> > >     I could change mmu_notifier_clear_young() (and
> > >     mmu_notifier_test_young()) to return if it was fast or not. At that
> > >     point, I could just as well combine all the notifiers into one notifier,
> > >     like what was in v2 and v3.
> > >
> > >  Documentation/admin-guide/mm/multigen_lru.rst |   6 +-
> > >  include/linux/mmzone.h                        |   6 +-
> > >  mm/rmap.c                                     |   9 +-
> > >  mm/vmscan.c                                   | 185 ++++++++++++++----
> > >  4 files changed, 164 insertions(+), 42 deletions(-)
> >
> > ...
> >
> > >  static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> > >                            struct mm_walk *args)
> > >  {
> > > @@ -3357,8 +3416,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> > >         struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
> > >         DEFINE_MAX_SEQ(walk->lruvec);
> > >         int old_gen, new_gen = lru_gen_from_seq(max_seq);
> > > +       struct mm_struct *mm = args->mm;
> > >
> > > -       pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl);
> > > +       pte = pte_offset_map_nolock(mm, pmd, start & PMD_MASK, &ptl);
> > >         if (!pte)
> > >                 return false;
> > >         if (!spin_trylock(ptl)) {
> > > @@ -3376,11 +3436,12 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> > >                 total++;
> > >                 walk->mm_stats[MM_LEAF_TOTAL]++;
> > >
> > > -               pfn = get_pte_pfn(ptent, args->vma, addr);
> > > +               pfn = get_pte_pfn(ptent, args->vma, addr, pgdat);
> > >                 if (pfn == -1)
> > >                         continue;
> > >
> > > -               if (!pte_young(ptent)) {
> > > +               if (!pte_young(ptent) &&
> > > +                   !lru_gen_notifier_test_young(mm, addr)) {
> > >                         walk->mm_stats[MM_LEAF_OLD]++;
> > >                         continue;
> > >                 }
> > > @@ -3389,8 +3450,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> > >                 if (!folio)
> > >                         continue;
> > >
> > > -               if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
> > > -                       VM_WARN_ON_ONCE(true);
> > > +               lru_gen_notifier_clear_young(mm, addr, addr + PAGE_SIZE);
> > > +               if (pte_young(ptent))
> > > +                       ptep_test_and_clear_young(args->vma, addr, pte + i);
> > >
> > >                 young++;
> > >                 walk->mm_stats[MM_LEAF_YOUNG]++;
> >
> >
> > There are two ways to structure the test conditions in walk_pte_range():
> > 1. a single pass into the MMU notifier (combine test/clear) which
> > causes a cache miss from get_pfn_page() if the page is NOT young.
> > 2. two passes into the MMU notifier (separate test/clear) if the page
> > is young, which does NOT cause a cache miss if the page is NOT young.
> >
> > v2 can batch up to 64 PTEs, i.e., it only goes into the MMU notifier
> > twice every 64 PTEs, and therefore the second option is a clear win.
> >
> > But you are doing twice per PTE. So what's the rationale behind going
> > with the second option? Was the first option considered?
>
> Hi Yu,
>
> I didn't consider changing this from your v2[1]. Thanks for bringing it up.
>
> The only real change I have made is that I reordered the
> (!test_spte_young() && !pte_young()) to what it is now (!pte_young()
> && !lru_gen_notifier_test_young()) because pte_young() can be
> evaluated much faster.
>
> I am happy to change the initial test_young() notifier to a
> clear_young() (and drop the later clear_young(). In fact, I think I
> should. Making the condition (!pte_young() &&
> !lru_gen_notifier_clear_young()) makes sense to me. This returns the
> same result as if it were !lru_gen_notifier_test_young() instead,
> there is no need for a second clear_young(), and we don't call
> get_pfn_folio() on pages that are not young.

We don't want to do that because we would lose the A-bit for a folio
that's beyond the current reclaim scope, i.e., the cases where
get_pfn_folio() returns NULL (a folio from another memcg, e.g.).

> WDYT? Have I misunderstood your comment?

I hope this is clear enough:

@@ -3395,7 +3395,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned
long start, unsigned long end,
                if (pfn == -1)
                        continue;

-               if (!pte_young(ptent)) {
+               if (!pte_young(ptent) && !mm_has_notifiers(args->mm)) {
                        walk->mm_stats[MM_LEAF_OLD]++;
                        continue;
                }
@@ -3404,8 +3404,8 @@ static bool walk_pte_range(pmd_t *pmd, unsigned
long start, unsigned long end,
                if (!folio)
                        continue;

-               if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
-                       VM_WARN_ON_ONCE(true);
+               if (!ptep_clear_young_notify(args->vma, addr, pte + i))
+                       continue;

                young++;
                walk->mm_stats[MM_LEAF_YOUNG]++;

> Also, I take it your comment was not just about walk_pte_range() but
> about the similar bits in lru_gen_look_around() as well, so I'll make
> whatever changes we agree on there too (or maybe factor out the common
> bits).
>
> [1]: https://lore.kernel.org/kvmarm/20230526234435.662652-11-yuzhao@google.com/
>
> > In addition, what about the non-lockless cases? Would this change make
> > them worse by grabbing the MMU lock twice per PTE?
>
> That's a good point. Yes I think calling the notifier twice here would
> indeed exacerbate problems with a non-lockless notifier.

I think so too, but I haven't verified it. Please do?
James Houghton July 22, 2024, 8:45 p.m. UTC | #8
On Mon, Jul 8, 2024 at 4:42 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Jul 8, 2024 at 11:31 AM James Houghton <jthoughton@google.com> wrote:
> >
> > On Fri, Jul 5, 2024 at 11:36 AM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Mon, Jun 10, 2024 at 6:22 PM James Houghton <jthoughton@google.com> wrote:
> > > > @@ -3389,8 +3450,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> > > >                 if (!folio)
> > > >                         continue;
> > > >
> > > > -               if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
> > > > -                       VM_WARN_ON_ONCE(true);
> > > > +               lru_gen_notifier_clear_young(mm, addr, addr + PAGE_SIZE);
> > > > +               if (pte_young(ptent))
> > > > +                       ptep_test_and_clear_young(args->vma, addr, pte + i);
> > > >
> > > >                 young++;
> > > >                 walk->mm_stats[MM_LEAF_YOUNG]++;
> > >
> > >
> > > There are two ways to structure the test conditions in walk_pte_range():
> > > 1. a single pass into the MMU notifier (combine test/clear) which
> > > causes a cache miss from get_pfn_page() if the page is NOT young.
> > > 2. two passes into the MMU notifier (separate test/clear) if the page
> > > is young, which does NOT cause a cache miss if the page is NOT young.
> > >
> > > v2 can batch up to 64 PTEs, i.e., it only goes into the MMU notifier
> > > twice every 64 PTEs, and therefore the second option is a clear win.
> > >
> > > But you are doing twice per PTE. So what's the rationale behind going
> > > with the second option? Was the first option considered?
> >
> > Hi Yu,
> >
> > I didn't consider changing this from your v2[1]. Thanks for bringing it up.
> >
> > The only real change I have made is that I reordered the
> > (!test_spte_young() && !pte_young()) to what it is now (!pte_young()
> > && !lru_gen_notifier_test_young()) because pte_young() can be
> > evaluated much faster.
> >
> > I am happy to change the initial test_young() notifier to a
> > clear_young() (and drop the later clear_young(). In fact, I think I
> > should. Making the condition (!pte_young() &&
> > !lru_gen_notifier_clear_young()) makes sense to me. This returns the
> > same result as if it were !lru_gen_notifier_test_young() instead,
> > there is no need for a second clear_young(), and we don't call
> > get_pfn_folio() on pages that are not young.
>
> We don't want to do that because we would lose the A-bit for a folio
> that's beyond the current reclaim scope, i.e., the cases where
> get_pfn_folio() returns NULL (a folio from another memcg, e.g.).
>
> > WDYT? Have I misunderstood your comment?
>
> I hope this is clear enough:
>
> @@ -3395,7 +3395,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned
> long start, unsigned long end,
>                 if (pfn == -1)
>                         continue;
>
> -               if (!pte_young(ptent)) {
> +               if (!pte_young(ptent) && !mm_has_notifiers(args->mm)) {
>                         walk->mm_stats[MM_LEAF_OLD]++;
>                         continue;
>                 }
> @@ -3404,8 +3404,8 @@ static bool walk_pte_range(pmd_t *pmd, unsigned
> long start, unsigned long end,
>                 if (!folio)
>                         continue;
>
> -               if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
> -                       VM_WARN_ON_ONCE(true);
> +               if (!ptep_clear_young_notify(args->vma, addr, pte + i))

walk->mm_stats[MM_LEAF_OLD]++ should be here, I take it.

> +                       continue;
>
>                 young++;
>                 walk->mm_stats[MM_LEAF_YOUNG]++;
>
> > Also, I take it your comment was not just about walk_pte_range() but
> > about the similar bits in lru_gen_look_around() as well, so I'll make
> > whatever changes we agree on there too (or maybe factor out the common
> > bits).
> >
> > [1]: https://lore.kernel.org/kvmarm/20230526234435.662652-11-yuzhao@google.com/
> >
> > > In addition, what about the non-lockless cases? Would this change make
> > > them worse by grabbing the MMU lock twice per PTE?
> >
> > That's a good point. Yes I think calling the notifier twice here would
> > indeed exacerbate problems with a non-lockless notifier.
>
> I think so too, but I haven't verified it. Please do?

I have some results now, sorry for the wait.

It seems like one notifier is definitely better. It doesn't look like
the read lock actually made anything worse with what I was testing
(faulting memory in while doing aging). This is kind of surprising,
but either way, I'll change it to the single notifier in v6. Thanks
Yu!

Here are the results I'm basing this conclusion on, using the selftest
added at the end of this series.

# Use taskset to minimize NUMA concern.
# Give an extra core for the aging thread.
# THPs disabled (echo never > /sys/kernel/mm/transparent_hugepage/enabled)

x86:

# taskset -c 0-32 ./access_tracking_perf_test -l -v 32
# # One notifier
Populating memory             : 1.933017284s
Writing to populated memory   : 0.017323539s
Reading from populated memory : 0.013113260s
lru_gen: Aging                : 0.894133259s
lru_gen: Aging                : 0.738950525s
Writing to idle memory        : 0.059661329s
lru_gen: Aging                : 0.922719935s
lru_gen: Aging                : 0.829129877s
Reading from idle memory      : 0.059095098s
lru_gen: Aging                : 0.922689975s

# # Two notifiers
Populating memory             : 1.842645795s
Writing to populated memory   : 0.017277075s
Reading from populated memory : 0.013047457s
lru_gen: Aging                : 0.900751764s
lru_gen: Aging                : 0.707203167s
Writing to idle memory        : 0.060663733s
lru_gen: Aging                : 1.539957250s  <------ got longer
lru_gen: Aging                : 0.797475887s
Reading from idle memory      : 0.084415591s
lru_gen: Aging                : 1.539417121s  <------ got longer

arm64*:
(*Patched to do aging; not done in v5 or v6. Doing this to see if the read
lock is made substantially worse by using two notifiers vs. one.)

# taskset -c 0-16 ./access_tracking_perf_test -l -v 16 -m 3
# # One notifier
Populating memory             : 1.439261355s
Writing to populated memory   : 0.009755279s
Reading from populated memory : 0.007714120s
lru_gen: Aging                : 0.540183328s
lru_gen: Aging                : 0.455427973s
Writing to idle memory        : 0.010130399s
lru_gen: Aging                : 0.563424247s
lru_gen: Aging                : 0.500419850s
Reading from idle memory      : 0.008519640s
lru_gen: Aging                : 0.563178643s

# # Two notifiers
Populating memory             : 1.526805625s
Writing to populated memory   : 0.009836118s
Reading from populated memory : 0.007757280s
lru_gen: Aging                : 0.537770978s
lru_gen: Aging                : 0.421915391s
Writing to idle memory        : 0.010281959s
lru_gen: Aging                : 0.971448688s  <------ got longer
lru_gen: Aging                : 0.466956547s
Reading from idle memory      : 0.008588559s
lru_gen: Aging                : 0.971030648s  <------ got longer


arm64, faulting memory in while aging:

# perf record -g -- taskset -c 0-16 ./access_tracking_perf_test -l -v 16 -m 3 -p
# # One notifier
vcpu wall time                : 1.433908058s
lru_gen avg pass duration     : 0.172128073s, (passes:11, total:1.893408807s)

# # Two notifiers
vcpu wall time                : 1.450387765s
lru_gen avg pass duration     : 0.175652974s, (passes:10, total:1.756529744s)

# perf report
# # One notifier
-    6.25%     0.00%  access_tracking  [kernel.kallsyms]  [k] try_to_inc_max_seq
   - try_to_inc_max_seq
      - 6.06% walk_page_range
           __walk_page_range
         - walk_pgd_range
            - 6.04% walk_pud_range
               - 4.73% __mmu_notifier_clear_young
                  + 4.29% kvm_mmu_notifier_clear_young

# # Two notifiers
-    6.43%     0.00%  access_tracking  [kernel.kallsyms]  [k] try_to_inc_max_seq
   - try_to_inc_max_seq
      - 6.25% walk_page_range
           __walk_page_range
         - walk_pgd_range
            - 6.23% walk_pud_range
               - 2.75% __mmu_notifier_test_young
                  + 2.48% kvm_mmu_notifier_test_young
               - 2.39% __mmu_notifier_clear_young
                  + 2.19% kvm_mmu_notifier_clear_young
Yu Zhao July 22, 2024, 9:23 p.m. UTC | #9
On Mon, Jul 22, 2024 at 2:46 PM James Houghton <jthoughton@google.com> wrote:
>
> On Mon, Jul 8, 2024 at 4:42 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Mon, Jul 8, 2024 at 11:31 AM James Houghton <jthoughton@google.com> wrote:
> > >
> > > On Fri, Jul 5, 2024 at 11:36 AM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > On Mon, Jun 10, 2024 at 6:22 PM James Houghton <jthoughton@google.com> wrote:
> > > > > @@ -3389,8 +3450,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> > > > >                 if (!folio)
> > > > >                         continue;
> > > > >
> > > > > -               if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
> > > > > -                       VM_WARN_ON_ONCE(true);
> > > > > +               lru_gen_notifier_clear_young(mm, addr, addr + PAGE_SIZE);
> > > > > +               if (pte_young(ptent))
> > > > > +                       ptep_test_and_clear_young(args->vma, addr, pte + i);
> > > > >
> > > > >                 young++;
> > > > >                 walk->mm_stats[MM_LEAF_YOUNG]++;
> > > >
> > > >
> > > > There are two ways to structure the test conditions in walk_pte_range():
> > > > 1. a single pass into the MMU notifier (combine test/clear) which
> > > > causes a cache miss from get_pfn_page() if the page is NOT young.
> > > > 2. two passes into the MMU notifier (separate test/clear) if the page
> > > > is young, which does NOT cause a cache miss if the page is NOT young.
> > > >
> > > > v2 can batch up to 64 PTEs, i.e., it only goes into the MMU notifier
> > > > twice every 64 PTEs, and therefore the second option is a clear win.
> > > >
> > > > But you are doing twice per PTE. So what's the rationale behind going
> > > > with the second option? Was the first option considered?
> > >
> > > Hi Yu,
> > >
> > > I didn't consider changing this from your v2[1]. Thanks for bringing it up.
> > >
> > > The only real change I have made is that I reordered the
> > > (!test_spte_young() && !pte_young()) to what it is now (!pte_young()
> > > && !lru_gen_notifier_test_young()) because pte_young() can be
> > > evaluated much faster.
> > >
> > > I am happy to change the initial test_young() notifier to a
> > > clear_young() (and drop the later clear_young(). In fact, I think I
> > > should. Making the condition (!pte_young() &&
> > > !lru_gen_notifier_clear_young()) makes sense to me. This returns the
> > > same result as if it were !lru_gen_notifier_test_young() instead,
> > > there is no need for a second clear_young(), and we don't call
> > > get_pfn_folio() on pages that are not young.
> >
> > We don't want to do that because we would lose the A-bit for a folio
> > that's beyond the current reclaim scope, i.e., the cases where
> > get_pfn_folio() returns NULL (a folio from another memcg, e.g.).
> >
> > > WDYT? Have I misunderstood your comment?
> >
> > I hope this is clear enough:
> >
> > @@ -3395,7 +3395,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned
> > long start, unsigned long end,
> >                 if (pfn == -1)
> >                         continue;
> >
> > -               if (!pte_young(ptent)) {
> > +               if (!pte_young(ptent) && !mm_has_notifiers(args->mm)) {
> >                         walk->mm_stats[MM_LEAF_OLD]++;
> >                         continue;
> >                 }
> > @@ -3404,8 +3404,8 @@ static bool walk_pte_range(pmd_t *pmd, unsigned
> > long start, unsigned long end,
> >                 if (!folio)
> >                         continue;
> >
> > -               if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
> > -                       VM_WARN_ON_ONCE(true);
> > +               if (!ptep_clear_young_notify(args->vma, addr, pte + i))
>
> walk->mm_stats[MM_LEAF_OLD]++ should be here, I take it.
>
> > +                       continue;
> >
> >                 young++;
> >                 walk->mm_stats[MM_LEAF_YOUNG]++;
> >
> > > Also, I take it your comment was not just about walk_pte_range() but
> > > about the similar bits in lru_gen_look_around() as well, so I'll make
> > > whatever changes we agree on there too (or maybe factor out the common
> > > bits).
> > >
> > > [1]: https://lore.kernel.org/kvmarm/20230526234435.662652-11-yuzhao@google.com/
> > >
> > > > In addition, what about the non-lockless cases? Would this change make
> > > > them worse by grabbing the MMU lock twice per PTE?
> > >
> > > That's a good point. Yes I think calling the notifier twice here would
> > > indeed exacerbate problems with a non-lockless notifier.
> >
> > I think so too, but I haven't verified it. Please do?
>
> I have some results now, sorry for the wait.
>
> It seems like one notifier is definitely better. It doesn't look like
> the read lock actually made anything worse with what I was testing
> (faulting memory in while doing aging). This is kind of surprising,

Not at all if you were only doing the aging path, which only takes the
lock for read.

Under memory pressure, we need to both the aging and eviction, and the
latter has to take the lock for write (to unmap). And that's when the
real contention happens, because the search space is too big -- the
entire system memory for global reclaim -- unmapping can easily
collide with clearing the A-bit.

> but either way, I'll change it to the single notifier in v6. Thanks
> Yu!
>
> Here are the results I'm basing this conclusion on, using the selftest
> added at the end of this series.
>
> # Use taskset to minimize NUMA concern.
> # Give an extra core for the aging thread.
> # THPs disabled (echo never > /sys/kernel/mm/transparent_hugepage/enabled)
>
> x86:
>
> # taskset -c 0-32 ./access_tracking_perf_test -l -v 32
> # # One notifier
> Populating memory             : 1.933017284s
> Writing to populated memory   : 0.017323539s
> Reading from populated memory : 0.013113260s
> lru_gen: Aging                : 0.894133259s
> lru_gen: Aging                : 0.738950525s
> Writing to idle memory        : 0.059661329s
> lru_gen: Aging                : 0.922719935s
> lru_gen: Aging                : 0.829129877s
> Reading from idle memory      : 0.059095098s
> lru_gen: Aging                : 0.922689975s
>
> # # Two notifiers
> Populating memory             : 1.842645795s
> Writing to populated memory   : 0.017277075s
> Reading from populated memory : 0.013047457s
> lru_gen: Aging                : 0.900751764s
> lru_gen: Aging                : 0.707203167s
> Writing to idle memory        : 0.060663733s
> lru_gen: Aging                : 1.539957250s  <------ got longer
> lru_gen: Aging                : 0.797475887s
> Reading from idle memory      : 0.084415591s
> lru_gen: Aging                : 1.539417121s  <------ got longer
>
> arm64*:
> (*Patched to do aging; not done in v5 or v6. Doing this to see if the read
> lock is made substantially worse by using two notifiers vs. one.)
>
> # taskset -c 0-16 ./access_tracking_perf_test -l -v 16 -m 3
> # # One notifier
> Populating memory             : 1.439261355s
> Writing to populated memory   : 0.009755279s
> Reading from populated memory : 0.007714120s
> lru_gen: Aging                : 0.540183328s
> lru_gen: Aging                : 0.455427973s
> Writing to idle memory        : 0.010130399s
> lru_gen: Aging                : 0.563424247s
> lru_gen: Aging                : 0.500419850s
> Reading from idle memory      : 0.008519640s
> lru_gen: Aging                : 0.563178643s
>
> # # Two notifiers
> Populating memory             : 1.526805625s
> Writing to populated memory   : 0.009836118s
> Reading from populated memory : 0.007757280s
> lru_gen: Aging                : 0.537770978s
> lru_gen: Aging                : 0.421915391s
> Writing to idle memory        : 0.010281959s
> lru_gen: Aging                : 0.971448688s  <------ got longer
> lru_gen: Aging                : 0.466956547s
> Reading from idle memory      : 0.008588559s
> lru_gen: Aging                : 0.971030648s  <------ got longer
>
>
> arm64, faulting memory in while aging:
>
> # perf record -g -- taskset -c 0-16 ./access_tracking_perf_test -l -v 16 -m 3 -p
> # # One notifier
> vcpu wall time                : 1.433908058s
> lru_gen avg pass duration     : 0.172128073s, (passes:11, total:1.893408807s)
>
> # # Two notifiers
> vcpu wall time                : 1.450387765s
> lru_gen avg pass duration     : 0.175652974s, (passes:10, total:1.756529744s)
>
> # perf report
> # # One notifier
> -    6.25%     0.00%  access_tracking  [kernel.kallsyms]  [k] try_to_inc_max_seq
>    - try_to_inc_max_seq
>       - 6.06% walk_page_range
>            __walk_page_range
>          - walk_pgd_range
>             - 6.04% walk_pud_range
>                - 4.73% __mmu_notifier_clear_young
>                   + 4.29% kvm_mmu_notifier_clear_young
>
> # # Two notifiers
> -    6.43%     0.00%  access_tracking  [kernel.kallsyms]  [k] try_to_inc_max_seq
>    - try_to_inc_max_seq
>       - 6.25% walk_page_range
>            __walk_page_range
>          - walk_pgd_range
>             - 6.23% walk_pud_range
>                - 2.75% __mmu_notifier_test_young
>                   + 2.48% kvm_mmu_notifier_test_young
>                - 2.39% __mmu_notifier_clear_young
>                   + 2.19% kvm_mmu_notifier_clear_young
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
index 33e068830497..1e578e0c4c0c 100644
--- a/Documentation/admin-guide/mm/multigen_lru.rst
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -48,6 +48,10 @@  Values Components
        verified on x86 varieties other than Intel and AMD. If it is
        disabled, the multi-gen LRU will suffer a negligible
        performance degradation.
+0x0008 Continuously clear the accessed bit in secondary MMU page
+       tables instead of waiting until eviction time. This results in
+       accurate page age information for pages that are mainly used by
+       a secondary MMU.
 [yYnN] Apply to all the components above.
 ====== ===============================================================
 
@@ -56,7 +60,7 @@  E.g.,
 
     echo y >/sys/kernel/mm/lru_gen/enabled
     cat /sys/kernel/mm/lru_gen/enabled
-    0x0007
+    0x000f
     echo 5 >/sys/kernel/mm/lru_gen/enabled
     cat /sys/kernel/mm/lru_gen/enabled
     0x0005
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8f9c9590a42c..869824ef5f3b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -400,6 +400,7 @@  enum {
 	LRU_GEN_CORE,
 	LRU_GEN_MM_WALK,
 	LRU_GEN_NONLEAF_YOUNG,
+	LRU_GEN_SECONDARY_MMU_WALK,
 	NR_LRU_GEN_CAPS
 };
 
@@ -557,7 +558,7 @@  struct lru_gen_memcg {
 
 void lru_gen_init_pgdat(struct pglist_data *pgdat);
 void lru_gen_init_lruvec(struct lruvec *lruvec);
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
 
 void lru_gen_init_memcg(struct mem_cgroup *memcg);
 void lru_gen_exit_memcg(struct mem_cgroup *memcg);
@@ -576,8 +577,9 @@  static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
 {
 }
 
-static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 {
+	return false;
 }
 
 static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
diff --git a/mm/rmap.c b/mm/rmap.c
index e8fc5ecb59b2..24a3ff639919 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -870,13 +870,10 @@  static bool folio_referenced_one(struct folio *folio,
 			continue;
 		}
 
-		if (pvmw.pte) {
-			if (lru_gen_enabled() &&
-			    pte_young(ptep_get(pvmw.pte))) {
-				lru_gen_look_around(&pvmw);
+		if (lru_gen_enabled() && pvmw.pte) {
+			if (lru_gen_look_around(&pvmw))
 				referenced++;
-			}
-
+		} else if (pvmw.pte) {
 			if (ptep_clear_flush_young_notify(vma, address,
 						pvmw.pte))
 				referenced++;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2e34de9cd0d4..348f3ffc8d5d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -56,6 +56,7 @@ 
 #include <linux/khugepaged.h>
 #include <linux/rculist_nulls.h>
 #include <linux/random.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -2579,6 +2580,21 @@  static bool should_clear_pmd_young(void)
 	return arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG);
 }
 
+#ifdef CONFIG_HAVE_KVM_YOUNG_FAST_ONLY_NOTIFIER
+#include <linux/kvm_host.h>
+static bool should_walk_secondary_mmu(void)
+{
+	return kvm_arch_young_notifier_likely_fast() &&
+	       get_cap(LRU_GEN_SECONDARY_MMU_WALK);
+}
+#else
+static bool should_walk_secondary_mmu(void)
+{
+	return false;
+}
+#endif
+
+
 /******************************************************************************
  *                          shorthand helpers
  ******************************************************************************/
@@ -3276,7 +3292,8 @@  static bool get_next_vma(unsigned long mask, unsigned long size, struct mm_walk
 	return false;
 }
 
-static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr)
+static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr,
+				 struct pglist_data *pgdat)
 {
 	unsigned long pfn = pte_pfn(pte);
 
@@ -3291,10 +3308,15 @@  static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned
 	if (WARN_ON_ONCE(!pfn_valid(pfn)))
 		return -1;
 
+	/* try to avoid unnecessary memory loads */
+	if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+		return -1;
+
 	return pfn;
 }
 
-static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr)
+static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr,
+				 struct pglist_data *pgdat)
 {
 	unsigned long pfn = pmd_pfn(pmd);
 
@@ -3309,6 +3331,10 @@  static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned
 	if (WARN_ON_ONCE(!pfn_valid(pfn)))
 		return -1;
 
+	/* try to avoid unnecessary memory loads */
+	if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+		return -1;
+
 	return pfn;
 }
 
@@ -3317,10 +3343,6 @@  static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
 {
 	struct folio *folio;
 
-	/* try to avoid unnecessary memory loads */
-	if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
-		return NULL;
-
 	folio = pfn_folio(pfn);
 	if (folio_nid(folio) != pgdat->node_id)
 		return NULL;
@@ -3343,6 +3365,43 @@  static bool suitable_to_scan(int total, int young)
 	return young * n >= total;
 }
 
+static bool lru_gen_notifier_test_clear_young(struct mm_struct *mm,
+					      unsigned long start,
+					      unsigned long end,
+					      bool clear)
+{
+	return should_walk_secondary_mmu() &&
+		(mmu_notifier_test_clear_young_fast_only(
+				mm, start, end, clear) &
+		 MMU_NOTIFIER_FAST_YOUNG);
+}
+
+static bool lru_gen_notifier_test_young(struct mm_struct *mm,
+					unsigned long addr)
+{
+	return lru_gen_notifier_test_clear_young(mm, addr, addr + PAGE_SIZE,
+						 false);
+}
+
+static bool lru_gen_notifier_clear_young(struct mm_struct *mm,
+					 unsigned long start,
+					 unsigned long end)
+{
+	return lru_gen_notifier_test_clear_young(mm, start, end, true);
+}
+
+static bool lru_gen_pmdp_test_and_clear_young(struct vm_area_struct *vma,
+					      unsigned long addr,
+					      pmd_t *pmd)
+{
+	bool young = pmdp_test_and_clear_young(vma, addr, pmd);
+
+	if (lru_gen_notifier_clear_young(vma->vm_mm, addr, addr + PMD_SIZE))
+		young = true;
+
+	return young;
+}
+
 static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 			   struct mm_walk *args)
 {
@@ -3357,8 +3416,9 @@  static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 	struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
 	DEFINE_MAX_SEQ(walk->lruvec);
 	int old_gen, new_gen = lru_gen_from_seq(max_seq);
+	struct mm_struct *mm = args->mm;
 
-	pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl);
+	pte = pte_offset_map_nolock(mm, pmd, start & PMD_MASK, &ptl);
 	if (!pte)
 		return false;
 	if (!spin_trylock(ptl)) {
@@ -3376,11 +3436,12 @@  static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		total++;
 		walk->mm_stats[MM_LEAF_TOTAL]++;
 
-		pfn = get_pte_pfn(ptent, args->vma, addr);
+		pfn = get_pte_pfn(ptent, args->vma, addr, pgdat);
 		if (pfn == -1)
 			continue;
 
-		if (!pte_young(ptent)) {
+		if (!pte_young(ptent) &&
+		    !lru_gen_notifier_test_young(mm, addr)) {
 			walk->mm_stats[MM_LEAF_OLD]++;
 			continue;
 		}
@@ -3389,8 +3450,9 @@  static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		if (!folio)
 			continue;
 
-		if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
-			VM_WARN_ON_ONCE(true);
+		lru_gen_notifier_clear_young(mm, addr, addr + PAGE_SIZE);
+		if (pte_young(ptent))
+			ptep_test_and_clear_young(args->vma, addr, pte + i);
 
 		young++;
 		walk->mm_stats[MM_LEAF_YOUNG]++;
@@ -3456,22 +3518,25 @@  static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 		/* don't round down the first address */
 		addr = i ? (*first & PMD_MASK) + i * PMD_SIZE : *first;
 
-		pfn = get_pmd_pfn(pmd[i], vma, addr);
-		if (pfn == -1)
-			goto next;
-
-		if (!pmd_trans_huge(pmd[i])) {
-			if (should_clear_pmd_young())
+		if (pmd_present(pmd[i]) && !pmd_trans_huge(pmd[i])) {
+			if (should_clear_pmd_young() &&
+			    !should_walk_secondary_mmu())
 				pmdp_test_and_clear_young(vma, addr, pmd + i);
 			goto next;
 		}
 
+		pfn = get_pmd_pfn(pmd[i], vma, addr, pgdat);
+		if (pfn == -1)
+			goto next;
+
 		folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
 		if (!folio)
 			goto next;
 
-		if (!pmdp_test_and_clear_young(vma, addr, pmd + i))
+		if (!lru_gen_pmdp_test_and_clear_young(vma, addr, pmd + i)) {
+			walk->mm_stats[MM_LEAF_OLD]++;
 			goto next;
+		}
 
 		walk->mm_stats[MM_LEAF_YOUNG]++;
 
@@ -3528,19 +3593,18 @@  static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 		}
 
 		if (pmd_trans_huge(val)) {
-			unsigned long pfn = pmd_pfn(val);
 			struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+			unsigned long pfn = get_pmd_pfn(val, vma, addr, pgdat);
 
 			walk->mm_stats[MM_LEAF_TOTAL]++;
 
-			if (!pmd_young(val)) {
-				walk->mm_stats[MM_LEAF_OLD]++;
+			if (pfn == -1)
 				continue;
-			}
 
-			/* try to avoid unnecessary memory loads */
-			if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+			if (!pmd_young(val) && !mm_has_notifiers(args->mm)) {
+				walk->mm_stats[MM_LEAF_OLD]++;
 				continue;
+			}
 
 			walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first);
 			continue;
@@ -3548,7 +3612,7 @@  static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 
 		walk->mm_stats[MM_NONLEAF_TOTAL]++;
 
-		if (should_clear_pmd_young()) {
+		if (should_clear_pmd_young() && !should_walk_secondary_mmu()) {
 			if (!pmd_young(val))
 				continue;
 
@@ -3994,6 +4058,47 @@  static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
  *                          rmap/PT walk feedback
  ******************************************************************************/
 
+static bool should_look_around(struct vm_area_struct *vma, unsigned long addr,
+			       pte_t *pte, int *young)
+{
+	int notifier_result = MMU_NOTIFIER_FAST_FAILED;
+	bool notifier_was_fast = false;
+	bool secondary_young = false;
+
+	if (should_walk_secondary_mmu()) {
+		notifier_result =
+			mmu_notifier_test_clear_young_fast_only(
+					vma->vm_mm, addr, addr + PAGE_SIZE,
+					/*clear=*/true);
+	}
+
+	if (notifier_result & MMU_NOTIFIER_FAST_FAILED)
+		secondary_young = mmu_notifier_clear_young(vma->vm_mm, addr,
+							   addr + PAGE_SIZE);
+	else {
+		secondary_young = notifier_result & MMU_NOTIFIER_FAST_YOUNG;
+		notifier_was_fast = true;
+	}
+
+	/*
+	 * Look around if (1) the PTE is young or (2) the secondary PTE was
+	 * young and the results were gathered fast (so look-around will
+	 * probably be accurate).
+	 */
+	if (pte_young(ptep_get(pte))) {
+		ptep_test_and_clear_young(vma, addr, pte);
+		*young = true;
+		return true;
+	}
+
+	if (secondary_young) {
+		*young = true;
+		return notifier_was_fast;
+	}
+
+	return false;
+}
+
 /*
  * This function exploits spatial locality when shrink_folio_list() walks the
  * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If
@@ -4001,7 +4106,7 @@  static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
  * the PTE table to the Bloom filter. This forms a feedback loop between the
  * eviction and the aging.
  */
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 {
 	int i;
 	unsigned long start;
@@ -4019,16 +4124,20 @@  void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
 	DEFINE_MAX_SEQ(lruvec);
 	int old_gen, new_gen = lru_gen_from_seq(max_seq);
+	struct mm_struct *mm = pvmw->vma->vm_mm;
 
 	lockdep_assert_held(pvmw->ptl);
 	VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);
 
+	if (!should_look_around(vma, addr, pte, &young))
+		return young;
+
 	if (spin_is_contended(pvmw->ptl))
-		return;
+		return young;
 
 	/* exclude special VMAs containing anon pages from COW */
 	if (vma->vm_flags & VM_SPECIAL)
-		return;
+		return young;
 
 	/* avoid taking the LRU lock under the PTL when possible */
 	walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
@@ -4036,6 +4145,9 @@  void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	start = max(addr & PMD_MASK, vma->vm_start);
 	end = min(addr | ~PMD_MASK, vma->vm_end - 1) + 1;
 
+	if (end - start == PAGE_SIZE)
+		return young;
+
 	if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
 		if (addr - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
 			end = start + MIN_LRU_BATCH * PAGE_SIZE;
@@ -4049,7 +4161,7 @@  void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 
 	/* folio_update_gen() requires stable folio_memcg() */
 	if (!mem_cgroup_trylock_pages(memcg))
-		return;
+		return young;
 
 	arch_enter_lazy_mmu_mode();
 
@@ -4059,19 +4171,21 @@  void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 		unsigned long pfn;
 		pte_t ptent = ptep_get(pte + i);
 
-		pfn = get_pte_pfn(ptent, vma, addr);
+		pfn = get_pte_pfn(ptent, vma, addr, pgdat);
 		if (pfn == -1)
 			continue;
 
-		if (!pte_young(ptent))
+		if (!pte_young(ptent) &&
+		    !lru_gen_notifier_test_young(mm, addr))
 			continue;
 
 		folio = get_pfn_folio(pfn, memcg, pgdat, can_swap);
 		if (!folio)
 			continue;
 
-		if (!ptep_test_and_clear_young(vma, addr, pte + i))
-			VM_WARN_ON_ONCE(true);
+		lru_gen_notifier_clear_young(mm, addr, addr + PAGE_SIZE);
+		if (pte_young(ptent))
+			ptep_test_and_clear_young(vma, addr, pte + i);
 
 		young++;
 
@@ -4101,6 +4215,8 @@  void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	/* feedback from rmap walkers to page table walkers */
 	if (mm_state && suitable_to_scan(i, young))
 		update_bloom_filter(mm_state, max_seq, pvmw->pmd);
+
+	return young;
 }
 
 /******************************************************************************
@@ -5137,6 +5253,9 @@  static ssize_t enabled_show(struct kobject *kobj, struct kobj_attribute *attr, c
 	if (should_clear_pmd_young())
 		caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
 
+	if (should_walk_secondary_mmu())
+		caps |= BIT(LRU_GEN_SECONDARY_MMU_WALK);
+
 	return sysfs_emit(buf, "0x%04x\n", caps);
 }