mbox series

[v2,0/5] Avoid building lrugen page table walk code

Message ID 20230706062044.816068-1-aneesh.kumar@linux.ibm.com (mailing list archive)
Headers show
Series Avoid building lrugen page table walk code | expand

Message

Aneesh Kumar K.V July 6, 2023, 6:20 a.m. UTC
This patchset avoids building changes added by commit bd74fdaea146 ("mm:
multi-gen LRU: support page table walks") on platforms that don't support
hardware atomic updates of access bits.

Aneesh Kumar K.V (5):
  mm/mglru: Create a new helper iterate_mm_list_walk
  mm/mglru: Move Bloom filter code around
  mm/mglru: Move code around to make future patch easy
  mm/mglru: move iterate_mm_list_walk Helper
  mm/mglru: Don't build multi-gen LRU page table walk code on
    architecture not supported

 arch/Kconfig               |   3 +
 arch/arm64/Kconfig         |   1 +
 arch/x86/Kconfig           |   1 +
 include/linux/memcontrol.h |   2 +-
 include/linux/mm_types.h   |  10 +-
 include/linux/mmzone.h     |  12 +-
 kernel/fork.c              |   2 +-
 mm/memcontrol.c            |   2 +-
 mm/vmscan.c                | 955 +++++++++++++++++++------------------
 9 files changed, 528 insertions(+), 460 deletions(-)

Comments

Yu Zhao July 7, 2023, 7:57 a.m. UTC | #1
On Thu, Jul 6, 2023 at 12:21 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> This patchset avoids building changes added by commit bd74fdaea146 ("mm:
> multi-gen LRU: support page table walks") on platforms that don't support
> hardware atomic updates of access bits.
>
> Aneesh Kumar K.V (5):
>   mm/mglru: Create a new helper iterate_mm_list_walk
>   mm/mglru: Move Bloom filter code around
>   mm/mglru: Move code around to make future patch easy
>   mm/mglru: move iterate_mm_list_walk Helper
>   mm/mglru: Don't build multi-gen LRU page table walk code on
>     architecture not supported
>
>  arch/Kconfig               |   3 +
>  arch/arm64/Kconfig         |   1 +
>  arch/x86/Kconfig           |   1 +
>  include/linux/memcontrol.h |   2 +-
>  include/linux/mm_types.h   |  10 +-
>  include/linux/mmzone.h     |  12 +-
>  kernel/fork.c              |   2 +-
>  mm/memcontrol.c            |   2 +-
>  mm/vmscan.c                | 955 +++++++++++++++++++------------------
>  9 files changed, 528 insertions(+), 460 deletions(-)

1. There is no need for a new Kconfig -- the condition is simply
   defined(CONFIG_LRU_GEN) && !defined(arch_has_hw_pte_young)

2. The best practice to disable static functions is not by macros but:

    static int const_cond(void)
    {
        return 1;
    }

    int main(void)
    {
        int a = const_cond();

        if (a)
                return 0;

        /* the compiler doesn't generate code for static funcs below */
        static_func_1();
        ...
        static_func_N();

LTO also optimizes external functions. But not everyone uses it. So we
still need macros for them, and of course data structures.

3. In 4/5, you have:

    @@ -461,6 +461,7 @@ enum {
     struct lru_gen_mm_state {
        /* set to max_seq after each iteration */
        unsigned long seq;
    +#ifdef CONFIG_LRU_TASK_PAGE_AGING
        /* where the current iteration continues after */
        struct list_head *head;
        /* where the last iteration ended before */
    @@ -469,6 +470,11 @@ struct lru_gen_mm_state {
        unsigned long *filters[NR_BLOOM_FILTERS];
        /* the mm stats for debugging */
        unsigned long stats[NR_HIST_GENS][NR_MM_STATS];
    +#else
    +   /* protect the seq update above */
    +   /* May be we can use lruvec->lock?  */
    +   spinlock_t lock;
    +#endif
     };

The answer is yes, and not only that, we don't need lru_gen_mm_state at all.

I'm attaching a patch that fixes all above. If you want to post it,
please feel free -- fully test it please, since I didn't. Otherwise I
can ask TJ to help make this work for you.

$ git diff --stat
 include/linux/memcontrol.h |   2 +-
 include/linux/mm_types.h   |  12 +-
 include/linux/mmzone.h     |   2 +
 kernel/bounds.c            |   6 +-
 kernel/fork.c              |   2 +-
 mm/vmscan.c                | 169 +++++++++++++++++++--------
 6 files changed, 137 insertions(+), 56 deletions(-)

On x86:

$ ./scripts/bloat-o-meter mm/vmscan.o.old mm/vmscan.o
add/remove: 24/34 grow/shrink: 2/7 up/down: 966/-8716 (-7750)
Function                                     old     new   delta
...
should_skip_vma                              206       -    -206
get_pte_pfn                                  261       -    -261
lru_gen_add_mm                               323       -    -323
lru_gen_seq_show                            1710    1370    -340
lru_gen_del_mm                               432       -    -432
reset_batch_size                             572       -    -572
try_to_inc_max_seq                          2947    1635   -1312
walk_pmd_range_locked                       1508       -   -1508
walk_pud_range                              3238       -   -3238
Total: Before=99449, After=91699, chg -7.79%

$ objdump -S mm/vmscan.o | grep -A 20 "<try_to_inc_max_seq>:"
000000000000a350 <try_to_inc_max_seq>:
{
    a350: e8 00 00 00 00        call   a355 <try_to_inc_max_seq+0x5>
    a355: 55                    push   %rbp
    a356: 48 89 e5              mov    %rsp,%rbp
    a359: 41 57                push   %r15
    a35b: 41 56                push   %r14
    a35d: 41 55                push   %r13
    a35f: 41 54                push   %r12
    a361: 53                    push   %rbx
    a362: 48 83 ec 70          sub    $0x70,%rsp
    a366: 41 89 d4              mov    %edx,%r12d
    a369: 49 89 f6              mov    %rsi,%r14
    a36c: 49 89 ff              mov    %rdi,%r15
spin_lock_irq(&lruvec->lru_lock);
    a36f: 48 8d 5f 50          lea    0x50(%rdi),%rbx
    a373: 48 89 df              mov    %rbx,%rdi
    a376: e8 00 00 00 00        call   a37b <try_to_inc_max_seq+0x2b>
success = max_seq == lrugen->max_seq;
    a37b: 49 8b 87 88 00 00 00 mov    0x88(%r15),%rax
    a382: 4c 39 f0              cmp    %r14,%rax
Aneesh Kumar K.V July 7, 2023, 1:24 p.m. UTC | #2
On 7/7/23 1:27 PM, Yu Zhao wrote:
> On Thu, Jul 6, 2023 at 12:21 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> This patchset avoids building changes added by commit bd74fdaea146 ("mm:
>> multi-gen LRU: support page table walks") on platforms that don't support
>> hardware atomic updates of access bits.
>>
>> Aneesh Kumar K.V (5):
>>   mm/mglru: Create a new helper iterate_mm_list_walk
>>   mm/mglru: Move Bloom filter code around
>>   mm/mglru: Move code around to make future patch easy
>>   mm/mglru: move iterate_mm_list_walk Helper
>>   mm/mglru: Don't build multi-gen LRU page table walk code on
>>     architecture not supported
>>
>>  arch/Kconfig               |   3 +
>>  arch/arm64/Kconfig         |   1 +
>>  arch/x86/Kconfig           |   1 +
>>  include/linux/memcontrol.h |   2 +-
>>  include/linux/mm_types.h   |  10 +-
>>  include/linux/mmzone.h     |  12 +-
>>  kernel/fork.c              |   2 +-
>>  mm/memcontrol.c            |   2 +-
>>  mm/vmscan.c                | 955 +++++++++++++++++++------------------
>>  9 files changed, 528 insertions(+), 460 deletions(-)
> 
> 1. There is no need for a new Kconfig -- the condition is simply
>    defined(CONFIG_LRU_GEN) && !defined(arch_has_hw_pte_young)
> 
> 2. The best practice to disable static functions is not by macros but:
> 
>     static int const_cond(void)
>     {
>         return 1;
>     }
> 
>     int main(void)
>     {
>         int a = const_cond();
> 
>         if (a)
>                 return 0;
> 
>         /* the compiler doesn't generate code for static funcs below */
>         static_func_1();
>         ...
>         static_func_N();
> 
> LTO also optimizes external functions. But not everyone uses it. So we
> still need macros for them, and of course data structures.
> 
> 3. In 4/5, you have:
> 
>     @@ -461,6 +461,7 @@ enum {
>      struct lru_gen_mm_state {
>         /* set to max_seq after each iteration */
>         unsigned long seq;
>     +#ifdef CONFIG_LRU_TASK_PAGE_AGING
>         /* where the current iteration continues after */
>         struct list_head *head;
>         /* where the last iteration ended before */
>     @@ -469,6 +470,11 @@ struct lru_gen_mm_state {
>         unsigned long *filters[NR_BLOOM_FILTERS];
>         /* the mm stats for debugging */
>         unsigned long stats[NR_HIST_GENS][NR_MM_STATS];
>     +#else
>     +   /* protect the seq update above */
>     +   /* May be we can use lruvec->lock?  */
>     +   spinlock_t lock;
>     +#endif
>      };
> 
> The answer is yes, and not only that, we don't need lru_gen_mm_state at all.
> 
> I'm attaching a patch that fixes all above. If you want to post it,
> please feel free -- fully test it please, since I didn't. Otherwise I
> can ask TJ to help make this work for you.
> 
> $ git diff --stat
>  include/linux/memcontrol.h |   2 +-
>  include/linux/mm_types.h   |  12 +-
>  include/linux/mmzone.h     |   2 +
>  kernel/bounds.c            |   6 +-
>  kernel/fork.c              |   2 +-
>  mm/vmscan.c                | 169 +++++++++++++++++++--------
>  6 files changed, 137 insertions(+), 56 deletions(-)
> 
> On x86:
> 
> $ ./scripts/bloat-o-meter mm/vmscan.o.old mm/vmscan.o
> add/remove: 24/34 grow/shrink: 2/7 up/down: 966/-8716 (-7750)
> Function                                     old     new   delta
> ...
> should_skip_vma                              206       -    -206
> get_pte_pfn                                  261       -    -261
> lru_gen_add_mm                               323       -    -323
> lru_gen_seq_show                            1710    1370    -340
> lru_gen_del_mm                               432       -    -432
> reset_batch_size                             572       -    -572
> try_to_inc_max_seq                          2947    1635   -1312
> walk_pmd_range_locked                       1508       -   -1508
> walk_pud_range                              3238       -   -3238
> Total: Before=99449, After=91699, chg -7.79%
> 
> $ objdump -S mm/vmscan.o | grep -A 20 "<try_to_inc_max_seq>:"
> 000000000000a350 <try_to_inc_max_seq>:
> {
>     a350: e8 00 00 00 00        call   a355 <try_to_inc_max_seq+0x5>
>     a355: 55                    push   %rbp
>     a356: 48 89 e5              mov    %rsp,%rbp
>     a359: 41 57                push   %r15
>     a35b: 41 56                push   %r14
>     a35d: 41 55                push   %r13
>     a35f: 41 54                push   %r12
>     a361: 53                    push   %rbx
>     a362: 48 83 ec 70          sub    $0x70,%rsp
>     a366: 41 89 d4              mov    %edx,%r12d
>     a369: 49 89 f6              mov    %rsi,%r14
>     a36c: 49 89 ff              mov    %rdi,%r15
> spin_lock_irq(&lruvec->lru_lock);
>     a36f: 48 8d 5f 50          lea    0x50(%rdi),%rbx
>     a373: 48 89 df              mov    %rbx,%rdi
>     a376: e8 00 00 00 00        call   a37b <try_to_inc_max_seq+0x2b>
> success = max_seq == lrugen->max_seq;
>     a37b: 49 8b 87 88 00 00 00 mov    0x88(%r15),%rax
>     a382: 4c 39 f0              cmp    %r14,%rax

For the below diff:

@@ -4497,14 +4547,16 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
 	struct lru_gen_mm_walk *walk;
 	struct mm_struct *mm = NULL;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
+	struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
 
 	VM_WARN_ON_ONCE(max_seq > READ_ONCE(lrugen->max_seq));
 
+	if (!mm_state)
+		return inc_max_seq(lruvec, max_seq, can_swap, force_scan);
+
 	/* see the comment in iterate_mm_list() */
-	if (max_seq <= READ_ONCE(lruvec->mm_state.seq)) {
-		success = false;
-		goto done;
-	}
+	if (max_seq <= READ_ONCE(mm_state->seq))
+		return false;
 
 	/*
 	 * If the hardware doesn't automatically set the accessed bit, fallback
@@ -4534,8 +4586,10 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
 			walk_mm(lruvec, mm, walk);
 	} while (mm);
 done:
-	if (success)
-		inc_max_seq(lruvec, can_swap, force_scan);
+	if (success) {
+		success = inc_max_seq(lruvec, max_seq, can_swap, force_scan);
+		WARN_ON_ONCE(!success);
+	}
 
 	return success;
 }
@

We did discuss a possible race that can happen if we allow multiple callers hit inc_max_seq at the same time. 
inc_max_seq drop the lru_lock and restart the loop at the previous value of type. ie. if we want to do the above
we might also need the below? 


modified   mm/vmscan.c
@@ -4368,6 +4368,7 @@ void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan)
 	int type, zone;
 	struct lru_gen_struct *lrugen = &lruvec->lrugen;
 
+retry:
 	spin_lock_irq(&lruvec->lru_lock);
 
 	VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
@@ -4381,7 +4382,7 @@ void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan)
 		while (!inc_min_seq(lruvec, type, can_swap)) {
 			spin_unlock_irq(&lruvec->lru_lock);
 			cond_resched();
-			spin_lock_irq(&lruvec->lru_lock);
+			goto retry;
 		}
 	}
 
I also found that allowing only one cpu to increment max seq value and making other request
with the same max_seq return false is also useful in performance runs. ie, we need an equivalent of this? 


+	if (max_seq <= READ_ONCE(mm_state->seq))
+		return false;
Yu Zhao July 8, 2023, 2:06 a.m. UTC | #3
On Fri, Jul 7, 2023 at 7:24 AM Aneesh Kumar K V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 7/7/23 1:27 PM, Yu Zhao wrote:
> > On Thu, Jul 6, 2023 at 12:21 AM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> >>
> >> This patchset avoids building changes added by commit bd74fdaea146 ("mm:
> >> multi-gen LRU: support page table walks") on platforms that don't support
> >> hardware atomic updates of access bits.
> >>
> >> Aneesh Kumar K.V (5):
> >>   mm/mglru: Create a new helper iterate_mm_list_walk
> >>   mm/mglru: Move Bloom filter code around
> >>   mm/mglru: Move code around to make future patch easy
> >>   mm/mglru: move iterate_mm_list_walk Helper
> >>   mm/mglru: Don't build multi-gen LRU page table walk code on
> >>     architecture not supported
> >>
> >>  arch/Kconfig               |   3 +
> >>  arch/arm64/Kconfig         |   1 +
> >>  arch/x86/Kconfig           |   1 +
> >>  include/linux/memcontrol.h |   2 +-
> >>  include/linux/mm_types.h   |  10 +-
> >>  include/linux/mmzone.h     |  12 +-
> >>  kernel/fork.c              |   2 +-
> >>  mm/memcontrol.c            |   2 +-
> >>  mm/vmscan.c                | 955 +++++++++++++++++++------------------
> >>  9 files changed, 528 insertions(+), 460 deletions(-)
> >
> > 1. There is no need for a new Kconfig -- the condition is simply
> >    defined(CONFIG_LRU_GEN) && !defined(arch_has_hw_pte_young)
> >
> > 2. The best practice to disable static functions is not by macros but:
> >
> >     static int const_cond(void)
> >     {
> >         return 1;
> >     }
> >
> >     int main(void)
> >     {
> >         int a = const_cond();
> >
> >         if (a)
> >                 return 0;
> >
> >         /* the compiler doesn't generate code for static funcs below */
> >         static_func_1();
> >         ...
> >         static_func_N();
> >
> > LTO also optimizes external functions. But not everyone uses it. So we
> > still need macros for them, and of course data structures.
> >
> > 3. In 4/5, you have:
> >
> >     @@ -461,6 +461,7 @@ enum {
> >      struct lru_gen_mm_state {
> >         /* set to max_seq after each iteration */
> >         unsigned long seq;
> >     +#ifdef CONFIG_LRU_TASK_PAGE_AGING
> >         /* where the current iteration continues after */
> >         struct list_head *head;
> >         /* where the last iteration ended before */
> >     @@ -469,6 +470,11 @@ struct lru_gen_mm_state {
> >         unsigned long *filters[NR_BLOOM_FILTERS];
> >         /* the mm stats for debugging */
> >         unsigned long stats[NR_HIST_GENS][NR_MM_STATS];
> >     +#else
> >     +   /* protect the seq update above */
> >     +   /* May be we can use lruvec->lock?  */
> >     +   spinlock_t lock;
> >     +#endif
> >      };
> >
> > The answer is yes, and not only that, we don't need lru_gen_mm_state at all.
> >
> > I'm attaching a patch that fixes all above. If you want to post it,
> > please feel free -- fully test it please, since I didn't. Otherwise I
> > can ask TJ to help make this work for you.
> >
> > $ git diff --stat
> >  include/linux/memcontrol.h |   2 +-
> >  include/linux/mm_types.h   |  12 +-
> >  include/linux/mmzone.h     |   2 +
> >  kernel/bounds.c            |   6 +-
> >  kernel/fork.c              |   2 +-
> >  mm/vmscan.c                | 169 +++++++++++++++++++--------
> >  6 files changed, 137 insertions(+), 56 deletions(-)
> >
> > On x86:
> >
> > $ ./scripts/bloat-o-meter mm/vmscan.o.old mm/vmscan.o
> > add/remove: 24/34 grow/shrink: 2/7 up/down: 966/-8716 (-7750)
> > Function                                     old     new   delta
> > ...
> > should_skip_vma                              206       -    -206
> > get_pte_pfn                                  261       -    -261
> > lru_gen_add_mm                               323       -    -323
> > lru_gen_seq_show                            1710    1370    -340
> > lru_gen_del_mm                               432       -    -432
> > reset_batch_size                             572       -    -572
> > try_to_inc_max_seq                          2947    1635   -1312
> > walk_pmd_range_locked                       1508       -   -1508
> > walk_pud_range                              3238       -   -3238
> > Total: Before=99449, After=91699, chg -7.79%
> >
> > $ objdump -S mm/vmscan.o | grep -A 20 "<try_to_inc_max_seq>:"
> > 000000000000a350 <try_to_inc_max_seq>:
> > {
> >     a350: e8 00 00 00 00        call   a355 <try_to_inc_max_seq+0x5>
> >     a355: 55                    push   %rbp
> >     a356: 48 89 e5              mov    %rsp,%rbp
> >     a359: 41 57                push   %r15
> >     a35b: 41 56                push   %r14
> >     a35d: 41 55                push   %r13
> >     a35f: 41 54                push   %r12
> >     a361: 53                    push   %rbx
> >     a362: 48 83 ec 70          sub    $0x70,%rsp
> >     a366: 41 89 d4              mov    %edx,%r12d
> >     a369: 49 89 f6              mov    %rsi,%r14
> >     a36c: 49 89 ff              mov    %rdi,%r15
> > spin_lock_irq(&lruvec->lru_lock);
> >     a36f: 48 8d 5f 50          lea    0x50(%rdi),%rbx
> >     a373: 48 89 df              mov    %rbx,%rdi
> >     a376: e8 00 00 00 00        call   a37b <try_to_inc_max_seq+0x2b>
> > success = max_seq == lrugen->max_seq;
> >     a37b: 49 8b 87 88 00 00 00 mov    0x88(%r15),%rax
> >     a382: 4c 39 f0              cmp    %r14,%rax
>
> For the below diff:
>
> @@ -4497,14 +4547,16 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
>         struct lru_gen_mm_walk *walk;
>         struct mm_struct *mm = NULL;
>         struct lru_gen_folio *lrugen = &lruvec->lrugen;
> +       struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
>
>         VM_WARN_ON_ONCE(max_seq > READ_ONCE(lrugen->max_seq));
>
> +       if (!mm_state)
> +               return inc_max_seq(lruvec, max_seq, can_swap, force_scan);
> +
>         /* see the comment in iterate_mm_list() */
> -       if (max_seq <= READ_ONCE(lruvec->mm_state.seq)) {
> -               success = false;
> -               goto done;
> -       }
> +       if (max_seq <= READ_ONCE(mm_state->seq))
> +               return false;
>
>         /*
>          * If the hardware doesn't automatically set the accessed bit, fallback
> @@ -4534,8 +4586,10 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
>                         walk_mm(lruvec, mm, walk);
>         } while (mm);
>  done:
> -       if (success)
> -               inc_max_seq(lruvec, can_swap, force_scan);
> +       if (success) {
> +               success = inc_max_seq(lruvec, max_seq, can_swap, force_scan);
> +               WARN_ON_ONCE(!success);
> +       }
>
>         return success;
>  }
> @
>
> We did discuss a possible race that can happen if we allow multiple callers hit inc_max_seq at the same time.
> inc_max_seq drop the lru_lock and restart the loop at the previous value of type. ie. if we want to do the above
> we might also need the below?

Yes, you are right.

In fact, there is an existing bug here: even though max_seq can't
change without the patch, min_seq still can, and if it does, the
initial condition for inc_min_seq() to keep looping, i.e., nr_gens at
max, doesn't hold anymore.

        for (type = ANON_AND_FILE - 1; type >= 0; type--) {
                if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
                        continue;

This only affects the debugfs interface (force_scan=true), which is
probably why it wasn't never reported.

> modified   mm/vmscan.c
> @@ -4368,6 +4368,7 @@ void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan)
>         int type, zone;
>         struct lru_gen_struct *lrugen = &lruvec->lrugen;
>
> +retry:
>         spin_lock_irq(&lruvec->lru_lock);
>
>         VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
> @@ -4381,7 +4382,7 @@ void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan)
>                 while (!inc_min_seq(lruvec, type, can_swap)) {
>                         spin_unlock_irq(&lruvec->lru_lock);
>                         cond_resched();
> -                       spin_lock_irq(&lruvec->lru_lock);
> +                       goto retry;
>                 }
>         }
>
> I also found that allowing only one cpu to increment max seq value and making other request
> with the same max_seq return false is also useful in performance runs. ie, we need an equivalent of this?
>
>
> +       if (max_seq <= READ_ONCE(mm_state->seq))
> +               return false;

Yes, but the condition should be "<" -- ">" is a bug, which was
asserted in try_to_in_max_seq().