Message ID | 20230706062044.816068-1-aneesh.kumar@linux.ibm.com (mailing list archive) |
---|---|
Headers | show |
Series | Avoid building lrugen page table walk code | expand |
On Thu, Jul 6, 2023 at 12:21 AM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > This patchset avoids building changes added by commit bd74fdaea146 ("mm: > multi-gen LRU: support page table walks") on platforms that don't support > hardware atomic updates of access bits. > > Aneesh Kumar K.V (5): > mm/mglru: Create a new helper iterate_mm_list_walk > mm/mglru: Move Bloom filter code around > mm/mglru: Move code around to make future patch easy > mm/mglru: move iterate_mm_list_walk Helper > mm/mglru: Don't build multi-gen LRU page table walk code on > architecture not supported > > arch/Kconfig | 3 + > arch/arm64/Kconfig | 1 + > arch/x86/Kconfig | 1 + > include/linux/memcontrol.h | 2 +- > include/linux/mm_types.h | 10 +- > include/linux/mmzone.h | 12 +- > kernel/fork.c | 2 +- > mm/memcontrol.c | 2 +- > mm/vmscan.c | 955 +++++++++++++++++++------------------ > 9 files changed, 528 insertions(+), 460 deletions(-) 1. There is no need for a new Kconfig -- the condition is simply defined(CONFIG_LRU_GEN) && !defined(arch_has_hw_pte_young) 2. The best practice to disable static functions is not by macros but: static int const_cond(void) { return 1; } int main(void) { int a = const_cond(); if (a) return 0; /* the compiler doesn't generate code for static funcs below */ static_func_1(); ... static_func_N(); LTO also optimizes external functions. But not everyone uses it. So we still need macros for them, and of course data structures. 3. In 4/5, you have: @@ -461,6 +461,7 @@ enum { struct lru_gen_mm_state { /* set to max_seq after each iteration */ unsigned long seq; +#ifdef CONFIG_LRU_TASK_PAGE_AGING /* where the current iteration continues after */ struct list_head *head; /* where the last iteration ended before */ @@ -469,6 +470,11 @@ struct lru_gen_mm_state { unsigned long *filters[NR_BLOOM_FILTERS]; /* the mm stats for debugging */ unsigned long stats[NR_HIST_GENS][NR_MM_STATS]; +#else + /* protect the seq update above */ + /* May be we can use lruvec->lock? */ + spinlock_t lock; +#endif }; The answer is yes, and not only that, we don't need lru_gen_mm_state at all. I'm attaching a patch that fixes all above. If you want to post it, please feel free -- fully test it please, since I didn't. Otherwise I can ask TJ to help make this work for you. $ git diff --stat include/linux/memcontrol.h | 2 +- include/linux/mm_types.h | 12 +- include/linux/mmzone.h | 2 + kernel/bounds.c | 6 +- kernel/fork.c | 2 +- mm/vmscan.c | 169 +++++++++++++++++++-------- 6 files changed, 137 insertions(+), 56 deletions(-) On x86: $ ./scripts/bloat-o-meter mm/vmscan.o.old mm/vmscan.o add/remove: 24/34 grow/shrink: 2/7 up/down: 966/-8716 (-7750) Function old new delta ... should_skip_vma 206 - -206 get_pte_pfn 261 - -261 lru_gen_add_mm 323 - -323 lru_gen_seq_show 1710 1370 -340 lru_gen_del_mm 432 - -432 reset_batch_size 572 - -572 try_to_inc_max_seq 2947 1635 -1312 walk_pmd_range_locked 1508 - -1508 walk_pud_range 3238 - -3238 Total: Before=99449, After=91699, chg -7.79% $ objdump -S mm/vmscan.o | grep -A 20 "<try_to_inc_max_seq>:" 000000000000a350 <try_to_inc_max_seq>: { a350: e8 00 00 00 00 call a355 <try_to_inc_max_seq+0x5> a355: 55 push %rbp a356: 48 89 e5 mov %rsp,%rbp a359: 41 57 push %r15 a35b: 41 56 push %r14 a35d: 41 55 push %r13 a35f: 41 54 push %r12 a361: 53 push %rbx a362: 48 83 ec 70 sub $0x70,%rsp a366: 41 89 d4 mov %edx,%r12d a369: 49 89 f6 mov %rsi,%r14 a36c: 49 89 ff mov %rdi,%r15 spin_lock_irq(&lruvec->lru_lock); a36f: 48 8d 5f 50 lea 0x50(%rdi),%rbx a373: 48 89 df mov %rbx,%rdi a376: e8 00 00 00 00 call a37b <try_to_inc_max_seq+0x2b> success = max_seq == lrugen->max_seq; a37b: 49 8b 87 88 00 00 00 mov 0x88(%r15),%rax a382: 4c 39 f0 cmp %r14,%rax
On 7/7/23 1:27 PM, Yu Zhao wrote: > On Thu, Jul 6, 2023 at 12:21 AM Aneesh Kumar K.V > <aneesh.kumar@linux.ibm.com> wrote: >> >> This patchset avoids building changes added by commit bd74fdaea146 ("mm: >> multi-gen LRU: support page table walks") on platforms that don't support >> hardware atomic updates of access bits. >> >> Aneesh Kumar K.V (5): >> mm/mglru: Create a new helper iterate_mm_list_walk >> mm/mglru: Move Bloom filter code around >> mm/mglru: Move code around to make future patch easy >> mm/mglru: move iterate_mm_list_walk Helper >> mm/mglru: Don't build multi-gen LRU page table walk code on >> architecture not supported >> >> arch/Kconfig | 3 + >> arch/arm64/Kconfig | 1 + >> arch/x86/Kconfig | 1 + >> include/linux/memcontrol.h | 2 +- >> include/linux/mm_types.h | 10 +- >> include/linux/mmzone.h | 12 +- >> kernel/fork.c | 2 +- >> mm/memcontrol.c | 2 +- >> mm/vmscan.c | 955 +++++++++++++++++++------------------ >> 9 files changed, 528 insertions(+), 460 deletions(-) > > 1. There is no need for a new Kconfig -- the condition is simply > defined(CONFIG_LRU_GEN) && !defined(arch_has_hw_pte_young) > > 2. The best practice to disable static functions is not by macros but: > > static int const_cond(void) > { > return 1; > } > > int main(void) > { > int a = const_cond(); > > if (a) > return 0; > > /* the compiler doesn't generate code for static funcs below */ > static_func_1(); > ... > static_func_N(); > > LTO also optimizes external functions. But not everyone uses it. So we > still need macros for them, and of course data structures. > > 3. In 4/5, you have: > > @@ -461,6 +461,7 @@ enum { > struct lru_gen_mm_state { > /* set to max_seq after each iteration */ > unsigned long seq; > +#ifdef CONFIG_LRU_TASK_PAGE_AGING > /* where the current iteration continues after */ > struct list_head *head; > /* where the last iteration ended before */ > @@ -469,6 +470,11 @@ struct lru_gen_mm_state { > unsigned long *filters[NR_BLOOM_FILTERS]; > /* the mm stats for debugging */ > unsigned long stats[NR_HIST_GENS][NR_MM_STATS]; > +#else > + /* protect the seq update above */ > + /* May be we can use lruvec->lock? */ > + spinlock_t lock; > +#endif > }; > > The answer is yes, and not only that, we don't need lru_gen_mm_state at all. > > I'm attaching a patch that fixes all above. If you want to post it, > please feel free -- fully test it please, since I didn't. Otherwise I > can ask TJ to help make this work for you. > > $ git diff --stat > include/linux/memcontrol.h | 2 +- > include/linux/mm_types.h | 12 +- > include/linux/mmzone.h | 2 + > kernel/bounds.c | 6 +- > kernel/fork.c | 2 +- > mm/vmscan.c | 169 +++++++++++++++++++-------- > 6 files changed, 137 insertions(+), 56 deletions(-) > > On x86: > > $ ./scripts/bloat-o-meter mm/vmscan.o.old mm/vmscan.o > add/remove: 24/34 grow/shrink: 2/7 up/down: 966/-8716 (-7750) > Function old new delta > ... > should_skip_vma 206 - -206 > get_pte_pfn 261 - -261 > lru_gen_add_mm 323 - -323 > lru_gen_seq_show 1710 1370 -340 > lru_gen_del_mm 432 - -432 > reset_batch_size 572 - -572 > try_to_inc_max_seq 2947 1635 -1312 > walk_pmd_range_locked 1508 - -1508 > walk_pud_range 3238 - -3238 > Total: Before=99449, After=91699, chg -7.79% > > $ objdump -S mm/vmscan.o | grep -A 20 "<try_to_inc_max_seq>:" > 000000000000a350 <try_to_inc_max_seq>: > { > a350: e8 00 00 00 00 call a355 <try_to_inc_max_seq+0x5> > a355: 55 push %rbp > a356: 48 89 e5 mov %rsp,%rbp > a359: 41 57 push %r15 > a35b: 41 56 push %r14 > a35d: 41 55 push %r13 > a35f: 41 54 push %r12 > a361: 53 push %rbx > a362: 48 83 ec 70 sub $0x70,%rsp > a366: 41 89 d4 mov %edx,%r12d > a369: 49 89 f6 mov %rsi,%r14 > a36c: 49 89 ff mov %rdi,%r15 > spin_lock_irq(&lruvec->lru_lock); > a36f: 48 8d 5f 50 lea 0x50(%rdi),%rbx > a373: 48 89 df mov %rbx,%rdi > a376: e8 00 00 00 00 call a37b <try_to_inc_max_seq+0x2b> > success = max_seq == lrugen->max_seq; > a37b: 49 8b 87 88 00 00 00 mov 0x88(%r15),%rax > a382: 4c 39 f0 cmp %r14,%rax For the below diff: @@ -4497,14 +4547,16 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, struct lru_gen_mm_walk *walk; struct mm_struct *mm = NULL; struct lru_gen_folio *lrugen = &lruvec->lrugen; + struct lru_gen_mm_state *mm_state = get_mm_state(lruvec); VM_WARN_ON_ONCE(max_seq > READ_ONCE(lrugen->max_seq)); + if (!mm_state) + return inc_max_seq(lruvec, max_seq, can_swap, force_scan); + /* see the comment in iterate_mm_list() */ - if (max_seq <= READ_ONCE(lruvec->mm_state.seq)) { - success = false; - goto done; - } + if (max_seq <= READ_ONCE(mm_state->seq)) + return false; /* * If the hardware doesn't automatically set the accessed bit, fallback @@ -4534,8 +4586,10 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, walk_mm(lruvec, mm, walk); } while (mm); done: - if (success) - inc_max_seq(lruvec, can_swap, force_scan); + if (success) { + success = inc_max_seq(lruvec, max_seq, can_swap, force_scan); + WARN_ON_ONCE(!success); + } return success; } @ We did discuss a possible race that can happen if we allow multiple callers hit inc_max_seq at the same time. inc_max_seq drop the lru_lock and restart the loop at the previous value of type. ie. if we want to do the above we might also need the below? modified mm/vmscan.c @@ -4368,6 +4368,7 @@ void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan) int type, zone; struct lru_gen_struct *lrugen = &lruvec->lrugen; +retry: spin_lock_irq(&lruvec->lru_lock); VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); @@ -4381,7 +4382,7 @@ void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan) while (!inc_min_seq(lruvec, type, can_swap)) { spin_unlock_irq(&lruvec->lru_lock); cond_resched(); - spin_lock_irq(&lruvec->lru_lock); + goto retry; } } I also found that allowing only one cpu to increment max seq value and making other request with the same max_seq return false is also useful in performance runs. ie, we need an equivalent of this? + if (max_seq <= READ_ONCE(mm_state->seq)) + return false;
On Fri, Jul 7, 2023 at 7:24 AM Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote: > > On 7/7/23 1:27 PM, Yu Zhao wrote: > > On Thu, Jul 6, 2023 at 12:21 AM Aneesh Kumar K.V > > <aneesh.kumar@linux.ibm.com> wrote: > >> > >> This patchset avoids building changes added by commit bd74fdaea146 ("mm: > >> multi-gen LRU: support page table walks") on platforms that don't support > >> hardware atomic updates of access bits. > >> > >> Aneesh Kumar K.V (5): > >> mm/mglru: Create a new helper iterate_mm_list_walk > >> mm/mglru: Move Bloom filter code around > >> mm/mglru: Move code around to make future patch easy > >> mm/mglru: move iterate_mm_list_walk Helper > >> mm/mglru: Don't build multi-gen LRU page table walk code on > >> architecture not supported > >> > >> arch/Kconfig | 3 + > >> arch/arm64/Kconfig | 1 + > >> arch/x86/Kconfig | 1 + > >> include/linux/memcontrol.h | 2 +- > >> include/linux/mm_types.h | 10 +- > >> include/linux/mmzone.h | 12 +- > >> kernel/fork.c | 2 +- > >> mm/memcontrol.c | 2 +- > >> mm/vmscan.c | 955 +++++++++++++++++++------------------ > >> 9 files changed, 528 insertions(+), 460 deletions(-) > > > > 1. There is no need for a new Kconfig -- the condition is simply > > defined(CONFIG_LRU_GEN) && !defined(arch_has_hw_pte_young) > > > > 2. The best practice to disable static functions is not by macros but: > > > > static int const_cond(void) > > { > > return 1; > > } > > > > int main(void) > > { > > int a = const_cond(); > > > > if (a) > > return 0; > > > > /* the compiler doesn't generate code for static funcs below */ > > static_func_1(); > > ... > > static_func_N(); > > > > LTO also optimizes external functions. But not everyone uses it. So we > > still need macros for them, and of course data structures. > > > > 3. In 4/5, you have: > > > > @@ -461,6 +461,7 @@ enum { > > struct lru_gen_mm_state { > > /* set to max_seq after each iteration */ > > unsigned long seq; > > +#ifdef CONFIG_LRU_TASK_PAGE_AGING > > /* where the current iteration continues after */ > > struct list_head *head; > > /* where the last iteration ended before */ > > @@ -469,6 +470,11 @@ struct lru_gen_mm_state { > > unsigned long *filters[NR_BLOOM_FILTERS]; > > /* the mm stats for debugging */ > > unsigned long stats[NR_HIST_GENS][NR_MM_STATS]; > > +#else > > + /* protect the seq update above */ > > + /* May be we can use lruvec->lock? */ > > + spinlock_t lock; > > +#endif > > }; > > > > The answer is yes, and not only that, we don't need lru_gen_mm_state at all. > > > > I'm attaching a patch that fixes all above. If you want to post it, > > please feel free -- fully test it please, since I didn't. Otherwise I > > can ask TJ to help make this work for you. > > > > $ git diff --stat > > include/linux/memcontrol.h | 2 +- > > include/linux/mm_types.h | 12 +- > > include/linux/mmzone.h | 2 + > > kernel/bounds.c | 6 +- > > kernel/fork.c | 2 +- > > mm/vmscan.c | 169 +++++++++++++++++++-------- > > 6 files changed, 137 insertions(+), 56 deletions(-) > > > > On x86: > > > > $ ./scripts/bloat-o-meter mm/vmscan.o.old mm/vmscan.o > > add/remove: 24/34 grow/shrink: 2/7 up/down: 966/-8716 (-7750) > > Function old new delta > > ... > > should_skip_vma 206 - -206 > > get_pte_pfn 261 - -261 > > lru_gen_add_mm 323 - -323 > > lru_gen_seq_show 1710 1370 -340 > > lru_gen_del_mm 432 - -432 > > reset_batch_size 572 - -572 > > try_to_inc_max_seq 2947 1635 -1312 > > walk_pmd_range_locked 1508 - -1508 > > walk_pud_range 3238 - -3238 > > Total: Before=99449, After=91699, chg -7.79% > > > > $ objdump -S mm/vmscan.o | grep -A 20 "<try_to_inc_max_seq>:" > > 000000000000a350 <try_to_inc_max_seq>: > > { > > a350: e8 00 00 00 00 call a355 <try_to_inc_max_seq+0x5> > > a355: 55 push %rbp > > a356: 48 89 e5 mov %rsp,%rbp > > a359: 41 57 push %r15 > > a35b: 41 56 push %r14 > > a35d: 41 55 push %r13 > > a35f: 41 54 push %r12 > > a361: 53 push %rbx > > a362: 48 83 ec 70 sub $0x70,%rsp > > a366: 41 89 d4 mov %edx,%r12d > > a369: 49 89 f6 mov %rsi,%r14 > > a36c: 49 89 ff mov %rdi,%r15 > > spin_lock_irq(&lruvec->lru_lock); > > a36f: 48 8d 5f 50 lea 0x50(%rdi),%rbx > > a373: 48 89 df mov %rbx,%rdi > > a376: e8 00 00 00 00 call a37b <try_to_inc_max_seq+0x2b> > > success = max_seq == lrugen->max_seq; > > a37b: 49 8b 87 88 00 00 00 mov 0x88(%r15),%rax > > a382: 4c 39 f0 cmp %r14,%rax > > For the below diff: > > @@ -4497,14 +4547,16 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, > struct lru_gen_mm_walk *walk; > struct mm_struct *mm = NULL; > struct lru_gen_folio *lrugen = &lruvec->lrugen; > + struct lru_gen_mm_state *mm_state = get_mm_state(lruvec); > > VM_WARN_ON_ONCE(max_seq > READ_ONCE(lrugen->max_seq)); > > + if (!mm_state) > + return inc_max_seq(lruvec, max_seq, can_swap, force_scan); > + > /* see the comment in iterate_mm_list() */ > - if (max_seq <= READ_ONCE(lruvec->mm_state.seq)) { > - success = false; > - goto done; > - } > + if (max_seq <= READ_ONCE(mm_state->seq)) > + return false; > > /* > * If the hardware doesn't automatically set the accessed bit, fallback > @@ -4534,8 +4586,10 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, > walk_mm(lruvec, mm, walk); > } while (mm); > done: > - if (success) > - inc_max_seq(lruvec, can_swap, force_scan); > + if (success) { > + success = inc_max_seq(lruvec, max_seq, can_swap, force_scan); > + WARN_ON_ONCE(!success); > + } > > return success; > } > @ > > We did discuss a possible race that can happen if we allow multiple callers hit inc_max_seq at the same time. > inc_max_seq drop the lru_lock and restart the loop at the previous value of type. ie. if we want to do the above > we might also need the below? Yes, you are right. In fact, there is an existing bug here: even though max_seq can't change without the patch, min_seq still can, and if it does, the initial condition for inc_min_seq() to keep looping, i.e., nr_gens at max, doesn't hold anymore. for (type = ANON_AND_FILE - 1; type >= 0; type--) { if (get_nr_gens(lruvec, type) != MAX_NR_GENS) continue; This only affects the debugfs interface (force_scan=true), which is probably why it wasn't never reported. > modified mm/vmscan.c > @@ -4368,6 +4368,7 @@ void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan) > int type, zone; > struct lru_gen_struct *lrugen = &lruvec->lrugen; > > +retry: > spin_lock_irq(&lruvec->lru_lock); > > VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); > @@ -4381,7 +4382,7 @@ void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan) > while (!inc_min_seq(lruvec, type, can_swap)) { > spin_unlock_irq(&lruvec->lru_lock); > cond_resched(); > - spin_lock_irq(&lruvec->lru_lock); > + goto retry; > } > } > > I also found that allowing only one cpu to increment max seq value and making other request > with the same max_seq return false is also useful in performance runs. ie, we need an equivalent of this? > > > + if (max_seq <= READ_ONCE(mm_state->seq)) > + return false; Yes, but the condition should be "<" -- ">" is a bug, which was asserted in try_to_in_max_seq().