Message ID | 20250411082857.2426539-1-tujinjiang@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/swap: set active flag after adding the folio to activate fbatch | expand |
Hi, Andrew - could we drop this from mm-new until this gets fixed? Thanks! This patch is currently breaking mm-new, which now does not boot with CONFIG_DEBUG_VM. It seems like you're underflowing the folio batch? Stack trace below: [ 0.773399] ------------[ cut here ]------------ [ 0.773400] mem_cgroup_update_lru_size(ffff88817e801440, 1, -1): lru_size -1 This is line 1321 in mm/memcontrol.c (actually had to comment out the VM_BUG_ON() as it otherwise mangles output, fun! But if not removed obviously the kernel just halts): void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, int zid, int nr_pages) { struct mem_cgroup_per_node *mz; unsigned long *lru_size; long size; if (mem_cgroup_disabled()) return; mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec); lru_size = &mz->lru_zone_size[zid][lru]; if (nr_pages < 0) *lru_size += nr_pages; size = *lru_size; if (WARN_ONCE(size < 0, "%s(%p, %d, %d): lru_size %ld\n", __func__, lruvec, lru, nr_pages, size)) { VM_BUG_ON(1); <------------------------------------- here *lru_size = 0; } if (nr_pages > 0) *lru_size += nr_pages; } So, nr_pages = -1, size = -1. So I guess the problem is that mz->lru_zone_size[zid][lru] value is now negative on delete? So we're underflowing the folio batch? This is called from the always-inlined update_lru_size() which simply passes nr_pages in... [ 0.773415] WARNING: CPU: 2 PID: 92 at mm/memcontrol.c:1318 mem_cgroup_update_lru_size+0xa1/0xb0 [ 0.773423] Modules linked in: [ 0.773425] CPU: 2 UID: 0 PID: 92 Comm: 9 Not tainted 6.15.0-rc1+ #8 PREEMPT(voluntary) [ 0.773426] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 0.773427] RIP: 0010:mem_cgroup_update_lru_size+0xa1/0xb0 [ 0.773429] Code: 00 00 00 00 eb ae c6 05 6f 39 77 01 01 90 89 f1 48 89 fa 41 89 d8 48 c7 c6 80 a7 22 82 48 c7 c7 6a 66 7c 82 e8 c0 23 dc ff 90 <0f> 0b 90 90 eb c9 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 [ 0.773430] RSP: 0018:ffffc900002aba80 EFLAGS: 00010096 [ 0.773431] RAX: 0000000000000040 RBX: ffffffffffffffff RCX: 0000000000000000 [ 0.773432] RDX: 0000000000000002 RSI: ffffc900002ab938 RDI: 00000000ffffdfff [ 0.773432] RBP: ffff88817e801498 R08: 00000000ffffdfff R09: ffffffff82b18528 [ 0.773433] R10: 0000000000000003 R11: 0000000000000002 R12: 0000000000000002 [ 0.773433] R13: 00000000ffffffff R14: 0000000000000001 R15: 0000000000000001 [ 0.773435] FS: 00007f5e47c3e880(0000) GS:ffff8882e085a000(0000) knlGS:0000000000000000 [ 0.773437] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.773438] CR2: 00007f5e475fb2f0 CR3: 000000017e802000 CR4: 0000000000750ef0 [ 0.773438] PKRU: 55555554 [ 0.773439] Call Trace: [ 0.773442] <TASK> [ 0.773443] lru_activate+0x121/0x320 [ 0.773447] ? __pfx_lru_activate+0x10/0x10 This indicates that the move_fn() below is in fact lru_activate(). With inlining I can't precisely say what ultimately calls mm-cgroup_update_lru_size(), but given the negative sign being the problem, it seems almost certain to be lruvec_del_folio(): static __always_inline void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio) { ... update_lru_size(lruvec, lru, folio_zonenum(folio), -folio_nr_pages(folio)); <-- negative. } This is called in lru_activate(): static void lru_activate(struct lruvec *lruvec, struct folio *folio) { long nr_pages = folio_nr_pages(folio); ... lruvec_del_folio(lruvec, folio); lruvec_add_folio(lruvec, folio); ... } [ 0.773448] folio_batch_move_lru+0x56/0xe0 This is: static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn) { ... for (i = 0; i < folio_batch_count(fbatch); i++) { struct folio *folio = fbatch->folios[i]; folio_lruvec_relock_irqsave(folio, &lruvec, &flags); move_fn(lruvec, folio); <----------------------------------- here folio_set_lru(folio); } ... } [ 0.773449] __folio_batch_release+0x2c/0x50 [ 0.773450] shmem_undo_range+0x297/0x650 [ 0.773454] shmem_evict_inode+0xed/0x250 [ 0.773455] ? atime_needs_update+0x9b/0x110 [ 0.773456] evict+0xf2/0x260 [ 0.773457] __dentry_kill+0x6c/0x180 [ 0.773458] dput+0xe6/0x1b0 [ 0.773460] __fput+0x12d/0x2a0 [ 0.773461] __x64_sys_close+0x38/0x80 [ 0.773463] do_syscall_64+0x9e/0x1a0 [ 0.773465] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 0.773466] RIP: 0033:0x7f5e4731b83b [ 0.773467] Code: ff ff c3 0f 1f 40 00 48 8b 15 d1 a4 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 a1 a4 0d 00 f7 d8 [ 0.773468] RSP: 002b:00007ffdf7d82bc8 EFLAGS: 00000297 ORIG_RAX: 0000000000000003 [ 0.773469] RAX: ffffffffffffffda RBX: 00005601862b32a0 RCX: 00007f5e4731b83b [ 0.773469] RDX: 00007f5e473f4ea0 RSI: 00005601862b3470 RDI: 000000000000002d [ 0.773470] RBP: 00007ffdf7d82bf0 R08: 00005601862b3010 R09: 0000000000000007 [ 0.773470] R10: 00005601862b3480 R11: 0000000000000297 R12: 0000000000000000 [ 0.773470] R13: 00007f5e473f4ff0 R14: 00007ffdf7d82f60 R15: 00007ffdf7d82d60 [ 0.773471] </TASK> [ 0.773471] ---[ end trace 0000000000000000 ]--- On Fri, Apr 11, 2025 at 04:28:57PM +0800, Jinjiang Tu wrote: > We notiched a 12.3% performance regression for LibMicro pwrite testcase > due to commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before > adding to LRU batch"). > > The testcase is executed as follows, and the file is tmpfs file. > pwrite -E -C 200 -L -S -W -N "pwrite_t1k" -s 1k -I 500 -f $TFILE > > this testcase writes 1KB (only one page) to the tmpfs and repeats > this step for many times. The Flame graph shows the performance regression > comes from folio_mark_accessed() and workingset_activation(). > > folio_mark_accessed() is called for the same page for many times. > Before the commit, each call will add the page to activate fbatch. When > the fbatch is full, the fbatch is drained and the page is promoted to > active list. And then, folio_mark_accessed() does nothing in later calls. > > But after the commit, the folio clear lru flags after it is added to > activate fbatch. After then, folio_mark_accessed will never call > folio_activate() again due to the page is without lru flag, and the fbatch > will not be full and the folio will not be marked active, later > folio_mark_accessed() calls will always call workingset_activation(), > leading to performance regression. > > Besides, repeated workingset_age_nonresident() call before the folio is > drained from activate fbatch leads to unreasonable lruvec->nonresident_age. > > To fix it, set active flag after the folio is cleared lru flag when adding > the folio to activate fbatch. > > Fixes: 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding to LRU batch") > Suggested-by: David Hildenbrand <david@redhat.com> > Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> > --- > mm/swap.c | 26 ++++++++++++++++++++++---- > 1 file changed, 22 insertions(+), 4 deletions(-) > > diff --git a/mm/swap.c b/mm/swap.c > index 77b2d5997873..f0de837988b4 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -175,6 +175,8 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn) > folios_put(fbatch); > } > > +static void lru_activate(struct lruvec *lruvec, struct folio *folio); > + > static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch, > struct folio *folio, move_fn_t move_fn, > bool on_lru, bool disable_irq) > @@ -184,6 +186,14 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch, > if (on_lru && !folio_test_clear_lru(folio)) > return; > > + if (move_fn == lru_activate) { > + if (folio_test_unevictable(folio)) { > + folio_set_lru(folio); > + return; > + } > + folio_set_active(folio); > + } > + > folio_get(folio); > > if (disable_irq) > @@ -299,12 +309,15 @@ static void lru_activate(struct lruvec *lruvec, struct folio *folio) > { > long nr_pages = folio_nr_pages(folio); > > - if (folio_test_active(folio) || folio_test_unevictable(folio)) > - return; > - > + /* > + * We check unevictable flag isn't set and set active flag > + * after we clear lru flag. Unevictable and active flag > + * couldn't be modified before we set lru flag again. > + */ > + VM_WARN_ON_ONCE(!folio_test_active(folio)); > + VM_WARN_ON_ONCE(folio_test_unevictable(folio)); > > lruvec_del_folio(lruvec, folio); > - folio_set_active(folio); > lruvec_add_folio(lruvec, folio); > trace_mm_lru_activate(folio); > > @@ -341,6 +354,11 @@ void folio_activate(struct folio *folio) > if (!folio_test_clear_lru(folio)) > return; > > + if (folio_test_unevictable(folio) || folio_test_active(folio)) { > + folio_set_lru(folio); > + return; > + } > + folio_set_active(folio); > lruvec = folio_lruvec_lock_irq(folio); > lru_activate(lruvec, folio); > unlock_page_lruvec_irq(lruvec); > -- > 2.43.0 > >
On Mon, 14 Apr 2025 12:47:04 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> Andrew - could we drop this from mm-new until this gets fixed? Thanks!
Done, thanks.
On 14.04.25 13:47, Lorenzo Stoakes wrote: > Hi, > > Andrew - could we drop this from mm-new until this gets fixed? Thanks! > > This patch is currently breaking mm-new, which now does not boot with > CONFIG_DEBUG_VM. > > It seems like you're underflowing the folio batch? > > Stack trace below: > > [ 0.773399] ------------[ cut here ]------------ > [ 0.773400] mem_cgroup_update_lru_size(ffff88817e801440, 1, -1): lru_size -1 > > This is line 1321 in mm/memcontrol.c (actually had to comment out the > VM_BUG_ON() as it otherwise mangles output, fun! But if not removed obviously > the kernel just halts): > > void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, > int zid, int nr_pages) > { > struct mem_cgroup_per_node *mz; > unsigned long *lru_size; > long size; > > if (mem_cgroup_disabled()) > return; > > mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec); > lru_size = &mz->lru_zone_size[zid][lru]; > > if (nr_pages < 0) > *lru_size += nr_pages; > > size = *lru_size; > if (WARN_ONCE(size < 0, > "%s(%p, %d, %d): lru_size %ld\n", > __func__, lruvec, lru, nr_pages, size)) { > VM_BUG_ON(1); <------------------------------------- here > *lru_size = 0; > } > > if (nr_pages > 0) > *lru_size += nr_pages; > } > > So, nr_pages = -1, size = -1. > > So I guess the problem is that mz->lru_zone_size[zid][lru] value is now negative > on delete? So we're underflowing the folio batch? > > This is called from the always-inlined update_lru_size() which simply passes > nr_pages in... > > [ 0.773415] WARNING: CPU: 2 PID: 92 at mm/memcontrol.c:1318 mem_cgroup_update_lru_size+0xa1/0xb0 > [ 0.773423] Modules linked in: > [ 0.773425] CPU: 2 UID: 0 PID: 92 Comm: 9 Not tainted 6.15.0-rc1+ #8 PREEMPT(voluntary) > [ 0.773426] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 > [ 0.773427] RIP: 0010:mem_cgroup_update_lru_size+0xa1/0xb0 > [ 0.773429] Code: 00 00 00 00 eb ae c6 05 6f 39 77 01 01 90 89 f1 48 89 fa 41 89 d8 48 c7 c6 80 a7 22 82 48 c7 c7 6a 66 7c 82 e8 c0 23 dc ff 90 <0f> 0b 90 90 eb c9 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 > [ 0.773430] RSP: 0018:ffffc900002aba80 EFLAGS: 00010096 > [ 0.773431] RAX: 0000000000000040 RBX: ffffffffffffffff RCX: 0000000000000000 > [ 0.773432] RDX: 0000000000000002 RSI: ffffc900002ab938 RDI: 00000000ffffdfff > [ 0.773432] RBP: ffff88817e801498 R08: 00000000ffffdfff R09: ffffffff82b18528 > [ 0.773433] R10: 0000000000000003 R11: 0000000000000002 R12: 0000000000000002 > [ 0.773433] R13: 00000000ffffffff R14: 0000000000000001 R15: 0000000000000001 > [ 0.773435] FS: 00007f5e47c3e880(0000) GS:ffff8882e085a000(0000) knlGS:0000000000000000 > [ 0.773437] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 0.773438] CR2: 00007f5e475fb2f0 CR3: 000000017e802000 CR4: 0000000000750ef0 > [ 0.773438] PKRU: 55555554 > [ 0.773439] Call Trace: > [ 0.773442] <TASK> > [ 0.773443] lru_activate+0x121/0x320 > [ 0.773447] ? __pfx_lru_activate+0x10/0x10 > > This indicates that the move_fn() below is in fact lru_activate(). > > With inlining I can't precisely say what ultimately calls > mm-cgroup_update_lru_size(), but given the negative sign being the problem, it > seems almost certain to be lruvec_del_folio(): Agreed. Maybe we are calling lruvec_del_folio() although we should not ... I don't immediately see the issue, I'll try to understand what exactly is going wrong here. Thanks for digging into the issue!
diff --git a/mm/swap.c b/mm/swap.c index 77b2d5997873..f0de837988b4 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -175,6 +175,8 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn) folios_put(fbatch); } +static void lru_activate(struct lruvec *lruvec, struct folio *folio); + static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch, struct folio *folio, move_fn_t move_fn, bool on_lru, bool disable_irq) @@ -184,6 +186,14 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch, if (on_lru && !folio_test_clear_lru(folio)) return; + if (move_fn == lru_activate) { + if (folio_test_unevictable(folio)) { + folio_set_lru(folio); + return; + } + folio_set_active(folio); + } + folio_get(folio); if (disable_irq) @@ -299,12 +309,15 @@ static void lru_activate(struct lruvec *lruvec, struct folio *folio) { long nr_pages = folio_nr_pages(folio); - if (folio_test_active(folio) || folio_test_unevictable(folio)) - return; - + /* + * We check unevictable flag isn't set and set active flag + * after we clear lru flag. Unevictable and active flag + * couldn't be modified before we set lru flag again. + */ + VM_WARN_ON_ONCE(!folio_test_active(folio)); + VM_WARN_ON_ONCE(folio_test_unevictable(folio)); lruvec_del_folio(lruvec, folio); - folio_set_active(folio); lruvec_add_folio(lruvec, folio); trace_mm_lru_activate(folio); @@ -341,6 +354,11 @@ void folio_activate(struct folio *folio) if (!folio_test_clear_lru(folio)) return; + if (folio_test_unevictable(folio) || folio_test_active(folio)) { + folio_set_lru(folio); + return; + } + folio_set_active(folio); lruvec = folio_lruvec_lock_irq(folio); lru_activate(lruvec, folio); unlock_page_lruvec_irq(lruvec);
We notiched a 12.3% performance regression for LibMicro pwrite testcase due to commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding to LRU batch"). The testcase is executed as follows, and the file is tmpfs file. pwrite -E -C 200 -L -S -W -N "pwrite_t1k" -s 1k -I 500 -f $TFILE this testcase writes 1KB (only one page) to the tmpfs and repeats this step for many times. The Flame graph shows the performance regression comes from folio_mark_accessed() and workingset_activation(). folio_mark_accessed() is called for the same page for many times. Before the commit, each call will add the page to activate fbatch. When the fbatch is full, the fbatch is drained and the page is promoted to active list. And then, folio_mark_accessed() does nothing in later calls. But after the commit, the folio clear lru flags after it is added to activate fbatch. After then, folio_mark_accessed will never call folio_activate() again due to the page is without lru flag, and the fbatch will not be full and the folio will not be marked active, later folio_mark_accessed() calls will always call workingset_activation(), leading to performance regression. Besides, repeated workingset_age_nonresident() call before the folio is drained from activate fbatch leads to unreasonable lruvec->nonresident_age. To fix it, set active flag after the folio is cleared lru flag when adding the folio to activate fbatch. Fixes: 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding to LRU batch") Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> --- mm/swap.c | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-)