Message ID | 20240613000721.23093-4-21cnbao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: clarify folio_add_new_anon_rmap() and __folio_add_anon_rmap() | expand |
On Thu, Jun 13, 2024 at 12:08 PM Barry Song <21cnbao@gmail.com> wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > The folio_test_anon(folio)==false case within do_swap_page() has been > relocated to folio_add_new_anon_rmap(). Additionally, two other callers > consistently pass anonymous folios. > > stack 1: > remove_migration_pmd > -> folio_add_anon_rmap_pmd > -> __folio_add_anon_rmap > > stack 2: > __split_huge_pmd_locked > -> folio_add_anon_rmap_ptes > -> __folio_add_anon_rmap > > __folio_add_anon_rmap() only needs to handle the cases > folio_test_anon(folio)==true now. My team reported a case where swapoff() is calling folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes with one new anon (!folio_test_anon(folio)). I will double check all folio_add_anon_rmap_pte() cases. > > Suggested-by: David Hildenbrand <david@redhat.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > mm/rmap.c | 17 +++-------------- > 1 file changed, 3 insertions(+), 14 deletions(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index e612d999811a..e84c706c8241 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1299,21 +1299,10 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio, > > nr = __folio_add_rmap(folio, page, nr_pages, level, &nr_pmdmapped); > > - if (unlikely(!folio_test_anon(folio))) { > - VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); > - /* > - * For a PTE-mapped large folio, we only know that the single > - * PTE is exclusive. Further, __folio_set_anon() might not get > - * folio->index right when not given the address of the head > - * page. > - */ > - VM_WARN_ON_FOLIO(folio_test_large(folio) && > - level != RMAP_LEVEL_PMD, folio); > - __folio_set_anon(folio, vma, address, > - !!(flags & RMAP_EXCLUSIVE)); > - } else if (likely(!folio_test_ksm(folio))) { > + VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio); > + > + if (likely(!folio_test_ksm(folio))) > __page_check_anon_rmap(folio, page, vma, address); > - } > > __folio_mod_stat(folio, nr, nr_pmdmapped); > > -- > 2.34.1 >
On 13.06.24 10:46, Barry Song wrote: > On Thu, Jun 13, 2024 at 12:08 PM Barry Song <21cnbao@gmail.com> wrote: >> >> From: Barry Song <v-songbaohua@oppo.com> >> >> The folio_test_anon(folio)==false case within do_swap_page() has been >> relocated to folio_add_new_anon_rmap(). Additionally, two other callers >> consistently pass anonymous folios. >> >> stack 1: >> remove_migration_pmd >> -> folio_add_anon_rmap_pmd >> -> __folio_add_anon_rmap >> >> stack 2: >> __split_huge_pmd_locked >> -> folio_add_anon_rmap_ptes >> -> __folio_add_anon_rmap >> >> __folio_add_anon_rmap() only needs to handle the cases >> folio_test_anon(folio)==true now. > > My team reported a case where swapoff() is calling > folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes > with one new anon (!folio_test_anon(folio)). > > I will double check all folio_add_anon_rmap_pte() cases. Right, swapoff() path is a bit special (e.g., we don't do any kind of batching on the swapoff() path). But we should never get a new large anon folio there, or could that now happen?
On Thu, Jun 13, 2024 at 8:49 PM David Hildenbrand <david@redhat.com> wrote: > > On 13.06.24 10:46, Barry Song wrote: > > On Thu, Jun 13, 2024 at 12:08 PM Barry Song <21cnbao@gmail.com> wrote: > >> > >> From: Barry Song <v-songbaohua@oppo.com> > >> > >> The folio_test_anon(folio)==false case within do_swap_page() has been > >> relocated to folio_add_new_anon_rmap(). Additionally, two other callers > >> consistently pass anonymous folios. > >> > >> stack 1: > >> remove_migration_pmd > >> -> folio_add_anon_rmap_pmd > >> -> __folio_add_anon_rmap > >> > >> stack 2: > >> __split_huge_pmd_locked > >> -> folio_add_anon_rmap_ptes > >> -> __folio_add_anon_rmap > >> > >> __folio_add_anon_rmap() only needs to handle the cases > >> folio_test_anon(folio)==true now. > > > > My team reported a case where swapoff() is calling > > folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes > > with one new anon (!folio_test_anon(folio)). > > > > I will double check all folio_add_anon_rmap_pte() cases. > > Right, swapoff() path is a bit special (e.g., we don't do any kind of > batching on the swapoff() path). > > But we should never get a new large anon folio there, or could that now > happen? My team encountered the following issue while testing this RFC series on real hardware. Let me take a look to identify the problem. [ 261.214195][T11285] page:000000004cdd779e refcount:4 mapcount:1 mapping:00000000ba142c22 index:0x1 pfn:0x1b30a6 [ 261.214213][T11285] memcg:ffffff8003678000 [ 261.214217][T11285] aops:swap_aops [ 261.214233][T11285] flags: 0x2000000000081009(locked|uptodate|owner_priv_1|swapbacked|zone=1|kasantag=0x0) [ 261.214241][T11285] page_type: 0x0() [ 261.214246][T11285] raw: 2000000000081009 0000000000000000 dead000000000122 0000000000000000 [ 261.214251][T11285] raw: 0000000000000001 00000000000d84b3 0000000400000000 ffffff8003678000 [ 261.214254][T11285] page dumped because: VM_WARN_ON_FOLIO(!folio_test_anon(folio)) [ 261.214257][T11285] page_owner tracks the page as allocated [ 261.214260][T11285] page last allocated via order 0, migratetype Movable, gfp_mask 0x2140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), pid 11285, tgid 11285 (swapoff), ts 261214177545, free_ts 261151875699 [ 261.214268][T11285] post_alloc_hook+0x1b8/0x1c0 [ 261.214284][T11285] prep_new_page+0x28/0x13c [ 261.214291][T11285] get_page_from_freelist+0x198c/0x1aa4 [ 261.214298][T11285] __alloc_pages+0x15c/0x330 [ 261.214304][T11285] __folio_alloc+0x1c/0x4c [ 261.214310][T11285] __read_swap_cache_async+0xd8/0x48c [ 261.214320][T11285] swap_cluster_readahead+0x158/0x324 [ 261.214326][T11285] swapin_readahead+0x64/0x448 [ 261.214331][T11285] __arm64_sys_swapoff+0x6ec/0x14b0 [ 261.214337][T11285] invoke_syscall+0x58/0x114 [ 261.214353][T11285] el0_svc_common+0xac/0xe0 [ 261.214360][T11285] do_el0_svc+0x1c/0x28 [ 261.214366][T11285] el0_svc+0x38/0x68 [ 261.214372][T11285] el0t_64_sync_handler+0x68/0xbc [ 261.214376][T11285] el0t_64_sync+0x1a8/0x1ac [ 261.214381][T11285] page last free pid 90 tgid 90 stack trace: [ 261.214386][T11285] free_unref_page_prepare+0x338/0x374 [ 261.214395][T11285] free_unref_page_list+0x84/0x378 [ 261.214400][T11285] shrink_folio_list+0x1234/0x13e4 [ 261.214409][T11285] evict_folios+0x1458/0x19b4 [ 261.214417][T11285] try_to_shrink_lruvec+0x1c8/0x264 [ 261.214422][T11285] shrink_one+0xa8/0x234 [ 261.214427][T11285] shrink_node+0xb38/0xde0 [ 261.214432][T11285] balance_pgdat+0x7a4/0xdb4 [ 261.214437][T11285] kswapd+0x290/0x4e4 [ 261.214442][T11285] kthread+0x114/0x1bc [ 261.214459][T11285] ret_from_fork+0x10/0x20 [ 261.214477][T11285] ------------[ cut here ]------------ [ 261.214480][T11285] WARNING: CPU: 3 PID: 11285 at mm/rmap.c:1305 folio_add_anon_rmap_ptes+0x2b4/0x330 [ 261.215403][T11285] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 261.215423][T11285] pc : folio_add_anon_rmap_ptes+0x2b4/0x330 [ 261.215428][T11285] lr : folio_add_anon_rmap_ptes+0x2b4/0x330 [ 261.215433][T11285] sp : ffffffc0a7dbbbf0 [ 261.215437][T11285] x29: ffffffc0a7dbbbf0 x28: ffffff8040860a08 x27: ffffff80db480000 [ 261.215445][T11285] x26: fffffffe04cc2980 x25: ffffff808f77f120 x24: 0000007b44941000 [ 261.215452][T11285] x23: 0000000000000001 x22: 0000000000000001 x21: fffffffe04cc2980 [ 261.215459][T11285] x20: ffffff80db480000 x19: fffffffe04cc2980 x18: ffffffed011dae80 [ 261.215465][T11285] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000004 [ 261.215471][T11285] x14: ffffff82fb73fac0 x13: 0000000000000003 x12: 0000000000000003 [ 261.215476][T11285] x11: 00000000fffeffff x10: c0000000fffeffff x9 : 0d46c0889b468e00 [ 261.215483][T11285] x8 : 0d46c0889b468e00 x7 : 545b5d3935343431 x6 : 322e31363220205b [ 261.215489][T11285] x5 : ffffffed013de407 x4 : ffffffed00698967 x3 : 0000000000000000 [ 261.215495][T11285] x2 : 0000000000000000 x1 : ffffffc0a7dbb8c0 x0 : ffffff8068c15c80 [ 261.215501][T11285] Call trace: [ 261.215504][T11285] folio_add_anon_rmap_ptes+0x2b4/0x330 [ 261.215509][T11285] __arm64_sys_swapoff+0x8cc/0x14b0 [ 261.215516][T11285] invoke_syscall+0x58/0x114 [ 261.215526][T11285] el0_svc_common+0xac/0xe0 [ 261.215532][T11285] do_el0_svc+0x1c/0x28 [ 261.215539][T11285] el0_svc+0x38/0x68 [ 261.215544][T11285] el0t_64_sync_handler+0x68/0xbc [ 261.215548][T11285] el0t_64_sync+0x1a8/0x1ac [ 261.215552][T11285] ---[ end trace 0000000000000000 ]--- > > -- > Cheers, > > David / dhildenb > Thanks Barry
On 13.06.24 11:06, Barry Song wrote: > On Thu, Jun 13, 2024 at 8:49 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 13.06.24 10:46, Barry Song wrote: >>> On Thu, Jun 13, 2024 at 12:08 PM Barry Song <21cnbao@gmail.com> wrote: >>>> >>>> From: Barry Song <v-songbaohua@oppo.com> >>>> >>>> The folio_test_anon(folio)==false case within do_swap_page() has been >>>> relocated to folio_add_new_anon_rmap(). Additionally, two other callers >>>> consistently pass anonymous folios. >>>> >>>> stack 1: >>>> remove_migration_pmd >>>> -> folio_add_anon_rmap_pmd >>>> -> __folio_add_anon_rmap >>>> >>>> stack 2: >>>> __split_huge_pmd_locked >>>> -> folio_add_anon_rmap_ptes >>>> -> __folio_add_anon_rmap >>>> >>>> __folio_add_anon_rmap() only needs to handle the cases >>>> folio_test_anon(folio)==true now. >>> >>> My team reported a case where swapoff() is calling >>> folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes >>> with one new anon (!folio_test_anon(folio)). >>> >>> I will double check all folio_add_anon_rmap_pte() cases. >> >> Right, swapoff() path is a bit special (e.g., we don't do any kind of >> batching on the swapoff() path). >> >> But we should never get a new large anon folio there, or could that now >> happen? > > My team encountered the following issue while testing this RFC > series on real hardware. Let me take a look to identify the > problem. > > [ 261.214195][T11285] page:000000004cdd779e refcount:4 mapcount:1 > mapping:00000000ba142c22 index:0x1 pfn:0x1b30a6 > [ 261.214213][T11285] memcg:ffffff8003678000 > [ 261.214217][T11285] aops:swap_aops > [ 261.214233][T11285] flags: > 0x2000000000081009(locked|uptodate|owner_priv_1|swapbacked|zone=1|kasantag=0x0) > [ 261.214241][T11285] page_type: 0x0() > [ 261.214246][T11285] raw: 2000000000081009 0000000000000000 > dead000000000122 0000000000000000 > [ 261.214251][T11285] raw: 0000000000000001 00000000000d84b3 > 0000000400000000 ffffff8003678000 > [ 261.214254][T11285] page dumped because: > VM_WARN_ON_FOLIO(!folio_test_anon(folio)) > [ 261.214257][T11285] page_owner tracks the page as allocated > [ 261.214260][T11285] page last allocated via order 0, migratetype > Movable, gfp_mask 0x2140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), pid > 11285, tgid 11285 (swapoff), ts 261214177545, free_ts 261151875699 > [ 261.214268][T11285] post_alloc_hook+0x1b8/0x1c0 > [ 261.214284][T11285] prep_new_page+0x28/0x13c > [ 261.214291][T11285] get_page_from_freelist+0x198c/0x1aa4 > [ 261.214298][T11285] __alloc_pages+0x15c/0x330 > [ 261.214304][T11285] __folio_alloc+0x1c/0x4c > [ 261.214310][T11285] __read_swap_cache_async+0xd8/0x48c > [ 261.214320][T11285] swap_cluster_readahead+0x158/0x324 > [ 261.214326][T11285] swapin_readahead+0x64/0x448 > [ 261.214331][T11285] __arm64_sys_swapoff+0x6ec/0x14b0 > [ 261.214337][T11285] invoke_syscall+0x58/0x114 > [ 261.214353][T11285] el0_svc_common+0xac/0xe0 > [ 261.214360][T11285] do_el0_svc+0x1c/0x28 > [ 261.214366][T11285] el0_svc+0x38/0x68 > [ 261.214372][T11285] el0t_64_sync_handler+0x68/0xbc > [ 261.214376][T11285] el0t_64_sync+0x1a8/0x1ac > [ 261.214381][T11285] page last free pid 90 tgid 90 stack trace: > [ 261.214386][T11285] free_unref_page_prepare+0x338/0x374 > [ 261.214395][T11285] free_unref_page_list+0x84/0x378 > [ 261.214400][T11285] shrink_folio_list+0x1234/0x13e4 > [ 261.214409][T11285] evict_folios+0x1458/0x19b4 > [ 261.214417][T11285] try_to_shrink_lruvec+0x1c8/0x264 > [ 261.214422][T11285] shrink_one+0xa8/0x234 > [ 261.214427][T11285] shrink_node+0xb38/0xde0 > [ 261.214432][T11285] balance_pgdat+0x7a4/0xdb4 > [ 261.214437][T11285] kswapd+0x290/0x4e4 > [ 261.214442][T11285] kthread+0x114/0x1bc > [ 261.214459][T11285] ret_from_fork+0x10/0x20 > [ 261.214477][T11285] ------------[ cut here ]------------ > [ 261.214480][T11285] WARNING: CPU: 3 PID: 11285 at mm/rmap.c:1305 > folio_add_anon_rmap_ptes+0x2b4/0x330 > > [ 261.215403][T11285] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT > -SSBS BTYPE=--) > [ 261.215423][T11285] pc : folio_add_anon_rmap_ptes+0x2b4/0x330 > [ 261.215428][T11285] lr : folio_add_anon_rmap_ptes+0x2b4/0x330 > [ 261.215433][T11285] sp : ffffffc0a7dbbbf0 > [ 261.215437][T11285] x29: ffffffc0a7dbbbf0 x28: ffffff8040860a08 > x27: ffffff80db480000 > [ 261.215445][T11285] x26: fffffffe04cc2980 x25: ffffff808f77f120 > x24: 0000007b44941000 > [ 261.215452][T11285] x23: 0000000000000001 x22: 0000000000000001 > x21: fffffffe04cc2980 > [ 261.215459][T11285] x20: ffffff80db480000 x19: fffffffe04cc2980 > x18: ffffffed011dae80 > [ 261.215465][T11285] x17: 0000000000000001 x16: ffffffffffffffff > x15: 0000000000000004 > [ 261.215471][T11285] x14: ffffff82fb73fac0 x13: 0000000000000003 > x12: 0000000000000003 > [ 261.215476][T11285] x11: 00000000fffeffff x10: c0000000fffeffff x9 > : 0d46c0889b468e00 > [ 261.215483][T11285] x8 : 0d46c0889b468e00 x7 : 545b5d3935343431 x6 > : 322e31363220205b > [ 261.215489][T11285] x5 : ffffffed013de407 x4 : ffffffed00698967 x3 > : 0000000000000000 > [ 261.215495][T11285] x2 : 0000000000000000 x1 : ffffffc0a7dbb8c0 x0 > : ffffff8068c15c80 > [ 261.215501][T11285] Call trace: > [ 261.215504][T11285] folio_add_anon_rmap_ptes+0x2b4/0x330 > [ 261.215509][T11285] __arm64_sys_swapoff+0x8cc/0x14b0 > [ 261.215516][T11285] invoke_syscall+0x58/0x114 > [ 261.215526][T11285] el0_svc_common+0xac/0xe0 > [ 261.215532][T11285] do_el0_svc+0x1c/0x28 > [ 261.215539][T11285] el0_svc+0x38/0x68 > [ 261.215544][T11285] el0t_64_sync_handler+0x68/0xbc > [ 261.215548][T11285] el0t_64_sync+0x1a8/0x1ac > [ 261.215552][T11285] ---[ end trace 0000000000000000 ]--- Ah, yes. in unuse_pte(), you'll have to do the right thing if !folio_test(anon) before doing the folio_add_anon_rmap_pte(). You might have a fresh anon folio in the swapcache that was never mapped (hopefully order-0, otherwise we'd likely be in trouble).
On Thu, Jun 13, 2024 at 9:12 PM David Hildenbrand <david@redhat.com> wrote: > > On 13.06.24 11:06, Barry Song wrote: > > On Thu, Jun 13, 2024 at 8:49 PM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 13.06.24 10:46, Barry Song wrote: > >>> On Thu, Jun 13, 2024 at 12:08 PM Barry Song <21cnbao@gmail.com> wrote: > >>>> > >>>> From: Barry Song <v-songbaohua@oppo.com> > >>>> > >>>> The folio_test_anon(folio)==false case within do_swap_page() has been > >>>> relocated to folio_add_new_anon_rmap(). Additionally, two other callers > >>>> consistently pass anonymous folios. > >>>> > >>>> stack 1: > >>>> remove_migration_pmd > >>>> -> folio_add_anon_rmap_pmd > >>>> -> __folio_add_anon_rmap > >>>> > >>>> stack 2: > >>>> __split_huge_pmd_locked > >>>> -> folio_add_anon_rmap_ptes > >>>> -> __folio_add_anon_rmap > >>>> > >>>> __folio_add_anon_rmap() only needs to handle the cases > >>>> folio_test_anon(folio)==true now. > >>> > >>> My team reported a case where swapoff() is calling > >>> folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes > >>> with one new anon (!folio_test_anon(folio)). > >>> > >>> I will double check all folio_add_anon_rmap_pte() cases. > >> > >> Right, swapoff() path is a bit special (e.g., we don't do any kind of > >> batching on the swapoff() path). > >> > >> But we should never get a new large anon folio there, or could that now > >> happen? > > > > My team encountered the following issue while testing this RFC > > series on real hardware. Let me take a look to identify the > > problem. > > > > [ 261.214195][T11285] page:000000004cdd779e refcount:4 mapcount:1 > > mapping:00000000ba142c22 index:0x1 pfn:0x1b30a6 > > [ 261.214213][T11285] memcg:ffffff8003678000 > > [ 261.214217][T11285] aops:swap_aops > > [ 261.214233][T11285] flags: > > 0x2000000000081009(locked|uptodate|owner_priv_1|swapbacked|zone=1|kasantag=0x0) > > [ 261.214241][T11285] page_type: 0x0() > > [ 261.214246][T11285] raw: 2000000000081009 0000000000000000 > > dead000000000122 0000000000000000 > > [ 261.214251][T11285] raw: 0000000000000001 00000000000d84b3 > > 0000000400000000 ffffff8003678000 > > [ 261.214254][T11285] page dumped because: > > VM_WARN_ON_FOLIO(!folio_test_anon(folio)) > > [ 261.214257][T11285] page_owner tracks the page as allocated > > [ 261.214260][T11285] page last allocated via order 0, migratetype > > Movable, gfp_mask 0x2140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), pid > > 11285, tgid 11285 (swapoff), ts 261214177545, free_ts 261151875699 > > [ 261.214268][T11285] post_alloc_hook+0x1b8/0x1c0 > > [ 261.214284][T11285] prep_new_page+0x28/0x13c > > [ 261.214291][T11285] get_page_from_freelist+0x198c/0x1aa4 > > [ 261.214298][T11285] __alloc_pages+0x15c/0x330 > > [ 261.214304][T11285] __folio_alloc+0x1c/0x4c > > [ 261.214310][T11285] __read_swap_cache_async+0xd8/0x48c > > [ 261.214320][T11285] swap_cluster_readahead+0x158/0x324 > > [ 261.214326][T11285] swapin_readahead+0x64/0x448 > > [ 261.214331][T11285] __arm64_sys_swapoff+0x6ec/0x14b0 > > [ 261.214337][T11285] invoke_syscall+0x58/0x114 > > [ 261.214353][T11285] el0_svc_common+0xac/0xe0 > > [ 261.214360][T11285] do_el0_svc+0x1c/0x28 > > [ 261.214366][T11285] el0_svc+0x38/0x68 > > [ 261.214372][T11285] el0t_64_sync_handler+0x68/0xbc > > [ 261.214376][T11285] el0t_64_sync+0x1a8/0x1ac > > [ 261.214381][T11285] page last free pid 90 tgid 90 stack trace: > > [ 261.214386][T11285] free_unref_page_prepare+0x338/0x374 > > [ 261.214395][T11285] free_unref_page_list+0x84/0x378 > > [ 261.214400][T11285] shrink_folio_list+0x1234/0x13e4 > > [ 261.214409][T11285] evict_folios+0x1458/0x19b4 > > [ 261.214417][T11285] try_to_shrink_lruvec+0x1c8/0x264 > > [ 261.214422][T11285] shrink_one+0xa8/0x234 > > [ 261.214427][T11285] shrink_node+0xb38/0xde0 > > [ 261.214432][T11285] balance_pgdat+0x7a4/0xdb4 > > [ 261.214437][T11285] kswapd+0x290/0x4e4 > > [ 261.214442][T11285] kthread+0x114/0x1bc > > [ 261.214459][T11285] ret_from_fork+0x10/0x20 > > [ 261.214477][T11285] ------------[ cut here ]------------ > > [ 261.214480][T11285] WARNING: CPU: 3 PID: 11285 at mm/rmap.c:1305 > > folio_add_anon_rmap_ptes+0x2b4/0x330 > > > > [ 261.215403][T11285] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT > > -SSBS BTYPE=--) > > [ 261.215423][T11285] pc : folio_add_anon_rmap_ptes+0x2b4/0x330 > > [ 261.215428][T11285] lr : folio_add_anon_rmap_ptes+0x2b4/0x330 > > [ 261.215433][T11285] sp : ffffffc0a7dbbbf0 > > [ 261.215437][T11285] x29: ffffffc0a7dbbbf0 x28: ffffff8040860a08 > > x27: ffffff80db480000 > > [ 261.215445][T11285] x26: fffffffe04cc2980 x25: ffffff808f77f120 > > x24: 0000007b44941000 > > [ 261.215452][T11285] x23: 0000000000000001 x22: 0000000000000001 > > x21: fffffffe04cc2980 > > [ 261.215459][T11285] x20: ffffff80db480000 x19: fffffffe04cc2980 > > x18: ffffffed011dae80 > > [ 261.215465][T11285] x17: 0000000000000001 x16: ffffffffffffffff > > x15: 0000000000000004 > > [ 261.215471][T11285] x14: ffffff82fb73fac0 x13: 0000000000000003 > > x12: 0000000000000003 > > [ 261.215476][T11285] x11: 00000000fffeffff x10: c0000000fffeffff x9 > > : 0d46c0889b468e00 > > [ 261.215483][T11285] x8 : 0d46c0889b468e00 x7 : 545b5d3935343431 x6 > > : 322e31363220205b > > [ 261.215489][T11285] x5 : ffffffed013de407 x4 : ffffffed00698967 x3 > > : 0000000000000000 > > [ 261.215495][T11285] x2 : 0000000000000000 x1 : ffffffc0a7dbb8c0 x0 > > : ffffff8068c15c80 > > [ 261.215501][T11285] Call trace: > > [ 261.215504][T11285] folio_add_anon_rmap_ptes+0x2b4/0x330 > > [ 261.215509][T11285] __arm64_sys_swapoff+0x8cc/0x14b0 > > [ 261.215516][T11285] invoke_syscall+0x58/0x114 > > [ 261.215526][T11285] el0_svc_common+0xac/0xe0 > > [ 261.215532][T11285] do_el0_svc+0x1c/0x28 > > [ 261.215539][T11285] el0_svc+0x38/0x68 > > [ 261.215544][T11285] el0t_64_sync_handler+0x68/0xbc > > [ 261.215548][T11285] el0t_64_sync+0x1a8/0x1ac > > [ 261.215552][T11285] ---[ end trace 0000000000000000 ]--- > > Ah, yes. in unuse_pte(), you'll have to do the right thing if > !folio_test(anon) before doing the folio_add_anon_rmap_pte(). > > You might have a fresh anon folio in the swapcache that was never mapped > (hopefully order-0, otherwise we'd likely be in trouble). Yes. It is order-0 [ 261.214260][T11285] page last allocated via order 0, migratetype Otherwise, we would have encountered this VM_WARN_ON_FOLIO? __folio_add_anon_rmap() { ... VM_WARN_ON_FOLIO(folio_test_large(folio) && level != RMAP_LEVEL_PMD, folio); ... } Given that nobody has ever reported this warning, I assume all callers using folio_add_anon_rmap_pte(s) are right now safe to move to ? if (!folio_test_anon(folio)) folio_add_new_anon_rmap(folio, vma, address, rmap_flags); else folio_add_anon_rmap_pte(s) > > -- > Cheers, > > David / dhildenb > Thanks Barry
On 14.06.24 09:56, Barry Song wrote: > On Thu, Jun 13, 2024 at 9:12 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 13.06.24 11:06, Barry Song wrote: >>> On Thu, Jun 13, 2024 at 8:49 PM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 13.06.24 10:46, Barry Song wrote: >>>>> On Thu, Jun 13, 2024 at 12:08 PM Barry Song <21cnbao@gmail.com> wrote: >>>>>> >>>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>>> >>>>>> The folio_test_anon(folio)==false case within do_swap_page() has been >>>>>> relocated to folio_add_new_anon_rmap(). Additionally, two other callers >>>>>> consistently pass anonymous folios. >>>>>> >>>>>> stack 1: >>>>>> remove_migration_pmd >>>>>> -> folio_add_anon_rmap_pmd >>>>>> -> __folio_add_anon_rmap >>>>>> >>>>>> stack 2: >>>>>> __split_huge_pmd_locked >>>>>> -> folio_add_anon_rmap_ptes >>>>>> -> __folio_add_anon_rmap >>>>>> >>>>>> __folio_add_anon_rmap() only needs to handle the cases >>>>>> folio_test_anon(folio)==true now. >>>>> >>>>> My team reported a case where swapoff() is calling >>>>> folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes >>>>> with one new anon (!folio_test_anon(folio)). >>>>> >>>>> I will double check all folio_add_anon_rmap_pte() cases. >>>> >>>> Right, swapoff() path is a bit special (e.g., we don't do any kind of >>>> batching on the swapoff() path). >>>> >>>> But we should never get a new large anon folio there, or could that now >>>> happen? >>> >>> My team encountered the following issue while testing this RFC >>> series on real hardware. Let me take a look to identify the >>> problem. >>> >>> [ 261.214195][T11285] page:000000004cdd779e refcount:4 mapcount:1 >>> mapping:00000000ba142c22 index:0x1 pfn:0x1b30a6 >>> [ 261.214213][T11285] memcg:ffffff8003678000 >>> [ 261.214217][T11285] aops:swap_aops >>> [ 261.214233][T11285] flags: >>> 0x2000000000081009(locked|uptodate|owner_priv_1|swapbacked|zone=1|kasantag=0x0) >>> [ 261.214241][T11285] page_type: 0x0() >>> [ 261.214246][T11285] raw: 2000000000081009 0000000000000000 >>> dead000000000122 0000000000000000 >>> [ 261.214251][T11285] raw: 0000000000000001 00000000000d84b3 >>> 0000000400000000 ffffff8003678000 >>> [ 261.214254][T11285] page dumped because: >>> VM_WARN_ON_FOLIO(!folio_test_anon(folio)) >>> [ 261.214257][T11285] page_owner tracks the page as allocated >>> [ 261.214260][T11285] page last allocated via order 0, migratetype >>> Movable, gfp_mask 0x2140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), pid >>> 11285, tgid 11285 (swapoff), ts 261214177545, free_ts 261151875699 >>> [ 261.214268][T11285] post_alloc_hook+0x1b8/0x1c0 >>> [ 261.214284][T11285] prep_new_page+0x28/0x13c >>> [ 261.214291][T11285] get_page_from_freelist+0x198c/0x1aa4 >>> [ 261.214298][T11285] __alloc_pages+0x15c/0x330 >>> [ 261.214304][T11285] __folio_alloc+0x1c/0x4c >>> [ 261.214310][T11285] __read_swap_cache_async+0xd8/0x48c >>> [ 261.214320][T11285] swap_cluster_readahead+0x158/0x324 >>> [ 261.214326][T11285] swapin_readahead+0x64/0x448 >>> [ 261.214331][T11285] __arm64_sys_swapoff+0x6ec/0x14b0 >>> [ 261.214337][T11285] invoke_syscall+0x58/0x114 >>> [ 261.214353][T11285] el0_svc_common+0xac/0xe0 >>> [ 261.214360][T11285] do_el0_svc+0x1c/0x28 >>> [ 261.214366][T11285] el0_svc+0x38/0x68 >>> [ 261.214372][T11285] el0t_64_sync_handler+0x68/0xbc >>> [ 261.214376][T11285] el0t_64_sync+0x1a8/0x1ac >>> [ 261.214381][T11285] page last free pid 90 tgid 90 stack trace: >>> [ 261.214386][T11285] free_unref_page_prepare+0x338/0x374 >>> [ 261.214395][T11285] free_unref_page_list+0x84/0x378 >>> [ 261.214400][T11285] shrink_folio_list+0x1234/0x13e4 >>> [ 261.214409][T11285] evict_folios+0x1458/0x19b4 >>> [ 261.214417][T11285] try_to_shrink_lruvec+0x1c8/0x264 >>> [ 261.214422][T11285] shrink_one+0xa8/0x234 >>> [ 261.214427][T11285] shrink_node+0xb38/0xde0 >>> [ 261.214432][T11285] balance_pgdat+0x7a4/0xdb4 >>> [ 261.214437][T11285] kswapd+0x290/0x4e4 >>> [ 261.214442][T11285] kthread+0x114/0x1bc >>> [ 261.214459][T11285] ret_from_fork+0x10/0x20 >>> [ 261.214477][T11285] ------------[ cut here ]------------ >>> [ 261.214480][T11285] WARNING: CPU: 3 PID: 11285 at mm/rmap.c:1305 >>> folio_add_anon_rmap_ptes+0x2b4/0x330 >>> >>> [ 261.215403][T11285] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT >>> -SSBS BTYPE=--) >>> [ 261.215423][T11285] pc : folio_add_anon_rmap_ptes+0x2b4/0x330 >>> [ 261.215428][T11285] lr : folio_add_anon_rmap_ptes+0x2b4/0x330 >>> [ 261.215433][T11285] sp : ffffffc0a7dbbbf0 >>> [ 261.215437][T11285] x29: ffffffc0a7dbbbf0 x28: ffffff8040860a08 >>> x27: ffffff80db480000 >>> [ 261.215445][T11285] x26: fffffffe04cc2980 x25: ffffff808f77f120 >>> x24: 0000007b44941000 >>> [ 261.215452][T11285] x23: 0000000000000001 x22: 0000000000000001 >>> x21: fffffffe04cc2980 >>> [ 261.215459][T11285] x20: ffffff80db480000 x19: fffffffe04cc2980 >>> x18: ffffffed011dae80 >>> [ 261.215465][T11285] x17: 0000000000000001 x16: ffffffffffffffff >>> x15: 0000000000000004 >>> [ 261.215471][T11285] x14: ffffff82fb73fac0 x13: 0000000000000003 >>> x12: 0000000000000003 >>> [ 261.215476][T11285] x11: 00000000fffeffff x10: c0000000fffeffff x9 >>> : 0d46c0889b468e00 >>> [ 261.215483][T11285] x8 : 0d46c0889b468e00 x7 : 545b5d3935343431 x6 >>> : 322e31363220205b >>> [ 261.215489][T11285] x5 : ffffffed013de407 x4 : ffffffed00698967 x3 >>> : 0000000000000000 >>> [ 261.215495][T11285] x2 : 0000000000000000 x1 : ffffffc0a7dbb8c0 x0 >>> : ffffff8068c15c80 >>> [ 261.215501][T11285] Call trace: >>> [ 261.215504][T11285] folio_add_anon_rmap_ptes+0x2b4/0x330 >>> [ 261.215509][T11285] __arm64_sys_swapoff+0x8cc/0x14b0 >>> [ 261.215516][T11285] invoke_syscall+0x58/0x114 >>> [ 261.215526][T11285] el0_svc_common+0xac/0xe0 >>> [ 261.215532][T11285] do_el0_svc+0x1c/0x28 >>> [ 261.215539][T11285] el0_svc+0x38/0x68 >>> [ 261.215544][T11285] el0t_64_sync_handler+0x68/0xbc >>> [ 261.215548][T11285] el0t_64_sync+0x1a8/0x1ac >>> [ 261.215552][T11285] ---[ end trace 0000000000000000 ]--- >> >> Ah, yes. in unuse_pte(), you'll have to do the right thing if >> !folio_test(anon) before doing the folio_add_anon_rmap_pte(). >> >> You might have a fresh anon folio in the swapcache that was never mapped >> (hopefully order-0, otherwise we'd likely be in trouble). > > Yes. It is order-0 > > [ 261.214260][T11285] page last allocated via order 0, migratetype > > Otherwise, we would have encountered this VM_WARN_ON_FOLIO? > > __folio_add_anon_rmap() > { > ... > VM_WARN_ON_FOLIO(folio_test_large(folio) && > level != RMAP_LEVEL_PMD, folio); > ... > } > > Given that nobody has ever reported this warning, I assume all callers > using folio_add_anon_rmap_pte(s) are right now safe to move to ? Yes, and we should likely add a VM_WARN_ON_ONCE() here that we have a small folio if !anon. If that ever changes, we can assess the situation. Only swap created "new" anon folios without properly calling the right function so far, all other code handles that correctly.
On Fri, Jun 14, 2024 at 8:51 PM David Hildenbrand <david@redhat.com> wrote: > > On 14.06.24 09:56, Barry Song wrote: > > On Thu, Jun 13, 2024 at 9:12 PM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 13.06.24 11:06, Barry Song wrote: > >>> On Thu, Jun 13, 2024 at 8:49 PM David Hildenbrand <david@redhat.com> wrote: > >>>> > >>>> On 13.06.24 10:46, Barry Song wrote: > >>>>> On Thu, Jun 13, 2024 at 12:08 PM Barry Song <21cnbao@gmail.com> wrote: > >>>>>> > >>>>>> From: Barry Song <v-songbaohua@oppo.com> > >>>>>> > >>>>>> The folio_test_anon(folio)==false case within do_swap_page() has been > >>>>>> relocated to folio_add_new_anon_rmap(). Additionally, two other callers > >>>>>> consistently pass anonymous folios. > >>>>>> > >>>>>> stack 1: > >>>>>> remove_migration_pmd > >>>>>> -> folio_add_anon_rmap_pmd > >>>>>> -> __folio_add_anon_rmap > >>>>>> > >>>>>> stack 2: > >>>>>> __split_huge_pmd_locked > >>>>>> -> folio_add_anon_rmap_ptes > >>>>>> -> __folio_add_anon_rmap > >>>>>> > >>>>>> __folio_add_anon_rmap() only needs to handle the cases > >>>>>> folio_test_anon(folio)==true now. > >>>>> > >>>>> My team reported a case where swapoff() is calling > >>>>> folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes > >>>>> with one new anon (!folio_test_anon(folio)). > >>>>> > >>>>> I will double check all folio_add_anon_rmap_pte() cases. > >>>> > >>>> Right, swapoff() path is a bit special (e.g., we don't do any kind of > >>>> batching on the swapoff() path). > >>>> > >>>> But we should never get a new large anon folio there, or could that now > >>>> happen? > >>> > >>> My team encountered the following issue while testing this RFC > >>> series on real hardware. Let me take a look to identify the > >>> problem. > >>> > >>> [ 261.214195][T11285] page:000000004cdd779e refcount:4 mapcount:1 > >>> mapping:00000000ba142c22 index:0x1 pfn:0x1b30a6 > >>> [ 261.214213][T11285] memcg:ffffff8003678000 > >>> [ 261.214217][T11285] aops:swap_aops > >>> [ 261.214233][T11285] flags: > >>> 0x2000000000081009(locked|uptodate|owner_priv_1|swapbacked|zone=1|kasantag=0x0) > >>> [ 261.214241][T11285] page_type: 0x0() > >>> [ 261.214246][T11285] raw: 2000000000081009 0000000000000000 > >>> dead000000000122 0000000000000000 > >>> [ 261.214251][T11285] raw: 0000000000000001 00000000000d84b3 > >>> 0000000400000000 ffffff8003678000 > >>> [ 261.214254][T11285] page dumped because: > >>> VM_WARN_ON_FOLIO(!folio_test_anon(folio)) > >>> [ 261.214257][T11285] page_owner tracks the page as allocated > >>> [ 261.214260][T11285] page last allocated via order 0, migratetype > >>> Movable, gfp_mask 0x2140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), pid > >>> 11285, tgid 11285 (swapoff), ts 261214177545, free_ts 261151875699 > >>> [ 261.214268][T11285] post_alloc_hook+0x1b8/0x1c0 > >>> [ 261.214284][T11285] prep_new_page+0x28/0x13c > >>> [ 261.214291][T11285] get_page_from_freelist+0x198c/0x1aa4 > >>> [ 261.214298][T11285] __alloc_pages+0x15c/0x330 > >>> [ 261.214304][T11285] __folio_alloc+0x1c/0x4c > >>> [ 261.214310][T11285] __read_swap_cache_async+0xd8/0x48c > >>> [ 261.214320][T11285] swap_cluster_readahead+0x158/0x324 > >>> [ 261.214326][T11285] swapin_readahead+0x64/0x448 > >>> [ 261.214331][T11285] __arm64_sys_swapoff+0x6ec/0x14b0 > >>> [ 261.214337][T11285] invoke_syscall+0x58/0x114 > >>> [ 261.214353][T11285] el0_svc_common+0xac/0xe0 > >>> [ 261.214360][T11285] do_el0_svc+0x1c/0x28 > >>> [ 261.214366][T11285] el0_svc+0x38/0x68 > >>> [ 261.214372][T11285] el0t_64_sync_handler+0x68/0xbc > >>> [ 261.214376][T11285] el0t_64_sync+0x1a8/0x1ac > >>> [ 261.214381][T11285] page last free pid 90 tgid 90 stack trace: > >>> [ 261.214386][T11285] free_unref_page_prepare+0x338/0x374 > >>> [ 261.214395][T11285] free_unref_page_list+0x84/0x378 > >>> [ 261.214400][T11285] shrink_folio_list+0x1234/0x13e4 > >>> [ 261.214409][T11285] evict_folios+0x1458/0x19b4 > >>> [ 261.214417][T11285] try_to_shrink_lruvec+0x1c8/0x264 > >>> [ 261.214422][T11285] shrink_one+0xa8/0x234 > >>> [ 261.214427][T11285] shrink_node+0xb38/0xde0 > >>> [ 261.214432][T11285] balance_pgdat+0x7a4/0xdb4 > >>> [ 261.214437][T11285] kswapd+0x290/0x4e4 > >>> [ 261.214442][T11285] kthread+0x114/0x1bc > >>> [ 261.214459][T11285] ret_from_fork+0x10/0x20 > >>> [ 261.214477][T11285] ------------[ cut here ]------------ > >>> [ 261.214480][T11285] WARNING: CPU: 3 PID: 11285 at mm/rmap.c:1305 > >>> folio_add_anon_rmap_ptes+0x2b4/0x330 > >>> > >>> [ 261.215403][T11285] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT > >>> -SSBS BTYPE=--) > >>> [ 261.215423][T11285] pc : folio_add_anon_rmap_ptes+0x2b4/0x330 > >>> [ 261.215428][T11285] lr : folio_add_anon_rmap_ptes+0x2b4/0x330 > >>> [ 261.215433][T11285] sp : ffffffc0a7dbbbf0 > >>> [ 261.215437][T11285] x29: ffffffc0a7dbbbf0 x28: ffffff8040860a08 > >>> x27: ffffff80db480000 > >>> [ 261.215445][T11285] x26: fffffffe04cc2980 x25: ffffff808f77f120 > >>> x24: 0000007b44941000 > >>> [ 261.215452][T11285] x23: 0000000000000001 x22: 0000000000000001 > >>> x21: fffffffe04cc2980 > >>> [ 261.215459][T11285] x20: ffffff80db480000 x19: fffffffe04cc2980 > >>> x18: ffffffed011dae80 > >>> [ 261.215465][T11285] x17: 0000000000000001 x16: ffffffffffffffff > >>> x15: 0000000000000004 > >>> [ 261.215471][T11285] x14: ffffff82fb73fac0 x13: 0000000000000003 > >>> x12: 0000000000000003 > >>> [ 261.215476][T11285] x11: 00000000fffeffff x10: c0000000fffeffff x9 > >>> : 0d46c0889b468e00 > >>> [ 261.215483][T11285] x8 : 0d46c0889b468e00 x7 : 545b5d3935343431 x6 > >>> : 322e31363220205b > >>> [ 261.215489][T11285] x5 : ffffffed013de407 x4 : ffffffed00698967 x3 > >>> : 0000000000000000 > >>> [ 261.215495][T11285] x2 : 0000000000000000 x1 : ffffffc0a7dbb8c0 x0 > >>> : ffffff8068c15c80 > >>> [ 261.215501][T11285] Call trace: > >>> [ 261.215504][T11285] folio_add_anon_rmap_ptes+0x2b4/0x330 > >>> [ 261.215509][T11285] __arm64_sys_swapoff+0x8cc/0x14b0 > >>> [ 261.215516][T11285] invoke_syscall+0x58/0x114 > >>> [ 261.215526][T11285] el0_svc_common+0xac/0xe0 > >>> [ 261.215532][T11285] do_el0_svc+0x1c/0x28 > >>> [ 261.215539][T11285] el0_svc+0x38/0x68 > >>> [ 261.215544][T11285] el0t_64_sync_handler+0x68/0xbc > >>> [ 261.215548][T11285] el0t_64_sync+0x1a8/0x1ac > >>> [ 261.215552][T11285] ---[ end trace 0000000000000000 ]--- > >> > >> Ah, yes. in unuse_pte(), you'll have to do the right thing if > >> !folio_test(anon) before doing the folio_add_anon_rmap_pte(). > >> > >> You might have a fresh anon folio in the swapcache that was never mapped > >> (hopefully order-0, otherwise we'd likely be in trouble). > > > > Yes. It is order-0 > > > > [ 261.214260][T11285] page last allocated via order 0, migratetype > > > > Otherwise, we would have encountered this VM_WARN_ON_FOLIO? > > > > __folio_add_anon_rmap() > > { > > ... > > VM_WARN_ON_FOLIO(folio_test_large(folio) && > > level != RMAP_LEVEL_PMD, folio); > > ... > > } > > > > Given that nobody has ever reported this warning, I assume all callers > > using folio_add_anon_rmap_pte(s) are right now safe to move to ? > > Yes, and we should likely add a VM_WARN_ON_ONCE() here that we have a > small folio if !anon. If that ever changes, we can assess the situation. > this patch actually has a WARN_ON for all !anon, so it extends the WARN to small. - if (unlikely(!folio_test_anon(folio))) { - VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); - /* - * For a PTE-mapped large folio, we only know that the single - * PTE is exclusive. Further, __folio_set_anon() might not get - * folio->index right when not given the address of the head - * page. - */ - VM_WARN_ON_FOLIO(folio_test_large(folio) && - level != RMAP_LEVEL_PMD, folio); - __folio_set_anon(folio, vma, address, - !!(flags & RMAP_EXCLUSIVE)); - } else if (likely(!folio_test_ksm(folio))) { + VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio); + > Only swap created "new" anon folios without properly calling the right > function so far, all other code handles that correctly. > > -- > Cheers, > > David / dhildenb >
On Fri, Jun 14, 2024 at 8:56 PM Barry Song <21cnbao@gmail.com> wrote: > > On Fri, Jun 14, 2024 at 8:51 PM David Hildenbrand <david@redhat.com> wrote: > > > > On 14.06.24 09:56, Barry Song wrote: > > > On Thu, Jun 13, 2024 at 9:12 PM David Hildenbrand <david@redhat.com> wrote: > > >> > > >> On 13.06.24 11:06, Barry Song wrote: > > >>> On Thu, Jun 13, 2024 at 8:49 PM David Hildenbrand <david@redhat.com> wrote: > > >>>> > > >>>> On 13.06.24 10:46, Barry Song wrote: > > >>>>> On Thu, Jun 13, 2024 at 12:08 PM Barry Song <21cnbao@gmail.com> wrote: > > >>>>>> > > >>>>>> From: Barry Song <v-songbaohua@oppo.com> > > >>>>>> > > >>>>>> The folio_test_anon(folio)==false case within do_swap_page() has been > > >>>>>> relocated to folio_add_new_anon_rmap(). Additionally, two other callers > > >>>>>> consistently pass anonymous folios. > > >>>>>> > > >>>>>> stack 1: > > >>>>>> remove_migration_pmd > > >>>>>> -> folio_add_anon_rmap_pmd > > >>>>>> -> __folio_add_anon_rmap > > >>>>>> > > >>>>>> stack 2: > > >>>>>> __split_huge_pmd_locked > > >>>>>> -> folio_add_anon_rmap_ptes > > >>>>>> -> __folio_add_anon_rmap > > >>>>>> > > >>>>>> __folio_add_anon_rmap() only needs to handle the cases > > >>>>>> folio_test_anon(folio)==true now. > > >>>>> > > >>>>> My team reported a case where swapoff() is calling > > >>>>> folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes > > >>>>> with one new anon (!folio_test_anon(folio)). > > >>>>> > > >>>>> I will double check all folio_add_anon_rmap_pte() cases. > > >>>> > > >>>> Right, swapoff() path is a bit special (e.g., we don't do any kind of > > >>>> batching on the swapoff() path). > > >>>> > > >>>> But we should never get a new large anon folio there, or could that now > > >>>> happen? > > >>> > > >>> My team encountered the following issue while testing this RFC > > >>> series on real hardware. Let me take a look to identify the > > >>> problem. > > >>> > > >>> [ 261.214195][T11285] page:000000004cdd779e refcount:4 mapcount:1 > > >>> mapping:00000000ba142c22 index:0x1 pfn:0x1b30a6 > > >>> [ 261.214213][T11285] memcg:ffffff8003678000 > > >>> [ 261.214217][T11285] aops:swap_aops > > >>> [ 261.214233][T11285] flags: > > >>> 0x2000000000081009(locked|uptodate|owner_priv_1|swapbacked|zone=1|kasantag=0x0) > > >>> [ 261.214241][T11285] page_type: 0x0() > > >>> [ 261.214246][T11285] raw: 2000000000081009 0000000000000000 > > >>> dead000000000122 0000000000000000 > > >>> [ 261.214251][T11285] raw: 0000000000000001 00000000000d84b3 > > >>> 0000000400000000 ffffff8003678000 > > >>> [ 261.214254][T11285] page dumped because: > > >>> VM_WARN_ON_FOLIO(!folio_test_anon(folio)) > > >>> [ 261.214257][T11285] page_owner tracks the page as allocated > > >>> [ 261.214260][T11285] page last allocated via order 0, migratetype > > >>> Movable, gfp_mask 0x2140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), pid > > >>> 11285, tgid 11285 (swapoff), ts 261214177545, free_ts 261151875699 > > >>> [ 261.214268][T11285] post_alloc_hook+0x1b8/0x1c0 > > >>> [ 261.214284][T11285] prep_new_page+0x28/0x13c > > >>> [ 261.214291][T11285] get_page_from_freelist+0x198c/0x1aa4 > > >>> [ 261.214298][T11285] __alloc_pages+0x15c/0x330 > > >>> [ 261.214304][T11285] __folio_alloc+0x1c/0x4c > > >>> [ 261.214310][T11285] __read_swap_cache_async+0xd8/0x48c > > >>> [ 261.214320][T11285] swap_cluster_readahead+0x158/0x324 > > >>> [ 261.214326][T11285] swapin_readahead+0x64/0x448 > > >>> [ 261.214331][T11285] __arm64_sys_swapoff+0x6ec/0x14b0 > > >>> [ 261.214337][T11285] invoke_syscall+0x58/0x114 > > >>> [ 261.214353][T11285] el0_svc_common+0xac/0xe0 > > >>> [ 261.214360][T11285] do_el0_svc+0x1c/0x28 > > >>> [ 261.214366][T11285] el0_svc+0x38/0x68 > > >>> [ 261.214372][T11285] el0t_64_sync_handler+0x68/0xbc > > >>> [ 261.214376][T11285] el0t_64_sync+0x1a8/0x1ac > > >>> [ 261.214381][T11285] page last free pid 90 tgid 90 stack trace: > > >>> [ 261.214386][T11285] free_unref_page_prepare+0x338/0x374 > > >>> [ 261.214395][T11285] free_unref_page_list+0x84/0x378 > > >>> [ 261.214400][T11285] shrink_folio_list+0x1234/0x13e4 > > >>> [ 261.214409][T11285] evict_folios+0x1458/0x19b4 > > >>> [ 261.214417][T11285] try_to_shrink_lruvec+0x1c8/0x264 > > >>> [ 261.214422][T11285] shrink_one+0xa8/0x234 > > >>> [ 261.214427][T11285] shrink_node+0xb38/0xde0 > > >>> [ 261.214432][T11285] balance_pgdat+0x7a4/0xdb4 > > >>> [ 261.214437][T11285] kswapd+0x290/0x4e4 > > >>> [ 261.214442][T11285] kthread+0x114/0x1bc > > >>> [ 261.214459][T11285] ret_from_fork+0x10/0x20 > > >>> [ 261.214477][T11285] ------------[ cut here ]------------ > > >>> [ 261.214480][T11285] WARNING: CPU: 3 PID: 11285 at mm/rmap.c:1305 > > >>> folio_add_anon_rmap_ptes+0x2b4/0x330 > > >>> > > >>> [ 261.215403][T11285] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT > > >>> -SSBS BTYPE=--) > > >>> [ 261.215423][T11285] pc : folio_add_anon_rmap_ptes+0x2b4/0x330 > > >>> [ 261.215428][T11285] lr : folio_add_anon_rmap_ptes+0x2b4/0x330 > > >>> [ 261.215433][T11285] sp : ffffffc0a7dbbbf0 > > >>> [ 261.215437][T11285] x29: ffffffc0a7dbbbf0 x28: ffffff8040860a08 > > >>> x27: ffffff80db480000 > > >>> [ 261.215445][T11285] x26: fffffffe04cc2980 x25: ffffff808f77f120 > > >>> x24: 0000007b44941000 > > >>> [ 261.215452][T11285] x23: 0000000000000001 x22: 0000000000000001 > > >>> x21: fffffffe04cc2980 > > >>> [ 261.215459][T11285] x20: ffffff80db480000 x19: fffffffe04cc2980 > > >>> x18: ffffffed011dae80 > > >>> [ 261.215465][T11285] x17: 0000000000000001 x16: ffffffffffffffff > > >>> x15: 0000000000000004 > > >>> [ 261.215471][T11285] x14: ffffff82fb73fac0 x13: 0000000000000003 > > >>> x12: 0000000000000003 > > >>> [ 261.215476][T11285] x11: 00000000fffeffff x10: c0000000fffeffff x9 > > >>> : 0d46c0889b468e00 > > >>> [ 261.215483][T11285] x8 : 0d46c0889b468e00 x7 : 545b5d3935343431 x6 > > >>> : 322e31363220205b > > >>> [ 261.215489][T11285] x5 : ffffffed013de407 x4 : ffffffed00698967 x3 > > >>> : 0000000000000000 > > >>> [ 261.215495][T11285] x2 : 0000000000000000 x1 : ffffffc0a7dbb8c0 x0 > > >>> : ffffff8068c15c80 > > >>> [ 261.215501][T11285] Call trace: > > >>> [ 261.215504][T11285] folio_add_anon_rmap_ptes+0x2b4/0x330 > > >>> [ 261.215509][T11285] __arm64_sys_swapoff+0x8cc/0x14b0 > > >>> [ 261.215516][T11285] invoke_syscall+0x58/0x114 > > >>> [ 261.215526][T11285] el0_svc_common+0xac/0xe0 > > >>> [ 261.215532][T11285] do_el0_svc+0x1c/0x28 > > >>> [ 261.215539][T11285] el0_svc+0x38/0x68 > > >>> [ 261.215544][T11285] el0t_64_sync_handler+0x68/0xbc > > >>> [ 261.215548][T11285] el0t_64_sync+0x1a8/0x1ac > > >>> [ 261.215552][T11285] ---[ end trace 0000000000000000 ]--- > > >> > > >> Ah, yes. in unuse_pte(), you'll have to do the right thing if > > >> !folio_test(anon) before doing the folio_add_anon_rmap_pte(). > > >> > > >> You might have a fresh anon folio in the swapcache that was never mapped > > >> (hopefully order-0, otherwise we'd likely be in trouble). > > > > > > Yes. It is order-0 > > > > > > [ 261.214260][T11285] page last allocated via order 0, migratetype > > > > > > Otherwise, we would have encountered this VM_WARN_ON_FOLIO? > > > > > > __folio_add_anon_rmap() > > > { > > > ... > > > VM_WARN_ON_FOLIO(folio_test_large(folio) && > > > level != RMAP_LEVEL_PMD, folio); > > > ... > > > } > > > > > > Given that nobody has ever reported this warning, I assume all callers > > > using folio_add_anon_rmap_pte(s) are right now safe to move to ? > > > > Yes, and we should likely add a VM_WARN_ON_ONCE() here that we have a > > small folio if !anon. If that ever changes, we can assess the situation. > > > > this patch actually has a WARN_ON for all !anon, so it extends the WARN > to small. > > - if (unlikely(!folio_test_anon(folio))) { > - VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); > - /* > - * For a PTE-mapped large folio, we only know that the single > - * PTE is exclusive. Further, __folio_set_anon() might not get > - * folio->index right when not given the address of the head > - * page. > - */ > - VM_WARN_ON_FOLIO(folio_test_large(folio) && > - level != RMAP_LEVEL_PMD, folio); > - __folio_set_anon(folio, vma, address, > - !!(flags & RMAP_EXCLUSIVE)); > - } else if (likely(!folio_test_ksm(folio))) { > + VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio); > + > > > Only swap created "new" anon folios without properly calling the right > > function so far, all other code handles that correctly. as I assume patch2/3 should have moved all !anon to folio_add_new_anon_rmap() diff --git a/mm/ksm.c b/mm/ksm.c index d2641bc2efc9..c913ba8c37eb 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1420,7 +1420,14 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, */ if (!is_zero_pfn(page_to_pfn(kpage))) { folio_get(kfolio); - folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE); + /* + * We currently ensure that new folios cannot be partially + * exclusive: they are either fully exclusive or fully shared. + */ + if (!folio_test_anon(kfolio)) + folio_add_new_anon_rmap(kfolio, vma, addr, RMAP_NONE); + else + folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE); newpte = mk_pte(kpage, vma->vm_page_prot); } else { /* diff --git a/mm/memory.c b/mm/memory.c index 1f24ecdafe05..16797cc22058 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4339,6 +4339,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (unlikely(folio != swapcache && swapcache)) { folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); + } else if (!folio_test_anon(folio)) { + /* + * We currently ensure that new folios cannot be partially + * exclusive: they are either fully exclusive or fully shared. + */ + folio_add_new_anon_rmap(folio, vma, address, rmap_flags); } else { folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address, rmap_flags); diff --git a/mm/swapfile.c b/mm/swapfile.c index ae1d2700f6a3..ac53e46de48a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1908,8 +1908,14 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio); if (pte_swp_exclusive(old_pte)) rmap_flags |= RMAP_EXCLUSIVE; - - folio_add_anon_rmap_pte(folio, page, vma, addr, rmap_flags); + /* + * We currently ensure that new folios cannot be partially + * exclusive: they are either fully exclusive or fully shared. + */ + if (!folio_test_anon(folio)) + folio_add_new_anon_rmap(folio, vma, addr, rmap_flags); + else + folio_add_anon_rmap_pte(folio, page, vma, addr, rmap_flags); } else { /* ksm created a completely new copy */ folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); > > > > -- > > Cheers, > > > > David / dhildenb > >
On 14.06.24 10:58, Barry Song wrote: > On Fri, Jun 14, 2024 at 8:56 PM Barry Song <21cnbao@gmail.com> wrote: >> >> On Fri, Jun 14, 2024 at 8:51 PM David Hildenbrand <david@redhat.com> wrote: >>> >>> On 14.06.24 09:56, Barry Song wrote: >>>> On Thu, Jun 13, 2024 at 9:12 PM David Hildenbrand <david@redhat.com> wrote: >>>>> >>>>> On 13.06.24 11:06, Barry Song wrote: >>>>>> On Thu, Jun 13, 2024 at 8:49 PM David Hildenbrand <david@redhat.com> wrote: >>>>>>> >>>>>>> On 13.06.24 10:46, Barry Song wrote: >>>>>>>> On Thu, Jun 13, 2024 at 12:08 PM Barry Song <21cnbao@gmail.com> wrote: >>>>>>>>> >>>>>>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>>>>>> >>>>>>>>> The folio_test_anon(folio)==false case within do_swap_page() has been >>>>>>>>> relocated to folio_add_new_anon_rmap(). Additionally, two other callers >>>>>>>>> consistently pass anonymous folios. >>>>>>>>> >>>>>>>>> stack 1: >>>>>>>>> remove_migration_pmd >>>>>>>>> -> folio_add_anon_rmap_pmd >>>>>>>>> -> __folio_add_anon_rmap >>>>>>>>> >>>>>>>>> stack 2: >>>>>>>>> __split_huge_pmd_locked >>>>>>>>> -> folio_add_anon_rmap_ptes >>>>>>>>> -> __folio_add_anon_rmap >>>>>>>>> >>>>>>>>> __folio_add_anon_rmap() only needs to handle the cases >>>>>>>>> folio_test_anon(folio)==true now. >>>>>>>> >>>>>>>> My team reported a case where swapoff() is calling >>>>>>>> folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes >>>>>>>> with one new anon (!folio_test_anon(folio)). >>>>>>>> >>>>>>>> I will double check all folio_add_anon_rmap_pte() cases. >>>>>>> >>>>>>> Right, swapoff() path is a bit special (e.g., we don't do any kind of >>>>>>> batching on the swapoff() path). >>>>>>> >>>>>>> But we should never get a new large anon folio there, or could that now >>>>>>> happen? >>>>>> >>>>>> My team encountered the following issue while testing this RFC >>>>>> series on real hardware. Let me take a look to identify the >>>>>> problem. >>>>>> >>>>>> [ 261.214195][T11285] page:000000004cdd779e refcount:4 mapcount:1 >>>>>> mapping:00000000ba142c22 index:0x1 pfn:0x1b30a6 >>>>>> [ 261.214213][T11285] memcg:ffffff8003678000 >>>>>> [ 261.214217][T11285] aops:swap_aops >>>>>> [ 261.214233][T11285] flags: >>>>>> 0x2000000000081009(locked|uptodate|owner_priv_1|swapbacked|zone=1|kasantag=0x0) >>>>>> [ 261.214241][T11285] page_type: 0x0() >>>>>> [ 261.214246][T11285] raw: 2000000000081009 0000000000000000 >>>>>> dead000000000122 0000000000000000 >>>>>> [ 261.214251][T11285] raw: 0000000000000001 00000000000d84b3 >>>>>> 0000000400000000 ffffff8003678000 >>>>>> [ 261.214254][T11285] page dumped because: >>>>>> VM_WARN_ON_FOLIO(!folio_test_anon(folio)) >>>>>> [ 261.214257][T11285] page_owner tracks the page as allocated >>>>>> [ 261.214260][T11285] page last allocated via order 0, migratetype >>>>>> Movable, gfp_mask 0x2140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), pid >>>>>> 11285, tgid 11285 (swapoff), ts 261214177545, free_ts 261151875699 >>>>>> [ 261.214268][T11285] post_alloc_hook+0x1b8/0x1c0 >>>>>> [ 261.214284][T11285] prep_new_page+0x28/0x13c >>>>>> [ 261.214291][T11285] get_page_from_freelist+0x198c/0x1aa4 >>>>>> [ 261.214298][T11285] __alloc_pages+0x15c/0x330 >>>>>> [ 261.214304][T11285] __folio_alloc+0x1c/0x4c >>>>>> [ 261.214310][T11285] __read_swap_cache_async+0xd8/0x48c >>>>>> [ 261.214320][T11285] swap_cluster_readahead+0x158/0x324 >>>>>> [ 261.214326][T11285] swapin_readahead+0x64/0x448 >>>>>> [ 261.214331][T11285] __arm64_sys_swapoff+0x6ec/0x14b0 >>>>>> [ 261.214337][T11285] invoke_syscall+0x58/0x114 >>>>>> [ 261.214353][T11285] el0_svc_common+0xac/0xe0 >>>>>> [ 261.214360][T11285] do_el0_svc+0x1c/0x28 >>>>>> [ 261.214366][T11285] el0_svc+0x38/0x68 >>>>>> [ 261.214372][T11285] el0t_64_sync_handler+0x68/0xbc >>>>>> [ 261.214376][T11285] el0t_64_sync+0x1a8/0x1ac >>>>>> [ 261.214381][T11285] page last free pid 90 tgid 90 stack trace: >>>>>> [ 261.214386][T11285] free_unref_page_prepare+0x338/0x374 >>>>>> [ 261.214395][T11285] free_unref_page_list+0x84/0x378 >>>>>> [ 261.214400][T11285] shrink_folio_list+0x1234/0x13e4 >>>>>> [ 261.214409][T11285] evict_folios+0x1458/0x19b4 >>>>>> [ 261.214417][T11285] try_to_shrink_lruvec+0x1c8/0x264 >>>>>> [ 261.214422][T11285] shrink_one+0xa8/0x234 >>>>>> [ 261.214427][T11285] shrink_node+0xb38/0xde0 >>>>>> [ 261.214432][T11285] balance_pgdat+0x7a4/0xdb4 >>>>>> [ 261.214437][T11285] kswapd+0x290/0x4e4 >>>>>> [ 261.214442][T11285] kthread+0x114/0x1bc >>>>>> [ 261.214459][T11285] ret_from_fork+0x10/0x20 >>>>>> [ 261.214477][T11285] ------------[ cut here ]------------ >>>>>> [ 261.214480][T11285] WARNING: CPU: 3 PID: 11285 at mm/rmap.c:1305 >>>>>> folio_add_anon_rmap_ptes+0x2b4/0x330 >>>>>> >>>>>> [ 261.215403][T11285] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT >>>>>> -SSBS BTYPE=--) >>>>>> [ 261.215423][T11285] pc : folio_add_anon_rmap_ptes+0x2b4/0x330 >>>>>> [ 261.215428][T11285] lr : folio_add_anon_rmap_ptes+0x2b4/0x330 >>>>>> [ 261.215433][T11285] sp : ffffffc0a7dbbbf0 >>>>>> [ 261.215437][T11285] x29: ffffffc0a7dbbbf0 x28: ffffff8040860a08 >>>>>> x27: ffffff80db480000 >>>>>> [ 261.215445][T11285] x26: fffffffe04cc2980 x25: ffffff808f77f120 >>>>>> x24: 0000007b44941000 >>>>>> [ 261.215452][T11285] x23: 0000000000000001 x22: 0000000000000001 >>>>>> x21: fffffffe04cc2980 >>>>>> [ 261.215459][T11285] x20: ffffff80db480000 x19: fffffffe04cc2980 >>>>>> x18: ffffffed011dae80 >>>>>> [ 261.215465][T11285] x17: 0000000000000001 x16: ffffffffffffffff >>>>>> x15: 0000000000000004 >>>>>> [ 261.215471][T11285] x14: ffffff82fb73fac0 x13: 0000000000000003 >>>>>> x12: 0000000000000003 >>>>>> [ 261.215476][T11285] x11: 00000000fffeffff x10: c0000000fffeffff x9 >>>>>> : 0d46c0889b468e00 >>>>>> [ 261.215483][T11285] x8 : 0d46c0889b468e00 x7 : 545b5d3935343431 x6 >>>>>> : 322e31363220205b >>>>>> [ 261.215489][T11285] x5 : ffffffed013de407 x4 : ffffffed00698967 x3 >>>>>> : 0000000000000000 >>>>>> [ 261.215495][T11285] x2 : 0000000000000000 x1 : ffffffc0a7dbb8c0 x0 >>>>>> : ffffff8068c15c80 >>>>>> [ 261.215501][T11285] Call trace: >>>>>> [ 261.215504][T11285] folio_add_anon_rmap_ptes+0x2b4/0x330 >>>>>> [ 261.215509][T11285] __arm64_sys_swapoff+0x8cc/0x14b0 >>>>>> [ 261.215516][T11285] invoke_syscall+0x58/0x114 >>>>>> [ 261.215526][T11285] el0_svc_common+0xac/0xe0 >>>>>> [ 261.215532][T11285] do_el0_svc+0x1c/0x28 >>>>>> [ 261.215539][T11285] el0_svc+0x38/0x68 >>>>>> [ 261.215544][T11285] el0t_64_sync_handler+0x68/0xbc >>>>>> [ 261.215548][T11285] el0t_64_sync+0x1a8/0x1ac >>>>>> [ 261.215552][T11285] ---[ end trace 0000000000000000 ]--- >>>>> >>>>> Ah, yes. in unuse_pte(), you'll have to do the right thing if >>>>> !folio_test(anon) before doing the folio_add_anon_rmap_pte(). >>>>> >>>>> You might have a fresh anon folio in the swapcache that was never mapped >>>>> (hopefully order-0, otherwise we'd likely be in trouble). >>>> >>>> Yes. It is order-0 >>>> >>>> [ 261.214260][T11285] page last allocated via order 0, migratetype >>>> >>>> Otherwise, we would have encountered this VM_WARN_ON_FOLIO? >>>> >>>> __folio_add_anon_rmap() >>>> { >>>> ... >>>> VM_WARN_ON_FOLIO(folio_test_large(folio) && >>>> level != RMAP_LEVEL_PMD, folio); >>>> ... >>>> } >>>> >>>> Given that nobody has ever reported this warning, I assume all callers >>>> using folio_add_anon_rmap_pte(s) are right now safe to move to ? >>> >>> Yes, and we should likely add a VM_WARN_ON_ONCE() here that we have a >>> small folio if !anon. If that ever changes, we can assess the situation. >>> >> >> this patch actually has a WARN_ON for all !anon, so it extends the WARN >> to small. Not what I mean, see below: >> >> - if (unlikely(!folio_test_anon(folio))) { >> - VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); >> - /* >> - * For a PTE-mapped large folio, we only know that the single >> - * PTE is exclusive. Further, __folio_set_anon() might not get >> - * folio->index right when not given the address of the head >> - * page. >> - */ >> - VM_WARN_ON_FOLIO(folio_test_large(folio) && >> - level != RMAP_LEVEL_PMD, folio); >> - __folio_set_anon(folio, vma, address, >> - !!(flags & RMAP_EXCLUSIVE)); >> - } else if (likely(!folio_test_ksm(folio))) { >> + VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio); >> + >> >>> Only swap created "new" anon folios without properly calling the right >>> function so far, all other code handles that correctly. > > as I assume patch2/3 should have moved all !anon to > folio_add_new_anon_rmap() > > diff --git a/mm/ksm.c b/mm/ksm.c > index d2641bc2efc9..c913ba8c37eb 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -1420,7 +1420,14 @@ static int replace_page(struct vm_area_struct > *vma, struct page *page, > */ > if (!is_zero_pfn(page_to_pfn(kpage))) { > folio_get(kfolio); > - folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE); > + /* > + * We currently ensure that new folios cannot be partially > + * exclusive: they are either fully exclusive or fully shared. > + */ > + if (!folio_test_anon(kfolio)) > + folio_add_new_anon_rmap(kfolio, vma, addr, RMAP_NONE); > + else > + folio_add_anon_rmap_pte(kfolio, kpage, vma, > addr, RMAP_NONE); I don't think that is required? We are only working with anon folios. Or were you able to trigger this? (which would be weird) [...] > - folio_add_anon_rmap_pte(folio, page, vma, addr, rmap_flags); > + /* > + * We currently ensure that new folios cannot be partially > + * exclusive: they are either fully exclusive or fully shared. > + */ > + if (!folio_test_anon(folio)) Here I suggest changing the comment to (and adding the VM_WARN_ON_ONCE): /* * We currently only expect small !anon folios, for which we now that * they are either fully exclusive or fully shared. If we ever get large * folios here, we have to be careful. */ if (!folio_test_anon(folio) { VM_WARN_ON_ONCE(folio_test_large(folio)); folio_add_new_anon_rmap(folio, vma, addr, rmap_flags); } else { ...
On Fri, Jun 14, 2024 at 9:04 PM David Hildenbrand <david@redhat.com> wrote: > > On 14.06.24 10:58, Barry Song wrote: > > On Fri, Jun 14, 2024 at 8:56 PM Barry Song <21cnbao@gmail.com> wrote: > >> > >> On Fri, Jun 14, 2024 at 8:51 PM David Hildenbrand <david@redhat.com> wrote: > >>> > >>> On 14.06.24 09:56, Barry Song wrote: > >>>> On Thu, Jun 13, 2024 at 9:12 PM David Hildenbrand <david@redhat.com> wrote: > >>>>> > >>>>> On 13.06.24 11:06, Barry Song wrote: > >>>>>> On Thu, Jun 13, 2024 at 8:49 PM David Hildenbrand <david@redhat.com> wrote: > >>>>>>> > >>>>>>> On 13.06.24 10:46, Barry Song wrote: > >>>>>>>> On Thu, Jun 13, 2024 at 12:08 PM Barry Song <21cnbao@gmail.com> wrote: > >>>>>>>>> > >>>>>>>>> From: Barry Song <v-songbaohua@oppo.com> > >>>>>>>>> > >>>>>>>>> The folio_test_anon(folio)==false case within do_swap_page() has been > >>>>>>>>> relocated to folio_add_new_anon_rmap(). Additionally, two other callers > >>>>>>>>> consistently pass anonymous folios. > >>>>>>>>> > >>>>>>>>> stack 1: > >>>>>>>>> remove_migration_pmd > >>>>>>>>> -> folio_add_anon_rmap_pmd > >>>>>>>>> -> __folio_add_anon_rmap > >>>>>>>>> > >>>>>>>>> stack 2: > >>>>>>>>> __split_huge_pmd_locked > >>>>>>>>> -> folio_add_anon_rmap_ptes > >>>>>>>>> -> __folio_add_anon_rmap > >>>>>>>>> > >>>>>>>>> __folio_add_anon_rmap() only needs to handle the cases > >>>>>>>>> folio_test_anon(folio)==true now. > >>>>>>>> > >>>>>>>> My team reported a case where swapoff() is calling > >>>>>>>> folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes > >>>>>>>> with one new anon (!folio_test_anon(folio)). > >>>>>>>> > >>>>>>>> I will double check all folio_add_anon_rmap_pte() cases. > >>>>>>> > >>>>>>> Right, swapoff() path is a bit special (e.g., we don't do any kind of > >>>>>>> batching on the swapoff() path). > >>>>>>> > >>>>>>> But we should never get a new large anon folio there, or could that now > >>>>>>> happen? > >>>>>> > >>>>>> My team encountered the following issue while testing this RFC > >>>>>> series on real hardware. Let me take a look to identify the > >>>>>> problem. > >>>>>> > >>>>>> [ 261.214195][T11285] page:000000004cdd779e refcount:4 mapcount:1 > >>>>>> mapping:00000000ba142c22 index:0x1 pfn:0x1b30a6 > >>>>>> [ 261.214213][T11285] memcg:ffffff8003678000 > >>>>>> [ 261.214217][T11285] aops:swap_aops > >>>>>> [ 261.214233][T11285] flags: > >>>>>> 0x2000000000081009(locked|uptodate|owner_priv_1|swapbacked|zone=1|kasantag=0x0) > >>>>>> [ 261.214241][T11285] page_type: 0x0() > >>>>>> [ 261.214246][T11285] raw: 2000000000081009 0000000000000000 > >>>>>> dead000000000122 0000000000000000 > >>>>>> [ 261.214251][T11285] raw: 0000000000000001 00000000000d84b3 > >>>>>> 0000000400000000 ffffff8003678000 > >>>>>> [ 261.214254][T11285] page dumped because: > >>>>>> VM_WARN_ON_FOLIO(!folio_test_anon(folio)) > >>>>>> [ 261.214257][T11285] page_owner tracks the page as allocated > >>>>>> [ 261.214260][T11285] page last allocated via order 0, migratetype > >>>>>> Movable, gfp_mask 0x2140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), pid > >>>>>> 11285, tgid 11285 (swapoff), ts 261214177545, free_ts 261151875699 > >>>>>> [ 261.214268][T11285] post_alloc_hook+0x1b8/0x1c0 > >>>>>> [ 261.214284][T11285] prep_new_page+0x28/0x13c > >>>>>> [ 261.214291][T11285] get_page_from_freelist+0x198c/0x1aa4 > >>>>>> [ 261.214298][T11285] __alloc_pages+0x15c/0x330 > >>>>>> [ 261.214304][T11285] __folio_alloc+0x1c/0x4c > >>>>>> [ 261.214310][T11285] __read_swap_cache_async+0xd8/0x48c > >>>>>> [ 261.214320][T11285] swap_cluster_readahead+0x158/0x324 > >>>>>> [ 261.214326][T11285] swapin_readahead+0x64/0x448 > >>>>>> [ 261.214331][T11285] __arm64_sys_swapoff+0x6ec/0x14b0 > >>>>>> [ 261.214337][T11285] invoke_syscall+0x58/0x114 > >>>>>> [ 261.214353][T11285] el0_svc_common+0xac/0xe0 > >>>>>> [ 261.214360][T11285] do_el0_svc+0x1c/0x28 > >>>>>> [ 261.214366][T11285] el0_svc+0x38/0x68 > >>>>>> [ 261.214372][T11285] el0t_64_sync_handler+0x68/0xbc > >>>>>> [ 261.214376][T11285] el0t_64_sync+0x1a8/0x1ac > >>>>>> [ 261.214381][T11285] page last free pid 90 tgid 90 stack trace: > >>>>>> [ 261.214386][T11285] free_unref_page_prepare+0x338/0x374 > >>>>>> [ 261.214395][T11285] free_unref_page_list+0x84/0x378 > >>>>>> [ 261.214400][T11285] shrink_folio_list+0x1234/0x13e4 > >>>>>> [ 261.214409][T11285] evict_folios+0x1458/0x19b4 > >>>>>> [ 261.214417][T11285] try_to_shrink_lruvec+0x1c8/0x264 > >>>>>> [ 261.214422][T11285] shrink_one+0xa8/0x234 > >>>>>> [ 261.214427][T11285] shrink_node+0xb38/0xde0 > >>>>>> [ 261.214432][T11285] balance_pgdat+0x7a4/0xdb4 > >>>>>> [ 261.214437][T11285] kswapd+0x290/0x4e4 > >>>>>> [ 261.214442][T11285] kthread+0x114/0x1bc > >>>>>> [ 261.214459][T11285] ret_from_fork+0x10/0x20 > >>>>>> [ 261.214477][T11285] ------------[ cut here ]------------ > >>>>>> [ 261.214480][T11285] WARNING: CPU: 3 PID: 11285 at mm/rmap.c:1305 > >>>>>> folio_add_anon_rmap_ptes+0x2b4/0x330 > >>>>>> > >>>>>> [ 261.215403][T11285] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT > >>>>>> -SSBS BTYPE=--) > >>>>>> [ 261.215423][T11285] pc : folio_add_anon_rmap_ptes+0x2b4/0x330 > >>>>>> [ 261.215428][T11285] lr : folio_add_anon_rmap_ptes+0x2b4/0x330 > >>>>>> [ 261.215433][T11285] sp : ffffffc0a7dbbbf0 > >>>>>> [ 261.215437][T11285] x29: ffffffc0a7dbbbf0 x28: ffffff8040860a08 > >>>>>> x27: ffffff80db480000 > >>>>>> [ 261.215445][T11285] x26: fffffffe04cc2980 x25: ffffff808f77f120 > >>>>>> x24: 0000007b44941000 > >>>>>> [ 261.215452][T11285] x23: 0000000000000001 x22: 0000000000000001 > >>>>>> x21: fffffffe04cc2980 > >>>>>> [ 261.215459][T11285] x20: ffffff80db480000 x19: fffffffe04cc2980 > >>>>>> x18: ffffffed011dae80 > >>>>>> [ 261.215465][T11285] x17: 0000000000000001 x16: ffffffffffffffff > >>>>>> x15: 0000000000000004 > >>>>>> [ 261.215471][T11285] x14: ffffff82fb73fac0 x13: 0000000000000003 > >>>>>> x12: 0000000000000003 > >>>>>> [ 261.215476][T11285] x11: 00000000fffeffff x10: c0000000fffeffff x9 > >>>>>> : 0d46c0889b468e00 > >>>>>> [ 261.215483][T11285] x8 : 0d46c0889b468e00 x7 : 545b5d3935343431 x6 > >>>>>> : 322e31363220205b > >>>>>> [ 261.215489][T11285] x5 : ffffffed013de407 x4 : ffffffed00698967 x3 > >>>>>> : 0000000000000000 > >>>>>> [ 261.215495][T11285] x2 : 0000000000000000 x1 : ffffffc0a7dbb8c0 x0 > >>>>>> : ffffff8068c15c80 > >>>>>> [ 261.215501][T11285] Call trace: > >>>>>> [ 261.215504][T11285] folio_add_anon_rmap_ptes+0x2b4/0x330 > >>>>>> [ 261.215509][T11285] __arm64_sys_swapoff+0x8cc/0x14b0 > >>>>>> [ 261.215516][T11285] invoke_syscall+0x58/0x114 > >>>>>> [ 261.215526][T11285] el0_svc_common+0xac/0xe0 > >>>>>> [ 261.215532][T11285] do_el0_svc+0x1c/0x28 > >>>>>> [ 261.215539][T11285] el0_svc+0x38/0x68 > >>>>>> [ 261.215544][T11285] el0t_64_sync_handler+0x68/0xbc > >>>>>> [ 261.215548][T11285] el0t_64_sync+0x1a8/0x1ac > >>>>>> [ 261.215552][T11285] ---[ end trace 0000000000000000 ]--- > >>>>> > >>>>> Ah, yes. in unuse_pte(), you'll have to do the right thing if > >>>>> !folio_test(anon) before doing the folio_add_anon_rmap_pte(). > >>>>> > >>>>> You might have a fresh anon folio in the swapcache that was never mapped > >>>>> (hopefully order-0, otherwise we'd likely be in trouble). > >>>> > >>>> Yes. It is order-0 > >>>> > >>>> [ 261.214260][T11285] page last allocated via order 0, migratetype > >>>> > >>>> Otherwise, we would have encountered this VM_WARN_ON_FOLIO? > >>>> > >>>> __folio_add_anon_rmap() > >>>> { > >>>> ... > >>>> VM_WARN_ON_FOLIO(folio_test_large(folio) && > >>>> level != RMAP_LEVEL_PMD, folio); > >>>> ... > >>>> } > >>>> > >>>> Given that nobody has ever reported this warning, I assume all callers > >>>> using folio_add_anon_rmap_pte(s) are right now safe to move to ? > >>> > >>> Yes, and we should likely add a VM_WARN_ON_ONCE() here that we have a > >>> small folio if !anon. If that ever changes, we can assess the situation. > >>> > >> > >> this patch actually has a WARN_ON for all !anon, so it extends the WARN > >> to small. > > Not what I mean, see below: > > >> > >> - if (unlikely(!folio_test_anon(folio))) { > >> - VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); > >> - /* > >> - * For a PTE-mapped large folio, we only know that the single > >> - * PTE is exclusive. Further, __folio_set_anon() might not get > >> - * folio->index right when not given the address of the head > >> - * page. > >> - */ > >> - VM_WARN_ON_FOLIO(folio_test_large(folio) && > >> - level != RMAP_LEVEL_PMD, folio); > >> - __folio_set_anon(folio, vma, address, > >> - !!(flags & RMAP_EXCLUSIVE)); > >> - } else if (likely(!folio_test_ksm(folio))) { > >> + VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio); > >> + > >> > >>> Only swap created "new" anon folios without properly calling the right > >>> function so far, all other code handles that correctly. > > > > as I assume patch2/3 should have moved all !anon to > > folio_add_new_anon_rmap() > > > > diff --git a/mm/ksm.c b/mm/ksm.c > > index d2641bc2efc9..c913ba8c37eb 100644 > > --- a/mm/ksm.c > > +++ b/mm/ksm.c > > @@ -1420,7 +1420,14 @@ static int replace_page(struct vm_area_struct > > *vma, struct page *page, > > */ > > if (!is_zero_pfn(page_to_pfn(kpage))) { > > folio_get(kfolio); > > - folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE); > > + /* > > + * We currently ensure that new folios cannot be partially > > + * exclusive: they are either fully exclusive or fully shared. > > + */ > > + if (!folio_test_anon(kfolio)) > > + folio_add_new_anon_rmap(kfolio, vma, addr, RMAP_NONE); > > + else > > + folio_add_anon_rmap_pte(kfolio, kpage, vma, > addr, RMAP_NONE); > > I don't think that is required? We are only working with anon folios. Or > were you able to trigger this? (which would be weird) I didn't trigger this. but I am not sure if kfifo is always anon based on the code context. for page, it is 100% anon(otherwise "goto out"), but I am not quite sure about kpage by the code context. static int try_to_merge_one_page(struct vm_area_struct *vma, struct page *page, struct page *kpage) { pte_t orig_pte = __pte(0); int err = -EFAULT; if (page == kpage) /* ksm page forked */ return 0; if (!PageAnon(page)) goto out; .... } Then I saw this static int replace_page(struct vm_area_struct *vma, struct page *page, struct page *kpage, pte_t orig_pte) { ... VM_BUG_ON_PAGE(PageAnonExclusive(page), page); VM_BUG_ON_FOLIO(folio_test_anon(kfolio) && PageAnonExclusive(kpage), kfolio); } If kfolio is always anon, we should have used VM_BUG_ON_FOLIO(PageAnonExclusive(kpage), folio) just like VM_BUG_ON_PAGE(PageAnonExclusive(page), page); without "folio_test_anon(kfolio)". So I lost my way. > > [...] > > > - folio_add_anon_rmap_pte(folio, page, vma, addr, rmap_flags); > > + /* > > + * We currently ensure that new folios cannot be partially > > + * exclusive: they are either fully exclusive or fully shared. > > + */ > > + if (!folio_test_anon(folio)) > > Here I suggest changing the comment to (and adding the VM_WARN_ON_ONCE): > > /* > * We currently only expect small !anon folios, for which we now that > * they are either fully exclusive or fully shared. If we ever get large > * folios here, we have to be careful. > */ > if (!folio_test_anon(folio) { > VM_WARN_ON_ONCE(folio_test_large(folio)); > folio_add_new_anon_rmap(folio, vma, addr, rmap_flags); > } else { > ... looks good to me. > > > -- > Cheers, > > David / dhildenb > Thanks Barry
>> I don't think that is required? We are only working with anon folios. Or >> were you able to trigger this? (which would be weird) > > I didn't trigger this. but I am not sure if kfifo is always anon based on > the code context. > > for page, it is 100% anon(otherwise "goto out"), but I am not quite > sure about kpage > by the code context. > > static int try_to_merge_one_page(struct vm_area_struct *vma, > struct page *page, struct page *kpage) > { > pte_t orig_pte = __pte(0); > int err = -EFAULT; > > if (page == kpage) /* ksm page forked */ > return 0; > > if (!PageAnon(page)) > goto out; > .... > } > > Then I saw this > > static int replace_page(struct vm_area_struct *vma, struct page *page, > struct page *kpage, pte_t orig_pte) > { > ... > VM_BUG_ON_PAGE(PageAnonExclusive(page), page); > VM_BUG_ON_FOLIO(folio_test_anon(kfolio) && PageAnonExclusive(kpage), > kfolio); > } > > If kfolio is always anon, we should have used > VM_BUG_ON_FOLIO(PageAnonExclusive(kpage), folio) > just like > VM_BUG_ON_PAGE(PageAnonExclusive(page), page); > without "folio_test_anon(kfolio)". > > So I lost my way. try_to_merge_one_page() is either called with a KSM page (anon) from try_to_merge_with_ksm_page() or with the shared zeropage (!anon and must never become anon) from cmp_and_merge_page(). So in replace_page(), we either have an ksm/anon page or the shared zeropage. We never updated the documentation of replace_page() to spell out that "kpage" can also be the shared zeropage. Note how replace_page() calls folio_add_anon_rmap_pte() not for the shared zeropage. If we would have to craft a new anon page things would be going terribly wrong. So not, this (!anon -> anon) must not happen and if it were to happen, it would be a real bug and your check in folio_add_anon_rmap_pte() would catch it.
On Fri, Jun 14, 2024 at 11:10 PM David Hildenbrand <david@redhat.com> wrote: > > >> I don't think that is required? We are only working with anon folios. Or > >> were you able to trigger this? (which would be weird) > > > > I didn't trigger this. but I am not sure if kfifo is always anon based on > > the code context. > > > > for page, it is 100% anon(otherwise "goto out"), but I am not quite > > sure about kpage > > by the code context. > > > > static int try_to_merge_one_page(struct vm_area_struct *vma, > > struct page *page, struct page *kpage) > > { > > pte_t orig_pte = __pte(0); > > int err = -EFAULT; > > > > if (page == kpage) /* ksm page forked */ > > return 0; > > > > if (!PageAnon(page)) > > goto out; > > .... > > } > > > > Then I saw this > > > > static int replace_page(struct vm_area_struct *vma, struct page *page, > > struct page *kpage, pte_t orig_pte) > > { > > ... > > VM_BUG_ON_PAGE(PageAnonExclusive(page), page); > > VM_BUG_ON_FOLIO(folio_test_anon(kfolio) && PageAnonExclusive(kpage), > > kfolio); > > } > > > > If kfolio is always anon, we should have used > > VM_BUG_ON_FOLIO(PageAnonExclusive(kpage), folio) > > just like > > VM_BUG_ON_PAGE(PageAnonExclusive(page), page); > > without "folio_test_anon(kfolio)". > > > > So I lost my way. > > try_to_merge_one_page() is either called with a KSM page (anon) from > try_to_merge_with_ksm_page() or with the shared zeropage (!anon and must > never become anon) from cmp_and_merge_page(). > > So in replace_page(), we either have an ksm/anon page or the shared > zeropage. > > We never updated the documentation of replace_page() to spell out that > "kpage" can also be the shared zeropage. > > Note how replace_page() calls folio_add_anon_rmap_pte() not for the > shared zeropage. > > If we would have to craft a new anon page things would be going terribly > wrong. > > So not, this (!anon -> anon) must not happen and if it were to happen, > it would be a real bug and your check in folio_add_anon_rmap_pte() > would catch it. Thanks very much for the explanation. I wonder if the below can help improve the doc. If yes, I may add a separate patch for it in v2. diff --git a/mm/ksm.c b/mm/ksm.c index d2641bc2efc9..56b10265e617 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1411,14 +1411,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, goto out_mn; } VM_BUG_ON_PAGE(PageAnonExclusive(page), page); - VM_BUG_ON_FOLIO(folio_test_anon(kfolio) && PageAnonExclusive(kpage), - kfolio); - /* * No need to check ksm_use_zero_pages here: we can only have a * zero_page here if ksm_use_zero_pages was enabled already. */ if (!is_zero_pfn(page_to_pfn(kpage))) { + VM_BUG_ON_FOLIO(!folio_test_anon(kfolio) || PageAnonExclusive(kpage), + kfolio); folio_get(kfolio); folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE); newpte = mk_pte(kpage, vma->vm_page_prot); > > -- > Cheers, > > David / dhildenb > Thanks Barry
diff --git a/mm/rmap.c b/mm/rmap.c index e612d999811a..e84c706c8241 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1299,21 +1299,10 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio, nr = __folio_add_rmap(folio, page, nr_pages, level, &nr_pmdmapped); - if (unlikely(!folio_test_anon(folio))) { - VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); - /* - * For a PTE-mapped large folio, we only know that the single - * PTE is exclusive. Further, __folio_set_anon() might not get - * folio->index right when not given the address of the head - * page. - */ - VM_WARN_ON_FOLIO(folio_test_large(folio) && - level != RMAP_LEVEL_PMD, folio); - __folio_set_anon(folio, vma, address, - !!(flags & RMAP_EXCLUSIVE)); - } else if (likely(!folio_test_ksm(folio))) { + VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio); + + if (likely(!folio_test_ksm(folio))) __page_check_anon_rmap(folio, page, vma, address); - } __folio_mod_stat(folio, nr, nr_pmdmapped);