Message ID | 20231207161211.2374093-3-ryan.roberts@arm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Multi-size THP for anonymous memory | expand |
On Thu, Dec 07, 2023 at 04:12:03PM +0000, Ryan Roberts wrote: > In preparation for supporting anonymous multi-size THP, improve > folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be > passed to it. In this case, all contained pages are accounted using the > order-0 folio (or base page) scheme. > > Reviewed-by: Yu Zhao <yuzhao@google.com> > Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> > Reviewed-by: David Hildenbrand <david@redhat.com> > Reviewed-by: Barry Song <v-songbaohua@oppo.com> > Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> > Tested-by: John Hubbard <jhubbard@nvidia.com> > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > mm/rmap.c | 28 ++++++++++++++++++++-------- > 1 file changed, 20 insertions(+), 8 deletions(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 2a1e45e6419f..846fc79f3ca9 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1335,32 +1335,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, > * This means the inc-and-test can be bypassed. > * The folio does not have to be locked. > * > - * If the folio is large, it is accounted as a THP. As the folio > + * If the folio is pmd-mappable, it is accounted as a THP. As the folio > * is new, it's assumed to be mapped exclusively by a single process. > */ > void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, > unsigned long address) > { > - int nr; > + int nr = folio_nr_pages(folio); > > - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); > + VM_BUG_ON_VMA(address < vma->vm_start || > + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); hi, I'm hitting this bug (console output below) with adding uprobe on simple program like: $ cat up.c int main(void) { return 0; } # bpftrace -e 'uprobe:/home/jolsa/up:_start {}' $ ./up it's on top of current linus tree master: 052d534373b7 Merge tag 'exfat-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat before this patch it seems to work, I can send my .config if needed thanks, jirka --- [ 147.562264][ T719] vma ffff888166134e68 start 0000000000401000 end 0000000000402000 mm ffff88817cf2a840 [ 147.562264][ T719] prot 25 anon_vma ffff88817b6818e0 vm_ops ffffffff83475ec0 [ 147.562264][ T719] pgoff 1 file ffff888168d01240 private_data 0000000000000000 [ 147.562264][ T719] flags: 0x75(read|exec|mayread|maywrite|mayexec) [ 147.571660][ T719] ------------[ cut here ]------------ [ 147.572319][ T719] kernel BUG at mm/rmap.c:1412! [ 147.572825][ T719] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN NOPTI [ 147.573792][ T719] CPU: 3 PID: 719 Comm: up Not tainted 6.7.0+ #273 faf755a6fc44b54f4ff1c207411fbd9df5a3968d [ 147.574831][ T719] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014 [ 147.575652][ T719] RIP: 0010:folio_add_new_anon_rmap+0x2cc/0x8f0 [ 147.576164][ T719] Code: c7 c6 20 d2 38 83 48 89 df e8 c0 4a fb ff 0f 0b 48 89 ef e8 16 ab 08 00 4c 3b 65 00 0f 83 cd fd ff ff 48 89 ef e8 34 44 fb ff f7 c3 ff 0f 00 00 0f 85 de fe ff ff be 08 00 00 00 48 89 df [ 147.577609][ T719] RSP: 0018:ffff88815759f568 EFLAGS: 00010286 [ 147.578140][ T719] RAX: 00000000000000fa RBX: ffffea00053eef40 RCX: 0000000000000000 [ 147.578825][ T719] RDX: 0000000000000000 RSI: ffffffff81289b44 RDI: ffffffff872ff1a0 [ 147.579513][ T719] RBP: ffff888166134e68 R08: 0000000000000001 R09: ffffed102aeb3e5f [ 147.580198][ T719] R10: ffff88815759f2ff R11: 0000000000000000 R12: 0000000000401020 [ 147.580886][ T719] R13: 0000000000000001 R14: ffffea00053eef40 R15: ffffea00053eef40 [ 147.581566][ T719] FS: 0000000000000000(0000) GS:ffff88842ce00000(0000) knlGS:0000000000000000 [ 147.582263][ T719] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 147.582724][ T719] CR2: 00005634f0ffe880 CR3: 000000010c0f8002 CR4: 0000000000770ef0 [ 147.583304][ T719] PKRU: 55555554 [ 147.583586][ T719] Call Trace: [ 147.583869][ T719] <TASK> [ 147.584122][ T719] ? die+0x32/0x80 [ 147.584422][ T719] ? do_trap+0x12f/0x220 [ 147.584800][ T719] ? folio_add_new_anon_rmap+0x2cc/0x8f0 [ 147.585411][ T719] ? folio_add_new_anon_rmap+0x2cc/0x8f0 [ 147.585891][ T719] ? do_error_trap+0xa7/0x160 [ 147.586349][ T719] ? folio_add_new_anon_rmap+0x2cc/0x8f0 [ 147.586879][ T719] ? handle_invalid_op+0x2c/0x40 [ 147.587354][ T719] ? folio_add_new_anon_rmap+0x2cc/0x8f0 [ 147.587892][ T719] ? exc_invalid_op+0x29/0x40 [ 147.588352][ T719] ? asm_exc_invalid_op+0x16/0x20 [ 147.588847][ T719] ? preempt_count_sub+0x14/0xc0 [ 147.589335][ T719] ? folio_add_new_anon_rmap+0x2cc/0x8f0 [ 147.589899][ T719] ? folio_add_new_anon_rmap+0x2cc/0x8f0 [ 147.590437][ T719] __replace_page+0x364/0xb40 [ 147.590918][ T719] ? __pfx___replace_page+0x10/0x10 [ 147.591412][ T719] ? __pfx_lock_release+0x10/0x10 [ 147.591910][ T719] ? do_raw_spin_trylock+0xcd/0x120 [ 147.592555][ T719] ? __pfx_vma_alloc_folio+0x10/0x10 [ 147.593095][ T719] ? preempt_count_add+0x6e/0xc0 [ 147.593612][ T719] ? preempt_count_sub+0x14/0xc0 [ 147.594143][ T719] uprobe_write_opcode+0x3f6/0x820 [ 147.594616][ T719] ? __pfx_uprobe_write_opcode+0x10/0x10 [ 147.595125][ T719] ? preempt_count_sub+0x14/0xc0 [ 147.595551][ T719] ? up_write+0x125/0x2f0 [ 147.596014][ T719] install_breakpoint.isra.0+0xe5/0x470 [ 147.596635][ T719] uprobe_mmap+0x37b/0x8d0 [ 147.598111][ T719] ? __pfx_uprobe_mmap+0x10/0x10 [ 147.598561][ T719] mmap_region+0xa02/0x1220 [ 147.599013][ T719] ? rcu_is_watching+0x34/0x60 [ 147.599602][ T719] ? lock_acquired+0xbf/0x670 [ 147.600024][ T719] ? __pfx_mmap_region+0x10/0x10 [ 147.600458][ T719] ? security_mmap_addr+0x20/0x60 [ 147.600909][ T719] ? get_unmapped_area+0x169/0x1f0 [ 147.601353][ T719] do_mmap+0x425/0x660 [ 147.601739][ T719] vm_mmap_pgoff+0x15e/0x2b0 [ 147.602156][ T719] ? __pfx_vm_mmap_pgoff+0x10/0x10 [ 147.602597][ T719] ? __pfx_get_random_u64+0x10/0x10 [ 147.603059][ T719] elf_load+0xdc/0x3a0 [ 147.603433][ T719] load_elf_binary+0x6f6/0x22b0 [ 147.603889][ T719] ? __pfx_load_elf_binary+0x10/0x10 [ 147.604385][ T719] ? __pfx_lock_acquired+0x10/0x10 [ 147.604952][ T719] bprm_execve+0x494/0xc80 [ 147.605379][ T719] ? __pfx_bprm_execve+0x10/0x10 [ 147.605843][ T719] do_execveat_common.isra.0+0x24f/0x330 [ 147.606358][ T719] __x64_sys_execve+0x52/0x60 [ 147.606797][ T719] do_syscall_64+0x87/0x1b0 [ 147.607148][ T719] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 147.607630][ T719] RIP: 0033:0x7faa9b0bdb4b [ 147.608732][ T719] Code: Unable to access opcode bytes at 0x7faa9b0bdb21. [ 147.609318][ T719] RSP: 002b:00007ffff9921708 EFLAGS: 00000246 ORIG_RAX: 000000000000003b [ 147.610046][ T719] RAX: ffffffffffffffda RBX: 00005634f1964990 RCX: 00007faa9b0bdb4b [ 147.610727][ T719] RDX: 00005634f1966d20 RSI: 00005634f19612c0 RDI: 00005634f1964990 [ 147.611528][ T719] RBP: 00007ffff9921800 R08: 0000000000000001 R09: 0000000000000001 [ 147.612192][ T719] R10: 0000000000000008 R11: 0000000000000246 R12: 00000000ffffffff [ 147.612829][ T719] R13: 00005634f1964990 R14: 00005634f19612c0 R15: 00005634f1966d20 [ 147.613479][ T719] </TASK> [ 147.613787][ T719] Modules linked in: intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel kvm_intel rapl iTiTCO_vendor_support i2c_i801 i2c_smbus lpc_ich drm loop drm_panel_orientation_quirks zram [ 147.615630][ T719] ---[ end trace 0000000000000000 ]--- [ 147.616253][ T719] RIP: 0010:folio_add_new_anon_rmap+0x2cc/0x8f0 [ 147.616714][ T719] Code: c7 c6 20 d2 38 83 48 89 df e8 c0 4a fb ff 0f 0b 48 89 ef e8 16 ab 08 00 4c 3b 65 00 0f 83 cd fd ff ff 48 89 ef e8 34 44 fb ff f7 c3 ff 0f 00 00 0f 85 de fe ff ff be 08 00 00 00 48 89 df [ 147.618160][ T719] RSP: 0018:ffff88815759f568 EFLAGS: 00010286 [ 147.618594][ T719] RAX: 00000000000000fa RBX: ffffea00053eef40 RCX: 0000000000000000 [ 147.619318][ T719] RDX: 0000000000000000 RSI: ffffffff81289b44 RDI: ffffffff872ff1a0 [ 147.619930][ T719] RBP: ffff888166134e68 R08: 0000000000000001 R09: ffffed102aeb3e5f [ 147.620577][ T719] R10: ffff88815759f2ff R11: 0000000000000000 R12: 0000000000401020 [ 147.621236][ T719] R13: 0000000000000001 R14: ffffea00053eef40 R15: ffffea00053eef40 [ 147.621894][ T719] FS: 0000000000000000(0000) GS:ffff88842ce00000(0000) knlGS:0000000000000000 [ 147.622596][ T719] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 147.623186][ T719] CR2: 00007faa9b0bdb21 CR3: 000000010c0f8002 CR4: 0000000000770ef0 [ 147.623960][ T719] PKRU: 55555554 [ 147.624331][ T719] note: up[719] exited with preempt_count 1 [ 147.624953][ T719] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:49 [ 147.625898][ T719] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 719, name: up [ 147.626672][ T719] preempt_count: 0, expected: 0 [ 147.627945][ T719] RCU nest depth: 1, expected: 0 [ 147.628410][ T719] INFO: lockdep is turned off. [ 147.628898][ T719] CPU: 3 PID: 719 Comm: up Tainted: G D 6.7.0+ #273 faf755a6fc44b54f4ff1c207411fbd9df5a3968d [ 147.629954][ T719] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014 [ 147.630838][ T719] Call Trace: [ 147.631185][ T719] <TASK> [ 147.631514][ T719] dump_stack_lvl+0x15d/0x180 [ 147.631973][ T719] __might_resched+0x270/0x3b0 [ 147.636533][ T719] exit_signals+0x1d/0x460 [ 147.636947][ T719] do_exit+0x27f/0x13b0 [ 147.637368][ T719] ? __pfx__printk+0x10/0x10 [ 147.637827][ T719] ? __pfx_do_exit+0x10/0x10 [ 147.638238][ T719] make_task_dead+0xd9/0x240 [ 147.638610][ T719] rewind_stack_and_make_dead+0x17/0x20 [ 147.639064][ T719] RIP: 0033:0x7faa9b0bdb4b [ 147.639445][ T719] Code: Unable to access opcode bytes at 0x7faa9b0bdb21. [ 147.640015][ T719] RSP: 002b:00007ffff9921708 EFLAGS: 00000246 ORIG_RAX: 000000000000003b [ 147.640694][ T719] RAX: ffffffffffffffda RBX: 00005634f1964990 RCX: 00007faa9b0bdb4b [ 147.641407][ T719] RDX: 00005634f1966d20 RSI: 00005634f19612c0 RDI: 00005634f1964990 [ 147.642133][ T719] RBP: 00007ffff9921800 R08: 0000000000000001 R09: 0000000000000001 [ 147.642911][ T719] R10: 0000000000000008 R11: 0000000000000246 R12: 00000000ffffffff [ 147.643685][ T719] R13: 00005634f1964990 R14: 00005634f19612c0 R15: 00005634f1966d20 [ 147.644454][ T719] </TASK> [ 147.644819][ T719] ------------[ cut here ]------------
On 13.01.24 23:42, Jiri Olsa wrote: > On Thu, Dec 07, 2023 at 04:12:03PM +0000, Ryan Roberts wrote: >> In preparation for supporting anonymous multi-size THP, improve >> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be >> passed to it. In this case, all contained pages are accounted using the >> order-0 folio (or base page) scheme. >> >> Reviewed-by: Yu Zhao <yuzhao@google.com> >> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> >> Reviewed-by: David Hildenbrand <david@redhat.com> >> Reviewed-by: Barry Song <v-songbaohua@oppo.com> >> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> >> Tested-by: John Hubbard <jhubbard@nvidia.com> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> mm/rmap.c | 28 ++++++++++++++++++++-------- >> 1 file changed, 20 insertions(+), 8 deletions(-) >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 2a1e45e6419f..846fc79f3ca9 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1335,32 +1335,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, >> * This means the inc-and-test can be bypassed. >> * The folio does not have to be locked. >> * >> - * If the folio is large, it is accounted as a THP. As the folio >> + * If the folio is pmd-mappable, it is accounted as a THP. As the folio >> * is new, it's assumed to be mapped exclusively by a single process. >> */ >> void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, >> unsigned long address) >> { >> - int nr; >> + int nr = folio_nr_pages(folio); >> >> - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); >> + VM_BUG_ON_VMA(address < vma->vm_start || >> + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); > > hi, > I'm hitting this bug (console output below) with adding uprobe > on simple program like: > > $ cat up.c > int main(void) > { > return 0; > } > > # bpftrace -e 'uprobe:/home/jolsa/up:_start {}' > > $ ./up > > it's on top of current linus tree master: > 052d534373b7 Merge tag 'exfat-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat > > before this patch it seems to work, I can send my .config if needed bpf only inserts a small folio, so no magic there. It was: VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); And now it is VM_BUG_ON_VMA(address < vma->vm_start || address + (nr << PAGE_SHIFT) > vma->vm_end, vma); I think this change is sane. As long as the address is aligned to full pages (which it better should be) Staring at uprobe_write_opcode, likely vaddr isn't aligned ... Likely (hopefully) that is not an issue for __folio_set_anon(), because linear_page_index() will mask these bits off. Would the following change fix it for you? From c640a8363e47bc96965a35115a040b5f876c4320 Mon Sep 17 00:00:00 2001 From: David Hildenbrand <david@redhat.com> Date: Sun, 14 Jan 2024 18:32:57 +0100 Subject: [PATCH] tmp Signed-off-by: David Hildenbrand <david@redhat.com> --- kernel/events/uprobes.c | 2 +- mm/rmap.c | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 485bb0389b488..929e98c629652 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -537,7 +537,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, } } - ret = __replace_page(vma, vaddr, old_page, new_page); + ret = __replace_page(vma, vaddr & PAGE_MASK, old_page, new_page); if (new_page) put_page(new_page); put_old: diff --git a/mm/rmap.c b/mm/rmap.c index f5d43edad529a..a903db4df6b97 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1408,6 +1408,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, { int nr = folio_nr_pages(folio); + VM_WARN_ON_FOLIO(!IS_ALIGNED(address, PAGE_SIZE), folio); VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); VM_BUG_ON_VMA(address < vma->vm_start || address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
On Sun, Jan 14, 2024 at 06:33:56PM +0100, David Hildenbrand wrote: > On 13.01.24 23:42, Jiri Olsa wrote: > > On Thu, Dec 07, 2023 at 04:12:03PM +0000, Ryan Roberts wrote: > > > In preparation for supporting anonymous multi-size THP, improve > > > folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be > > > passed to it. In this case, all contained pages are accounted using the > > > order-0 folio (or base page) scheme. > > > > > > Reviewed-by: Yu Zhao <yuzhao@google.com> > > > Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> > > > Reviewed-by: David Hildenbrand <david@redhat.com> > > > Reviewed-by: Barry Song <v-songbaohua@oppo.com> > > > Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> > > > Tested-by: John Hubbard <jhubbard@nvidia.com> > > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > > > --- > > > mm/rmap.c | 28 ++++++++++++++++++++-------- > > > 1 file changed, 20 insertions(+), 8 deletions(-) > > > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > > index 2a1e45e6419f..846fc79f3ca9 100644 > > > --- a/mm/rmap.c > > > +++ b/mm/rmap.c > > > @@ -1335,32 +1335,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, > > > * This means the inc-and-test can be bypassed. > > > * The folio does not have to be locked. > > > * > > > - * If the folio is large, it is accounted as a THP. As the folio > > > + * If the folio is pmd-mappable, it is accounted as a THP. As the folio > > > * is new, it's assumed to be mapped exclusively by a single process. > > > */ > > > void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, > > > unsigned long address) > > > { > > > - int nr; > > > + int nr = folio_nr_pages(folio); > > > - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); > > > + VM_BUG_ON_VMA(address < vma->vm_start || > > > + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); > > > > hi, > > I'm hitting this bug (console output below) with adding uprobe > > on simple program like: > > > > $ cat up.c > > int main(void) > > { > > return 0; > > } > > > > # bpftrace -e 'uprobe:/home/jolsa/up:_start {}' > > > > $ ./up > > > > it's on top of current linus tree master: > > 052d534373b7 Merge tag 'exfat-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat > > > > before this patch it seems to work, I can send my .config if needed > > bpf only inserts a small folio, so no magic there. > > It was: > VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); > And now it is > VM_BUG_ON_VMA(address < vma->vm_start || address + (nr << PAGE_SHIFT) > vma->vm_end, vma); > > I think this change is sane. As long as the address is aligned to full pages > (which it better should be) > > Staring at uprobe_write_opcode, likely vaddr isn't aligned ... > > Likely (hopefully) that is not an issue for __folio_set_anon(), because linear_page_index() > will mask these bits off. > > > Would the following change fix it for you? great, that fixes it for me, you can add my Tested-by: Jiri Olsa <jolsa@kernel.org> thanks, jirka > > From c640a8363e47bc96965a35115a040b5f876c4320 Mon Sep 17 00:00:00 2001 > From: David Hildenbrand <david@redhat.com> > Date: Sun, 14 Jan 2024 18:32:57 +0100 > Subject: [PATCH] tmp > > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > kernel/events/uprobes.c | 2 +- > mm/rmap.c | 1 + > 2 files changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index 485bb0389b488..929e98c629652 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -537,7 +537,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, > } > } > - ret = __replace_page(vma, vaddr, old_page, new_page); > + ret = __replace_page(vma, vaddr & PAGE_MASK, old_page, new_page); > if (new_page) > put_page(new_page); > put_old: > diff --git a/mm/rmap.c b/mm/rmap.c > index f5d43edad529a..a903db4df6b97 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1408,6 +1408,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, > { > int nr = folio_nr_pages(folio); > + VM_WARN_ON_FOLIO(!IS_ALIGNED(address, PAGE_SIZE), folio); > VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); > VM_BUG_ON_VMA(address < vma->vm_start || > address + (nr << PAGE_SHIFT) > vma->vm_end, vma); > -- > 2.43.0 > > > > -- > Cheers, > > David / dhildenb >
On 14/01/2024 20:55, Jiri Olsa wrote: > On Sun, Jan 14, 2024 at 06:33:56PM +0100, David Hildenbrand wrote: >> On 13.01.24 23:42, Jiri Olsa wrote: >>> On Thu, Dec 07, 2023 at 04:12:03PM +0000, Ryan Roberts wrote: >>>> In preparation for supporting anonymous multi-size THP, improve >>>> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be >>>> passed to it. In this case, all contained pages are accounted using the >>>> order-0 folio (or base page) scheme. >>>> >>>> Reviewed-by: Yu Zhao <yuzhao@google.com> >>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> >>>> Reviewed-by: David Hildenbrand <david@redhat.com> >>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com> >>>> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> >>>> Tested-by: John Hubbard <jhubbard@nvidia.com> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>> --- >>>> mm/rmap.c | 28 ++++++++++++++++++++-------- >>>> 1 file changed, 20 insertions(+), 8 deletions(-) >>>> >>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>> index 2a1e45e6419f..846fc79f3ca9 100644 >>>> --- a/mm/rmap.c >>>> +++ b/mm/rmap.c >>>> @@ -1335,32 +1335,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, >>>> * This means the inc-and-test can be bypassed. >>>> * The folio does not have to be locked. >>>> * >>>> - * If the folio is large, it is accounted as a THP. As the folio >>>> + * If the folio is pmd-mappable, it is accounted as a THP. As the folio >>>> * is new, it's assumed to be mapped exclusively by a single process. >>>> */ >>>> void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, >>>> unsigned long address) >>>> { >>>> - int nr; >>>> + int nr = folio_nr_pages(folio); >>>> - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); >>>> + VM_BUG_ON_VMA(address < vma->vm_start || >>>> + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); >>> >>> hi, >>> I'm hitting this bug (console output below) with adding uprobe >>> on simple program like: >>> >>> $ cat up.c >>> int main(void) >>> { >>> return 0; >>> } >>> >>> # bpftrace -e 'uprobe:/home/jolsa/up:_start {}' >>> >>> $ ./up >>> >>> it's on top of current linus tree master: >>> 052d534373b7 Merge tag 'exfat-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat >>> >>> before this patch it seems to work, I can send my .config if needed Thanks for the bug report! >> >> bpf only inserts a small folio, so no magic there. >> >> It was: >> VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); >> And now it is >> VM_BUG_ON_VMA(address < vma->vm_start || address + (nr << PAGE_SHIFT) > vma->vm_end, vma); >> >> I think this change is sane. As long as the address is aligned to full pages >> (which it better should be) >> >> Staring at uprobe_write_opcode, likely vaddr isn't aligned ... >> >> Likely (hopefully) that is not an issue for __folio_set_anon(), because linear_page_index() >> will mask these bits off. >> >> >> Would the following change fix it for you? And thanks for fixing my mess so quickly, David. FWIW, I agree with your diagnosis. One small suggestion below. > > great, that fixes it for me, you can add my > > Tested-by: Jiri Olsa <jolsa@kernel.org> > > thanks, > jirka > >> >> From c640a8363e47bc96965a35115a040b5f876c4320 Mon Sep 17 00:00:00 2001 >> From: David Hildenbrand <david@redhat.com> >> Date: Sun, 14 Jan 2024 18:32:57 +0100 >> Subject: [PATCH] tmp >> >> Signed-off-by: David Hildenbrand <david@redhat.com> >> --- >> kernel/events/uprobes.c | 2 +- >> mm/rmap.c | 1 + >> 2 files changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c >> index 485bb0389b488..929e98c629652 100644 >> --- a/kernel/events/uprobes.c >> +++ b/kernel/events/uprobes.c >> @@ -537,7 +537,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, >> } >> } >> - ret = __replace_page(vma, vaddr, old_page, new_page); >> + ret = __replace_page(vma, vaddr & PAGE_MASK, old_page, new_page); >> if (new_page) >> put_page(new_page); >> put_old: >> diff --git a/mm/rmap.c b/mm/rmap.c >> index f5d43edad529a..a903db4df6b97 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1408,6 +1408,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, >> { >> int nr = folio_nr_pages(folio); >> + VM_WARN_ON_FOLIO(!IS_ALIGNED(address, PAGE_SIZE), folio); nit: Is it worth also adding this to __folio_add_anon_rmap() so that folio_add_anon_rmap_ptes() and folio_add_anon_rmap_pmd() also benefit? Regardless: Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> >> VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); >> VM_BUG_ON_VMA(address < vma->vm_start || >> address + (nr << PAGE_SHIFT) > vma->vm_end, vma); >> -- >> 2.43.0 >> >> >> >> -- >> Cheers, >> >> David / dhildenb >>
>>> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c >>> index 485bb0389b488..929e98c629652 100644 >>> --- a/kernel/events/uprobes.c >>> +++ b/kernel/events/uprobes.c >>> @@ -537,7 +537,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, >>> } >>> } >>> - ret = __replace_page(vma, vaddr, old_page, new_page); >>> + ret = __replace_page(vma, vaddr & PAGE_MASK, old_page, new_page); >>> if (new_page) >>> put_page(new_page); >>> put_old: >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index f5d43edad529a..a903db4df6b97 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -1408,6 +1408,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, >>> { >>> int nr = folio_nr_pages(folio); >>> + VM_WARN_ON_FOLIO(!IS_ALIGNED(address, PAGE_SIZE), folio); > > nit: Is it worth also adding this to __folio_add_anon_rmap() so that > folio_add_anon_rmap_ptes() and folio_add_anon_rmap_pmd() also benefit? > Yes, same thoughts. Just included it so we would catch if still something goes wrong here. I'll split that change out either way. > Regardless: > > Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Thanks!
On Wed, Jan 24, 2024 at 12:15:52PM +0100, Sven Schnelle wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > > > On 14/01/2024 20:55, Jiri Olsa wrote: > >> On Sun, Jan 14, 2024 at 06:33:56PM +0100, David Hildenbrand wrote: > >>> On 13.01.24 23:42, Jiri Olsa wrote: > >>>> On Thu, Dec 07, 2023 at 04:12:03PM +0000, Ryan Roberts wrote: > >>>>> In preparation for supporting anonymous multi-size THP, improve > >>>>> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be > >>>>> passed to it. In this case, all contained pages are accounted using the > >>>>> order-0 folio (or base page) scheme. > >>>>> > >>>>> Reviewed-by: Yu Zhao <yuzhao@google.com> > >>>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> > >>>>> Reviewed-by: David Hildenbrand <david@redhat.com> > >>>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com> > >>>>> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> > >>>>> Tested-by: John Hubbard <jhubbard@nvidia.com> > >>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > >>>>> --- > >>>>> mm/rmap.c | 28 ++++++++++++++++++++-------- > >>>>> 1 file changed, 20 insertions(+), 8 deletions(-) > >>>>> > >>>>> diff --git a/mm/rmap.c b/mm/rmap.c > >>>>> index 2a1e45e6419f..846fc79f3ca9 100644 > >>>>> --- a/mm/rmap.c > >>>>> +++ b/mm/rmap.c > >>>>> @@ -1335,32 +1335,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, > >>>>> * This means the inc-and-test can be bypassed. > >>>>> * The folio does not have to be locked. > >>>>> * > >>>>> - * If the folio is large, it is accounted as a THP. As the folio > >>>>> + * If the folio is pmd-mappable, it is accounted as a THP. As the folio > >>>>> * is new, it's assumed to be mapped exclusively by a single process. > >>>>> */ > >>>>> void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, > >>>>> unsigned long address) > >>>>> { > >>>>> - int nr; > >>>>> + int nr = folio_nr_pages(folio); > >>>>> - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); > >>>>> + VM_BUG_ON_VMA(address < vma->vm_start || > >>>>> + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); > >>>> > >>>> hi, > >>>> I'm hitting this bug (console output below) with adding uprobe > >>>> on simple program like: > >>>> > >>>> $ cat up.c > >>>> int main(void) > >>>> { > >>>> return 0; > >>>> } > >>>> > >>>> # bpftrace -e 'uprobe:/home/jolsa/up:_start {}' > >>>> > >>>> $ ./up > >>>> > >>>> it's on top of current linus tree master: > >>>> 052d534373b7 Merge tag 'exfat-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat > >>>> > >>>> before this patch it seems to work, I can send my .config if needed > > > > Thanks for the bug report! > > I just hit the same bug in our CI, but can't find the fix in -next. Is > this in the queue somewhere? we hit it as well, but I can see the fix in linux-next/master 4c137bc28064 uprobes: use pagesize-aligned virtual address when replacing pages jirka
On 24/01/2024 11:19, Jiri Olsa wrote: > On Wed, Jan 24, 2024 at 12:15:52PM +0100, Sven Schnelle wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> On 14/01/2024 20:55, Jiri Olsa wrote: >>>> On Sun, Jan 14, 2024 at 06:33:56PM +0100, David Hildenbrand wrote: >>>>> On 13.01.24 23:42, Jiri Olsa wrote: >>>>>> On Thu, Dec 07, 2023 at 04:12:03PM +0000, Ryan Roberts wrote: >>>>>>> In preparation for supporting anonymous multi-size THP, improve >>>>>>> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be >>>>>>> passed to it. In this case, all contained pages are accounted using the >>>>>>> order-0 folio (or base page) scheme. >>>>>>> >>>>>>> Reviewed-by: Yu Zhao <yuzhao@google.com> >>>>>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> >>>>>>> Reviewed-by: David Hildenbrand <david@redhat.com> >>>>>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com> >>>>>>> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> >>>>>>> Tested-by: John Hubbard <jhubbard@nvidia.com> >>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>>>>> --- >>>>>>> mm/rmap.c | 28 ++++++++++++++++++++-------- >>>>>>> 1 file changed, 20 insertions(+), 8 deletions(-) >>>>>>> >>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>>> index 2a1e45e6419f..846fc79f3ca9 100644 >>>>>>> --- a/mm/rmap.c >>>>>>> +++ b/mm/rmap.c >>>>>>> @@ -1335,32 +1335,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, >>>>>>> * This means the inc-and-test can be bypassed. >>>>>>> * The folio does not have to be locked. >>>>>>> * >>>>>>> - * If the folio is large, it is accounted as a THP. As the folio >>>>>>> + * If the folio is pmd-mappable, it is accounted as a THP. As the folio >>>>>>> * is new, it's assumed to be mapped exclusively by a single process. >>>>>>> */ >>>>>>> void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, >>>>>>> unsigned long address) >>>>>>> { >>>>>>> - int nr; >>>>>>> + int nr = folio_nr_pages(folio); >>>>>>> - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); >>>>>>> + VM_BUG_ON_VMA(address < vma->vm_start || >>>>>>> + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); >>>>>> >>>>>> hi, >>>>>> I'm hitting this bug (console output below) with adding uprobe >>>>>> on simple program like: >>>>>> >>>>>> $ cat up.c >>>>>> int main(void) >>>>>> { >>>>>> return 0; >>>>>> } >>>>>> >>>>>> # bpftrace -e 'uprobe:/home/jolsa/up:_start {}' >>>>>> >>>>>> $ ./up >>>>>> >>>>>> it's on top of current linus tree master: >>>>>> 052d534373b7 Merge tag 'exfat-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat >>>>>> >>>>>> before this patch it seems to work, I can send my .config if needed >>> >>> Thanks for the bug report! >> >> I just hit the same bug in our CI, but can't find the fix in -next. Is >> this in the queue somewhere? > > we hit it as well, but I can see the fix in linux-next/master > > 4c137bc28064 uprobes: use pagesize-aligned virtual address when replacing pages Yes that's the one. Just to confirm: you are still hitting the VM_BUG_ON despite having this change in your kernel? Could you please send over the full bug log? > > jirka
On 24/01/2024 12:06, Jiri Olsa wrote: > On Wed, Jan 24, 2024 at 12:02:53PM +0000, Ryan Roberts wrote: >> On 24/01/2024 11:19, Jiri Olsa wrote: >>> On Wed, Jan 24, 2024 at 12:15:52PM +0100, Sven Schnelle wrote: >>>> Ryan Roberts <ryan.roberts@arm.com> writes: >>>> >>>>> On 14/01/2024 20:55, Jiri Olsa wrote: >>>>>> On Sun, Jan 14, 2024 at 06:33:56PM +0100, David Hildenbrand wrote: >>>>>>> On 13.01.24 23:42, Jiri Olsa wrote: >>>>>>>> On Thu, Dec 07, 2023 at 04:12:03PM +0000, Ryan Roberts wrote: >>>>>>>>> In preparation for supporting anonymous multi-size THP, improve >>>>>>>>> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be >>>>>>>>> passed to it. In this case, all contained pages are accounted using the >>>>>>>>> order-0 folio (or base page) scheme. >>>>>>>>> >>>>>>>>> Reviewed-by: Yu Zhao <yuzhao@google.com> >>>>>>>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> >>>>>>>>> Reviewed-by: David Hildenbrand <david@redhat.com> >>>>>>>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com> >>>>>>>>> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> >>>>>>>>> Tested-by: John Hubbard <jhubbard@nvidia.com> >>>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>>>>>>> --- >>>>>>>>> mm/rmap.c | 28 ++++++++++++++++++++-------- >>>>>>>>> 1 file changed, 20 insertions(+), 8 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>>>>> index 2a1e45e6419f..846fc79f3ca9 100644 >>>>>>>>> --- a/mm/rmap.c >>>>>>>>> +++ b/mm/rmap.c >>>>>>>>> @@ -1335,32 +1335,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, >>>>>>>>> * This means the inc-and-test can be bypassed. >>>>>>>>> * The folio does not have to be locked. >>>>>>>>> * >>>>>>>>> - * If the folio is large, it is accounted as a THP. As the folio >>>>>>>>> + * If the folio is pmd-mappable, it is accounted as a THP. As the folio >>>>>>>>> * is new, it's assumed to be mapped exclusively by a single process. >>>>>>>>> */ >>>>>>>>> void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, >>>>>>>>> unsigned long address) >>>>>>>>> { >>>>>>>>> - int nr; >>>>>>>>> + int nr = folio_nr_pages(folio); >>>>>>>>> - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); >>>>>>>>> + VM_BUG_ON_VMA(address < vma->vm_start || >>>>>>>>> + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); >>>>>>>> >>>>>>>> hi, >>>>>>>> I'm hitting this bug (console output below) with adding uprobe >>>>>>>> on simple program like: >>>>>>>> >>>>>>>> $ cat up.c >>>>>>>> int main(void) >>>>>>>> { >>>>>>>> return 0; >>>>>>>> } >>>>>>>> >>>>>>>> # bpftrace -e 'uprobe:/home/jolsa/up:_start {}' >>>>>>>> >>>>>>>> $ ./up >>>>>>>> >>>>>>>> it's on top of current linus tree master: >>>>>>>> 052d534373b7 Merge tag 'exfat-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat >>>>>>>> >>>>>>>> before this patch it seems to work, I can send my .config if needed >>>>> >>>>> Thanks for the bug report! >>>> >>>> I just hit the same bug in our CI, but can't find the fix in -next. Is >>>> this in the queue somewhere? >>> >>> we hit it as well, but I can see the fix in linux-next/master >>> >>> 4c137bc28064 uprobes: use pagesize-aligned virtual address when replacing pages >> >> Yes that's the one. Just to confirm: you are still hitting the VM_BUG_ON despite >> having this change in your kernel? Could you please send over the full bug log? > > ah sorry.. I meant the change fixes the problem for us, it just did not > yet propagate through the merge cycle into bpf trees.. but I can see it > in linux-next tree, so it's probably just matter of time OK great! How about you, Sven? Do you have this change in your kernel? Hopefully it should fix your problem. > > jirka
On 24/01/2024 12:28, Sven Schnelle wrote: > Hi Ryan, > > Ryan Roberts <ryan.roberts@arm.com> writes: > >>>>>>>>>> I'm hitting this bug (console output below) with adding uprobe >>>>>>>>>> on simple program like: >>>>>>>>>> >>>>>>>>>> $ cat up.c >>>>>>>>>> int main(void) >>>>>>>>>> { >>>>>>>>>> return 0; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> # bpftrace -e 'uprobe:/home/jolsa/up:_start {}' >>>>>>>>>> >>>>>>>>>> $ ./up >>>>>>>>>> >>>>>>>>>> it's on top of current linus tree master: >>>>>>>>>> 052d534373b7 Merge tag 'exfat-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat >>>>>>>>>> >>>>>>>>>> before this patch it seems to work, I can send my .config if needed >>>>>>> >>>>>>> Thanks for the bug report! >>>>>> >>>>>> I just hit the same bug in our CI, but can't find the fix in -next. Is >>>>>> this in the queue somewhere? >>>>> >>>>> we hit it as well, but I can see the fix in linux-next/master >>>>> >>>>> 4c137bc28064 uprobes: use pagesize-aligned virtual address when replacing pages >>>> >>>> Yes that's the one. Just to confirm: you are still hitting the VM_BUG_ON despite >>>> having this change in your kernel? Could you please send over the full bug log? >>> >>> ah sorry.. I meant the change fixes the problem for us, it just did not >>> yet propagate through the merge cycle into bpf trees.. but I can see it >>> in linux-next tree, so it's probably just matter of time >> >> OK great! How about you, Sven? Do you have this change in your kernel? Hopefully >> it should fix your problem. > > Same here - the fix makes uprobes work again, i just didn't see it in > torvalds-master and neither in todays linux-next. But Jiri is right, > it's in linux-next/master. I just missed to find it there. So everything > should be ok. Great!
diff --git a/mm/rmap.c b/mm/rmap.c index 2a1e45e6419f..846fc79f3ca9 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1335,32 +1335,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, * This means the inc-and-test can be bypassed. * The folio does not have to be locked. * - * If the folio is large, it is accounted as a THP. As the folio + * If the folio is pmd-mappable, it is accounted as a THP. As the folio * is new, it's assumed to be mapped exclusively by a single process. */ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, unsigned long address) { - int nr; + int nr = folio_nr_pages(folio); - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); + VM_BUG_ON_VMA(address < vma->vm_start || + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); __folio_set_swapbacked(folio); + __folio_set_anon(folio, vma, address, true); - if (likely(!folio_test_pmd_mappable(folio))) { + if (likely(!folio_test_large(folio))) { /* increment count (starts at -1) */ atomic_set(&folio->_mapcount, 0); - nr = 1; + SetPageAnonExclusive(&folio->page); + } else if (!folio_test_pmd_mappable(folio)) { + int i; + + for (i = 0; i < nr; i++) { + struct page *page = folio_page(folio, i); + + /* increment count (starts at -1) */ + atomic_set(&page->_mapcount, 0); + SetPageAnonExclusive(page); + } + + atomic_set(&folio->_nr_pages_mapped, nr); } else { /* increment count (starts at -1) */ atomic_set(&folio->_entire_mapcount, 0); atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED); - nr = folio_nr_pages(folio); + SetPageAnonExclusive(&folio->page); __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); } __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); - __folio_set_anon(folio, vma, address, true); - SetPageAnonExclusive(&folio->page); } /**