Message ID | 20240830040101.822209-16-Liam.Howlett@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Avoid MAP_FIXED gap exposure | expand |
Hi Liam R. Howlett, Greetings! There is WARNING in __split_vma in next-20240902 in syzkaller fuzzing test. Bisected and found first bad commit: " 3483c95414f9 mm: change failure of MAP_FIXED to restoring the gap on failure " It's same as below patch. After reverted the above commit on top of next-20240902, this issue was gone. All detailed info: https://github.com/xupengfe/syzkaller_logs/tree/main/240903_092137___split_vma Syzkaller repro code: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.c Syzkaller repro syscall steps: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.prog Syzkaller report: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.report Kconfig(make olddefconfig): https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/kconfig_origin Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/bisect_info.log bzImage: https://github.com/xupengfe/syzkaller_logs/raw/main/240903_092137___split_vma/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0.tar.gz Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log And "KASAN: slab-use-after-free Read in acct_collect" also pointed to the same commit, all detailed info: https://github.com/xupengfe/syzkaller_logs/tree/main/240903_090000_acct_collect " [ 19.953726] cgroup: Unknown subsys name 'net' [ 20.045121] cgroup: Unknown subsys name 'rlimit' [ 20.138332] ------------[ cut here ]------------ [ 20.138634] WARNING: CPU: 1 PID: 732 at include/linux/maple_tree.h:733 __split_vma+0x4d7/0x1020 [ 20.139075] Modules linked in: [ 20.139245] CPU: 1 UID: 0 PID: 732 Comm: repro Not tainted 6.11.0-rc6-next-20240902-ecc768a84f0b #1 [ 20.139779] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 [ 20.140337] RIP: 0010:__split_vma+0x4d7/0x1020 [ 20.140572] Code: 89 ee 48 8b 40 10 48 89 c7 48 89 85 00 ff ff ff e8 8e 61 a7 ff 48 8b 85 00 ff ff ff 4c 39 e8 0f 83 ea fd ff ff e8 b9 5e a7 ff <0f> 0b e9 de fd ff ff 48 8b 85 30 ff ff ff 48 83 c0 10 48 89 85 18 [ 20.141476] RSP: 0018:ffff8880217379a0 EFLAGS: 00010293 [ 20.141749] RAX: 0000000000000000 RBX: ffff8880132351e0 RCX: ffffffff81bf6117 [ 20.142106] RDX: ffff888012c30000 RSI: ffffffff81bf6187 RDI: 0000000000000006 [ 20.142457] RBP: ffff888021737aa0 R08: 0000000000000001 R09: ffffed100263d3cd [ 20.142814] R10: 0000000020ff9000 R11: 0000000000000001 R12: ffff888021737e40 [ 20.143173] R13: 0000000020ff9000 R14: 0000000020ffc000 R15: ffff888013235d20 [ 20.143529] FS: 00007eff937f9740(0000) GS:ffff88806c500000(0000) knlGS:0000000000000000 [ 20.144308] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 20.144600] CR2: 0000000020000040 CR3: 000000001f464003 CR4: 0000000000770ef0 [ 20.144958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 20.145313] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 [ 20.145665] PKRU: 55555554 [ 20.145809] Call Trace: [ 20.145940] <TASK> [ 20.146056] ? show_regs+0x6d/0x80 [ 20.146247] ? __warn+0xf3/0x380 [ 20.146431] ? report_bug+0x25e/0x4b0 [ 20.146650] ? __split_vma+0x4d7/0x1020 [ 20.146864] ? report_bug+0x2cb/0x4b0 [ 20.147070] ? __split_vma+0x4d7/0x1020 [ 20.147281] ? __split_vma+0x4d8/0x1020 [ 20.147492] ? handle_bug+0xf1/0x190 [ 20.147756] ? exc_invalid_op+0x3c/0x80 [ 20.147971] ? asm_exc_invalid_op+0x1f/0x30 [ 20.148208] ? __split_vma+0x467/0x1020 [ 20.148417] ? __split_vma+0x4d7/0x1020 [ 20.148628] ? __split_vma+0x4d7/0x1020 [ 20.148845] ? __pfx___split_vma+0x10/0x10 [ 20.149068] ? __kasan_check_read+0x15/0x20 [ 20.149300] ? mark_lock.part.0+0xf3/0x17b0 [ 20.149535] ? __kasan_check_read+0x15/0x20 [ 20.149769] vms_gather_munmap_vmas+0x178/0xf70 [ 20.150024] do_vmi_align_munmap+0x26e/0x640 [ 20.150257] ? __pfx___lock_acquire+0x10/0x10 [ 20.150500] ? __pfx_do_vmi_align_munmap+0x10/0x10 [ 20.150758] ? __this_cpu_preempt_check+0x21/0x30 [ 20.151018] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 [ 20.151308] ? mtree_range_walk+0x728/0xb70 [ 20.151602] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 [ 20.151891] ? mas_walk+0x6a7/0x8b0 [ 20.152096] do_vmi_munmap+0x202/0x400 [ 20.152307] __vm_munmap+0x182/0x380 [ 20.152509] ? __pfx___vm_munmap+0x10/0x10 [ 20.152729] ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0 [ 20.153061] ? lockdep_hardirqs_on+0x89/0x110 [ 20.153299] ? trace_hardirqs_on+0x51/0x60 [ 20.153533] ? __audit_syscall_entry+0x39c/0x500 [ 20.153790] __x64_sys_munmap+0x62/0x90 [ 20.154001] x64_sys_call+0x198f/0x2140 [ 20.154212] do_syscall_64+0x6d/0x140 [ 20.154414] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 20.154683] RIP: 0033:0x7eff9343ee5d [ 20.154885] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48 [ 20.155872] RSP: 002b:00007ffe28711628 EFLAGS: 00000206 ORIG_RAX: 000000000000000b [ 20.156267] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff9343ee5d [ 20.156635] RDX: ffffffffffffff80 RSI: 0000000000001000 RDI: 0000000020ffc000 [ 20.157002] RBP: 00007ffe28711640 R08: 0000000000000000 R09: 00007ffe28711640 [ 20.157369] R10: 0000000000000003 R11: 0000000000000206 R12: 00007ffe287117d8 [ 20.157737] R13: 00000000004027cc R14: 0000000000404e08 R15: 00007eff93844000 [ 20.158122] </TASK> [ 20.158245] irq event stamp: 2019 [ 20.158426] hardirqs last enabled at (2027): [<ffffffff814629e4>] console_unlock+0x224/0x240 [ 20.158869] hardirqs last disabled at (2034): [<ffffffff814629c9>] console_unlock+0x209/0x240 [ 20.159306] softirqs last enabled at (1954): [<ffffffff81289969>] __irq_exit_rcu+0xa9/0x120 [ 20.159797] softirqs last disabled at (2051): [<ffffffff81289969>] __irq_exit_rcu+0xa9/0x120 [ 20.160233] ---[ end trace 0000000000000000 ]--- " I hope it's helpful. Thanks! --- If you don't need the following environment to reproduce the problem or if you already have one reproduced environment, please ignore the following information. How to reproduce: git clone https://gitlab.com/xupengfe/repro_vm_env.git cd repro_vm_env tar -xvf repro_vm_env.tar.gz cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0 // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel // You could change the bzImage_xxx as you want // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version You could use below command to log in, there is no password for root. ssh -p 10023 root@localhost After login vm(virtual machine) successfully, you could transfer reproduced binary to the vm by below way, and reproduce the problem in vm: gcc -pthread -o repro repro.c scp -P 10023 repro root@localhost:/root/ Get the bzImage for target kernel: Please use target kconfig and copy it to kernel_src/.config make olddefconfig make -jx bzImage //x should equal or less than cpu num your pc has Fill the bzImage file into above start3.sh to load the target kernel in vm. Tips: If you already have qemu-system-x86_64, please ignore below info. If you want to install qemu v7.1.0 version: git clone https://github.com/qemu/qemu.git cd qemu git checkout -f v7.1.0 mkdir build cd build yum install -y ninja-build.x86_64 yum -y install libslirp-devel.x86_64 ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp make make install Best Regards, Thanks! On 2024-08-30 at 00:00:55 -0400, Liam R. Howlett wrote: > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com> > > Prior to call_mmap(), the vmas that will be replaced need to clear the > way for what may happen in the call_mmap(). This clean up work includes > clearing the ptes and calling the close() vm_ops. Some users do more > setup than can be restored by calling the vm_ops open() function. It is > safer to store the gap in the vma tree in these cases. > > That is to say that the failure scenario that existed before the > MAP_FIXED gap exposure is restored as it is safer than trying to undo a > partial mapping. > > Since abort_munmap_vmas() is only reattaching vmas with this change, the > function is renamed to reattach_vmas(). > > There is also a secondary failure that may occur if there is not enough > memory to store the gap. In this case, the vmas are reattached and > resources freed. If the system cannot complete the call_mmap() and > fails to allocate with GFP_KERNEL, then the system will print a warning > about the failure. > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > --- > mm/mmap.c | 3 +-- > mm/vma.c | 4 +-- > mm/vma.h | 80 ++++++++++++++++++++++++++++++++++++++++--------------- > 3 files changed, 61 insertions(+), 26 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 405e0432c78e..e1e5c78b6c3c 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1623,8 +1623,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, > vm_unacct_memory(charged); > > abort_munmap: > - if (vms.nr_pages) > - abort_munmap_vmas(&mas_detach, vms.closed_vm_ops); > + vms_abort_munmap_vmas(&vms, &mas_detach); > gather_failed: > validate_mm(mm); > return error; > diff --git a/mm/vma.c b/mm/vma.c > index 648c58da8ad4..d2d71d659d1e 100644 > --- a/mm/vma.c > +++ b/mm/vma.c > @@ -878,7 +878,7 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, > munmap_gather_failed: > end_split_failed: > modify_vma_failed: > - abort_munmap_vmas(mas_detach, /* closed = */ false); > + reattach_vmas(mas_detach); > start_split_failed: > map_count_exceeded: > return error; > @@ -923,7 +923,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, > return 0; > > clear_tree_failed: > - abort_munmap_vmas(&mas_detach, /* closed = */ false); > + reattach_vmas(&mas_detach); > gather_failed: > validate_mm(mm); > return error; > diff --git a/mm/vma.h b/mm/vma.h > index 64b44f5a0a11..b2306d13d456 100644 > --- a/mm/vma.h > +++ b/mm/vma.h > @@ -82,6 +82,22 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, > int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, > unsigned long start, unsigned long end, pgoff_t pgoff); > > +static inline int vma_iter_store_gfp(struct vma_iterator *vmi, > + struct vm_area_struct *vma, gfp_t gfp) > + > +{ > + if (vmi->mas.status != ma_start && > + ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start))) > + vma_iter_invalidate(vmi); > + > + __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1); > + mas_store_gfp(&vmi->mas, vma, gfp); > + if (unlikely(mas_is_err(&vmi->mas))) > + return -ENOMEM; > + > + return 0; > +} > + > #ifdef CONFIG_MMU > /* > * init_vma_munmap() - Initializer wrapper for vma_munmap_struct > @@ -129,24 +145,60 @@ void vms_clean_up_area(struct vma_munmap_struct *vms, > struct ma_state *mas_detach, bool mm_wr_locked); > > /* > - * abort_munmap_vmas - Undo any munmap work and free resources > + * reattach_vmas() - Undo any munmap work and free resources > + * @mas_detach: The maple state with the detached maple tree > * > * Reattach any detached vmas and free up the maple tree used to track the vmas. > */ > -static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed) > +static inline void reattach_vmas(struct ma_state *mas_detach) > { > struct vm_area_struct *vma; > > mas_set(mas_detach, 0); > - mas_for_each(mas_detach, vma, ULONG_MAX) { > + mas_for_each(mas_detach, vma, ULONG_MAX) > vma_mark_detached(vma, false); > - if (closed && vma->vm_ops && vma->vm_ops->open) > - vma->vm_ops->open(vma); > - } > > __mt_destroy(mas_detach->tree); > } > > +/* > + * vms_abort_munmap_vmas() - Undo as much as possible from an aborted munmap() > + * operation. > + * @vms: The vma unmap structure > + * @mas_detach: The maple state with the detached maple tree > + * > + * Reattach any detached vmas, free up the maple tree used to track the vmas. > + * If that's not possible because the ptes are cleared (and vm_ops->closed() may > + * have been called), then a NULL is written over the vmas and the vmas are > + * removed (munmap() completed). > + */ > +static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms, > + struct ma_state *mas_detach) > +{ > + struct ma_state *mas = &vms->vmi->mas; > + if (!vms->nr_pages) > + return; > + > + if (vms->clear_ptes) > + return reattach_vmas(mas_detach); > + > + /* > + * Aborting cannot just call the vm_ops open() because they are often > + * not symmetrical and state data has been lost. Resort to the old > + * failure method of leaving a gap where the MAP_FIXED mapping failed. > + */ > + mas_set_range(mas, vms->start, vms->end); > + if (unlikely(mas_store_gfp(mas, NULL, GFP_KERNEL))) { > + pr_warn_once("%s: (%d) Unable to abort munmap() operation\n", > + current->comm, current->pid); > + /* Leaving vmas detached and in-tree may hamper recovery */ > + reattach_vmas(mas_detach); > + } else { > + /* Clean up the insertion of the unfortunate gap */ > + vms_complete_munmap_vmas(vms, mas_detach); > + } > +} > + > int > do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, > struct mm_struct *mm, unsigned long start, > @@ -299,22 +351,6 @@ static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi, > return mas_prev(&vmi->mas, min); > } > > -static inline int vma_iter_store_gfp(struct vma_iterator *vmi, > - struct vm_area_struct *vma, gfp_t gfp) > -{ > - if (vmi->mas.status != ma_start && > - ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start))) > - vma_iter_invalidate(vmi); > - > - __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1); > - mas_store_gfp(&vmi->mas, vma, gfp); > - if (unlikely(mas_is_err(&vmi->mas))) > - return -ENOMEM; > - > - return 0; > -} > - > - > /* > * These three helpers classifies VMAs for virtual memory accounting. > */ > -- > 2.43.0 >
On Tue, Sep 03, 2024 at 11:07:38AM GMT, Pengfei Xu wrote: > Hi Liam R. Howlett, > > Greetings! > > There is WARNING in __split_vma in next-20240902 in syzkaller fuzzing test. > Bisected and found first bad commit: > " > 3483c95414f9 mm: change failure of MAP_FIXED to restoring the gap on failure > " > It's same as below patch. > After reverted the above commit on top of next-20240902, this issue was gone. > > All detailed info: https://github.com/xupengfe/syzkaller_logs/tree/main/240903_092137___split_vma > Syzkaller repro code: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.c > Syzkaller repro syscall steps: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.prog > Syzkaller report: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.report > Kconfig(make olddefconfig): https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/kconfig_origin > Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/bisect_info.log > bzImage: https://github.com/xupengfe/syzkaller_logs/raw/main/240903_092137___split_vma/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0.tar.gz > Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log > > And "KASAN: slab-use-after-free Read in acct_collect" also pointed to the > same commit, all detailed info: > https://github.com/xupengfe/syzkaller_logs/tree/main/240903_090000_acct_collect > > " Thanks for the report! Looking into it. > [ 19.953726] cgroup: Unknown subsys name 'net' > [ 20.045121] cgroup: Unknown subsys name 'rlimit' > [ 20.138332] ------------[ cut here ]------------ > [ 20.138634] WARNING: CPU: 1 PID: 732 at include/linux/maple_tree.h:733 __split_vma+0x4d7/0x1020 > [ 20.139075] Modules linked in: > [ 20.139245] CPU: 1 UID: 0 PID: 732 Comm: repro Not tainted 6.11.0-rc6-next-20240902-ecc768a84f0b #1 > [ 20.139779] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > [ 20.140337] RIP: 0010:__split_vma+0x4d7/0x1020 > [ 20.140572] Code: 89 ee 48 8b 40 10 48 89 c7 48 89 85 00 ff ff ff e8 8e 61 a7 ff 48 8b 85 00 ff ff ff 4c 39 e8 0f 83 ea fd ff ff e8 b9 5e a7 ff <0f> 0b e9 de fd ff ff 48 8b 85 30 ff ff ff 48 83 c0 10 48 89 85 18 > [ 20.141476] RSP: 0018:ffff8880217379a0 EFLAGS: 00010293 > [ 20.141749] RAX: 0000000000000000 RBX: ffff8880132351e0 RCX: ffffffff81bf6117 > [ 20.142106] RDX: ffff888012c30000 RSI: ffffffff81bf6187 RDI: 0000000000000006 > [ 20.142457] RBP: ffff888021737aa0 R08: 0000000000000001 R09: ffffed100263d3cd > [ 20.142814] R10: 0000000020ff9000 R11: 0000000000000001 R12: ffff888021737e40 > [ 20.143173] R13: 0000000020ff9000 R14: 0000000020ffc000 R15: ffff888013235d20 > [ 20.143529] FS: 00007eff937f9740(0000) GS:ffff88806c500000(0000) knlGS:0000000000000000 > [ 20.144308] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 20.144600] CR2: 0000000020000040 CR3: 000000001f464003 CR4: 0000000000770ef0 > [ 20.144958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 20.145313] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 > [ 20.145665] PKRU: 55555554 > [ 20.145809] Call Trace: > [ 20.145940] <TASK> > [ 20.146056] ? show_regs+0x6d/0x80 > [ 20.146247] ? __warn+0xf3/0x380 > [ 20.146431] ? report_bug+0x25e/0x4b0 > [ 20.146650] ? __split_vma+0x4d7/0x1020 Have repro'd locally. This is, unsurprisingly, on this line (even if trace above doesn't decode to it unfortunately): vma_iter_config(vmi, new->vm_start, new->vm_end); The VMA in question spans 0x20ff9000, 0x21000000 so is 7 pages in size. At the point of invoking vma_iter_config(), the vma iterator points at 0x20ff9001, but we try to position it to 0x20ff9000. It seems the issue is that in do_vmi_munmap(), after vma_find() is called, we find a VMA at 0x20ff9000, but the VMI is positioned to 0x20ff9001...! Perhaps maple tree corruption in a previous call somehow? I can interestingly only repro this if I clear the qemu image each time, I'm guessing this is somehow tied to the instantiation of the cgroup setup or such? Am continuing the investigation. > [ 20.146864] ? report_bug+0x2cb/0x4b0 > [ 20.147070] ? __split_vma+0x4d7/0x1020 > [ 20.147281] ? __split_vma+0x4d8/0x1020 > [ 20.147492] ? handle_bug+0xf1/0x190 > [ 20.147756] ? exc_invalid_op+0x3c/0x80 > [ 20.147971] ? asm_exc_invalid_op+0x1f/0x30 > [ 20.148208] ? __split_vma+0x467/0x1020 > [ 20.148417] ? __split_vma+0x4d7/0x1020 > [ 20.148628] ? __split_vma+0x4d7/0x1020 > [ 20.148845] ? __pfx___split_vma+0x10/0x10 > [ 20.149068] ? __kasan_check_read+0x15/0x20 > [ 20.149300] ? mark_lock.part.0+0xf3/0x17b0 > [ 20.149535] ? __kasan_check_read+0x15/0x20 > [ 20.149769] vms_gather_munmap_vmas+0x178/0xf70 > [ 20.150024] do_vmi_align_munmap+0x26e/0x640 > [ 20.150257] ? __pfx___lock_acquire+0x10/0x10 > [ 20.150500] ? __pfx_do_vmi_align_munmap+0x10/0x10 > [ 20.150758] ? __this_cpu_preempt_check+0x21/0x30 > [ 20.151018] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 > [ 20.151308] ? mtree_range_walk+0x728/0xb70 > [ 20.151602] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 > [ 20.151891] ? mas_walk+0x6a7/0x8b0 > [ 20.152096] do_vmi_munmap+0x202/0x400 > [ 20.152307] __vm_munmap+0x182/0x380 > [ 20.152509] ? __pfx___vm_munmap+0x10/0x10 > [ 20.152729] ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0 > [ 20.153061] ? lockdep_hardirqs_on+0x89/0x110 > [ 20.153299] ? trace_hardirqs_on+0x51/0x60 > [ 20.153533] ? __audit_syscall_entry+0x39c/0x500 > [ 20.153790] __x64_sys_munmap+0x62/0x90 > [ 20.154001] x64_sys_call+0x198f/0x2140 > [ 20.154212] do_syscall_64+0x6d/0x140 > [ 20.154414] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 20.154683] RIP: 0033:0x7eff9343ee5d > [ 20.154885] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48 > [ 20.155872] RSP: 002b:00007ffe28711628 EFLAGS: 00000206 ORIG_RAX: 000000000000000b > [ 20.156267] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff9343ee5d > [ 20.156635] RDX: ffffffffffffff80 RSI: 0000000000001000 RDI: 0000000020ffc000 > [ 20.157002] RBP: 00007ffe28711640 R08: 0000000000000000 R09: 00007ffe28711640 > [ 20.157369] R10: 0000000000000003 R11: 0000000000000206 R12: 00007ffe287117d8 > [ 20.157737] R13: 00000000004027cc R14: 0000000000404e08 R15: 00007eff93844000 > [ 20.158122] </TASK> > [ 20.158245] irq event stamp: 2019 > [ 20.158426] hardirqs last enabled at (2027): [<ffffffff814629e4>] console_unlock+0x224/0x240 > [ 20.158869] hardirqs last disabled at (2034): [<ffffffff814629c9>] console_unlock+0x209/0x240 > [ 20.159306] softirqs last enabled at (1954): [<ffffffff81289969>] __irq_exit_rcu+0xa9/0x120 > [ 20.159797] softirqs last disabled at (2051): [<ffffffff81289969>] __irq_exit_rcu+0xa9/0x120 > [ 20.160233] ---[ end trace 0000000000000000 ]--- > " > > I hope it's helpful. > > Thanks! > > --- > > If you don't need the following environment to reproduce the problem or if you > already have one reproduced environment, please ignore the following information. > > How to reproduce: > git clone https://gitlab.com/xupengfe/repro_vm_env.git > cd repro_vm_env > tar -xvf repro_vm_env.tar.gz > cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0 > // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel > // You could change the bzImage_xxx as you want > // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version > You could use below command to log in, there is no password for root. > ssh -p 10023 root@localhost > > After login vm(virtual machine) successfully, you could transfer reproduced > binary to the vm by below way, and reproduce the problem in vm: > gcc -pthread -o repro repro.c > scp -P 10023 repro root@localhost:/root/ > > Get the bzImage for target kernel: > Please use target kconfig and copy it to kernel_src/.config > make olddefconfig > make -jx bzImage //x should equal or less than cpu num your pc has > > Fill the bzImage file into above start3.sh to load the target kernel in vm. > > Tips: > If you already have qemu-system-x86_64, please ignore below info. > If you want to install qemu v7.1.0 version: > git clone https://github.com/qemu/qemu.git > cd qemu > git checkout -f v7.1.0 > mkdir build > cd build > yum install -y ninja-build.x86_64 > yum -y install libslirp-devel.x86_64 > ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp > make > make install > > Best Regards, > Thanks! > > > On 2024-08-30 at 00:00:55 -0400, Liam R. Howlett wrote: > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com> > > > > Prior to call_mmap(), the vmas that will be replaced need to clear the > > way for what may happen in the call_mmap(). This clean up work includes > > clearing the ptes and calling the close() vm_ops. Some users do more > > setup than can be restored by calling the vm_ops open() function. It is > > safer to store the gap in the vma tree in these cases. > > > > That is to say that the failure scenario that existed before the > > MAP_FIXED gap exposure is restored as it is safer than trying to undo a > > partial mapping. > > > > Since abort_munmap_vmas() is only reattaching vmas with this change, the > > function is renamed to reattach_vmas(). > > > > There is also a secondary failure that may occur if there is not enough > > memory to store the gap. In this case, the vmas are reattached and > > resources freed. If the system cannot complete the call_mmap() and > > fails to allocate with GFP_KERNEL, then the system will print a warning > > about the failure. > > > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> > > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > --- > > mm/mmap.c | 3 +-- > > mm/vma.c | 4 +-- > > mm/vma.h | 80 ++++++++++++++++++++++++++++++++++++++++--------------- > > 3 files changed, 61 insertions(+), 26 deletions(-) > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > index 405e0432c78e..e1e5c78b6c3c 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -1623,8 +1623,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, > > vm_unacct_memory(charged); > > > > abort_munmap: > > - if (vms.nr_pages) > > - abort_munmap_vmas(&mas_detach, vms.closed_vm_ops); > > + vms_abort_munmap_vmas(&vms, &mas_detach); > > gather_failed: > > validate_mm(mm); > > return error; > > diff --git a/mm/vma.c b/mm/vma.c > > index 648c58da8ad4..d2d71d659d1e 100644 > > --- a/mm/vma.c > > +++ b/mm/vma.c > > @@ -878,7 +878,7 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, > > munmap_gather_failed: > > end_split_failed: > > modify_vma_failed: > > - abort_munmap_vmas(mas_detach, /* closed = */ false); > > + reattach_vmas(mas_detach); > > start_split_failed: > > map_count_exceeded: > > return error; > > @@ -923,7 +923,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, > > return 0; > > > > clear_tree_failed: > > - abort_munmap_vmas(&mas_detach, /* closed = */ false); > > + reattach_vmas(&mas_detach); > > gather_failed: > > validate_mm(mm); > > return error; > > diff --git a/mm/vma.h b/mm/vma.h > > index 64b44f5a0a11..b2306d13d456 100644 > > --- a/mm/vma.h > > +++ b/mm/vma.h > > @@ -82,6 +82,22 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, > > int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, > > unsigned long start, unsigned long end, pgoff_t pgoff); > > > > +static inline int vma_iter_store_gfp(struct vma_iterator *vmi, > > + struct vm_area_struct *vma, gfp_t gfp) > > + > > +{ > > + if (vmi->mas.status != ma_start && > > + ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start))) > > + vma_iter_invalidate(vmi); > > + > > + __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1); > > + mas_store_gfp(&vmi->mas, vma, gfp); > > + if (unlikely(mas_is_err(&vmi->mas))) > > + return -ENOMEM; > > + > > + return 0; > > +} > > + > > #ifdef CONFIG_MMU > > /* > > * init_vma_munmap() - Initializer wrapper for vma_munmap_struct > > @@ -129,24 +145,60 @@ void vms_clean_up_area(struct vma_munmap_struct *vms, > > struct ma_state *mas_detach, bool mm_wr_locked); > > > > /* > > - * abort_munmap_vmas - Undo any munmap work and free resources > > + * reattach_vmas() - Undo any munmap work and free resources > > + * @mas_detach: The maple state with the detached maple tree > > * > > * Reattach any detached vmas and free up the maple tree used to track the vmas. > > */ > > -static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed) > > +static inline void reattach_vmas(struct ma_state *mas_detach) > > { > > struct vm_area_struct *vma; > > > > mas_set(mas_detach, 0); > > - mas_for_each(mas_detach, vma, ULONG_MAX) { > > + mas_for_each(mas_detach, vma, ULONG_MAX) > > vma_mark_detached(vma, false); > > - if (closed && vma->vm_ops && vma->vm_ops->open) > > - vma->vm_ops->open(vma); > > - } > > > > __mt_destroy(mas_detach->tree); > > } > > > > +/* > > + * vms_abort_munmap_vmas() - Undo as much as possible from an aborted munmap() > > + * operation. > > + * @vms: The vma unmap structure > > + * @mas_detach: The maple state with the detached maple tree > > + * > > + * Reattach any detached vmas, free up the maple tree used to track the vmas. > > + * If that's not possible because the ptes are cleared (and vm_ops->closed() may > > + * have been called), then a NULL is written over the vmas and the vmas are > > + * removed (munmap() completed). > > + */ > > +static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms, > > + struct ma_state *mas_detach) > > +{ > > + struct ma_state *mas = &vms->vmi->mas; > > + if (!vms->nr_pages) > > + return; > > + > > + if (vms->clear_ptes) > > + return reattach_vmas(mas_detach); > > + > > + /* > > + * Aborting cannot just call the vm_ops open() because they are often > > + * not symmetrical and state data has been lost. Resort to the old > > + * failure method of leaving a gap where the MAP_FIXED mapping failed. > > + */ > > + mas_set_range(mas, vms->start, vms->end); > > + if (unlikely(mas_store_gfp(mas, NULL, GFP_KERNEL))) { > > + pr_warn_once("%s: (%d) Unable to abort munmap() operation\n", > > + current->comm, current->pid); > > + /* Leaving vmas detached and in-tree may hamper recovery */ > > + reattach_vmas(mas_detach); > > + } else { > > + /* Clean up the insertion of the unfortunate gap */ > > + vms_complete_munmap_vmas(vms, mas_detach); > > + } > > +} > > + > > int > > do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, > > struct mm_struct *mm, unsigned long start, > > @@ -299,22 +351,6 @@ static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi, > > return mas_prev(&vmi->mas, min); > > } > > > > -static inline int vma_iter_store_gfp(struct vma_iterator *vmi, > > - struct vm_area_struct *vma, gfp_t gfp) > > -{ > > - if (vmi->mas.status != ma_start && > > - ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start))) > > - vma_iter_invalidate(vmi); > > - > > - __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1); > > - mas_store_gfp(&vmi->mas, vma, gfp); > > - if (unlikely(mas_is_err(&vmi->mas))) > > - return -ENOMEM; > > - > > - return 0; > > -} > > - > > - > > /* > > * These three helpers classifies VMAs for virtual memory accounting. > > */ > > -- > > 2.43.0 > >
Hi Andrew - TL;DR of this is - please apply the fix patch attached below to fix a problem in this series, thanks! :) On Tue, Sep 03, 2024 at 12:00:04PM GMT, Lorenzo Stoakes wrote: > On Tue, Sep 03, 2024 at 11:07:38AM GMT, Pengfei Xu wrote: > > Hi Liam R. Howlett, > > > > Greetings! > > > > There is WARNING in __split_vma in next-20240902 in syzkaller fuzzing test. > > Bisected and found first bad commit: > > " > > 3483c95414f9 mm: change failure of MAP_FIXED to restoring the gap on failure > > " > > It's same as below patch. > > After reverted the above commit on top of next-20240902, this issue was gone. > > > > All detailed info: https://github.com/xupengfe/syzkaller_logs/tree/main/240903_092137___split_vma > > Syzkaller repro code: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.c > > Syzkaller repro syscall steps: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.prog > > Syzkaller report: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.report > > Kconfig(make olddefconfig): https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/kconfig_origin > > Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/bisect_info.log > > bzImage: https://github.com/xupengfe/syzkaller_logs/raw/main/240903_092137___split_vma/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0.tar.gz > > Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log > > > > And "KASAN: slab-use-after-free Read in acct_collect" also pointed to the > > same commit, all detailed info: > > https://github.com/xupengfe/syzkaller_logs/tree/main/240903_090000_acct_collect > > > > " > > Thanks for the report! Looking into it. > > > [ 19.953726] cgroup: Unknown subsys name 'net' > > [ 20.045121] cgroup: Unknown subsys name 'rlimit' > > [ 20.138332] ------------[ cut here ]------------ > > [ 20.138634] WARNING: CPU: 1 PID: 732 at include/linux/maple_tree.h:733 __split_vma+0x4d7/0x1020 > > [ 20.139075] Modules linked in: > > [ 20.139245] CPU: 1 UID: 0 PID: 732 Comm: repro Not tainted 6.11.0-rc6-next-20240902-ecc768a84f0b #1 > > [ 20.139779] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > > [ 20.140337] RIP: 0010:__split_vma+0x4d7/0x1020 > > [ 20.140572] Code: 89 ee 48 8b 40 10 48 89 c7 48 89 85 00 ff ff ff e8 8e 61 a7 ff 48 8b 85 00 ff ff ff 4c 39 e8 0f 83 ea fd ff ff e8 b9 5e a7 ff <0f> 0b e9 de fd ff ff 48 8b 85 30 ff ff ff 48 83 c0 10 48 89 85 18 > > [ 20.141476] RSP: 0018:ffff8880217379a0 EFLAGS: 00010293 > > [ 20.141749] RAX: 0000000000000000 RBX: ffff8880132351e0 RCX: ffffffff81bf6117 > > [ 20.142106] RDX: ffff888012c30000 RSI: ffffffff81bf6187 RDI: 0000000000000006 > > [ 20.142457] RBP: ffff888021737aa0 R08: 0000000000000001 R09: ffffed100263d3cd > > [ 20.142814] R10: 0000000020ff9000 R11: 0000000000000001 R12: ffff888021737e40 > > [ 20.143173] R13: 0000000020ff9000 R14: 0000000020ffc000 R15: ffff888013235d20 > > [ 20.143529] FS: 00007eff937f9740(0000) GS:ffff88806c500000(0000) knlGS:0000000000000000 > > [ 20.144308] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 20.144600] CR2: 0000000020000040 CR3: 000000001f464003 CR4: 0000000000770ef0 > > [ 20.144958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > [ 20.145313] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 > > [ 20.145665] PKRU: 55555554 > > [ 20.145809] Call Trace: > > [ 20.145940] <TASK> > > [ 20.146056] ? show_regs+0x6d/0x80 > > [ 20.146247] ? __warn+0xf3/0x380 > > [ 20.146431] ? report_bug+0x25e/0x4b0 > > [ 20.146650] ? __split_vma+0x4d7/0x1020 > > Have repro'd locally. This is, unsurprisingly, on this line (even if trace above > doesn't decode to it unfortunately): > > vma_iter_config(vmi, new->vm_start, new->vm_end); > > The VMA in question spans 0x20ff9000, 0x21000000 so is 7 pages in size. > > At the point of invoking vma_iter_config(), the vma iterator points at > 0x20ff9001, but we try to position it to 0x20ff9000. > > It seems the issue is that in do_vmi_munmap(), after vma_find() is called, we > find a VMA at 0x20ff9000, but the VMI is positioned to 0x20ff9001...! > > Perhaps maple tree corruption in a previous call somehow? > > > I can interestingly only repro this if I clear the qemu image each time, I'm > guessing this is somehow tied to the instantiation of the cgroup setup or such? > > Am continuing the investigation. > [snip] OK I turned on CONFIG_DEBUG_VM_MAPLE_TREE and am hitting VM_WARN_ON_ONCE_MM(vma->vm_start != vmi_start, mm) after gather_failed is hit in mmap_region() as a result of call_mmap() returning an error. This is invoking kernfs_fop_mmap(), which returns -ENODEV because the KERNFS_HAS_MMAP flag has not been set for the cgroup file being mapped. This results in mmap_region() jumping to unmap_and_free_vma, which unmaps the page tables in the region and goes on to abort the unmap operation. The validate_mm() that fails is called by vms_complete_munmap_vmas() which was invoked by vms_abort_munmap_vmas(). The tree is then corrupted: vma ffff888013414d20 start 0000000020ff9000 end 0000000021000000 mm ffff88800d06ae40 prot 25 anon_vma ffff8880132cc660 vm_ops 0000000000000000 pgoff 20ff9 file 0000000000000000 private_data 0000000000000000 flags: 0x8100077(read|write|exec|mayread|maywrite|mayexec|account|softdirty) tree range: ffff888013414d20 start 20ff9001 end 20ffffff Incorrectly starting at off-by-one 0x20ff9001 rather than 0x20ff9000. This is a very telling off-by... the programmer's favourite off-by-1 :) Which then made me think of how mas operations have an _inclusive_ end and VMA ones have an _exclusive_ one. And so I tracked down the cause of this to vms_abort_munmap_vmas() which was invoking mas_set_range() using vms->end (exclusive) as if it were inclusive, which thus resulted in 0x20ff9000 being wrongly cleared. Thes solution is simply to subtract this by one as done in the attached fix-patch. I confirmed this fixed the issue as I was able to set up a reliable repro locally. Thanks for the report! Great find. ----8<---- From 3e7decc5390b0edc462afa74794a8208e25e50f2 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Date: Tue, 3 Sep 2024 13:20:34 +0100 Subject: [PATCH] mm: fix off-by-one error in vms_abort_munmap_vmas() Maple tree ranges have an inclusive end, VMAs do not, so we must subtract one from the VMA-specific end value when using a mas_...() function. We failed to do so in vms_abort_munmap_vmas() which resulted in a store overlapping the intended range by one byte, and thus corrupting the maple tree. Fix this by subtracting one from vms->end() passed into mas_set_range(). Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> --- mm/vma.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vma.h b/mm/vma.h index 370d3246f147..819f994cf727 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -240,7 +240,7 @@ static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms, * not symmetrical and state data has been lost. Resort to the old * failure method of leaving a gap where the MAP_FIXED mapping failed. */ - mas_set_range(mas, vms->start, vms->end); + mas_set_range(mas, vms->start, vms->end - 1); if (unlikely(mas_store_gfp(mas, NULL, GFP_KERNEL))) { pr_warn_once("%s: (%d) Unable to abort munmap() operation\n", current->comm, current->pid); -- 2.46.0
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240903 08:27]: > Hi Andrew - TL;DR of this is - please apply the fix patch attached below to > fix a problem in this series, thanks! :) Oh yes. I should have caught this, thanks Lorenzo. Cheers, Liam > ----8<---- > From 3e7decc5390b0edc462afa74794a8208e25e50f2 Mon Sep 17 00:00:00 2001 > From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Date: Tue, 3 Sep 2024 13:20:34 +0100 > Subject: [PATCH] mm: fix off-by-one error in vms_abort_munmap_vmas() > > Maple tree ranges have an inclusive end, VMAs do not, so we must subtract > one from the VMA-specific end value when using a mas_...() function. > > We failed to do so in vms_abort_munmap_vmas() which resulted in a store > overlapping the intended range by one byte, and thus corrupting the maple > tree. > > Fix this by subtracting one from vms->end() passed into mas_set_range(). > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > --- > mm/vma.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/vma.h b/mm/vma.h > index 370d3246f147..819f994cf727 100644 > --- a/mm/vma.h > +++ b/mm/vma.h > @@ -240,7 +240,7 @@ static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms, > * not symmetrical and state data has been lost. Resort to the old > * failure method of leaving a gap where the MAP_FIXED mapping failed. > */ > - mas_set_range(mas, vms->start, vms->end); > + mas_set_range(mas, vms->start, vms->end - 1); > if (unlikely(mas_store_gfp(mas, NULL, GFP_KERNEL))) { > pr_warn_once("%s: (%d) Unable to abort munmap() operation\n", > current->comm, current->pid); > -- > 2.46.0
diff --git a/mm/mmap.c b/mm/mmap.c index 405e0432c78e..e1e5c78b6c3c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1623,8 +1623,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vm_unacct_memory(charged); abort_munmap: - if (vms.nr_pages) - abort_munmap_vmas(&mas_detach, vms.closed_vm_ops); + vms_abort_munmap_vmas(&vms, &mas_detach); gather_failed: validate_mm(mm); return error; diff --git a/mm/vma.c b/mm/vma.c index 648c58da8ad4..d2d71d659d1e 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -878,7 +878,7 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, munmap_gather_failed: end_split_failed: modify_vma_failed: - abort_munmap_vmas(mas_detach, /* closed = */ false); + reattach_vmas(mas_detach); start_split_failed: map_count_exceeded: return error; @@ -923,7 +923,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, return 0; clear_tree_failed: - abort_munmap_vmas(&mas_detach, /* closed = */ false); + reattach_vmas(&mas_detach); gather_failed: validate_mm(mm); return error; diff --git a/mm/vma.h b/mm/vma.h index 64b44f5a0a11..b2306d13d456 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -82,6 +82,22 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long start, unsigned long end, pgoff_t pgoff); +static inline int vma_iter_store_gfp(struct vma_iterator *vmi, + struct vm_area_struct *vma, gfp_t gfp) + +{ + if (vmi->mas.status != ma_start && + ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start))) + vma_iter_invalidate(vmi); + + __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1); + mas_store_gfp(&vmi->mas, vma, gfp); + if (unlikely(mas_is_err(&vmi->mas))) + return -ENOMEM; + + return 0; +} + #ifdef CONFIG_MMU /* * init_vma_munmap() - Initializer wrapper for vma_munmap_struct @@ -129,24 +145,60 @@ void vms_clean_up_area(struct vma_munmap_struct *vms, struct ma_state *mas_detach, bool mm_wr_locked); /* - * abort_munmap_vmas - Undo any munmap work and free resources + * reattach_vmas() - Undo any munmap work and free resources + * @mas_detach: The maple state with the detached maple tree * * Reattach any detached vmas and free up the maple tree used to track the vmas. */ -static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed) +static inline void reattach_vmas(struct ma_state *mas_detach) { struct vm_area_struct *vma; mas_set(mas_detach, 0); - mas_for_each(mas_detach, vma, ULONG_MAX) { + mas_for_each(mas_detach, vma, ULONG_MAX) vma_mark_detached(vma, false); - if (closed && vma->vm_ops && vma->vm_ops->open) - vma->vm_ops->open(vma); - } __mt_destroy(mas_detach->tree); } +/* + * vms_abort_munmap_vmas() - Undo as much as possible from an aborted munmap() + * operation. + * @vms: The vma unmap structure + * @mas_detach: The maple state with the detached maple tree + * + * Reattach any detached vmas, free up the maple tree used to track the vmas. + * If that's not possible because the ptes are cleared (and vm_ops->closed() may + * have been called), then a NULL is written over the vmas and the vmas are + * removed (munmap() completed). + */ +static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms, + struct ma_state *mas_detach) +{ + struct ma_state *mas = &vms->vmi->mas; + if (!vms->nr_pages) + return; + + if (vms->clear_ptes) + return reattach_vmas(mas_detach); + + /* + * Aborting cannot just call the vm_ops open() because they are often + * not symmetrical and state data has been lost. Resort to the old + * failure method of leaving a gap where the MAP_FIXED mapping failed. + */ + mas_set_range(mas, vms->start, vms->end); + if (unlikely(mas_store_gfp(mas, NULL, GFP_KERNEL))) { + pr_warn_once("%s: (%d) Unable to abort munmap() operation\n", + current->comm, current->pid); + /* Leaving vmas detached and in-tree may hamper recovery */ + reattach_vmas(mas_detach); + } else { + /* Clean up the insertion of the unfortunate gap */ + vms_complete_munmap_vmas(vms, mas_detach); + } +} + int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, struct mm_struct *mm, unsigned long start, @@ -299,22 +351,6 @@ static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi, return mas_prev(&vmi->mas, min); } -static inline int vma_iter_store_gfp(struct vma_iterator *vmi, - struct vm_area_struct *vma, gfp_t gfp) -{ - if (vmi->mas.status != ma_start && - ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start))) - vma_iter_invalidate(vmi); - - __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1); - mas_store_gfp(&vmi->mas, vma, gfp); - if (unlikely(mas_is_err(&vmi->mas))) - return -ENOMEM; - - return 0; -} - - /* * These three helpers classifies VMAs for virtual memory accounting. */