diff mbox series

[v8,15/21] mm: Change failure of MAP_FIXED to restoring the gap on failure

Message ID 20240830040101.822209-16-Liam.Howlett@oracle.com (mailing list archive)
State New
Headers show
Series Avoid MAP_FIXED gap exposure | expand

Commit Message

Liam R. Howlett Aug. 30, 2024, 4 a.m. UTC
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Prior to call_mmap(), the vmas that will be replaced need to clear the
way for what may happen in the call_mmap().  This clean up work includes
clearing the ptes and calling the close() vm_ops.  Some users do more
setup than can be restored by calling the vm_ops open() function.  It is
safer to store the gap in the vma tree in these cases.

That is to say that the failure scenario that existed before the
MAP_FIXED gap exposure is restored as it is safer than trying to undo a
partial mapping.

Since abort_munmap_vmas() is only reattaching vmas with this change, the
function is renamed to reattach_vmas().

There is also a secondary failure that may occur if there is not enough
memory to store the gap.  In this case, the vmas are reattached and
resources freed.  If the system cannot complete the call_mmap() and
fails to allocate with GFP_KERNEL, then the system will print a warning
about the failure.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c |  3 +--
 mm/vma.c  |  4 +--
 mm/vma.h  | 80 ++++++++++++++++++++++++++++++++++++++++---------------
 3 files changed, 61 insertions(+), 26 deletions(-)

Comments

Pengfei Xu Sept. 3, 2024, 3:07 a.m. UTC | #1
Hi Liam R. Howlett,

Greetings!

There is WARNING in __split_vma in next-20240902 in syzkaller fuzzing test.
Bisected and found first bad commit:
"
3483c95414f9 mm: change failure of MAP_FIXED to restoring the gap on failure
"
It's same as below patch.
After reverted the above commit on top of next-20240902, this issue was gone.

All detailed info: https://github.com/xupengfe/syzkaller_logs/tree/main/240903_092137___split_vma
Syzkaller repro code: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.c
Syzkaller repro syscall steps: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.prog
Syzkaller report: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.report
Kconfig(make olddefconfig): https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/kconfig_origin
Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/bisect_info.log
bzImage: https://github.com/xupengfe/syzkaller_logs/raw/main/240903_092137___split_vma/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0.tar.gz
Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log

And "KASAN: slab-use-after-free Read in acct_collect" also pointed to the
same commit, all detailed info:
https://github.com/xupengfe/syzkaller_logs/tree/main/240903_090000_acct_collect

"
[   19.953726] cgroup: Unknown subsys name 'net'
[   20.045121] cgroup: Unknown subsys name 'rlimit'
[   20.138332] ------------[ cut here ]------------
[   20.138634] WARNING: CPU: 1 PID: 732 at include/linux/maple_tree.h:733 __split_vma+0x4d7/0x1020
[   20.139075] Modules linked in:
[   20.139245] CPU: 1 UID: 0 PID: 732 Comm: repro Not tainted 6.11.0-rc6-next-20240902-ecc768a84f0b #1
[   20.139779] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[   20.140337] RIP: 0010:__split_vma+0x4d7/0x1020
[   20.140572] Code: 89 ee 48 8b 40 10 48 89 c7 48 89 85 00 ff ff ff e8 8e 61 a7 ff 48 8b 85 00 ff ff ff 4c 39 e8 0f 83 ea fd ff ff e8 b9 5e a7 ff <0f> 0b e9 de fd ff ff 48 8b 85 30 ff ff ff 48 83 c0 10 48 89 85 18
[   20.141476] RSP: 0018:ffff8880217379a0 EFLAGS: 00010293
[   20.141749] RAX: 0000000000000000 RBX: ffff8880132351e0 RCX: ffffffff81bf6117
[   20.142106] RDX: ffff888012c30000 RSI: ffffffff81bf6187 RDI: 0000000000000006
[   20.142457] RBP: ffff888021737aa0 R08: 0000000000000001 R09: ffffed100263d3cd
[   20.142814] R10: 0000000020ff9000 R11: 0000000000000001 R12: ffff888021737e40
[   20.143173] R13: 0000000020ff9000 R14: 0000000020ffc000 R15: ffff888013235d20
[   20.143529] FS:  00007eff937f9740(0000) GS:ffff88806c500000(0000) knlGS:0000000000000000
[   20.144308] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   20.144600] CR2: 0000000020000040 CR3: 000000001f464003 CR4: 0000000000770ef0
[   20.144958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   20.145313] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[   20.145665] PKRU: 55555554
[   20.145809] Call Trace:
[   20.145940]  <TASK>
[   20.146056]  ? show_regs+0x6d/0x80
[   20.146247]  ? __warn+0xf3/0x380
[   20.146431]  ? report_bug+0x25e/0x4b0
[   20.146650]  ? __split_vma+0x4d7/0x1020
[   20.146864]  ? report_bug+0x2cb/0x4b0
[   20.147070]  ? __split_vma+0x4d7/0x1020
[   20.147281]  ? __split_vma+0x4d8/0x1020
[   20.147492]  ? handle_bug+0xf1/0x190
[   20.147756]  ? exc_invalid_op+0x3c/0x80
[   20.147971]  ? asm_exc_invalid_op+0x1f/0x30
[   20.148208]  ? __split_vma+0x467/0x1020
[   20.148417]  ? __split_vma+0x4d7/0x1020
[   20.148628]  ? __split_vma+0x4d7/0x1020
[   20.148845]  ? __pfx___split_vma+0x10/0x10
[   20.149068]  ? __kasan_check_read+0x15/0x20
[   20.149300]  ? mark_lock.part.0+0xf3/0x17b0
[   20.149535]  ? __kasan_check_read+0x15/0x20
[   20.149769]  vms_gather_munmap_vmas+0x178/0xf70
[   20.150024]  do_vmi_align_munmap+0x26e/0x640
[   20.150257]  ? __pfx___lock_acquire+0x10/0x10
[   20.150500]  ? __pfx_do_vmi_align_munmap+0x10/0x10
[   20.150758]  ? __this_cpu_preempt_check+0x21/0x30
[   20.151018]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[   20.151308]  ? mtree_range_walk+0x728/0xb70
[   20.151602]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[   20.151891]  ? mas_walk+0x6a7/0x8b0
[   20.152096]  do_vmi_munmap+0x202/0x400
[   20.152307]  __vm_munmap+0x182/0x380
[   20.152509]  ? __pfx___vm_munmap+0x10/0x10
[   20.152729]  ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0
[   20.153061]  ? lockdep_hardirqs_on+0x89/0x110
[   20.153299]  ? trace_hardirqs_on+0x51/0x60
[   20.153533]  ? __audit_syscall_entry+0x39c/0x500
[   20.153790]  __x64_sys_munmap+0x62/0x90
[   20.154001]  x64_sys_call+0x198f/0x2140
[   20.154212]  do_syscall_64+0x6d/0x140
[   20.154414]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   20.154683] RIP: 0033:0x7eff9343ee5d
[   20.154885] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
[   20.155872] RSP: 002b:00007ffe28711628 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
[   20.156267] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff9343ee5d
[   20.156635] RDX: ffffffffffffff80 RSI: 0000000000001000 RDI: 0000000020ffc000
[   20.157002] RBP: 00007ffe28711640 R08: 0000000000000000 R09: 00007ffe28711640
[   20.157369] R10: 0000000000000003 R11: 0000000000000206 R12: 00007ffe287117d8
[   20.157737] R13: 00000000004027cc R14: 0000000000404e08 R15: 00007eff93844000
[   20.158122]  </TASK>
[   20.158245] irq event stamp: 2019
[   20.158426] hardirqs last  enabled at (2027): [<ffffffff814629e4>] console_unlock+0x224/0x240
[   20.158869] hardirqs last disabled at (2034): [<ffffffff814629c9>] console_unlock+0x209/0x240
[   20.159306] softirqs last  enabled at (1954): [<ffffffff81289969>] __irq_exit_rcu+0xa9/0x120
[   20.159797] softirqs last disabled at (2051): [<ffffffff81289969>] __irq_exit_rcu+0xa9/0x120
[   20.160233] ---[ end trace 0000000000000000 ]---
"

I hope it's helpful.

Thanks!

---

If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
  // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
  // You could change the bzImage_xxx as you want
  // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost

After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage           //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.

Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install

Best Regards,
Thanks!


On 2024-08-30 at 00:00:55 -0400, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> 
> Prior to call_mmap(), the vmas that will be replaced need to clear the
> way for what may happen in the call_mmap().  This clean up work includes
> clearing the ptes and calling the close() vm_ops.  Some users do more
> setup than can be restored by calling the vm_ops open() function.  It is
> safer to store the gap in the vma tree in these cases.
> 
> That is to say that the failure scenario that existed before the
> MAP_FIXED gap exposure is restored as it is safer than trying to undo a
> partial mapping.
> 
> Since abort_munmap_vmas() is only reattaching vmas with this change, the
> function is renamed to reattach_vmas().
> 
> There is also a secondary failure that may occur if there is not enough
> memory to store the gap.  In this case, the vmas are reattached and
> resources freed.  If the system cannot complete the call_mmap() and
> fails to allocate with GFP_KERNEL, then the system will print a warning
> about the failure.
> 
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  mm/mmap.c |  3 +--
>  mm/vma.c  |  4 +--
>  mm/vma.h  | 80 ++++++++++++++++++++++++++++++++++++++++---------------
>  3 files changed, 61 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 405e0432c78e..e1e5c78b6c3c 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1623,8 +1623,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  		vm_unacct_memory(charged);
>  
>  abort_munmap:
> -	if (vms.nr_pages)
> -		abort_munmap_vmas(&mas_detach, vms.closed_vm_ops);
> +	vms_abort_munmap_vmas(&vms, &mas_detach);
>  gather_failed:
>  	validate_mm(mm);
>  	return error;
> diff --git a/mm/vma.c b/mm/vma.c
> index 648c58da8ad4..d2d71d659d1e 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -878,7 +878,7 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>  munmap_gather_failed:
>  end_split_failed:
>  modify_vma_failed:
> -	abort_munmap_vmas(mas_detach, /* closed = */ false);
> +	reattach_vmas(mas_detach);
>  start_split_failed:
>  map_count_exceeded:
>  	return error;
> @@ -923,7 +923,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	return 0;
>  
>  clear_tree_failed:
> -	abort_munmap_vmas(&mas_detach, /* closed = */ false);
> +	reattach_vmas(&mas_detach);
>  gather_failed:
>  	validate_mm(mm);
>  	return error;
> diff --git a/mm/vma.h b/mm/vma.h
> index 64b44f5a0a11..b2306d13d456 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -82,6 +82,22 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	       unsigned long start, unsigned long end, pgoff_t pgoff);
>  
> +static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> +			struct vm_area_struct *vma, gfp_t gfp)
> +
> +{
> +	if (vmi->mas.status != ma_start &&
> +	    ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
> +		vma_iter_invalidate(vmi);
> +
> +	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> +	mas_store_gfp(&vmi->mas, vma, gfp);
> +	if (unlikely(mas_is_err(&vmi->mas)))
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
>  #ifdef CONFIG_MMU
>  /*
>   * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
> @@ -129,24 +145,60 @@ void vms_clean_up_area(struct vma_munmap_struct *vms,
>  		struct ma_state *mas_detach, bool mm_wr_locked);
>  
>  /*
> - * abort_munmap_vmas - Undo any munmap work and free resources
> + * reattach_vmas() - Undo any munmap work and free resources
> + * @mas_detach: The maple state with the detached maple tree
>   *
>   * Reattach any detached vmas and free up the maple tree used to track the vmas.
>   */
> -static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed)
> +static inline void reattach_vmas(struct ma_state *mas_detach)
>  {
>  	struct vm_area_struct *vma;
>  
>  	mas_set(mas_detach, 0);
> -	mas_for_each(mas_detach, vma, ULONG_MAX) {
> +	mas_for_each(mas_detach, vma, ULONG_MAX)
>  		vma_mark_detached(vma, false);
> -		if (closed && vma->vm_ops && vma->vm_ops->open)
> -			vma->vm_ops->open(vma);
> -	}
>  
>  	__mt_destroy(mas_detach->tree);
>  }
>  
> +/*
> + * vms_abort_munmap_vmas() - Undo as much as possible from an aborted munmap()
> + * operation.
> + * @vms: The vma unmap structure
> + * @mas_detach: The maple state with the detached maple tree
> + *
> + * Reattach any detached vmas, free up the maple tree used to track the vmas.
> + * If that's not possible because the ptes are cleared (and vm_ops->closed() may
> + * have been called), then a NULL is written over the vmas and the vmas are
> + * removed (munmap() completed).
> + */
> +static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
> +		struct ma_state *mas_detach)
> +{
> +	struct ma_state *mas = &vms->vmi->mas;
> +	if (!vms->nr_pages)
> +		return;
> +
> +	if (vms->clear_ptes)
> +		return reattach_vmas(mas_detach);
> +
> +	/*
> +	 * Aborting cannot just call the vm_ops open() because they are often
> +	 * not symmetrical and state data has been lost.  Resort to the old
> +	 * failure method of leaving a gap where the MAP_FIXED mapping failed.
> +	 */
> +	mas_set_range(mas, vms->start, vms->end);
> +	if (unlikely(mas_store_gfp(mas, NULL, GFP_KERNEL))) {
> +		pr_warn_once("%s: (%d) Unable to abort munmap() operation\n",
> +			     current->comm, current->pid);
> +		/* Leaving vmas detached and in-tree may hamper recovery */
> +		reattach_vmas(mas_detach);
> +	} else {
> +		/* Clean up the insertion of the unfortunate gap */
> +		vms_complete_munmap_vmas(vms, mas_detach);
> +	}
> +}
> +
>  int
>  do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		    struct mm_struct *mm, unsigned long start,
> @@ -299,22 +351,6 @@ static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi,
>  	return mas_prev(&vmi->mas, min);
>  }
>  
> -static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> -			struct vm_area_struct *vma, gfp_t gfp)
> -{
> -	if (vmi->mas.status != ma_start &&
> -	    ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
> -		vma_iter_invalidate(vmi);
> -
> -	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> -	mas_store_gfp(&vmi->mas, vma, gfp);
> -	if (unlikely(mas_is_err(&vmi->mas)))
> -		return -ENOMEM;
> -
> -	return 0;
> -}
> -
> -
>  /*
>   * These three helpers classifies VMAs for virtual memory accounting.
>   */
> -- 
> 2.43.0
>
Lorenzo Stoakes Sept. 3, 2024, 11 a.m. UTC | #2
On Tue, Sep 03, 2024 at 11:07:38AM GMT, Pengfei Xu wrote:
> Hi Liam R. Howlett,
>
> Greetings!
>
> There is WARNING in __split_vma in next-20240902 in syzkaller fuzzing test.
> Bisected and found first bad commit:
> "
> 3483c95414f9 mm: change failure of MAP_FIXED to restoring the gap on failure
> "
> It's same as below patch.
> After reverted the above commit on top of next-20240902, this issue was gone.
>
> All detailed info: https://github.com/xupengfe/syzkaller_logs/tree/main/240903_092137___split_vma
> Syzkaller repro code: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.c
> Syzkaller repro syscall steps: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.prog
> Syzkaller report: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.report
> Kconfig(make olddefconfig): https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/kconfig_origin
> Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/bisect_info.log
> bzImage: https://github.com/xupengfe/syzkaller_logs/raw/main/240903_092137___split_vma/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0.tar.gz
> Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log
>
> And "KASAN: slab-use-after-free Read in acct_collect" also pointed to the
> same commit, all detailed info:
> https://github.com/xupengfe/syzkaller_logs/tree/main/240903_090000_acct_collect
>
> "

Thanks for the report! Looking into it.

> [   19.953726] cgroup: Unknown subsys name 'net'
> [   20.045121] cgroup: Unknown subsys name 'rlimit'
> [   20.138332] ------------[ cut here ]------------
> [   20.138634] WARNING: CPU: 1 PID: 732 at include/linux/maple_tree.h:733 __split_vma+0x4d7/0x1020
> [   20.139075] Modules linked in:
> [   20.139245] CPU: 1 UID: 0 PID: 732 Comm: repro Not tainted 6.11.0-rc6-next-20240902-ecc768a84f0b #1
> [   20.139779] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> [   20.140337] RIP: 0010:__split_vma+0x4d7/0x1020
> [   20.140572] Code: 89 ee 48 8b 40 10 48 89 c7 48 89 85 00 ff ff ff e8 8e 61 a7 ff 48 8b 85 00 ff ff ff 4c 39 e8 0f 83 ea fd ff ff e8 b9 5e a7 ff <0f> 0b e9 de fd ff ff 48 8b 85 30 ff ff ff 48 83 c0 10 48 89 85 18
> [   20.141476] RSP: 0018:ffff8880217379a0 EFLAGS: 00010293
> [   20.141749] RAX: 0000000000000000 RBX: ffff8880132351e0 RCX: ffffffff81bf6117
> [   20.142106] RDX: ffff888012c30000 RSI: ffffffff81bf6187 RDI: 0000000000000006
> [   20.142457] RBP: ffff888021737aa0 R08: 0000000000000001 R09: ffffed100263d3cd
> [   20.142814] R10: 0000000020ff9000 R11: 0000000000000001 R12: ffff888021737e40
> [   20.143173] R13: 0000000020ff9000 R14: 0000000020ffc000 R15: ffff888013235d20
> [   20.143529] FS:  00007eff937f9740(0000) GS:ffff88806c500000(0000) knlGS:0000000000000000
> [   20.144308] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   20.144600] CR2: 0000000020000040 CR3: 000000001f464003 CR4: 0000000000770ef0
> [   20.144958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   20.145313] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [   20.145665] PKRU: 55555554
> [   20.145809] Call Trace:
> [   20.145940]  <TASK>
> [   20.146056]  ? show_regs+0x6d/0x80
> [   20.146247]  ? __warn+0xf3/0x380
> [   20.146431]  ? report_bug+0x25e/0x4b0
> [   20.146650]  ? __split_vma+0x4d7/0x1020

Have repro'd locally. This is, unsurprisingly, on this line (even if trace above
doesn't decode to it unfortunately):

	vma_iter_config(vmi, new->vm_start, new->vm_end);

The VMA in question spans 0x20ff9000, 0x21000000 so is 7 pages in size.

At the point of invoking vma_iter_config(), the vma iterator points at
0x20ff9001, but we try to position it to 0x20ff9000.

It seems the issue is that in do_vmi_munmap(), after vma_find() is called, we
find a VMA at 0x20ff9000, but the VMI is positioned to 0x20ff9001...!

Perhaps maple tree corruption in a previous call somehow?


I can interestingly only repro this if I clear the qemu image each time, I'm
guessing this is somehow tied to the instantiation of the cgroup setup or such?

Am continuing the investigation.


> [   20.146864]  ? report_bug+0x2cb/0x4b0
> [   20.147070]  ? __split_vma+0x4d7/0x1020
> [   20.147281]  ? __split_vma+0x4d8/0x1020
> [   20.147492]  ? handle_bug+0xf1/0x190
> [   20.147756]  ? exc_invalid_op+0x3c/0x80
> [   20.147971]  ? asm_exc_invalid_op+0x1f/0x30
> [   20.148208]  ? __split_vma+0x467/0x1020
> [   20.148417]  ? __split_vma+0x4d7/0x1020
> [   20.148628]  ? __split_vma+0x4d7/0x1020
> [   20.148845]  ? __pfx___split_vma+0x10/0x10
> [   20.149068]  ? __kasan_check_read+0x15/0x20
> [   20.149300]  ? mark_lock.part.0+0xf3/0x17b0
> [   20.149535]  ? __kasan_check_read+0x15/0x20
> [   20.149769]  vms_gather_munmap_vmas+0x178/0xf70
> [   20.150024]  do_vmi_align_munmap+0x26e/0x640
> [   20.150257]  ? __pfx___lock_acquire+0x10/0x10
> [   20.150500]  ? __pfx_do_vmi_align_munmap+0x10/0x10
> [   20.150758]  ? __this_cpu_preempt_check+0x21/0x30
> [   20.151018]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [   20.151308]  ? mtree_range_walk+0x728/0xb70
> [   20.151602]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [   20.151891]  ? mas_walk+0x6a7/0x8b0
> [   20.152096]  do_vmi_munmap+0x202/0x400
> [   20.152307]  __vm_munmap+0x182/0x380
> [   20.152509]  ? __pfx___vm_munmap+0x10/0x10
> [   20.152729]  ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0
> [   20.153061]  ? lockdep_hardirqs_on+0x89/0x110
> [   20.153299]  ? trace_hardirqs_on+0x51/0x60
> [   20.153533]  ? __audit_syscall_entry+0x39c/0x500
> [   20.153790]  __x64_sys_munmap+0x62/0x90
> [   20.154001]  x64_sys_call+0x198f/0x2140
> [   20.154212]  do_syscall_64+0x6d/0x140
> [   20.154414]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   20.154683] RIP: 0033:0x7eff9343ee5d
> [   20.154885] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
> [   20.155872] RSP: 002b:00007ffe28711628 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
> [   20.156267] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff9343ee5d
> [   20.156635] RDX: ffffffffffffff80 RSI: 0000000000001000 RDI: 0000000020ffc000
> [   20.157002] RBP: 00007ffe28711640 R08: 0000000000000000 R09: 00007ffe28711640
> [   20.157369] R10: 0000000000000003 R11: 0000000000000206 R12: 00007ffe287117d8
> [   20.157737] R13: 00000000004027cc R14: 0000000000404e08 R15: 00007eff93844000
> [   20.158122]  </TASK>
> [   20.158245] irq event stamp: 2019
> [   20.158426] hardirqs last  enabled at (2027): [<ffffffff814629e4>] console_unlock+0x224/0x240
> [   20.158869] hardirqs last disabled at (2034): [<ffffffff814629c9>] console_unlock+0x209/0x240
> [   20.159306] softirqs last  enabled at (1954): [<ffffffff81289969>] __irq_exit_rcu+0xa9/0x120
> [   20.159797] softirqs last disabled at (2051): [<ffffffff81289969>] __irq_exit_rcu+0xa9/0x120
> [   20.160233] ---[ end trace 0000000000000000 ]---
> "
>
> I hope it's helpful.
>
> Thanks!
>
> ---
>
> If you don't need the following environment to reproduce the problem or if you
> already have one reproduced environment, please ignore the following information.
>
> How to reproduce:
> git clone https://gitlab.com/xupengfe/repro_vm_env.git
> cd repro_vm_env
> tar -xvf repro_vm_env.tar.gz
> cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
>   // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
>   // You could change the bzImage_xxx as you want
>   // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
> You could use below command to log in, there is no password for root.
> ssh -p 10023 root@localhost
>
> After login vm(virtual machine) successfully, you could transfer reproduced
> binary to the vm by below way, and reproduce the problem in vm:
> gcc -pthread -o repro repro.c
> scp -P 10023 repro root@localhost:/root/
>
> Get the bzImage for target kernel:
> Please use target kconfig and copy it to kernel_src/.config
> make olddefconfig
> make -jx bzImage           //x should equal or less than cpu num your pc has
>
> Fill the bzImage file into above start3.sh to load the target kernel in vm.
>
> Tips:
> If you already have qemu-system-x86_64, please ignore below info.
> If you want to install qemu v7.1.0 version:
> git clone https://github.com/qemu/qemu.git
> cd qemu
> git checkout -f v7.1.0
> mkdir build
> cd build
> yum install -y ninja-build.x86_64
> yum -y install libslirp-devel.x86_64
> ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
> make
> make install
>
> Best Regards,
> Thanks!
>
>
> On 2024-08-30 at 00:00:55 -0400, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Prior to call_mmap(), the vmas that will be replaced need to clear the
> > way for what may happen in the call_mmap().  This clean up work includes
> > clearing the ptes and calling the close() vm_ops.  Some users do more
> > setup than can be restored by calling the vm_ops open() function.  It is
> > safer to store the gap in the vma tree in these cases.
> >
> > That is to say that the failure scenario that existed before the
> > MAP_FIXED gap exposure is restored as it is safer than trying to undo a
> > partial mapping.
> >
> > Since abort_munmap_vmas() is only reattaching vmas with this change, the
> > function is renamed to reattach_vmas().
> >
> > There is also a secondary failure that may occur if there is not enough
> > memory to store the gap.  In this case, the vmas are reattached and
> > resources freed.  If the system cannot complete the call_mmap() and
> > fails to allocate with GFP_KERNEL, then the system will print a warning
> > about the failure.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  mm/mmap.c |  3 +--
> >  mm/vma.c  |  4 +--
> >  mm/vma.h  | 80 ++++++++++++++++++++++++++++++++++++++++---------------
> >  3 files changed, 61 insertions(+), 26 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 405e0432c78e..e1e5c78b6c3c 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1623,8 +1623,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  		vm_unacct_memory(charged);
> >
> >  abort_munmap:
> > -	if (vms.nr_pages)
> > -		abort_munmap_vmas(&mas_detach, vms.closed_vm_ops);
> > +	vms_abort_munmap_vmas(&vms, &mas_detach);
> >  gather_failed:
> >  	validate_mm(mm);
> >  	return error;
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 648c58da8ad4..d2d71d659d1e 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -878,7 +878,7 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> >  munmap_gather_failed:
> >  end_split_failed:
> >  modify_vma_failed:
> > -	abort_munmap_vmas(mas_detach, /* closed = */ false);
> > +	reattach_vmas(mas_detach);
> >  start_split_failed:
> >  map_count_exceeded:
> >  	return error;
> > @@ -923,7 +923,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  	return 0;
> >
> >  clear_tree_failed:
> > -	abort_munmap_vmas(&mas_detach, /* closed = */ false);
> > +	reattach_vmas(&mas_detach);
> >  gather_failed:
> >  	validate_mm(mm);
> >  	return error;
> > diff --git a/mm/vma.h b/mm/vma.h
> > index 64b44f5a0a11..b2306d13d456 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -82,6 +82,22 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  	       unsigned long start, unsigned long end, pgoff_t pgoff);
> >
> > +static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> > +			struct vm_area_struct *vma, gfp_t gfp)
> > +
> > +{
> > +	if (vmi->mas.status != ma_start &&
> > +	    ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
> > +		vma_iter_invalidate(vmi);
> > +
> > +	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> > +	mas_store_gfp(&vmi->mas, vma, gfp);
> > +	if (unlikely(mas_is_err(&vmi->mas)))
> > +		return -ENOMEM;
> > +
> > +	return 0;
> > +}
> > +
> >  #ifdef CONFIG_MMU
> >  /*
> >   * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
> > @@ -129,24 +145,60 @@ void vms_clean_up_area(struct vma_munmap_struct *vms,
> >  		struct ma_state *mas_detach, bool mm_wr_locked);
> >
> >  /*
> > - * abort_munmap_vmas - Undo any munmap work and free resources
> > + * reattach_vmas() - Undo any munmap work and free resources
> > + * @mas_detach: The maple state with the detached maple tree
> >   *
> >   * Reattach any detached vmas and free up the maple tree used to track the vmas.
> >   */
> > -static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed)
> > +static inline void reattach_vmas(struct ma_state *mas_detach)
> >  {
> >  	struct vm_area_struct *vma;
> >
> >  	mas_set(mas_detach, 0);
> > -	mas_for_each(mas_detach, vma, ULONG_MAX) {
> > +	mas_for_each(mas_detach, vma, ULONG_MAX)
> >  		vma_mark_detached(vma, false);
> > -		if (closed && vma->vm_ops && vma->vm_ops->open)
> > -			vma->vm_ops->open(vma);
> > -	}
> >
> >  	__mt_destroy(mas_detach->tree);
> >  }
> >
> > +/*
> > + * vms_abort_munmap_vmas() - Undo as much as possible from an aborted munmap()
> > + * operation.
> > + * @vms: The vma unmap structure
> > + * @mas_detach: The maple state with the detached maple tree
> > + *
> > + * Reattach any detached vmas, free up the maple tree used to track the vmas.
> > + * If that's not possible because the ptes are cleared (and vm_ops->closed() may
> > + * have been called), then a NULL is written over the vmas and the vmas are
> > + * removed (munmap() completed).
> > + */
> > +static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
> > +		struct ma_state *mas_detach)
> > +{
> > +	struct ma_state *mas = &vms->vmi->mas;
> > +	if (!vms->nr_pages)
> > +		return;
> > +
> > +	if (vms->clear_ptes)
> > +		return reattach_vmas(mas_detach);
> > +
> > +	/*
> > +	 * Aborting cannot just call the vm_ops open() because they are often
> > +	 * not symmetrical and state data has been lost.  Resort to the old
> > +	 * failure method of leaving a gap where the MAP_FIXED mapping failed.
> > +	 */
> > +	mas_set_range(mas, vms->start, vms->end);
> > +	if (unlikely(mas_store_gfp(mas, NULL, GFP_KERNEL))) {
> > +		pr_warn_once("%s: (%d) Unable to abort munmap() operation\n",
> > +			     current->comm, current->pid);
> > +		/* Leaving vmas detached and in-tree may hamper recovery */
> > +		reattach_vmas(mas_detach);
> > +	} else {
> > +		/* Clean up the insertion of the unfortunate gap */
> > +		vms_complete_munmap_vmas(vms, mas_detach);
> > +	}
> > +}
> > +
> >  int
> >  do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  		    struct mm_struct *mm, unsigned long start,
> > @@ -299,22 +351,6 @@ static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi,
> >  	return mas_prev(&vmi->mas, min);
> >  }
> >
> > -static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> > -			struct vm_area_struct *vma, gfp_t gfp)
> > -{
> > -	if (vmi->mas.status != ma_start &&
> > -	    ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
> > -		vma_iter_invalidate(vmi);
> > -
> > -	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> > -	mas_store_gfp(&vmi->mas, vma, gfp);
> > -	if (unlikely(mas_is_err(&vmi->mas)))
> > -		return -ENOMEM;
> > -
> > -	return 0;
> > -}
> > -
> > -
> >  /*
> >   * These three helpers classifies VMAs for virtual memory accounting.
> >   */
> > --
> > 2.43.0
> >
Lorenzo Stoakes Sept. 3, 2024, 12:27 p.m. UTC | #3
Hi Andrew - TL;DR of this is - please apply the fix patch attached below to
fix a problem in this series, thanks! :)

On Tue, Sep 03, 2024 at 12:00:04PM GMT, Lorenzo Stoakes wrote:
> On Tue, Sep 03, 2024 at 11:07:38AM GMT, Pengfei Xu wrote:
> > Hi Liam R. Howlett,
> >
> > Greetings!
> >
> > There is WARNING in __split_vma in next-20240902 in syzkaller fuzzing test.
> > Bisected and found first bad commit:
> > "
> > 3483c95414f9 mm: change failure of MAP_FIXED to restoring the gap on failure
> > "
> > It's same as below patch.
> > After reverted the above commit on top of next-20240902, this issue was gone.
> >
> > All detailed info: https://github.com/xupengfe/syzkaller_logs/tree/main/240903_092137___split_vma
> > Syzkaller repro code: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.c
> > Syzkaller repro syscall steps: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.prog
> > Syzkaller report: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.report
> > Kconfig(make olddefconfig): https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/kconfig_origin
> > Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/bisect_info.log
> > bzImage: https://github.com/xupengfe/syzkaller_logs/raw/main/240903_092137___split_vma/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0.tar.gz
> > Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log
> >
> > And "KASAN: slab-use-after-free Read in acct_collect" also pointed to the
> > same commit, all detailed info:
> > https://github.com/xupengfe/syzkaller_logs/tree/main/240903_090000_acct_collect
> >
> > "
>
> Thanks for the report! Looking into it.
>
> > [   19.953726] cgroup: Unknown subsys name 'net'
> > [   20.045121] cgroup: Unknown subsys name 'rlimit'
> > [   20.138332] ------------[ cut here ]------------
> > [   20.138634] WARNING: CPU: 1 PID: 732 at include/linux/maple_tree.h:733 __split_vma+0x4d7/0x1020
> > [   20.139075] Modules linked in:
> > [   20.139245] CPU: 1 UID: 0 PID: 732 Comm: repro Not tainted 6.11.0-rc6-next-20240902-ecc768a84f0b #1
> > [   20.139779] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > [   20.140337] RIP: 0010:__split_vma+0x4d7/0x1020
> > [   20.140572] Code: 89 ee 48 8b 40 10 48 89 c7 48 89 85 00 ff ff ff e8 8e 61 a7 ff 48 8b 85 00 ff ff ff 4c 39 e8 0f 83 ea fd ff ff e8 b9 5e a7 ff <0f> 0b e9 de fd ff ff 48 8b 85 30 ff ff ff 48 83 c0 10 48 89 85 18
> > [   20.141476] RSP: 0018:ffff8880217379a0 EFLAGS: 00010293
> > [   20.141749] RAX: 0000000000000000 RBX: ffff8880132351e0 RCX: ffffffff81bf6117
> > [   20.142106] RDX: ffff888012c30000 RSI: ffffffff81bf6187 RDI: 0000000000000006
> > [   20.142457] RBP: ffff888021737aa0 R08: 0000000000000001 R09: ffffed100263d3cd
> > [   20.142814] R10: 0000000020ff9000 R11: 0000000000000001 R12: ffff888021737e40
> > [   20.143173] R13: 0000000020ff9000 R14: 0000000020ffc000 R15: ffff888013235d20
> > [   20.143529] FS:  00007eff937f9740(0000) GS:ffff88806c500000(0000) knlGS:0000000000000000
> > [   20.144308] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   20.144600] CR2: 0000000020000040 CR3: 000000001f464003 CR4: 0000000000770ef0
> > [   20.144958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [   20.145313] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> > [   20.145665] PKRU: 55555554
> > [   20.145809] Call Trace:
> > [   20.145940]  <TASK>
> > [   20.146056]  ? show_regs+0x6d/0x80
> > [   20.146247]  ? __warn+0xf3/0x380
> > [   20.146431]  ? report_bug+0x25e/0x4b0
> > [   20.146650]  ? __split_vma+0x4d7/0x1020
>
> Have repro'd locally. This is, unsurprisingly, on this line (even if trace above
> doesn't decode to it unfortunately):
>
> 	vma_iter_config(vmi, new->vm_start, new->vm_end);
>
> The VMA in question spans 0x20ff9000, 0x21000000 so is 7 pages in size.
>
> At the point of invoking vma_iter_config(), the vma iterator points at
> 0x20ff9001, but we try to position it to 0x20ff9000.
>
> It seems the issue is that in do_vmi_munmap(), after vma_find() is called, we
> find a VMA at 0x20ff9000, but the VMI is positioned to 0x20ff9001...!
>
> Perhaps maple tree corruption in a previous call somehow?
>
>
> I can interestingly only repro this if I clear the qemu image each time, I'm
> guessing this is somehow tied to the instantiation of the cgroup setup or such?
>
> Am continuing the investigation.
>

[snip]

OK I turned on CONFIG_DEBUG_VM_MAPLE_TREE and am hitting
VM_WARN_ON_ONCE_MM(vma->vm_start != vmi_start, mm) after gather_failed is hit in
mmap_region() as a result of call_mmap() returning an error.

This is invoking kernfs_fop_mmap(), which returns -ENODEV because the
KERNFS_HAS_MMAP flag has not been set for the cgroup file being mapped.

This results in mmap_region() jumping to unmap_and_free_vma, which unmaps the
page tables in the region and goes on to abort the unmap operation.

The validate_mm() that fails is called by vms_complete_munmap_vmas() which was
invoked by vms_abort_munmap_vmas().

The tree is then corrupted:

vma ffff888013414d20 start 0000000020ff9000 end 0000000021000000 mm ffff88800d06ae40
prot 25 anon_vma ffff8880132cc660 vm_ops 0000000000000000
pgoff 20ff9 file 0000000000000000 private_data 0000000000000000
flags: 0x8100077(read|write|exec|mayread|maywrite|mayexec|account|softdirty)
tree range: ffff888013414d20 start 20ff9001 end 20ffffff

Incorrectly starting at off-by-one 0x20ff9001 rather than 0x20ff9000. This
is a very telling off-by... the programmer's favourite off-by-1 :) Which
then made me think of how mas operations have an _inclusive_ end and VMA
ones have an _exclusive_ one.

And so I tracked down the cause of this to vms_abort_munmap_vmas() which
was invoking mas_set_range() using vms->end (exclusive) as if it were
inclusive, which thus resulted in 0x20ff9000 being wrongly cleared.

Thes solution is simply to subtract this by one as done in the attached
fix-patch.

I confirmed this fixed the issue as I was able to set up a reliable repro
locally.

Thanks for the report! Great find.

----8<----
From 3e7decc5390b0edc462afa74794a8208e25e50f2 Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Tue, 3 Sep 2024 13:20:34 +0100
Subject: [PATCH] mm: fix off-by-one error in vms_abort_munmap_vmas()

Maple tree ranges have an inclusive end, VMAs do not, so we must subtract
one from the VMA-specific end value when using a mas_...() function.

We failed to do so in vms_abort_munmap_vmas() which resulted in a store
overlapping the intended range by one byte, and thus corrupting the maple
tree.

Fix this by subtracting one from vms->end() passed into mas_set_range().

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/vma.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vma.h b/mm/vma.h
index 370d3246f147..819f994cf727 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -240,7 +240,7 @@ static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
 	 * not symmetrical and state data has been lost.  Resort to the old
 	 * failure method of leaving a gap where the MAP_FIXED mapping failed.
 	 */
-	mas_set_range(mas, vms->start, vms->end);
+	mas_set_range(mas, vms->start, vms->end - 1);
 	if (unlikely(mas_store_gfp(mas, NULL, GFP_KERNEL))) {
 		pr_warn_once("%s: (%d) Unable to abort munmap() operation\n",
 			     current->comm, current->pid);
--
2.46.0
Liam R. Howlett Sept. 3, 2024, 4:03 p.m. UTC | #4
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240903 08:27]:
> Hi Andrew - TL;DR of this is - please apply the fix patch attached below to
> fix a problem in this series, thanks! :)

Oh yes.  I should have caught this, thanks Lorenzo.

Cheers,
Liam


> ----8<----
> From 3e7decc5390b0edc462afa74794a8208e25e50f2 Mon Sep 17 00:00:00 2001
> From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Date: Tue, 3 Sep 2024 13:20:34 +0100
> Subject: [PATCH] mm: fix off-by-one error in vms_abort_munmap_vmas()
> 
> Maple tree ranges have an inclusive end, VMAs do not, so we must subtract
> one from the VMA-specific end value when using a mas_...() function.
> 
> We failed to do so in vms_abort_munmap_vmas() which resulted in a store
> overlapping the intended range by one byte, and thus corrupting the maple
> tree.
> 
> Fix this by subtracting one from vms->end() passed into mas_set_range().
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  mm/vma.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vma.h b/mm/vma.h
> index 370d3246f147..819f994cf727 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -240,7 +240,7 @@ static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
>  	 * not symmetrical and state data has been lost.  Resort to the old
>  	 * failure method of leaving a gap where the MAP_FIXED mapping failed.
>  	 */
> -	mas_set_range(mas, vms->start, vms->end);
> +	mas_set_range(mas, vms->start, vms->end - 1);
>  	if (unlikely(mas_store_gfp(mas, NULL, GFP_KERNEL))) {
>  		pr_warn_once("%s: (%d) Unable to abort munmap() operation\n",
>  			     current->comm, current->pid);
> --
> 2.46.0
diff mbox series

Patch

diff --git a/mm/mmap.c b/mm/mmap.c
index 405e0432c78e..e1e5c78b6c3c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1623,8 +1623,7 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 		vm_unacct_memory(charged);
 
 abort_munmap:
-	if (vms.nr_pages)
-		abort_munmap_vmas(&mas_detach, vms.closed_vm_ops);
+	vms_abort_munmap_vmas(&vms, &mas_detach);
 gather_failed:
 	validate_mm(mm);
 	return error;
diff --git a/mm/vma.c b/mm/vma.c
index 648c58da8ad4..d2d71d659d1e 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -878,7 +878,7 @@  int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 munmap_gather_failed:
 end_split_failed:
 modify_vma_failed:
-	abort_munmap_vmas(mas_detach, /* closed = */ false);
+	reattach_vmas(mas_detach);
 start_split_failed:
 map_count_exceeded:
 	return error;
@@ -923,7 +923,7 @@  int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	return 0;
 
 clear_tree_failed:
-	abort_munmap_vmas(&mas_detach, /* closed = */ false);
+	reattach_vmas(&mas_detach);
 gather_failed:
 	validate_mm(mm);
 	return error;
diff --git a/mm/vma.h b/mm/vma.h
index 64b44f5a0a11..b2306d13d456 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -82,6 +82,22 @@  int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
 int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	       unsigned long start, unsigned long end, pgoff_t pgoff);
 
+static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
+			struct vm_area_struct *vma, gfp_t gfp)
+
+{
+	if (vmi->mas.status != ma_start &&
+	    ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
+		vma_iter_invalidate(vmi);
+
+	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
+	mas_store_gfp(&vmi->mas, vma, gfp);
+	if (unlikely(mas_is_err(&vmi->mas)))
+		return -ENOMEM;
+
+	return 0;
+}
+
 #ifdef CONFIG_MMU
 /*
  * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
@@ -129,24 +145,60 @@  void vms_clean_up_area(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach, bool mm_wr_locked);
 
 /*
- * abort_munmap_vmas - Undo any munmap work and free resources
+ * reattach_vmas() - Undo any munmap work and free resources
+ * @mas_detach: The maple state with the detached maple tree
  *
  * Reattach any detached vmas and free up the maple tree used to track the vmas.
  */
-static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed)
+static inline void reattach_vmas(struct ma_state *mas_detach)
 {
 	struct vm_area_struct *vma;
 
 	mas_set(mas_detach, 0);
-	mas_for_each(mas_detach, vma, ULONG_MAX) {
+	mas_for_each(mas_detach, vma, ULONG_MAX)
 		vma_mark_detached(vma, false);
-		if (closed && vma->vm_ops && vma->vm_ops->open)
-			vma->vm_ops->open(vma);
-	}
 
 	__mt_destroy(mas_detach->tree);
 }
 
+/*
+ * vms_abort_munmap_vmas() - Undo as much as possible from an aborted munmap()
+ * operation.
+ * @vms: The vma unmap structure
+ * @mas_detach: The maple state with the detached maple tree
+ *
+ * Reattach any detached vmas, free up the maple tree used to track the vmas.
+ * If that's not possible because the ptes are cleared (and vm_ops->closed() may
+ * have been called), then a NULL is written over the vmas and the vmas are
+ * removed (munmap() completed).
+ */
+static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach)
+{
+	struct ma_state *mas = &vms->vmi->mas;
+	if (!vms->nr_pages)
+		return;
+
+	if (vms->clear_ptes)
+		return reattach_vmas(mas_detach);
+
+	/*
+	 * Aborting cannot just call the vm_ops open() because they are often
+	 * not symmetrical and state data has been lost.  Resort to the old
+	 * failure method of leaving a gap where the MAP_FIXED mapping failed.
+	 */
+	mas_set_range(mas, vms->start, vms->end);
+	if (unlikely(mas_store_gfp(mas, NULL, GFP_KERNEL))) {
+		pr_warn_once("%s: (%d) Unable to abort munmap() operation\n",
+			     current->comm, current->pid);
+		/* Leaving vmas detached and in-tree may hamper recovery */
+		reattach_vmas(mas_detach);
+	} else {
+		/* Clean up the insertion of the unfortunate gap */
+		vms_complete_munmap_vmas(vms, mas_detach);
+	}
+}
+
 int
 do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		    struct mm_struct *mm, unsigned long start,
@@ -299,22 +351,6 @@  static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi,
 	return mas_prev(&vmi->mas, min);
 }
 
-static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
-			struct vm_area_struct *vma, gfp_t gfp)
-{
-	if (vmi->mas.status != ma_start &&
-	    ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
-		vma_iter_invalidate(vmi);
-
-	__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
-	mas_store_gfp(&vmi->mas, vma, gfp);
-	if (unlikely(mas_is_err(&vmi->mas)))
-		return -ENOMEM;
-
-	return 0;
-}
-
-
 /*
  * These three helpers classifies VMAs for virtual memory accounting.
  */