Message ID | 20230418210230.3495922-1-longman@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/mmap: Map MAP_STACK to VM_STACK | expand |
On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <longman@redhat.com> wrote: > One of the flags of mmap(2) is MAP_STACK to request a memory segment > suitable for a process or thread stack. The kernel currently ignores > this flags. Glibc uses MAP_STACK when mmapping a thread stack. However, > selinux has an execstack check in selinux_file_mprotect() which disallows > a stack VMA to be made executable. > > Since MAP_STACK is a noop, it is possible for a stack VMA to be merged > with an adjacent anonymous VMA. With that merging, using mprotect(2) > to change a part of the merged anonymous VMA to make it executable may > fail. This can lead to sporadic failure of applications that need to > make those changes. "Sporadic failure of applications" sounds quite serious. Can you provide more details? Did you consider a -stable backport? I'm unable to judge, because the description of the userspace effects is so thin, > One possible fix is to make sure that a stack VMA will not be merged > with a non-stack anonymous VMA. That requires a vm flag that can be > used to distinguish a stack VMA from a regular anonymous VMA. One > can add a new dummy vm flag for that purpose. However, there is only > 1 bit left in the lower 32 bits of vm_flags. Another alternative is > to use an existing vm flag. VM_STACK (= VM_GROWSDOWN for most arches) > can certainly be used for this purpose. The downside is that it is a > slight change in existing behavior. > > Making a stack VMA growable by default certainly fits the need of a > process or thread stack. This patch now maps MAP_STACK to VM_STACK to > prevent unwanted merging with adjacent non-stack VMAs and make the VMA > more suitable for being used as a stack. > > ... > > --- a/include/linux/mman.h > +++ b/include/linux/mman.h > @@ -152,6 +152,7 @@ calc_vm_flag_bits(unsigned long flags) > return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | > _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) | > _calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) | > + _calc_vm_trans(flags, MAP_STACK, VM_STACK ) | > arch_calc_vm_flag_bits(flags); > } The mmap(2) manpage says This flag is currently a no-op on Linux. However, by employing this flag, applications can ensure that they transparently ob- tain support if the flag is implemented in the future. Thus, it is used in the glibc threading implementation to allow for the fact that some architectures may (later) require special treat- ment for stack allocations. A further reason to employ this flag is portability: MAP_STACK exists (and has an effect) on some other systems (e.g., some of the BSDs). so please propose an update for this?
On 4/18/23 17:18, Andrew Morton wrote: > On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <longman@redhat.com> wrote: > >> One of the flags of mmap(2) is MAP_STACK to request a memory segment >> suitable for a process or thread stack. The kernel currently ignores >> this flags. Glibc uses MAP_STACK when mmapping a thread stack. However, >> selinux has an execstack check in selinux_file_mprotect() which disallows >> a stack VMA to be made executable. >> >> Since MAP_STACK is a noop, it is possible for a stack VMA to be merged >> with an adjacent anonymous VMA. With that merging, using mprotect(2) >> to change a part of the merged anonymous VMA to make it executable may >> fail. This can lead to sporadic failure of applications that need to >> make those changes. > "Sporadic failure of applications" sounds quite serious. Can you > provide more details? The problem boils down to the fact that it is possible for user code to mmap a region of memory and then for the kernel to merge the VMA for that memory with the VMA for one of the application's thread stacks. This is causing random SEGVs with one of our large customer application. At a high level, this is what's happening: 1) App runs creating lots of threads. 2) It mmap's 256K pages of anonymous memory. 3) It writes executable code to that memory. 4) It calls mprotect() with PROT_EXEC on that memory so it can subsequently execute the code. The above mprotect() will fail if the mmap'd region's VMA gets merged with the VMA for one of the thread stacks. That's because the default RHEL SELinux policy is to not allow executable stacks. > > Did you consider a -stable backport? I'm unable to judge, because the > description of the userspace effects is so thin, Yes, stable backport can be considered. > >> One possible fix is to make sure that a stack VMA will not be merged >> with a non-stack anonymous VMA. That requires a vm flag that can be >> used to distinguish a stack VMA from a regular anonymous VMA. One >> can add a new dummy vm flag for that purpose. However, there is only >> 1 bit left in the lower 32 bits of vm_flags. Another alternative is >> to use an existing vm flag. VM_STACK (= VM_GROWSDOWN for most arches) >> can certainly be used for this purpose. The downside is that it is a >> slight change in existing behavior. >> >> Making a stack VMA growable by default certainly fits the need of a >> process or thread stack. This patch now maps MAP_STACK to VM_STACK to >> prevent unwanted merging with adjacent non-stack VMAs and make the VMA >> more suitable for being used as a stack. >> >> ... >> >> --- a/include/linux/mman.h >> +++ b/include/linux/mman.h >> @@ -152,6 +152,7 @@ calc_vm_flag_bits(unsigned long flags) >> return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | >> _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) | >> _calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) | >> + _calc_vm_trans(flags, MAP_STACK, VM_STACK ) | >> arch_calc_vm_flag_bits(flags); >> } > The mmap(2) manpage says > > This flag is currently a no-op on Linux. However, by employing > this flag, applications can ensure that they transparently ob- tain > support if the flag is implemented in the future. Thus, it is used > in the glibc threading implementation to allow for the fact that some > architectures may (later) require special treat- ment for stack > allocations. A further reason to employ this flag is portability: > MAP_STACK exists (and has an effect) on some other systems (e.g., > some of the BSDs). > > so please propose an update for this? OK, will do. Thanks, Longman
On Tue, 18 Apr 2023, Waiman Long wrote: > On 4/18/23 17:18, Andrew Morton wrote: > > On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <longman@redhat.com> wrote: > > > >> One of the flags of mmap(2) is MAP_STACK to request a memory segment > >> suitable for a process or thread stack. The kernel currently ignores > >> this flags. Glibc uses MAP_STACK when mmapping a thread stack. However, > >> selinux has an execstack check in selinux_file_mprotect() which disallows > >> a stack VMA to be made executable. > >> > >> Since MAP_STACK is a noop, it is possible for a stack VMA to be merged > >> with an adjacent anonymous VMA. With that merging, using mprotect(2) > >> to change a part of the merged anonymous VMA to make it executable may > >> fail. This can lead to sporadic failure of applications that need to > >> make those changes. > > "Sporadic failure of applications" sounds quite serious. Can you > > provide more details? > > The problem boils down to the fact that it is possible for user code to mmap a > region of memory and then for the kernel to merge the VMA for that memory with > the VMA for one of the application's thread stacks. This is causing random > SEGVs with one of our large customer application. > > At a high level, this is what's happening: > > 1) App runs creating lots of threads. > 2) It mmap's 256K pages of anonymous memory. > 3) It writes executable code to that memory. > 4) It calls mprotect() with PROT_EXEC on that memory so > it can subsequently execute the code. > > The above mprotect() will fail if the mmap'd region's VMA gets merged with the > VMA for one of the thread stacks. That's because the default RHEL SELinux > policy is to not allow executable stacks. Then wouldn't the bug be at the SELinux end? VMAs may have been merged already, but the mprotect() with PROT_EXEC of the good non-stack range will then split that area off from the stack again - maybe the SELinux check does not understand that must happen? Hugh
On 4/18/23 21:36, Hugh Dickins wrote: > On Tue, 18 Apr 2023, Waiman Long wrote: >> On 4/18/23 17:18, Andrew Morton wrote: >>> On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <longman@redhat.com> wrote: >>> >>>> One of the flags of mmap(2) is MAP_STACK to request a memory segment >>>> suitable for a process or thread stack. The kernel currently ignores >>>> this flags. Glibc uses MAP_STACK when mmapping a thread stack. However, >>>> selinux has an execstack check in selinux_file_mprotect() which disallows >>>> a stack VMA to be made executable. >>>> >>>> Since MAP_STACK is a noop, it is possible for a stack VMA to be merged >>>> with an adjacent anonymous VMA. With that merging, using mprotect(2) >>>> to change a part of the merged anonymous VMA to make it executable may >>>> fail. This can lead to sporadic failure of applications that need to >>>> make those changes. >>> "Sporadic failure of applications" sounds quite serious. Can you >>> provide more details? >> The problem boils down to the fact that it is possible for user code to mmap a >> region of memory and then for the kernel to merge the VMA for that memory with >> the VMA for one of the application's thread stacks. This is causing random >> SEGVs with one of our large customer application. >> >> At a high level, this is what's happening: >> >> 1) App runs creating lots of threads. >> 2) It mmap's 256K pages of anonymous memory. >> 3) It writes executable code to that memory. >> 4) It calls mprotect() with PROT_EXEC on that memory so >> it can subsequently execute the code. >> >> The above mprotect() will fail if the mmap'd region's VMA gets merged with the >> VMA for one of the thread stacks. That's because the default RHEL SELinux >> policy is to not allow executable stacks. > Then wouldn't the bug be at the SELinux end? VMAs may have been merged > already, but the mprotect() with PROT_EXEC of the good non-stack range > will then split that area off from the stack again - maybe the SELinux > check does not understand that must happen? The SELinux check is done per VMA, not a region within a VMA. After VMA merging, SELinux is probably not able to determine which part of a VMA is a stack unless we keep that information somewhere and provide an API for SELinux to query. That can be quite a lot of work. So the easiest way to prevent this problem is to avoid merging a stack VMA with a regular anonymous VMA. Cheers, Longman
On Tue, Apr 18, 2023 at 09:45:34PM -0400, Waiman Long wrote: > > On 4/18/23 21:36, Hugh Dickins wrote: > > On Tue, 18 Apr 2023, Waiman Long wrote: > > > On 4/18/23 17:18, Andrew Morton wrote: > > > > On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <longman@redhat.com> wrote: > > > > > > > > > One of the flags of mmap(2) is MAP_STACK to request a memory segment > > > > > suitable for a process or thread stack. The kernel currently ignores > > > > > this flags. Glibc uses MAP_STACK when mmapping a thread stack. However, > > > > > selinux has an execstack check in selinux_file_mprotect() which disallows > > > > > a stack VMA to be made executable. > > > > > > > > > > Since MAP_STACK is a noop, it is possible for a stack VMA to be merged > > > > > with an adjacent anonymous VMA. With that merging, using mprotect(2) > > > > > to change a part of the merged anonymous VMA to make it executable may > > > > > fail. This can lead to sporadic failure of applications that need to > > > > > make those changes. > > > > "Sporadic failure of applications" sounds quite serious. Can you > > > > provide more details? > > > The problem boils down to the fact that it is possible for user code to mmap a > > > region of memory and then for the kernel to merge the VMA for that memory with > > > the VMA for one of the application's thread stacks. This is causing random > > > SEGVs with one of our large customer application. > > > > > > At a high level, this is what's happening: > > > > > > 1) App runs creating lots of threads. > > > 2) It mmap's 256K pages of anonymous memory. > > > 3) It writes executable code to that memory. > > > 4) It calls mprotect() with PROT_EXEC on that memory so > > > it can subsequently execute the code. > > > > > > The above mprotect() will fail if the mmap'd region's VMA gets merged with the > > > VMA for one of the thread stacks. That's because the default RHEL SELinux > > > policy is to not allow executable stacks. > > Then wouldn't the bug be at the SELinux end? VMAs may have been merged > > already, but the mprotect() with PROT_EXEC of the good non-stack range > > will then split that area off from the stack again - maybe the SELinux > > check does not understand that must happen? > > The SELinux check is done per VMA, not a region within a VMA. After VMA > merging, SELinux is probably not able to determine which part of a VMA is a > stack unless we keep that information somewhere and provide an API for > SELinux to query. That can be quite a lot of work. So the easiest way to > prevent this problem is to avoid merging a stack VMA with a regular > anonymous VMA. To paraphrase you, "Yes, SELinux is buggy, but we don't want to fix it". Cc'ing the SELinux people so it can be fixed properly.
On Tue, Apr 18, 2023 at 09:16:37PM -0400, Waiman Long wrote: > 1) App runs creating lots of threads. > 2) It mmap's 256K pages of anonymous memory. > 3) It writes executable code to that memory. > 4) It calls mprotect() with PROT_EXEC on that memory so > it can subsequently execute the code. > > The above mprotect() will fail if the mmap'd region's VMA gets merged with > the VMA for one of the thread stacks. That's because the default RHEL > SELinux policy is to not allow executable stacks. By the way, this is a daft policy. The policy you really want is EXEC|WRITE is not allowed. A non-writable stack is useless, so it's actually a superset of your current policy. Forbidding _simultaneous_ write and executable is just good programming. This way, you don't need to care about the underlying VMA's current permissions, you just need to do: if ((prot & (PROT_EXEC|PROT_WRITE)) == (PROT_EXEC|PROT_WRITE)) return -EACCESS;
On Tue, Apr 18, 2023 at 11:24 PM Matthew Wilcox <willy@infradead.org> wrote: > On Tue, Apr 18, 2023 at 09:45:34PM -0400, Waiman Long wrote: > > On 4/18/23 21:36, Hugh Dickins wrote: > > > On Tue, 18 Apr 2023, Waiman Long wrote: > > > > On 4/18/23 17:18, Andrew Morton wrote: > > > > > On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <longman@redhat.com> wrote: > > > > > > > > > > > One of the flags of mmap(2) is MAP_STACK to request a memory segment > > > > > > suitable for a process or thread stack. The kernel currently ignores > > > > > > this flags. Glibc uses MAP_STACK when mmapping a thread stack. However, > > > > > > selinux has an execstack check in selinux_file_mprotect() which disallows > > > > > > a stack VMA to be made executable. > > > > > > > > > > > > Since MAP_STACK is a noop, it is possible for a stack VMA to be merged > > > > > > with an adjacent anonymous VMA. With that merging, using mprotect(2) > > > > > > to change a part of the merged anonymous VMA to make it executable may > > > > > > fail. This can lead to sporadic failure of applications that need to > > > > > > make those changes. > > > > > "Sporadic failure of applications" sounds quite serious. Can you > > > > > provide more details? > > > > The problem boils down to the fact that it is possible for user code to mmap a > > > > region of memory and then for the kernel to merge the VMA for that memory with > > > > the VMA for one of the application's thread stacks. This is causing random > > > > SEGVs with one of our large customer application. > > > > > > > > At a high level, this is what's happening: > > > > > > > > 1) App runs creating lots of threads. > > > > 2) It mmap's 256K pages of anonymous memory. > > > > 3) It writes executable code to that memory. > > > > 4) It calls mprotect() with PROT_EXEC on that memory so > > > > it can subsequently execute the code. > > > > > > > > The above mprotect() will fail if the mmap'd region's VMA gets merged with the > > > > VMA for one of the thread stacks. That's because the default RHEL SELinux > > > > policy is to not allow executable stacks. > > > Then wouldn't the bug be at the SELinux end? VMAs may have been merged > > > already, but the mprotect() with PROT_EXEC of the good non-stack range > > > will then split that area off from the stack again - maybe the SELinux > > > check does not understand that must happen? > > > > The SELinux check is done per VMA, not a region within a VMA. After VMA > > merging, SELinux is probably not able to determine which part of a VMA is a > > stack unless we keep that information somewhere and provide an API for > > SELinux to query. That can be quite a lot of work. So the easiest way to > > prevent this problem is to avoid merging a stack VMA with a regular > > anonymous VMA. > > To paraphrase you, "Yes, SELinux is buggy, but we don't want to fix it". > > Cc'ing the SELinux people so it can be fixed properly. SELinux needs some way to determine what memory region is currently being used by an application's stacks. The current logic can be found in selinux_file_mprotect(), the relevant snippet is below: int selinux_file_mprotect(struct vm_area_struct *vma, ...) { ... } else if (!vma->vm_file && ((vma->vm_start <= vma->vm_mm->start_stack && vma->vm_end >= vma->vm_mm->start_stack) || vma_is_stack_for_current(vma))) { rc = avc_has_perm(&selinux_state, sid, sid, SECCLASS_PROCESS, PROCESS__EXECSTACK, NULL); } ... } If someone has a better, more foolproof way to determine an application's stack please let us know, or better yet submit a patch :)
On 4/18/23 23:46, Matthew Wilcox wrote: > On Tue, Apr 18, 2023 at 09:16:37PM -0400, Waiman Long wrote: >> 1) App runs creating lots of threads. >> 2) It mmap's 256K pages of anonymous memory. >> 3) It writes executable code to that memory. >> 4) It calls mprotect() with PROT_EXEC on that memory so >> it can subsequently execute the code. >> >> The above mprotect() will fail if the mmap'd region's VMA gets merged with >> the VMA for one of the thread stacks. That's because the default RHEL >> SELinux policy is to not allow executable stacks. > By the way, this is a daft policy. The policy you really want is > EXEC|WRITE is not allowed. A non-writable stack is useless, so it's > actually a superset of your current policy. Forbidding _simultaneous_ > write and executable is just good programming. This way, you don't need > to care about the underlying VMA's current permissions, you just need > to do: > > if ((prot & (PROT_EXEC|PROT_WRITE)) == (PROT_EXEC|PROT_WRITE)) > return -EACCESS; I am not totally sure if the application changes the VMA to read-only first. Even if it does that, it highlights another possible issue when an anonymous VMA is merged with a stack VMA. Either the mprotect() to write-protect the VMA will fail or the application will segfault if it writes stuff to the stack. This particular issue is not related to SELinux. It provides another good idea why we should avoid merging stack VMA to anonymous VMA. Cheers, Longman
On Wed, Apr 19, 2023 at 11:07:04AM -0400, Waiman Long wrote: > On 4/18/23 23:46, Matthew Wilcox wrote: > > On Tue, Apr 18, 2023 at 09:16:37PM -0400, Waiman Long wrote: > > > 1) App runs creating lots of threads. > > > 2) It mmap's 256K pages of anonymous memory. > > > 3) It writes executable code to that memory. > > > 4) It calls mprotect() with PROT_EXEC on that memory so > > > it can subsequently execute the code. > > > > > > The above mprotect() will fail if the mmap'd region's VMA gets merged with > > > the VMA for one of the thread stacks. That's because the default RHEL > > > SELinux policy is to not allow executable stacks. > > By the way, this is a daft policy. The policy you really want is > > EXEC|WRITE is not allowed. A non-writable stack is useless, so it's > > actually a superset of your current policy. Forbidding _simultaneous_ > > write and executable is just good programming. This way, you don't need > > to care about the underlying VMA's current permissions, you just need > > to do: > > > > if ((prot & (PROT_EXEC|PROT_WRITE)) == (PROT_EXEC|PROT_WRITE)) > > return -EACCESS; > > I am not totally sure if the application changes the VMA to read-only first. > Even if it does that, it highlights another possible issue when an anonymous > VMA is merged with a stack VMA. Either the mprotect() to write-protect the > VMA will fail or the application will segfault if it writes stuff to the > stack. This particular issue is not related to SELinux. It provides another > good idea why we should avoid merging stack VMA to anonymous VMA. mprotect will split the VMA into two VMAs, one that is PROT_READ|PROT_WRITE and one the is PROT_READ|PROT_EXEC.
On 4/19/23 11:09 AM, Matthew Wilcox wrote: > On Wed, Apr 19, 2023 at 11:07:04AM -0400, Waiman Long wrote: >> On 4/18/23 23:46, Matthew Wilcox wrote: >>> On Tue, Apr 18, 2023 at 09:16:37PM -0400, Waiman Long wrote: >>>> 1) App runs creating lots of threads. >>>> 2) It mmap's 256K pages of anonymous memory. >>>> 3) It writes executable code to that memory. >>>> 4) It calls mprotect() with PROT_EXEC on that memory so >>>> it can subsequently execute the code. >>>> >>>> The above mprotect() will fail if the mmap'd region's VMA gets merged with >>>> the VMA for one of the thread stacks. That's because the default RHEL >>>> SELinux policy is to not allow executable stacks. >>> By the way, this is a daft policy. The policy you really want is >>> EXEC|WRITE is not allowed. A non-writable stack is useless, so it's >>> actually a superset of your current policy. Forbidding _simultaneous_ >>> write and executable is just good programming. This way, you don't need >>> to care about the underlying VMA's current permissions, you just need >>> to do: >>> >>> if ((prot & (PROT_EXEC|PROT_WRITE)) == (PROT_EXEC|PROT_WRITE)) >>> return -EACCESS; >> >> I am not totally sure if the application changes the VMA to read-only first. >> Even if it does that, it highlights another possible issue when an anonymous >> VMA is merged with a stack VMA. Either the mprotect() to write-protect the >> VMA will fail or the application will segfault if it writes stuff to the >> stack. This particular issue is not related to SELinux. It provides another >> good idea why we should avoid merging stack VMA to anonymous VMA. > > mprotect will split the VMA into two VMAs, one that is > PROT_READ|PROT_WRITE and one the is PROT_READ|PROT_EXEC. > But in this case, the latter still has PROT_WRITE. This was reported by a large data analytics customer. They started getting infrequent random crashes in code they haven't touched in 10 years. One of the threads in their program mmaps a large region using PROT_READ|PROT_WRITE, and that region just happens to be merged with the thread's stack. Then they copy a small snipit of code to a location somewhere within that mapped region. For the one page that contains that code, they mprotect it to PROT_READ|PROT_WRITE|PROT_EXEC. I recall they're still reading and writing data elsewhere on that page. Joe
On 4/18/2023 2:18 PM, Andrew Morton wrote: > On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <longman@redhat.com> wrote: [..] >> ... >> >> --- a/include/linux/mman.h >> +++ b/include/linux/mman.h >> @@ -152,6 +152,7 @@ calc_vm_flag_bits(unsigned long flags) >> return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | >> _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) | >> _calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) | >> + _calc_vm_trans(flags, MAP_STACK, VM_STACK ) | >> arch_calc_vm_flag_bits(flags); >> } > > The mmap(2) manpage says > > This flag is currently a no-op on Linux. However, by employing > this flag, applications can ensure that they transparently ob- tain > support if the flag is implemented in the future. Thus, it is used > in the glibc threading implementation to allow for the fact that some > architectures may (later) require special treat- ment for stack > allocations. A further reason to employ this flag is portability: > MAP_STACK exists (and has an effect) on some other systems (e.g., > some of the BSDs). > > so please propose an update for this? > Just curious, why isn't MAP_STACK implemented in Linux kernel? what does it take to implement it? Also, could there be other potential issue with the vma merge, such as, the user process start to truncate half of the anonymous memory vma range oblivious to the fact that the vma has 'grown' into its stack and it might be attempting to unmap some of its stack range? If the vma merge is otherwise harmless, does it bring benefit other than being one vma less? thanks! -jane
On 4/19/2023 4:21 PM, Jane Chu wrote: > On 4/18/2023 2:18 PM, Andrew Morton wrote: >> On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <longman@redhat.com> >> wrote: > [..] >>> ... >>> >>> --- a/include/linux/mman.h >>> +++ b/include/linux/mman.h >>> @@ -152,6 +152,7 @@ calc_vm_flag_bits(unsigned long flags) >>> return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | >>> _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) | >>> _calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) | >>> + _calc_vm_trans(flags, MAP_STACK, VM_STACK ) | >>> arch_calc_vm_flag_bits(flags); >>> } >> >> The mmap(2) manpage says >> >> This flag is currently a no-op on Linux. However, by employing >> this flag, applications can ensure that they transparently ob- tain >> support if the flag is implemented in the future. Thus, it is used >> in the glibc threading implementation to allow for the fact that some >> architectures may (later) require special treat- ment for stack >> allocations. A further reason to employ this flag is portability: >> MAP_STACK exists (and has an effect) on some other systems (e.g., >> some of the BSDs). >> >> so please propose an update for this? >> > > Just curious, why isn't MAP_STACK implemented in Linux kernel? what does > it take to implement it? > > Also, could there be other potential issue with the vma merge, such as, > the user process start to truncate half of the anonymous memory vma > range oblivious to the fact that the vma has 'grown' into its stack and > it might be attempting to unmap some of its stack range? Sorry, not 'oblivious'. how about a malicious user process get an fd via memfd_create() and attempt to truncate more than it mmap'ed? > > If the vma merge is otherwise harmless, does it bring benefit other than > being one vma less? > > thanks! > -jane > >
diff --git a/include/linux/mman.h b/include/linux/mman.h index cee1e4b566d8..4b621d30ace9 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -152,6 +152,7 @@ calc_vm_flag_bits(unsigned long flags) return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) | _calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) | + _calc_vm_trans(flags, MAP_STACK, VM_STACK ) | arch_calc_vm_flag_bits(flags); }
One of the flags of mmap(2) is MAP_STACK to request a memory segment suitable for a process or thread stack. The kernel currently ignores this flags. Glibc uses MAP_STACK when mmapping a thread stack. However, selinux has an execstack check in selinux_file_mprotect() which disallows a stack VMA to be made executable. Since MAP_STACK is a noop, it is possible for a stack VMA to be merged with an adjacent anonymous VMA. With that merging, using mprotect(2) to change a part of the merged anonymous VMA to make it executable may fail. This can lead to sporadic failure of applications that need to make those changes. One possible fix is to make sure that a stack VMA will not be merged with a non-stack anonymous VMA. That requires a vm flag that can be used to distinguish a stack VMA from a regular anonymous VMA. One can add a new dummy vm flag for that purpose. However, there is only 1 bit left in the lower 32 bits of vm_flags. Another alternative is to use an existing vm flag. VM_STACK (= VM_GROWSDOWN for most arches) can certainly be used for this purpose. The downside is that it is a slight change in existing behavior. Making a stack VMA growable by default certainly fits the need of a process or thread stack. This patch now maps MAP_STACK to VM_STACK to prevent unwanted merging with adjacent non-stack VMAs and make the VMA more suitable for being used as a stack. Reported-by: Joe Mario <jmario@redhat.com> Signed-off-by: Waiman Long <longman@redhat.com> --- include/linux/mman.h | 1 + 1 file changed, 1 insertion(+)