diff mbox series

[RFC] mm: mmap: Allow mmap(MAP_STACK) to map growable stack

Message ID ZuHtb43Ar21ZptNz@p100 (mailing list archive)
State New
Headers show
Series [RFC] mm: mmap: Allow mmap(MAP_STACK) to map growable stack | expand

Commit Message

Helge Deller Sept. 11, 2024, 7:20 p.m. UTC
This is a RFC to change the behaviour of mmap(MAP_STACK) to be
sufficient to map memory for usage as stack on all architectures.
Currently MAP_STACK is a no-op on Linux, and instead MAP_GROWSDOWN
has to be used.
To clarify, here is the relevant info from the mmap() man page:

MAP_GROWSDOWN
   This flag is used for stacks. It indicates to the kernel virtual
   memory system that the mapping should extend downward in memory.  The
   return address is one page lower than the memory area that is
   actually created in the process's virtual address space.  Touching an
   address in the "guard" page below the mapping will cause the mapping
   to grow by a page. This growth can be repeated until the mapping
   grows to within a page of the high end of the next lower mapping,
   at which point touching the "guard" page will result in a SIGSEGV
   signal.

MAP_STACK (since Linux 2.6.27)
   Allocate the mapping at an address suitable for a process or thread
   stack.

   This flag is currently a no-op on Linux. However, by employing this
   flag, applications can ensure that they transparently obtain support
   if the flag is implemented in the future. Thus, it is used in the
   glibc threading implementation to allow for the fact that
   some architectures may (later) require special treatment for
   stack allocations. A further reason to employ this flag is
   portability: MAP_STACK exists (and has an effect) on some
   other systems (e.g., some of the BSDs).

The reason to suggest this change is, that on the parisc architecture the
stack grows upwards. As such, using solely the MAP_GROWSDOWN flag will not
work. Note that there exists no MAP_GROWSUP flag.
By changing the behaviour of MAP_STACK to mark the memory area with the
VM_STACK bit (which is VM_GROWSUP or VM_GROWSDOWN depending on the
architecture) the MAP_STACK flag does exactly what people would expect on
all platforms.

This change should have no negative side-effect, as all code which
used mmap(MAP_GROWSDOWN | MAP_STACK) still work as before.

Signed-off-by: Helge Deller <deller@gmx.de>

Comments

Liam R. Howlett Sept. 11, 2024, 7:48 p.m. UTC | #1
* Helge Deller <deller@kernel.org> [240911 15:20]:
> This is a RFC to change the behaviour of mmap(MAP_STACK) to be
> sufficient to map memory for usage as stack on all architectures.
> Currently MAP_STACK is a no-op on Linux, and instead MAP_GROWSDOWN
> has to be used.
> To clarify, here is the relevant info from the mmap() man page:
> 
> MAP_GROWSDOWN
>    This flag is used for stacks. It indicates to the kernel virtual
>    memory system that the mapping should extend downward in memory.  The
>    return address is one page lower than the memory area that is
>    actually created in the process's virtual address space.  Touching an
>    address in the "guard" page below the mapping will cause the mapping
>    to grow by a page. This growth can be repeated until the mapping
>    grows to within a page of the high end of the next lower mapping,
>    at which point touching the "guard" page will result in a SIGSEGV
>    signal.
> 
> MAP_STACK (since Linux 2.6.27)
>    Allocate the mapping at an address suitable for a process or thread
>    stack.
> 
>    This flag is currently a no-op on Linux. However, by employing this
>    flag, applications can ensure that they transparently obtain support
>    if the flag is implemented in the future. Thus, it is used in the
>    glibc threading implementation to allow for the fact that
>    some architectures may (later) require special treatment for
>    stack allocations. A further reason to employ this flag is
>    portability: MAP_STACK exists (and has an effect) on some
>    other systems (e.g., some of the BSDs).
> 
> The reason to suggest this change is, that on the parisc architecture the
> stack grows upwards. As such, using solely the MAP_GROWSDOWN flag will not
> work. Note that there exists no MAP_GROWSUP flag.
> By changing the behaviour of MAP_STACK to mark the memory area with the
> VM_STACK bit (which is VM_GROWSUP or VM_GROWSDOWN depending on the
> architecture) the MAP_STACK flag does exactly what people would expect on
> all platforms.
> 
> This change should have no negative side-effect, as all code which
> used mmap(MAP_GROWSDOWN | MAP_STACK) still work as before.
> 
> Signed-off-by: Helge Deller <deller@gmx.de>
> 
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index bcb201ab7a41..66bc72a0cb19 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
>  	return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
>  	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
>  	       _calc_vm_trans(flags, MAP_SYNC,	     VM_SYNC      ) |
> +	       _calc_vm_trans(flags, MAP_STACK,      VM_STACK     ) |

Right now MAP_STACK can be used to set VM_NOHUGEPAGE, but this will
change the user interface to create a vma that will grow.  I'm not
entirely sure this is okay?


That is mmap(MAP_STACK) would set VM_NOHUGEPAGE right now, with this
change you'd get VM_NOHUGEPAGE | VM_GROWS<something>

>  	       _calc_vm_trans(flags, MAP_STACK,	     VM_NOHUGEPAGE) |
>  	       arch_calc_vm_flag_bits(flags);
>  }
>
Yang Shi Sept. 11, 2024, 10:16 p.m. UTC | #2
On Wed, Sep 11, 2024 at 12:49 PM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> * Helge Deller <deller@kernel.org> [240911 15:20]:
> > This is a RFC to change the behaviour of mmap(MAP_STACK) to be
> > sufficient to map memory for usage as stack on all architectures.
> > Currently MAP_STACK is a no-op on Linux, and instead MAP_GROWSDOWN
> > has to be used.
> > To clarify, here is the relevant info from the mmap() man page:
> >
> > MAP_GROWSDOWN
> >    This flag is used for stacks. It indicates to the kernel virtual
> >    memory system that the mapping should extend downward in memory.  The
> >    return address is one page lower than the memory area that is
> >    actually created in the process's virtual address space.  Touching an
> >    address in the "guard" page below the mapping will cause the mapping
> >    to grow by a page. This growth can be repeated until the mapping
> >    grows to within a page of the high end of the next lower mapping,
> >    at which point touching the "guard" page will result in a SIGSEGV
> >    signal.
> >
> > MAP_STACK (since Linux 2.6.27)
> >    Allocate the mapping at an address suitable for a process or thread
> >    stack.
> >
> >    This flag is currently a no-op on Linux. However, by employing this
> >    flag, applications can ensure that they transparently obtain support
> >    if the flag is implemented in the future. Thus, it is used in the
> >    glibc threading implementation to allow for the fact that
> >    some architectures may (later) require special treatment for
> >    stack allocations. A further reason to employ this flag is
> >    portability: MAP_STACK exists (and has an effect) on some
> >    other systems (e.g., some of the BSDs).
> >
> > The reason to suggest this change is, that on the parisc architecture the
> > stack grows upwards. As such, using solely the MAP_GROWSDOWN flag will not
> > work. Note that there exists no MAP_GROWSUP flag.
> > By changing the behaviour of MAP_STACK to mark the memory area with the
> > VM_STACK bit (which is VM_GROWSUP or VM_GROWSDOWN depending on the
> > architecture) the MAP_STACK flag does exactly what people would expect on
> > all platforms.
> >
> > This change should have no negative side-effect, as all code which
> > used mmap(MAP_GROWSDOWN | MAP_STACK) still work as before.
> >
> > Signed-off-by: Helge Deller <deller@gmx.de>
> >
> > diff --git a/include/linux/mman.h b/include/linux/mman.h
> > index bcb201ab7a41..66bc72a0cb19 100644
> > --- a/include/linux/mman.h
> > +++ b/include/linux/mman.h
> > @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
> >       return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
> >              _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
> >              _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
> > +            _calc_vm_trans(flags, MAP_STACK,      VM_STACK     ) |
>
> Right now MAP_STACK can be used to set VM_NOHUGEPAGE, but this will
> change the user interface to create a vma that will grow.  I'm not
> entirely sure this is okay?

AFAICT, I don't see this is a problem. Currently huge page also skips
the VMAs with VM_GROWS* flags set. See vma_is_temporary_stack().
__thp_vma_allowable_orders() returns 0 if the vma is a temporary
stack.

>
>
> That is mmap(MAP_STACK) would set VM_NOHUGEPAGE right now, with this
> change you'd get VM_NOHUGEPAGE | VM_GROWS<something>
>
> >              _calc_vm_trans(flags, MAP_STACK,      VM_NOHUGEPAGE) |
> >              arch_calc_vm_flag_bits(flags);
> >  }
> >
>
Liam R. Howlett Sept. 11, 2024, 11:05 p.m. UTC | #3
* Yang Shi <shy828301@gmail.com> [240911 18:16]:
> On Wed, Sep 11, 2024 at 12:49 PM Liam R. Howlett
> <Liam.Howlett@oracle.com> wrote:
> >
> > * Helge Deller <deller@kernel.org> [240911 15:20]:
> > > This is a RFC to change the behaviour of mmap(MAP_STACK) to be
> > > sufficient to map memory for usage as stack on all architectures.
> > > Currently MAP_STACK is a no-op on Linux, and instead MAP_GROWSDOWN
> > > has to be used.
> > > To clarify, here is the relevant info from the mmap() man page:
> > >
> > > MAP_GROWSDOWN
> > >    This flag is used for stacks. It indicates to the kernel virtual
> > >    memory system that the mapping should extend downward in memory.  The
> > >    return address is one page lower than the memory area that is
> > >    actually created in the process's virtual address space.  Touching an
> > >    address in the "guard" page below the mapping will cause the mapping
> > >    to grow by a page. This growth can be repeated until the mapping
> > >    grows to within a page of the high end of the next lower mapping,
> > >    at which point touching the "guard" page will result in a SIGSEGV
> > >    signal.
> > >
> > > MAP_STACK (since Linux 2.6.27)
> > >    Allocate the mapping at an address suitable for a process or thread
> > >    stack.
> > >
> > >    This flag is currently a no-op on Linux. However, by employing this
> > >    flag, applications can ensure that they transparently obtain support
> > >    if the flag is implemented in the future. Thus, it is used in the
> > >    glibc threading implementation to allow for the fact that
> > >    some architectures may (later) require special treatment for
> > >    stack allocations. A further reason to employ this flag is
> > >    portability: MAP_STACK exists (and has an effect) on some
> > >    other systems (e.g., some of the BSDs).
> > >
> > > The reason to suggest this change is, that on the parisc architecture the
> > > stack grows upwards. As such, using solely the MAP_GROWSDOWN flag will not
> > > work. Note that there exists no MAP_GROWSUP flag.
> > > By changing the behaviour of MAP_STACK to mark the memory area with the
> > > VM_STACK bit (which is VM_GROWSUP or VM_GROWSDOWN depending on the
> > > architecture) the MAP_STACK flag does exactly what people would expect on
> > > all platforms.
> > >
> > > This change should have no negative side-effect, as all code which
> > > used mmap(MAP_GROWSDOWN | MAP_STACK) still work as before.
> > >
> > > Signed-off-by: Helge Deller <deller@gmx.de>
> > >
> > > diff --git a/include/linux/mman.h b/include/linux/mman.h
> > > index bcb201ab7a41..66bc72a0cb19 100644
> > > --- a/include/linux/mman.h
> > > +++ b/include/linux/mman.h
> > > @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
> > >       return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
> > >              _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
> > >              _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
> > > +            _calc_vm_trans(flags, MAP_STACK,      VM_STACK     ) |
> >
> > Right now MAP_STACK can be used to set VM_NOHUGEPAGE, but this will
> > change the user interface to create a vma that will grow.  I'm not
> > entirely sure this is okay?
> 
> AFAICT, I don't see this is a problem. Currently huge page also skips
> the VMAs with VM_GROWS* flags set. See vma_is_temporary_stack().
> __thp_vma_allowable_orders() returns 0 if the vma is a temporary
> stack.

If someone is using MAP_STACK to avoid having a huge page, they will
also get a mapping that grows - which is different than what happens
today.

I'm not saying that's right, but someone could be abusing the existing
flag and this will change the behaviour.

> 
> >
> >
> > That is mmap(MAP_STACK) would set VM_NOHUGEPAGE right now, with this
> > change you'd get VM_NOHUGEPAGE | VM_GROWS<something>
> >
> > >              _calc_vm_trans(flags, MAP_STACK,      VM_NOHUGEPAGE) |
> > >              arch_calc_vm_flag_bits(flags);
> > >  }
> > >
> >
Helge Deller Sept. 12, 2024, 12:50 a.m. UTC | #4
On 9/12/24 01:05, Liam R. Howlett wrote:
> * Yang Shi <shy828301@gmail.com> [240911 18:16]:
>> On Wed, Sep 11, 2024 at 12:49 PM Liam R. Howlett
>> <Liam.Howlett@oracle.com> wrote:
>>>
>>> * Helge Deller <deller@kernel.org> [240911 15:20]:
>>>> This is a RFC to change the behaviour of mmap(MAP_STACK) to be
>>>> sufficient to map memory for usage as stack on all architectures.
>>>> Currently MAP_STACK is a no-op on Linux, and instead MAP_GROWSDOWN
>>>> has to be used.
>>>> To clarify, here is the relevant info from the mmap() man page:
>>>>
>>>> MAP_GROWSDOWN
>>>>     This flag is used for stacks. It indicates to the kernel virtual
>>>>     memory system that the mapping should extend downward in memory.  The
>>>>     return address is one page lower than the memory area that is
>>>>     actually created in the process's virtual address space.  Touching an
>>>>     address in the "guard" page below the mapping will cause the mapping
>>>>     to grow by a page. This growth can be repeated until the mapping
>>>>     grows to within a page of the high end of the next lower mapping,
>>>>     at which point touching the "guard" page will result in a SIGSEGV
>>>>     signal.
>>>>
>>>> MAP_STACK (since Linux 2.6.27)
>>>>     Allocate the mapping at an address suitable for a process or thread
>>>>     stack.
>>>>
>>>>     This flag is currently a no-op on Linux. However, by employing this
>>>>     flag, applications can ensure that they transparently obtain support
>>>>     if the flag is implemented in the future. Thus, it is used in the
>>>>     glibc threading implementation to allow for the fact that
>>>>     some architectures may (later) require special treatment for
>>>>     stack allocations. A further reason to employ this flag is
>>>>     portability: MAP_STACK exists (and has an effect) on some
>>>>     other systems (e.g., some of the BSDs).
>>>>
>>>> The reason to suggest this change is, that on the parisc architecture the
>>>> stack grows upwards. As such, using solely the MAP_GROWSDOWN flag will not
>>>> work. Note that there exists no MAP_GROWSUP flag.
>>>> By changing the behaviour of MAP_STACK to mark the memory area with the
>>>> VM_STACK bit (which is VM_GROWSUP or VM_GROWSDOWN depending on the
>>>> architecture) the MAP_STACK flag does exactly what people would expect on
>>>> all platforms.
>>>>
>>>> This change should have no negative side-effect, as all code which
>>>> used mmap(MAP_GROWSDOWN | MAP_STACK) still work as before.
>>>>
>>>> Signed-off-by: Helge Deller <deller@gmx.de>
>>>>
>>>> diff --git a/include/linux/mman.h b/include/linux/mman.h
>>>> index bcb201ab7a41..66bc72a0cb19 100644
>>>> --- a/include/linux/mman.h
>>>> +++ b/include/linux/mman.h
>>>> @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
>>>>        return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
>>>>               _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
>>>>               _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
>>>> +            _calc_vm_trans(flags, MAP_STACK,      VM_STACK     ) |
>>>
>>> Right now MAP_STACK can be used to set VM_NOHUGEPAGE, but this will
>>> change the user interface to create a vma that will grow.  I'm not
>>> entirely sure this is okay?
>>
>> AFAICT, I don't see this is a problem. Currently huge page also skips
>> the VMAs with VM_GROWS* flags set. See vma_is_temporary_stack().
>> __thp_vma_allowable_orders() returns 0 if the vma is a temporary
>> stack.
>
> If someone is using MAP_STACK to avoid having a huge page, they will
> also get a mapping that grows - which is different than what happens
> today.
>
> I'm not saying that's right, but someone could be abusing the existing
> flag and this will change the behaviour.

Wouldn't a plain mmap() followed by madvise(MADV_NOHUGEPAGE) do exactly that?
Why abusing MAP_STACK for that?

Helge

>>> That is mmap(MAP_STACK) would set VM_NOHUGEPAGE right now, with this
>>> change you'd get VM_NOHUGEPAGE | VM_GROWS<something>
>>>
>>>>               _calc_vm_trans(flags, MAP_STACK,      VM_NOHUGEPAGE) |
>>>>               arch_calc_vm_flag_bits(flags);
>>>>   }
Yang Shi Sept. 12, 2024, 1:05 a.m. UTC | #5
On Wed, Sep 11, 2024 at 4:05 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Yang Shi <shy828301@gmail.com> [240911 18:16]:
> > On Wed, Sep 11, 2024 at 12:49 PM Liam R. Howlett
> > <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Helge Deller <deller@kernel.org> [240911 15:20]:
> > > > This is a RFC to change the behaviour of mmap(MAP_STACK) to be
> > > > sufficient to map memory for usage as stack on all architectures.
> > > > Currently MAP_STACK is a no-op on Linux, and instead MAP_GROWSDOWN
> > > > has to be used.
> > > > To clarify, here is the relevant info from the mmap() man page:
> > > >
> > > > MAP_GROWSDOWN
> > > >    This flag is used for stacks. It indicates to the kernel virtual
> > > >    memory system that the mapping should extend downward in memory.  The
> > > >    return address is one page lower than the memory area that is
> > > >    actually created in the process's virtual address space.  Touching an
> > > >    address in the "guard" page below the mapping will cause the mapping
> > > >    to grow by a page. This growth can be repeated until the mapping
> > > >    grows to within a page of the high end of the next lower mapping,
> > > >    at which point touching the "guard" page will result in a SIGSEGV
> > > >    signal.
> > > >
> > > > MAP_STACK (since Linux 2.6.27)
> > > >    Allocate the mapping at an address suitable for a process or thread
> > > >    stack.
> > > >
> > > >    This flag is currently a no-op on Linux. However, by employing this
> > > >    flag, applications can ensure that they transparently obtain support
> > > >    if the flag is implemented in the future. Thus, it is used in the
> > > >    glibc threading implementation to allow for the fact that
> > > >    some architectures may (later) require special treatment for
> > > >    stack allocations. A further reason to employ this flag is
> > > >    portability: MAP_STACK exists (and has an effect) on some
> > > >    other systems (e.g., some of the BSDs).
> > > >
> > > > The reason to suggest this change is, that on the parisc architecture the
> > > > stack grows upwards. As such, using solely the MAP_GROWSDOWN flag will not
> > > > work. Note that there exists no MAP_GROWSUP flag.
> > > > By changing the behaviour of MAP_STACK to mark the memory area with the
> > > > VM_STACK bit (which is VM_GROWSUP or VM_GROWSDOWN depending on the
> > > > architecture) the MAP_STACK flag does exactly what people would expect on
> > > > all platforms.
> > > >
> > > > This change should have no negative side-effect, as all code which
> > > > used mmap(MAP_GROWSDOWN | MAP_STACK) still work as before.
> > > >
> > > > Signed-off-by: Helge Deller <deller@gmx.de>
> > > >
> > > > diff --git a/include/linux/mman.h b/include/linux/mman.h
> > > > index bcb201ab7a41..66bc72a0cb19 100644
> > > > --- a/include/linux/mman.h
> > > > +++ b/include/linux/mman.h
> > > > @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
> > > >       return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
> > > >              _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
> > > >              _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
> > > > +            _calc_vm_trans(flags, MAP_STACK,      VM_STACK     ) |
> > >
> > > Right now MAP_STACK can be used to set VM_NOHUGEPAGE, but this will
> > > change the user interface to create a vma that will grow.  I'm not
> > > entirely sure this is okay?
> >
> > AFAICT, I don't see this is a problem. Currently huge page also skips
> > the VMAs with VM_GROWS* flags set. See vma_is_temporary_stack().
> > __thp_vma_allowable_orders() returns 0 if the vma is a temporary
> > stack.
>
> If someone is using MAP_STACK to avoid having a huge page, they will
> also get a mapping that grows - which is different than what happens
> today.

Yes, I agree. My point is no huge page + grow is fine.

>
> I'm not saying that's right, but someone could be abusing the existing
> flag and this will change the behaviour.

If you mean we will have more grow mapping but they are actually
unnecessary, then I agree someone could abuse the flag.

>
> >
> > >
> > >
> > > That is mmap(MAP_STACK) would set VM_NOHUGEPAGE right now, with this
> > > change you'd get VM_NOHUGEPAGE | VM_GROWS<something>
> > >
> > > >              _calc_vm_trans(flags, MAP_STACK,      VM_NOHUGEPAGE) |
> > > >              arch_calc_vm_flag_bits(flags);
> > > >  }
> > > >
> > >
Yang Shi Sept. 12, 2024, 1:08 a.m. UTC | #6
On Wed, Sep 11, 2024 at 5:50 PM Helge Deller <deller@gmx.de> wrote:
>
> On 9/12/24 01:05, Liam R. Howlett wrote:
> > * Yang Shi <shy828301@gmail.com> [240911 18:16]:
> >> On Wed, Sep 11, 2024 at 12:49 PM Liam R. Howlett
> >> <Liam.Howlett@oracle.com> wrote:
> >>>
> >>> * Helge Deller <deller@kernel.org> [240911 15:20]:
> >>>> This is a RFC to change the behaviour of mmap(MAP_STACK) to be
> >>>> sufficient to map memory for usage as stack on all architectures.
> >>>> Currently MAP_STACK is a no-op on Linux, and instead MAP_GROWSDOWN
> >>>> has to be used.
> >>>> To clarify, here is the relevant info from the mmap() man page:
> >>>>
> >>>> MAP_GROWSDOWN
> >>>>     This flag is used for stacks. It indicates to the kernel virtual
> >>>>     memory system that the mapping should extend downward in memory.  The
> >>>>     return address is one page lower than the memory area that is
> >>>>     actually created in the process's virtual address space.  Touching an
> >>>>     address in the "guard" page below the mapping will cause the mapping
> >>>>     to grow by a page. This growth can be repeated until the mapping
> >>>>     grows to within a page of the high end of the next lower mapping,
> >>>>     at which point touching the "guard" page will result in a SIGSEGV
> >>>>     signal.
> >>>>
> >>>> MAP_STACK (since Linux 2.6.27)
> >>>>     Allocate the mapping at an address suitable for a process or thread
> >>>>     stack.
> >>>>
> >>>>     This flag is currently a no-op on Linux. However, by employing this
> >>>>     flag, applications can ensure that they transparently obtain support
> >>>>     if the flag is implemented in the future. Thus, it is used in the
> >>>>     glibc threading implementation to allow for the fact that
> >>>>     some architectures may (later) require special treatment for
> >>>>     stack allocations. A further reason to employ this flag is
> >>>>     portability: MAP_STACK exists (and has an effect) on some
> >>>>     other systems (e.g., some of the BSDs).
> >>>>
> >>>> The reason to suggest this change is, that on the parisc architecture the
> >>>> stack grows upwards. As such, using solely the MAP_GROWSDOWN flag will not
> >>>> work. Note that there exists no MAP_GROWSUP flag.
> >>>> By changing the behaviour of MAP_STACK to mark the memory area with the
> >>>> VM_STACK bit (which is VM_GROWSUP or VM_GROWSDOWN depending on the
> >>>> architecture) the MAP_STACK flag does exactly what people would expect on
> >>>> all platforms.
> >>>>
> >>>> This change should have no negative side-effect, as all code which
> >>>> used mmap(MAP_GROWSDOWN | MAP_STACK) still work as before.
> >>>>
> >>>> Signed-off-by: Helge Deller <deller@gmx.de>
> >>>>
> >>>> diff --git a/include/linux/mman.h b/include/linux/mman.h
> >>>> index bcb201ab7a41..66bc72a0cb19 100644
> >>>> --- a/include/linux/mman.h
> >>>> +++ b/include/linux/mman.h
> >>>> @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
> >>>>        return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
> >>>>               _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
> >>>>               _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
> >>>> +            _calc_vm_trans(flags, MAP_STACK,      VM_STACK     ) |
> >>>
> >>> Right now MAP_STACK can be used to set VM_NOHUGEPAGE, but this will
> >>> change the user interface to create a vma that will grow.  I'm not
> >>> entirely sure this is okay?
> >>
> >> AFAICT, I don't see this is a problem. Currently huge page also skips
> >> the VMAs with VM_GROWS* flags set. See vma_is_temporary_stack().
> >> __thp_vma_allowable_orders() returns 0 if the vma is a temporary
> >> stack.
> >
> > If someone is using MAP_STACK to avoid having a huge page, they will
> > also get a mapping that grows - which is different than what happens
> > today.
> >
> > I'm not saying that's right, but someone could be abusing the existing
> > flag and this will change the behaviour.
>
> Wouldn't a plain mmap() followed by madvise(MADV_NOHUGEPAGE) do exactly that?
> Why abusing MAP_STACK for that?

Different sources and reports showed having huge pages for stack
mapping hurts performance. A lot of applications, for example, pthread
lib, allocate stack with MAP_STACK and they don't call MADV_NOHUGEPAGE
on stack mapping.

>
> Helge
>
> >>> That is mmap(MAP_STACK) would set VM_NOHUGEPAGE right now, with this
> >>> change you'd get VM_NOHUGEPAGE | VM_GROWS<something>
> >>>
> >>>>               _calc_vm_trans(flags, MAP_STACK,      VM_NOHUGEPAGE) |
> >>>>               arch_calc_vm_flag_bits(flags);
> >>>>   }
>
Liam R. Howlett Sept. 12, 2024, 1:32 a.m. UTC | #7
* Helge Deller <deller@gmx.de> [240911 20:51]:
> On 9/12/24 01:05, Liam R. Howlett wrote:
> > * Yang Shi <shy828301@gmail.com> [240911 18:16]:
> > > On Wed, Sep 11, 2024 at 12:49 PM Liam R. Howlett
> > > <Liam.Howlett@oracle.com> wrote:
> > > > 
> > > > * Helge Deller <deller@kernel.org> [240911 15:20]:
> > > > > This is a RFC to change the behaviour of mmap(MAP_STACK) to be
> > > > > sufficient to map memory for usage as stack on all architectures.
> > > > > Currently MAP_STACK is a no-op on Linux, and instead MAP_GROWSDOWN
> > > > > has to be used.
> > > > > To clarify, here is the relevant info from the mmap() man page:
> > > > > 
> > > > > MAP_GROWSDOWN
> > > > >     This flag is used for stacks. It indicates to the kernel virtual
> > > > >     memory system that the mapping should extend downward in memory.  The
> > > > >     return address is one page lower than the memory area that is
> > > > >     actually created in the process's virtual address space.  Touching an
> > > > >     address in the "guard" page below the mapping will cause the mapping
> > > > >     to grow by a page. This growth can be repeated until the mapping
> > > > >     grows to within a page of the high end of the next lower mapping,
> > > > >     at which point touching the "guard" page will result in a SIGSEGV
> > > > >     signal.
> > > > > 
> > > > > MAP_STACK (since Linux 2.6.27)
> > > > >     Allocate the mapping at an address suitable for a process or thread
> > > > >     stack.
> > > > > 
> > > > >     This flag is currently a no-op on Linux. However, by employing this
> > > > >     flag, applications can ensure that they transparently obtain support
> > > > >     if the flag is implemented in the future. Thus, it is used in the
> > > > >     glibc threading implementation to allow for the fact that
> > > > >     some architectures may (later) require special treatment for
> > > > >     stack allocations. A further reason to employ this flag is
> > > > >     portability: MAP_STACK exists (and has an effect) on some
> > > > >     other systems (e.g., some of the BSDs).
> > > > > 
> > > > > The reason to suggest this change is, that on the parisc architecture the
> > > > > stack grows upwards. As such, using solely the MAP_GROWSDOWN flag will not
> > > > > work. Note that there exists no MAP_GROWSUP flag.
> > > > > By changing the behaviour of MAP_STACK to mark the memory area with the
> > > > > VM_STACK bit (which is VM_GROWSUP or VM_GROWSDOWN depending on the
> > > > > architecture) the MAP_STACK flag does exactly what people would expect on
> > > > > all platforms.
> > > > > 
> > > > > This change should have no negative side-effect, as all code which
> > > > > used mmap(MAP_GROWSDOWN | MAP_STACK) still work as before.
> > > > > 
> > > > > Signed-off-by: Helge Deller <deller@gmx.de>
> > > > > 
> > > > > diff --git a/include/linux/mman.h b/include/linux/mman.h
> > > > > index bcb201ab7a41..66bc72a0cb19 100644
> > > > > --- a/include/linux/mman.h
> > > > > +++ b/include/linux/mman.h
> > > > > @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
> > > > >        return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
> > > > >               _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
> > > > >               _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
> > > > > +            _calc_vm_trans(flags, MAP_STACK,      VM_STACK     ) |
> > > > 
> > > > Right now MAP_STACK can be used to set VM_NOHUGEPAGE, but this will
> > > > change the user interface to create a vma that will grow.  I'm not
> > > > entirely sure this is okay?
> > > 
> > > AFAICT, I don't see this is a problem. Currently huge page also skips
> > > the VMAs with VM_GROWS* flags set. See vma_is_temporary_stack().
> > > __thp_vma_allowable_orders() returns 0 if the vma is a temporary
> > > stack.
> > 
> > If someone is using MAP_STACK to avoid having a huge page, they will
> > also get a mapping that grows - which is different than what happens
> > today.
> > 
> > I'm not saying that's right, but someone could be abusing the existing
> > flag and this will change the behaviour.
> 
> Wouldn't a plain mmap() followed by madvise(MADV_NOHUGEPAGE) do exactly that?
> Why abusing MAP_STACK for that?

I can think of two answers:
1. An error that has worked without issues so far
2. One less system call

I'm not saying this really is a blocker, but the change is not without
risk as it does change behaviour the user could see.

Interestingly enough, the man page is incorrect as it is written because
the flag is not strictly a no-op; it ensures no huge pages.  So the
feature of applying VM_NOHUGEPAGE with the use of MAP_STACK is not
documented today.

What happens to call that use the mmap(MAP_GROWSDOWN | MAP_STACK) on
parisc today?  How does this change with your VM_STACK change?  Wouldn't
this result in failed mappings?  VM_GROWSDOWN | VM_GROWSUP would fail in
do_mmap(), and these would be set if you map MAP_STACK to VM_STACK which
is defined as VM_GROWSUP?

Thanks,
Liam
Liam R. Howlett Sept. 12, 2024, 1:42 a.m. UTC | #8
* Yang Shi <shy828301@gmail.com> [240911 21:08]:
> On Wed, Sep 11, 2024 at 5:50 PM Helge Deller <deller@gmx.de> wrote:
> >
> > On 9/12/24 01:05, Liam R. Howlett wrote:
> > > * Yang Shi <shy828301@gmail.com> [240911 18:16]:
> > >> On Wed, Sep 11, 2024 at 12:49 PM Liam R. Howlett
> > >> <Liam.Howlett@oracle.com> wrote:
> > >>>
> > >>> * Helge Deller <deller@kernel.org> [240911 15:20]:
> > >>>> This is a RFC to change the behaviour of mmap(MAP_STACK) to be
> > >>>> sufficient to map memory for usage as stack on all architectures.
> > >>>> Currently MAP_STACK is a no-op on Linux, and instead MAP_GROWSDOWN
> > >>>> has to be used.
> > >>>> To clarify, here is the relevant info from the mmap() man page:
> > >>>>
> > >>>> MAP_GROWSDOWN
> > >>>>     This flag is used for stacks. It indicates to the kernel virtual
> > >>>>     memory system that the mapping should extend downward in memory.  The
> > >>>>     return address is one page lower than the memory area that is
> > >>>>     actually created in the process's virtual address space.  Touching an
> > >>>>     address in the "guard" page below the mapping will cause the mapping
> > >>>>     to grow by a page. This growth can be repeated until the mapping
> > >>>>     grows to within a page of the high end of the next lower mapping,
> > >>>>     at which point touching the "guard" page will result in a SIGSEGV
> > >>>>     signal.
> > >>>>
> > >>>> MAP_STACK (since Linux 2.6.27)
> > >>>>     Allocate the mapping at an address suitable for a process or thread
> > >>>>     stack.
> > >>>>
> > >>>>     This flag is currently a no-op on Linux. However, by employing this
> > >>>>     flag, applications can ensure that they transparently obtain support
> > >>>>     if the flag is implemented in the future. Thus, it is used in the
> > >>>>     glibc threading implementation to allow for the fact that
> > >>>>     some architectures may (later) require special treatment for
> > >>>>     stack allocations. A further reason to employ this flag is
> > >>>>     portability: MAP_STACK exists (and has an effect) on some
> > >>>>     other systems (e.g., some of the BSDs).
> > >>>>
> > >>>> The reason to suggest this change is, that on the parisc architecture the
> > >>>> stack grows upwards. As such, using solely the MAP_GROWSDOWN flag will not
> > >>>> work. Note that there exists no MAP_GROWSUP flag.
> > >>>> By changing the behaviour of MAP_STACK to mark the memory area with the
> > >>>> VM_STACK bit (which is VM_GROWSUP or VM_GROWSDOWN depending on the
> > >>>> architecture) the MAP_STACK flag does exactly what people would expect on
> > >>>> all platforms.
> > >>>>
> > >>>> This change should have no negative side-effect, as all code which
> > >>>> used mmap(MAP_GROWSDOWN | MAP_STACK) still work as before.
> > >>>>
> > >>>> Signed-off-by: Helge Deller <deller@gmx.de>
> > >>>>
> > >>>> diff --git a/include/linux/mman.h b/include/linux/mman.h
> > >>>> index bcb201ab7a41..66bc72a0cb19 100644
> > >>>> --- a/include/linux/mman.h
> > >>>> +++ b/include/linux/mman.h
> > >>>> @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
> > >>>>        return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
> > >>>>               _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
> > >>>>               _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
> > >>>> +            _calc_vm_trans(flags, MAP_STACK,      VM_STACK     ) |
> > >>>
> > >>> Right now MAP_STACK can be used to set VM_NOHUGEPAGE, but this will
> > >>> change the user interface to create a vma that will grow.  I'm not
> > >>> entirely sure this is okay?
> > >>
> > >> AFAICT, I don't see this is a problem. Currently huge page also skips
> > >> the VMAs with VM_GROWS* flags set. See vma_is_temporary_stack().
> > >> __thp_vma_allowable_orders() returns 0 if the vma is a temporary
> > >> stack.
> > >
> > > If someone is using MAP_STACK to avoid having a huge page, they will
> > > also get a mapping that grows - which is different than what happens
> > > today.
> > >
> > > I'm not saying that's right, but someone could be abusing the existing
> > > flag and this will change the behaviour.
> >
> > Wouldn't a plain mmap() followed by madvise(MADV_NOHUGEPAGE) do exactly that?
> > Why abusing MAP_STACK for that?
> 
> Different sources and reports showed having huge pages for stack
> mapping hurts performance. A lot of applications, for example, pthread
> lib, allocate stack with MAP_STACK and they don't call MADV_NOHUGEPAGE
> on stack mapping.
> 

It makes sense to have a stack with NOHUGEPAGE, but does anyone use
MAP_STACK to avoid the extra syscall to madv to set it on mappings that
are NOT stacks which would now become stack-like with this change?

...
> > >>>>               _calc_vm_trans(flags, MAP_STACK,      VM_NOHUGEPAGE) |
> > >>>>               arch_calc_vm_flag_bits(flags);
> > >>>>   }
> >
Helge Deller Sept. 12, 2024, 1:45 a.m. UTC | #9
On 9/12/24 03:08, Yang Shi wrote:
> On Wed, Sep 11, 2024 at 5:50 PM Helge Deller <deller@gmx.de> wrote:
>>
>> On 9/12/24 01:05, Liam R. Howlett wrote:
>>> * Yang Shi <shy828301@gmail.com> [240911 18:16]:
>>>> On Wed, Sep 11, 2024 at 12:49 PM Liam R. Howlett
>>>> <Liam.Howlett@oracle.com> wrote:
>>>>>
>>>>> * Helge Deller <deller@kernel.org> [240911 15:20]:
>>>>>> This is a RFC to change the behaviour of mmap(MAP_STACK) to be
>>>>>> sufficient to map memory for usage as stack on all architectures.
>>>>>> Currently MAP_STACK is a no-op on Linux, and instead MAP_GROWSDOWN
>>>>>> has to be used.
>>>>>> To clarify, here is the relevant info from the mmap() man page:
>>>>>>
>>>>>> MAP_GROWSDOWN
>>>>>>      This flag is used for stacks. It indicates to the kernel virtual
>>>>>>      memory system that the mapping should extend downward in memory.  The
>>>>>>      return address is one page lower than the memory area that is
>>>>>>      actually created in the process's virtual address space.  Touching an
>>>>>>      address in the "guard" page below the mapping will cause the mapping
>>>>>>      to grow by a page. This growth can be repeated until the mapping
>>>>>>      grows to within a page of the high end of the next lower mapping,
>>>>>>      at which point touching the "guard" page will result in a SIGSEGV
>>>>>>      signal.
>>>>>>
>>>>>> MAP_STACK (since Linux 2.6.27)
>>>>>>      Allocate the mapping at an address suitable for a process or thread
>>>>>>      stack.
>>>>>>
>>>>>>      This flag is currently a no-op on Linux. However, by employing this
>>>>>>      flag, applications can ensure that they transparently obtain support
>>>>>>      if the flag is implemented in the future. Thus, it is used in the
>>>>>>      glibc threading implementation to allow for the fact that
>>>>>>      some architectures may (later) require special treatment for
>>>>>>      stack allocations. A further reason to employ this flag is
>>>>>>      portability: MAP_STACK exists (and has an effect) on some
>>>>>>      other systems (e.g., some of the BSDs).
>>>>>>
>>>>>> The reason to suggest this change is, that on the parisc architecture the
>>>>>> stack grows upwards. As such, using solely the MAP_GROWSDOWN flag will not
>>>>>> work. Note that there exists no MAP_GROWSUP flag.
>>>>>> By changing the behaviour of MAP_STACK to mark the memory area with the
>>>>>> VM_STACK bit (which is VM_GROWSUP or VM_GROWSDOWN depending on the
>>>>>> architecture) the MAP_STACK flag does exactly what people would expect on
>>>>>> all platforms.
>>>>>>
>>>>>> This change should have no negative side-effect, as all code which
>>>>>> used mmap(MAP_GROWSDOWN | MAP_STACK) still work as before.
>>>>>>
>>>>>> Signed-off-by: Helge Deller <deller@gmx.de>
>>>>>>
>>>>>> diff --git a/include/linux/mman.h b/include/linux/mman.h
>>>>>> index bcb201ab7a41..66bc72a0cb19 100644
>>>>>> --- a/include/linux/mman.h
>>>>>> +++ b/include/linux/mman.h
>>>>>> @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
>>>>>>         return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
>>>>>>                _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
>>>>>>                _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
>>>>>> +            _calc_vm_trans(flags, MAP_STACK,      VM_STACK     ) |
>>>>>
>>>>> Right now MAP_STACK can be used to set VM_NOHUGEPAGE, but this will
>>>>> change the user interface to create a vma that will grow.  I'm not
>>>>> entirely sure this is okay?
>>>>
>>>> AFAICT, I don't see this is a problem. Currently huge page also skips
>>>> the VMAs with VM_GROWS* flags set. See vma_is_temporary_stack().
>>>> __thp_vma_allowable_orders() returns 0 if the vma is a temporary
>>>> stack.
>>>
>>> If someone is using MAP_STACK to avoid having a huge page, they will
>>> also get a mapping that grows - which is different than what happens
>>> today.
>>>
>>> I'm not saying that's right, but someone could be abusing the existing
>>> flag and this will change the behaviour.
>>
>> Wouldn't a plain mmap() followed by madvise(MADV_NOHUGEPAGE) do exactly that?
>> Why abusing MAP_STACK for that?
>
> Different sources and reports showed having huge pages for stack
> mapping hurts performance. A lot of applications, for example, pthread
> lib, allocate stack with MAP_STACK and they don't call MADV_NOHUGEPAGE
> on stack mapping.

That's true, and my patch does not change the behaviour wrt huge pages.
Using MAP_STACK still tags the area VM_NOHUGEPAGE. See below...


>>>>> That is mmap(MAP_STACK) would set VM_NOHUGEPAGE right now, with this
>>>>> change you'd get VM_NOHUGEPAGE | VM_GROWS<something>
>>>>>
>>>>>>                _calc_vm_trans(flags, MAP_STACK,      VM_NOHUGEPAGE) |
>>>>>>                arch_calc_vm_flag_bits(flags);
>>>>>>    }
>>
Helge Deller Sept. 12, 2024, 2:09 a.m. UTC | #10
On 9/12/24 03:32, Liam R. Howlett wrote:
> * Helge Deller <deller@gmx.de> [240911 20:51]:
>> On 9/12/24 01:05, Liam R. Howlett wrote:
>>> * Yang Shi <shy828301@gmail.com> [240911 18:16]:
>>>> On Wed, Sep 11, 2024 at 12:49 PM Liam R. Howlett
>>>> <Liam.Howlett@oracle.com> wrote:
>>>>>
>>>>> * Helge Deller <deller@kernel.org> [240911 15:20]:
>>>>>> This is a RFC to change the behaviour of mmap(MAP_STACK) to be
>>>>>> sufficient to map memory for usage as stack on all architectures.
>>>>>> Currently MAP_STACK is a no-op on Linux, and instead MAP_GROWSDOWN
>>>>>> has to be used.
>>>>>> To clarify, here is the relevant info from the mmap() man page:
>>>>>>
>>>>>> MAP_GROWSDOWN
>>>>>>      This flag is used for stacks. It indicates to the kernel virtual
>>>>>>      memory system that the mapping should extend downward in memory.  The
>>>>>>      return address is one page lower than the memory area that is
>>>>>>      actually created in the process's virtual address space.  Touching an
>>>>>>      address in the "guard" page below the mapping will cause the mapping
>>>>>>      to grow by a page. This growth can be repeated until the mapping
>>>>>>      grows to within a page of the high end of the next lower mapping,
>>>>>>      at which point touching the "guard" page will result in a SIGSEGV
>>>>>>      signal.
>>>>>>
>>>>>> MAP_STACK (since Linux 2.6.27)
>>>>>>      Allocate the mapping at an address suitable for a process or thread
>>>>>>      stack.
>>>>>>
>>>>>>      This flag is currently a no-op on Linux. However, by employing this
>>>>>>      flag, applications can ensure that they transparently obtain support
>>>>>>      if the flag is implemented in the future. Thus, it is used in the
>>>>>>      glibc threading implementation to allow for the fact that
>>>>>>      some architectures may (later) require special treatment for
>>>>>>      stack allocations. A further reason to employ this flag is
>>>>>>      portability: MAP_STACK exists (and has an effect) on some
>>>>>>      other systems (e.g., some of the BSDs).
>>>>>>
>>>>>> The reason to suggest this change is, that on the parisc architecture the
>>>>>> stack grows upwards. As such, using solely the MAP_GROWSDOWN flag will not
>>>>>> work. Note that there exists no MAP_GROWSUP flag.
>>>>>> By changing the behaviour of MAP_STACK to mark the memory area with the
>>>>>> VM_STACK bit (which is VM_GROWSUP or VM_GROWSDOWN depending on the
>>>>>> architecture) the MAP_STACK flag does exactly what people would expect on
>>>>>> all platforms.
>>>>>>
>>>>>> This change should have no negative side-effect, as all code which
>>>>>> used mmap(MAP_GROWSDOWN | MAP_STACK) still work as before.
>>>>>>
>>>>>> Signed-off-by: Helge Deller <deller@gmx.de>
>>>>>>
>>>>>> diff --git a/include/linux/mman.h b/include/linux/mman.h
>>>>>> index bcb201ab7a41..66bc72a0cb19 100644
>>>>>> --- a/include/linux/mman.h
>>>>>> +++ b/include/linux/mman.h
>>>>>> @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
>>>>>>         return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
>>>>>>                _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
>>>>>>                _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
>>>>>> +            _calc_vm_trans(flags, MAP_STACK,      VM_STACK     ) |
>>>>>
>>>>> Right now MAP_STACK can be used to set VM_NOHUGEPAGE, but this will
>>>>> change the user interface to create a vma that will grow.  I'm not
>>>>> entirely sure this is okay?
>>>>
>>>> AFAICT, I don't see this is a problem. Currently huge page also skips
>>>> the VMAs with VM_GROWS* flags set. See vma_is_temporary_stack().
>>>> __thp_vma_allowable_orders() returns 0 if the vma is a temporary
>>>> stack.
>>>
>>> If someone is using MAP_STACK to avoid having a huge page, they will
>>> also get a mapping that grows - which is different than what happens
>>> today.
>>>
>>> I'm not saying that's right, but someone could be abusing the existing
>>> flag and this will change the behaviour.
>>
>> Wouldn't a plain mmap() followed by madvise(MADV_NOHUGEPAGE) do exactly that?
>> Why abusing MAP_STACK for that?
>
> I can think of two answers:
> 1. An error that has worked without issues so far
> 2. One less system call
>
> I'm not saying this really is a blocker, but the change is not without
> risk as it does change behaviour the user could see.
>
> Interestingly enough, the man page is incorrect as it is written because
> the flag is not strictly a no-op; it ensures no huge pages.  So the
> feature of applying VM_NOHUGEPAGE with the use of MAP_STACK is not
> documented today.

Yes.

> What happens to call that use the mmap(MAP_GROWSDOWN | MAP_STACK) on
> parisc today?

Today, without my patch, on parisc the area is then flagged VM_GROWSDOWN only.
In that case, stack expansion will not work, as it checks for VM_GROWSUP.

> How does this change with your VM_STACK change?  Wouldn't this result
> in failed mappings?
> VM_GROWSDOWN | VM_GROWSUP would fail in do_mmap(), and these would be> set if you map MAP_STACK to VM_STACK which is defined as VM_GROWSUP?

Right, with my change, the area will be tagged VM_GROWSUP and VM_GROWSDOWN.
Due to VM_GROWSUP stack expansion works.
The mapping doesn't fail in do_mmap(), because stacks are not file-mapped
or shared or droppable. They should be mapped with MAP_PRIVATE, right?


Another option is to introduce an alias, e.g.:
#define MAP_STACK_EXPANDABLE	MAP_GROWSDOWN
and then
diff --git a/include/linux/mman.h b/include/linux/mman.h
index bcb201ab7a41..6a7ec3e0078a 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -153,7 +153,7 @@ calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
  static inline unsigned long
  calc_vm_flag_bits(unsigned long flags)
  {
-       return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
+       return _calc_vm_trans(flags, MAP_STACK_EXPANDABLE,  VM_STACK ) |
                _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
                _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
                _calc_vm_trans(flags, MAP_STACK,      VM_NOHUGEPAGE) |

This simply uses the existing MAP_GROWSDOWN flag to mark a stack,
is independend on the stack growth direction and is fully backwards
compatible to existing behaviour.

I prefer my initial proposal to change MAP_STACK though, as it's
simpler and clearer.

Helge
Yang Shi Sept. 12, 2024, 2:39 a.m. UTC | #11
On Wed, Sep 11, 2024 at 6:42 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Yang Shi <shy828301@gmail.com> [240911 21:08]:
> > On Wed, Sep 11, 2024 at 5:50 PM Helge Deller <deller@gmx.de> wrote:
> > >
> > > On 9/12/24 01:05, Liam R. Howlett wrote:
> > > > * Yang Shi <shy828301@gmail.com> [240911 18:16]:
> > > >> On Wed, Sep 11, 2024 at 12:49 PM Liam R. Howlett
> > > >> <Liam.Howlett@oracle.com> wrote:
> > > >>>
> > > >>> * Helge Deller <deller@kernel.org> [240911 15:20]:
> > > >>>> This is a RFC to change the behaviour of mmap(MAP_STACK) to be
> > > >>>> sufficient to map memory for usage as stack on all architectures.
> > > >>>> Currently MAP_STACK is a no-op on Linux, and instead MAP_GROWSDOWN
> > > >>>> has to be used.
> > > >>>> To clarify, here is the relevant info from the mmap() man page:
> > > >>>>
> > > >>>> MAP_GROWSDOWN
> > > >>>>     This flag is used for stacks. It indicates to the kernel virtual
> > > >>>>     memory system that the mapping should extend downward in memory.  The
> > > >>>>     return address is one page lower than the memory area that is
> > > >>>>     actually created in the process's virtual address space.  Touching an
> > > >>>>     address in the "guard" page below the mapping will cause the mapping
> > > >>>>     to grow by a page. This growth can be repeated until the mapping
> > > >>>>     grows to within a page of the high end of the next lower mapping,
> > > >>>>     at which point touching the "guard" page will result in a SIGSEGV
> > > >>>>     signal.
> > > >>>>
> > > >>>> MAP_STACK (since Linux 2.6.27)
> > > >>>>     Allocate the mapping at an address suitable for a process or thread
> > > >>>>     stack.
> > > >>>>
> > > >>>>     This flag is currently a no-op on Linux. However, by employing this
> > > >>>>     flag, applications can ensure that they transparently obtain support
> > > >>>>     if the flag is implemented in the future. Thus, it is used in the
> > > >>>>     glibc threading implementation to allow for the fact that
> > > >>>>     some architectures may (later) require special treatment for
> > > >>>>     stack allocations. A further reason to employ this flag is
> > > >>>>     portability: MAP_STACK exists (and has an effect) on some
> > > >>>>     other systems (e.g., some of the BSDs).
> > > >>>>
> > > >>>> The reason to suggest this change is, that on the parisc architecture the
> > > >>>> stack grows upwards. As such, using solely the MAP_GROWSDOWN flag will not
> > > >>>> work. Note that there exists no MAP_GROWSUP flag.
> > > >>>> By changing the behaviour of MAP_STACK to mark the memory area with the
> > > >>>> VM_STACK bit (which is VM_GROWSUP or VM_GROWSDOWN depending on the
> > > >>>> architecture) the MAP_STACK flag does exactly what people would expect on
> > > >>>> all platforms.
> > > >>>>
> > > >>>> This change should have no negative side-effect, as all code which
> > > >>>> used mmap(MAP_GROWSDOWN | MAP_STACK) still work as before.
> > > >>>>
> > > >>>> Signed-off-by: Helge Deller <deller@gmx.de>
> > > >>>>
> > > >>>> diff --git a/include/linux/mman.h b/include/linux/mman.h
> > > >>>> index bcb201ab7a41..66bc72a0cb19 100644
> > > >>>> --- a/include/linux/mman.h
> > > >>>> +++ b/include/linux/mman.h
> > > >>>> @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
> > > >>>>        return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
> > > >>>>               _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
> > > >>>>               _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
> > > >>>> +            _calc_vm_trans(flags, MAP_STACK,      VM_STACK     ) |
> > > >>>
> > > >>> Right now MAP_STACK can be used to set VM_NOHUGEPAGE, but this will
> > > >>> change the user interface to create a vma that will grow.  I'm not
> > > >>> entirely sure this is okay?
> > > >>
> > > >> AFAICT, I don't see this is a problem. Currently huge page also skips
> > > >> the VMAs with VM_GROWS* flags set. See vma_is_temporary_stack().
> > > >> __thp_vma_allowable_orders() returns 0 if the vma is a temporary
> > > >> stack.
> > > >
> > > > If someone is using MAP_STACK to avoid having a huge page, they will
> > > > also get a mapping that grows - which is different than what happens
> > > > today.
> > > >
> > > > I'm not saying that's right, but someone could be abusing the existing
> > > > flag and this will change the behaviour.
> > >
> > > Wouldn't a plain mmap() followed by madvise(MADV_NOHUGEPAGE) do exactly that?
> > > Why abusing MAP_STACK for that?
> >
> > Different sources and reports showed having huge pages for stack
> > mapping hurts performance. A lot of applications, for example, pthread
> > lib, allocate stack with MAP_STACK and they don't call MADV_NOHUGEPAGE
> > on stack mapping.
> >
>
> It makes sense to have a stack with NOHUGEPAGE, but does anyone use
> MAP_STACK to avoid the extra syscall to madv to set it on mappings that
> are NOT stacks which would now become stack-like with this change?

AFAICT, I'm not aware of such usecase. It is definitely not
recommended and misuse of MAP_STACK. I don't see how we can prevent
this in kernel other than document it properly.

>
> ...
> > > >>>>               _calc_vm_trans(flags, MAP_STACK,      VM_NOHUGEPAGE) |
> > > >>>>               arch_calc_vm_flag_bits(flags);
> > > >>>>   }
> > >
Liam R. Howlett Sept. 12, 2024, 3:43 p.m. UTC | #12
* Helge Deller <deller@gmx.de> [240911 22:10]:
> On 9/12/24 03:32, Liam R. Howlett wrote:
> > * Helge Deller <deller@gmx.de> [240911 20:51]:
> > > On 9/12/24 01:05, Liam R. Howlett wrote:
> > > > * Yang Shi <shy828301@gmail.com> [240911 18:16]:
> > > > > On Wed, Sep 11, 2024 at 12:49 PM Liam R. Howlett
> > > > > <Liam.Howlett@oracle.com> wrote:
> > > > > > 
> > > > > > * Helge Deller <deller@kernel.org> [240911 15:20]:
> > > > > > > This is a RFC to change the behaviour of mmap(MAP_STACK) to be
> > > > > > > sufficient to map memory for usage as stack on all architectures.
> > > > > > > Currently MAP_STACK is a no-op on Linux, and instead MAP_GROWSDOWN
> > > > > > > has to be used.
> > > > > > > To clarify, here is the relevant info from the mmap() man page:
> > > > > > > 
> > > > > > > MAP_GROWSDOWN
> > > > > > >      This flag is used for stacks. It indicates to the kernel virtual
> > > > > > >      memory system that the mapping should extend downward in memory.  The
> > > > > > >      return address is one page lower than the memory area that is
> > > > > > >      actually created in the process's virtual address space.  Touching an
> > > > > > >      address in the "guard" page below the mapping will cause the mapping
> > > > > > >      to grow by a page. This growth can be repeated until the mapping
> > > > > > >      grows to within a page of the high end of the next lower mapping,
> > > > > > >      at which point touching the "guard" page will result in a SIGSEGV
> > > > > > >      signal.
> > > > > > > 
> > > > > > > MAP_STACK (since Linux 2.6.27)
> > > > > > >      Allocate the mapping at an address suitable for a process or thread
> > > > > > >      stack.
> > > > > > > 
> > > > > > >      This flag is currently a no-op on Linux. However, by employing this
> > > > > > >      flag, applications can ensure that they transparently obtain support
> > > > > > >      if the flag is implemented in the future. Thus, it is used in the
> > > > > > >      glibc threading implementation to allow for the fact that
> > > > > > >      some architectures may (later) require special treatment for
> > > > > > >      stack allocations. A further reason to employ this flag is
> > > > > > >      portability: MAP_STACK exists (and has an effect) on some
> > > > > > >      other systems (e.g., some of the BSDs).
> > > > > > > 
> > > > > > > The reason to suggest this change is, that on the parisc architecture the
> > > > > > > stack grows upwards. As such, using solely the MAP_GROWSDOWN flag will not
> > > > > > > work. Note that there exists no MAP_GROWSUP flag.
> > > > > > > By changing the behaviour of MAP_STACK to mark the memory area with the
> > > > > > > VM_STACK bit (which is VM_GROWSUP or VM_GROWSDOWN depending on the
> > > > > > > architecture) the MAP_STACK flag does exactly what people would expect on
> > > > > > > all platforms.
> > > > > > > 
> > > > > > > This change should have no negative side-effect, as all code which
> > > > > > > used mmap(MAP_GROWSDOWN | MAP_STACK) still work as before.
> > > > > > > 
> > > > > > > Signed-off-by: Helge Deller <deller@gmx.de>
> > > > > > > 
> > > > > > > diff --git a/include/linux/mman.h b/include/linux/mman.h
> > > > > > > index bcb201ab7a41..66bc72a0cb19 100644
> > > > > > > --- a/include/linux/mman.h
> > > > > > > +++ b/include/linux/mman.h
> > > > > > > @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
> > > > > > >         return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
> > > > > > >                _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
> > > > > > >                _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
> > > > > > > +            _calc_vm_trans(flags, MAP_STACK,      VM_STACK     ) |
> > > > > > 
> > > > > > Right now MAP_STACK can be used to set VM_NOHUGEPAGE, but this will
> > > > > > change the user interface to create a vma that will grow.  I'm not
> > > > > > entirely sure this is okay?
> > > > > 
> > > > > AFAICT, I don't see this is a problem. Currently huge page also skips
> > > > > the VMAs with VM_GROWS* flags set. See vma_is_temporary_stack().
> > > > > __thp_vma_allowable_orders() returns 0 if the vma is a temporary
> > > > > stack.
> > > > 
> > > > If someone is using MAP_STACK to avoid having a huge page, they will
> > > > also get a mapping that grows - which is different than what happens
> > > > today.
> > > > 
> > > > I'm not saying that's right, but someone could be abusing the existing
> > > > flag and this will change the behaviour.
> > > 
> > > Wouldn't a plain mmap() followed by madvise(MADV_NOHUGEPAGE) do exactly that?
> > > Why abusing MAP_STACK for that?
> > 
> > I can think of two answers:
> > 1. An error that has worked without issues so far
> > 2. One less system call
> > 
> > I'm not saying this really is a blocker, but the change is not without
> > risk as it does change behaviour the user could see.
> > 
> > Interestingly enough, the man page is incorrect as it is written because
> > the flag is not strictly a no-op; it ensures no huge pages.  So the
> > feature of applying VM_NOHUGEPAGE with the use of MAP_STACK is not
> > documented today.
> 
> Yes.
> 
> > What happens to call that use the mmap(MAP_GROWSDOWN | MAP_STACK) on
> > parisc today?
> 
> Today, without my patch, on parisc the area is then flagged VM_GROWSDOWN only.
> In that case, stack expansion will not work, as it checks for VM_GROWSUP.
> 
> > How does this change with your VM_STACK change?  Wouldn't this result
> > in failed mappings?
> > VM_GROWSDOWN | VM_GROWSUP would fail in do_mmap(), and these would be> set if you map MAP_STACK to VM_STACK which is defined as VM_GROWSUP?
> 
> Right, with my change, the area will be tagged VM_GROWSUP and VM_GROWSDOWN.
> Due to VM_GROWSUP stack expansion works.
> The mapping doesn't fail in do_mmap(), because stacks are not file-mapped
> or shared or droppable. They should be mapped with MAP_PRIVATE, right?

Oh my, yes.  So now you will have a stack that can expand in either
direction, but it's okay because one direction isn't checked.  I sure
hope the rest of the code is correctly #ifdef'ed for this.

> 
> 
> Another option is to introduce an alias, e.g.:
> #define MAP_STACK_EXPANDABLE	MAP_GROWSDOWN
> and then

I don't like either of these options.

I guess you could also detect the MAP_STACK and MAP_GROWSDOWN and change
the flags for parisc, which I also don't like, but since parisc is the
only arch using this it's hard to justify a change that may cause issues
in other archs.

A quick grep shows we set VM_STACK_DEFAULT_FLAGS in x86 and powerpc,
which could affect what happens here with your change.  I am concerned
about the bleeding of other flags through this change.


> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index bcb201ab7a41..6a7ec3e0078a 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -153,7 +153,7 @@ calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
>  static inline unsigned long
>  calc_vm_flag_bits(unsigned long flags)
>  {
> -       return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
> +       return _calc_vm_trans(flags, MAP_STACK_EXPANDABLE,  VM_STACK ) |
>                _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
>                _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
>                _calc_vm_trans(flags, MAP_STACK,      VM_NOHUGEPAGE) |
> 
> This simply uses the existing MAP_GROWSDOWN flag to mark a stack,
> is independend on the stack growth direction and is fully backwards
> compatible to existing behaviour.
> 
> I prefer my initial proposal to change MAP_STACK though, as it's
> simpler and clearer.
> 
> Helge
>
Helge Deller Sept. 12, 2024, 5:37 p.m. UTC | #13
On 9/12/24 17:43, Liam R. Howlett wrote:
> * Helge Deller <deller@gmx.de> [240911 22:10]:
>> On 9/12/24 03:32, Liam R. Howlett wrote:
>>> * Helge Deller <deller@gmx.de> [240911 20:51]:
>>>> On 9/12/24 01:05, Liam R. Howlett wrote:
>>>>> * Yang Shi <shy828301@gmail.com> [240911 18:16]:
>>>>>> On Wed, Sep 11, 2024 at 12:49 PM Liam R. Howlett
>>>>>> <Liam.Howlett@oracle.com> wrote:
>>>>>>>
>>>>>>> * Helge Deller <deller@kernel.org> [240911 15:20]:
>>>>>>>> This is a RFC to change the behaviour of mmap(MAP_STACK) to be
>>>>>>>> sufficient to map memory for usage as stack on all architectures.
>>>>>>>> Currently MAP_STACK is a no-op on Linux, and instead MAP_GROWSDOWN
>>>>>>>> has to be used.
>>>>>>>> To clarify, here is the relevant info from the mmap() man page:
>>>>>>>>
>>>>>>>> MAP_GROWSDOWN
>>>>>>>>       This flag is used for stacks. It indicates to the kernel virtual
>>>>>>>>       memory system that the mapping should extend downward in memory.  The
>>>>>>>>       return address is one page lower than the memory area that is
>>>>>>>>       actually created in the process's virtual address space.  Touching an
>>>>>>>>       address in the "guard" page below the mapping will cause the mapping
>>>>>>>>       to grow by a page. This growth can be repeated until the mapping
>>>>>>>>       grows to within a page of the high end of the next lower mapping,
>>>>>>>>       at which point touching the "guard" page will result in a SIGSEGV
>>>>>>>>       signal.
>>>>>>>>
>>>>>>>> MAP_STACK (since Linux 2.6.27)
>>>>>>>>       Allocate the mapping at an address suitable for a process or thread
>>>>>>>>       stack.
>>>>>>>>
>>>>>>>>       This flag is currently a no-op on Linux. However, by employing this
>>>>>>>>       flag, applications can ensure that they transparently obtain support
>>>>>>>>       if the flag is implemented in the future. Thus, it is used in the
>>>>>>>>       glibc threading implementation to allow for the fact that
>>>>>>>>       some architectures may (later) require special treatment for
>>>>>>>>       stack allocations. A further reason to employ this flag is
>>>>>>>>       portability: MAP_STACK exists (and has an effect) on some
>>>>>>>>       other systems (e.g., some of the BSDs).
>>>>>>>>
>>>>>>>> The reason to suggest this change is, that on the parisc architecture the
>>>>>>>> stack grows upwards. As such, using solely the MAP_GROWSDOWN flag will not
>>>>>>>> work. Note that there exists no MAP_GROWSUP flag.
>>>>>>>> By changing the behaviour of MAP_STACK to mark the memory area with the
>>>>>>>> VM_STACK bit (which is VM_GROWSUP or VM_GROWSDOWN depending on the
>>>>>>>> architecture) the MAP_STACK flag does exactly what people would expect on
>>>>>>>> all platforms.
>>>>>>>>
>>>>>>>> This change should have no negative side-effect, as all code which
>>>>>>>> used mmap(MAP_GROWSDOWN | MAP_STACK) still work as before.
>>>>>>>>
>>>>>>>> Signed-off-by: Helge Deller <deller@gmx.de>
>>>>>>>>
>>>>>>>> diff --git a/include/linux/mman.h b/include/linux/mman.h
>>>>>>>> index bcb201ab7a41..66bc72a0cb19 100644
>>>>>>>> --- a/include/linux/mman.h
>>>>>>>> +++ b/include/linux/mman.h
>>>>>>>> @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
>>>>>>>>          return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
>>>>>>>>                 _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
>>>>>>>>                 _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
>>>>>>>> +            _calc_vm_trans(flags, MAP_STACK,      VM_STACK     ) |
>>>>>>>
>>>>>>> Right now MAP_STACK can be used to set VM_NOHUGEPAGE, but this will
>>>>>>> change the user interface to create a vma that will grow.  I'm not
>>>>>>> entirely sure this is okay?
>>>>>>
>>>>>> AFAICT, I don't see this is a problem. Currently huge page also skips
>>>>>> the VMAs with VM_GROWS* flags set. See vma_is_temporary_stack().
>>>>>> __thp_vma_allowable_orders() returns 0 if the vma is a temporary
>>>>>> stack.
>>>>>
>>>>> If someone is using MAP_STACK to avoid having a huge page, they will
>>>>> also get a mapping that grows - which is different than what happens
>>>>> today.
>>>>>
>>>>> I'm not saying that's right, but someone could be abusing the existing
>>>>> flag and this will change the behaviour.
>>>>
>>>> Wouldn't a plain mmap() followed by madvise(MADV_NOHUGEPAGE) do exactly that?
>>>> Why abusing MAP_STACK for that?
>>>
>>> I can think of two answers:
>>> 1. An error that has worked without issues so far
>>> 2. One less system call
>>>
>>> I'm not saying this really is a blocker, but the change is not without
>>> risk as it does change behaviour the user could see.
>>>
>>> Interestingly enough, the man page is incorrect as it is written because
>>> the flag is not strictly a no-op; it ensures no huge pages.  So the
>>> feature of applying VM_NOHUGEPAGE with the use of MAP_STACK is not
>>> documented today.
>>
>> Yes.
>>
>>> What happens to call that use the mmap(MAP_GROWSDOWN | MAP_STACK) on
>>> parisc today?
>>
>> Today, without my patch, on parisc the area is then flagged VM_GROWSDOWN only.
>> In that case, stack expansion will not work, as it checks for VM_GROWSUP.
>>
>>> How does this change with your VM_STACK change?  Wouldn't this result
>>> in failed mappings?
>>> VM_GROWSDOWN | VM_GROWSUP would fail in do_mmap(), and these would be> set if you map MAP_STACK to VM_STACK which is defined as VM_GROWSUP?
>>
>> Right, with my change, the area will be tagged VM_GROWSUP and VM_GROWSDOWN.
>> Due to VM_GROWSUP stack expansion works.
>> The mapping doesn't fail in do_mmap(), because stacks are not file-mapped
>> or shared or droppable. They should be mapped with MAP_PRIVATE, right?
>
> Oh my, yes.  So now you will have a stack that can expand in either
> direction, but it's okay because one direction isn't checked.  I sure
> hope the rest of the code is correctly #ifdef'ed for this.

Only one direction will be handled (depending on the architecture),
even if both (VM_GROWSUP and VM_GROWSDOWN) are set.
So it shouldn't be a problem to have both directions set.

>> Another option is to introduce an alias, e.g.:
>> #define MAP_STACK_EXPANDABLE	MAP_GROWSDOWN
>> and then
>
> I don't like either of these options.
>
> I guess you could also detect the MAP_STACK and MAP_GROWSDOWN and change
> the flags for parisc, which I also don't like, but since parisc is the
> only arch using this it's hard to justify a change that may cause issues
> in other archs.

Well, I still don't actually see real issues with my proposal.
That's why I proposed to change MAP_STACK generally.
Just adapting it for parisc was my initial approach which I sent
to the parisc mailing list prior to this patch:
https://patchwork.kernel.org/project/linux-parisc/patch/Zt3yJUiczUBmEC3a@p100/

I'd like to wait for some possible further feedback.
In the end I might end up just changing it for parisc.

> A quick grep shows we set VM_STACK_DEFAULT_FLAGS in x86 and powerpc,
> which could affect what happens here with your change.  I am concerned
> about the bleeding of other flags through this change.

Shouldn't be a problem. The VM_STACK in VM_STACK_FLAGS is already VM_GROWSUP
or VM_GROWSDOWN.
include/linux/mm.h:#define VM_STACK_FLAGS       (VM_STACK | VM_STACK_DEFAULT_FLAGS | VM_ACCOUNT)

Helge
Matthew Wilcox (Oracle) Sept. 12, 2024, 6:12 p.m. UTC | #14
On Wed, Sep 11, 2024 at 09:32:29PM -0400, Liam R. Howlett wrote:
> * Helge Deller <deller@gmx.de> [240911 20:51]:
> > On 9/12/24 01:05, Liam R. Howlett wrote:
> > > If someone is using MAP_STACK to avoid having a huge page, they will
> > > also get a mapping that grows - which is different than what happens
> > > today.
> > > 
> > > I'm not saying that's right, but someone could be abusing the existing
> > > flag and this will change the behaviour.
> > 
> > Wouldn't a plain mmap() followed by madvise(MADV_NOHUGEPAGE) do exactly that?
> > Why abusing MAP_STACK for that?
> 
> I can think of two answers:
> 1. An error that has worked without issues so far
> 2. One less system call
> 
> I'm not saying this really is a blocker, but the change is not without
> risk as it does change behaviour the user could see.
> 
> Interestingly enough, the man page is incorrect as it is written because
> the flag is not strictly a no-op; it ensures no huge pages.  So the
> feature of applying VM_NOHUGEPAGE with the use of MAP_STACK is not
> documented today.

It's a recent change and I don't think it's something we necessarily
want to document.  It was c4608d1bf7c6 which was December 2023.
Helge Deller Sept. 15, 2024, 8:04 p.m. UTC | #15
On 9/12/24 19:37, Helge Deller wrote:
> On 9/12/24 17:43, Liam R. Howlett wrote:
>> * Helge Deller <deller@gmx.de> [240911 22:10]:
>>> On 9/12/24 03:32, Liam R. Howlett wrote:
>>>> * Helge Deller <deller@gmx.de> [240911 20:51]:
>>>>> On 9/12/24 01:05, Liam R. Howlett wrote:
>>>>>> * Yang Shi <shy828301@gmail.com> [240911 18:16]:
>>>>>>> On Wed, Sep 11, 2024 at 12:49 PM Liam R. Howlett
>>>>>>> <Liam.Howlett@oracle.com> wrote:
>>>>>>>>
>>>>>>>> * Helge Deller <deller@kernel.org> [240911 15:20]:
>>>>>>>>> This is a RFC to change the behaviour of mmap(MAP_STACK) to be
>>>>>>>>> sufficient to map memory for usage as stack on all architectures.
>>>>>>>>> Currently MAP_STACK is a no-op on Linux, and instead MAP_GROWSDOWN
>>>>>>>>> has to be used.
>>>>>>>>> To clarify, here is the relevant info from the mmap() man page:
>>>>>>>>>
>>>>>>>>> MAP_GROWSDOWN
>>>>>>>>>       This flag is used for stacks. It indicates to the kernel virtual
>>>>>>>>>       memory system that the mapping should extend downward in memory.  The
>>>>>>>>>       return address is one page lower than the memory area that is
>>>>>>>>>       actually created in the process's virtual address space.  Touching an
>>>>>>>>>       address in the "guard" page below the mapping will cause the mapping
>>>>>>>>>       to grow by a page. This growth can be repeated until the mapping
>>>>>>>>>       grows to within a page of the high end of the next lower mapping,
>>>>>>>>>       at which point touching the "guard" page will result in a SIGSEGV
>>>>>>>>>       signal.
>>>>>>>>>
>>>>>>>>> MAP_STACK (since Linux 2.6.27)
>>>>>>>>>       Allocate the mapping at an address suitable for a process or thread
>>>>>>>>>       stack.
>>>>>>>>>
>>>>>>>>>       This flag is currently a no-op on Linux. However, by employing this
>>>>>>>>>       flag, applications can ensure that they transparently obtain support
>>>>>>>>>       if the flag is implemented in the future. Thus, it is used in the
>>>>>>>>>       glibc threading implementation to allow for the fact that
>>>>>>>>>       some architectures may (later) require special treatment for
>>>>>>>>>       stack allocations. A further reason to employ this flag is
>>>>>>>>>       portability: MAP_STACK exists (and has an effect) on some
>>>>>>>>>       other systems (e.g., some of the BSDs).
>>>>>>>>>
>>>>>>>>> The reason to suggest this change is, that on the parisc architecture the
>>>>>>>>> stack grows upwards. As such, using solely the MAP_GROWSDOWN flag will not
>>>>>>>>> work. Note that there exists no MAP_GROWSUP flag.
>>>>>>>>> By changing the behaviour of MAP_STACK to mark the memory area with the
>>>>>>>>> VM_STACK bit (which is VM_GROWSUP or VM_GROWSDOWN depending on the
>>>>>>>>> architecture) the MAP_STACK flag does exactly what people would expect on
>>>>>>>>> all platforms.
>>>>>>>>>
>>>>>>>>> This change should have no negative side-effect, as all code which
>>>>>>>>> used mmap(MAP_GROWSDOWN | MAP_STACK) still work as before.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Helge Deller <deller@gmx.de>
>>>>>>>>>
>>>>>>>>> diff --git a/include/linux/mman.h b/include/linux/mman.h
>>>>>>>>> index bcb201ab7a41..66bc72a0cb19 100644
>>>>>>>>> --- a/include/linux/mman.h
>>>>>>>>> +++ b/include/linux/mman.h
>>>>>>>>> @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
>>>>>>>>>          return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
>>>>>>>>>                 _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
>>>>>>>>>                 _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
>>>>>>>>> +            _calc_vm_trans(flags, MAP_STACK,      VM_STACK     ) |
>>>>>>>>
>>>>>>>> Right now MAP_STACK can be used to set VM_NOHUGEPAGE, but this will
>>>>>>>> change the user interface to create a vma that will grow.  I'm not
>>>>>>>> entirely sure this is okay?
>>>>>>>
>>>>>>> AFAICT, I don't see this is a problem. Currently huge page also skips
>>>>>>> the VMAs with VM_GROWS* flags set. See vma_is_temporary_stack().
>>>>>>> __thp_vma_allowable_orders() returns 0 if the vma is a temporary
>>>>>>> stack.
>>>>>>
>>>>>> If someone is using MAP_STACK to avoid having a huge page, they will
>>>>>> also get a mapping that grows - which is different than what happens
>>>>>> today.
>>>>>>
>>>>>> I'm not saying that's right, but someone could be abusing the existing
>>>>>> flag and this will change the behaviour.
>>>>>
>>>>> Wouldn't a plain mmap() followed by madvise(MADV_NOHUGEPAGE) do exactly that?
>>>>> Why abusing MAP_STACK for that?
>>>>
>>>> I can think of two answers:
>>>> 1. An error that has worked without issues so far
>>>> 2. One less system call
>>>>
>>>> I'm not saying this really is a blocker, but the change is not without
>>>> risk as it does change behaviour the user could see.
>>>>
>>>> Interestingly enough, the man page is incorrect as it is written because
>>>> the flag is not strictly a no-op; it ensures no huge pages.  So the
>>>> feature of applying VM_NOHUGEPAGE with the use of MAP_STACK is not
>>>> documented today.
>>>
>>> Yes.
>>>
>>>> What happens to call that use the mmap(MAP_GROWSDOWN | MAP_STACK) on
>>>> parisc today?
>>>
>>> Today, without my patch, on parisc the area is then flagged VM_GROWSDOWN only.
>>> In that case, stack expansion will not work, as it checks for VM_GROWSUP.
>>>
>>>> How does this change with your VM_STACK change?  Wouldn't this result
>>>> in failed mappings?
>>>> VM_GROWSDOWN | VM_GROWSUP would fail in do_mmap(), and these would be> set if you map MAP_STACK to VM_STACK which is defined as VM_GROWSUP?
>>>
>>> Right, with my change, the area will be tagged VM_GROWSUP and VM_GROWSDOWN.
>>> Due to VM_GROWSUP stack expansion works.
>>> The mapping doesn't fail in do_mmap(), because stacks are not file-mapped
>>> or shared or droppable. They should be mapped with MAP_PRIVATE, right?
>>
>> Oh my, yes.  So now you will have a stack that can expand in either
>> direction, but it's okay because one direction isn't checked.  I sure
>> hope the rest of the code is correctly #ifdef'ed for this.
>
> Only one direction will be handled (depending on the architecture),
> even if both (VM_GROWSUP and VM_GROWSDOWN) are set.
> So it shouldn't be a problem to have both directions set.
>
>>> Another option is to introduce an alias, e.g.:
>>> #define MAP_STACK_EXPANDABLE    MAP_GROWSDOWN
>>> and then
>>
>> I don't like either of these options.
>>
>> I guess you could also detect the MAP_STACK and MAP_GROWSDOWN and change
>> the flags for parisc, which I also don't like, but since parisc is the
>> only arch using this it's hard to justify a change that may cause issues
>> in other archs.
>
> Well, I still don't actually see real issues with my proposal.
> That's why I proposed to change MAP_STACK generally.
> Just adapting it for parisc was my initial approach which I sent
> to the parisc mailing list prior to this patch:
> https://patchwork.kernel.org/project/linux-parisc/patch/Zt3yJUiczUBmEC3a@p100/
>
> I'd like to wait for some possible further feedback.
> In the end I might end up just changing it for parisc.

No further input, so I will apply a patch which changes the default behavior
on parisc only. Other architectures will stay unchanged.

Thanks!
Helge
diff mbox series

Patch

diff --git a/include/linux/mman.h b/include/linux/mman.h
index bcb201ab7a41..66bc72a0cb19 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -156,6 +156,7 @@  calc_vm_flag_bits(unsigned long flags)
 	return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
 	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
 	       _calc_vm_trans(flags, MAP_SYNC,	     VM_SYNC      ) |
+	       _calc_vm_trans(flags, MAP_STACK,      VM_STACK     ) |
 	       _calc_vm_trans(flags, MAP_STACK,	     VM_NOHUGEPAGE) |
 	       arch_calc_vm_flag_bits(flags);
 }