mbox series

[RFC,00/14] Dynamic Kernel Stacks

Message ID 20240311164638.2015063-1-pasha.tatashin@soleen.com (mailing list archive)
Headers show
Series Dynamic Kernel Stacks | expand

Message

Pasha Tatashin March 11, 2024, 4:46 p.m. UTC
This is follow-up to the LSF/MM proposal [1]. Please provide your
thoughts and comments about dynamic kernel stacks feature. This is a WIP
has not been tested beside booting on some machines, and running LKDTM
thread exhaust tests. The series also lacks selftests, and
documentations.

This feature allows to grow kernel stack dynamically, from 4KiB and up
to the THREAD_SIZE. The intend is to save memory on fleet machines. From
the initial experiments it shows to save on average 70-75% of the kernel
stack memory.

The average depth of a kernel thread depends on the workload, profiling,
virtualization, compiler optimizations, and driver implementations.
However, the table below shows the amount of kernel stack memory before
vs. after on idling freshly booted machines:

CPU           #Cores #Stacks  BASE(kb) Dynamic(kb)   Saving
AMD Genoa        384    5786    92576       23388    74.74%
Intel Skylake    112    3182    50912       12860    74.74%
AMD Rome         128    3401    54416       14784    72.83%
AMD Rome         256    4908    78528       20876    73.42%
Intel Haswell     72    2644    42304       10624    74.89%

Some workloads with that have millions of threads would can benefit
significantly from this feature.

[1] https://lore.kernel.org/all/CA+CK2bBYt9RAVqASB2eLyRQxYT5aiL0fGhUu3TumQCyJCNTWvw@mail.gmail.com

Pasha Tatashin (14):
  task_stack.h: remove obsolete __HAVE_ARCH_KSTACK_END check
  fork: Clean-up ifdef logic around stack allocation
  fork: Clean-up naming of vm_strack/vm_struct variables in vmap stacks
    code
  fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE
  fork: check charging success before zeroing stack
  fork: zero vmap stack using clear_page() instead of memset()
  fork: use the first page in stack to store vm_stack in cached_stacks
  fork: separate vmap stack alloction and free calls
  mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range_noflush()
    public functions
  fork: Dynamic Kernel Stacks
  x86: add support for Dynamic Kernel Stacks
  task_stack.h: Clean-up stack_not_used() implementation
  task_stack.h: Add stack_not_used() support for dynamic stack
  fork: Dynamic Kernel Stack accounting

 arch/Kconfig                     |  33 +++
 arch/x86/Kconfig                 |   1 +
 arch/x86/kernel/traps.c          |   3 +
 arch/x86/mm/fault.c              |   3 +
 include/linux/mmzone.h           |   3 +
 include/linux/sched.h            |   2 +-
 include/linux/sched/task_stack.h |  94 ++++++--
 include/linux/vmalloc.h          |  15 ++
 kernel/fork.c                    | 388 ++++++++++++++++++++++++++-----
 kernel/sched/core.c              |   1 +
 mm/internal.h                    |   9 -
 mm/vmalloc.c                     |  24 ++
 mm/vmstat.c                      |   3 +
 13 files changed, 487 insertions(+), 92 deletions(-)

Comments

Mateusz Guzik March 11, 2024, 5:09 p.m. UTC | #1
On 3/11/24, Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> This is follow-up to the LSF/MM proposal [1]. Please provide your
> thoughts and comments about dynamic kernel stacks feature. This is a WIP
> has not been tested beside booting on some machines, and running LKDTM
> thread exhaust tests. The series also lacks selftests, and
> documentations.
>
> This feature allows to grow kernel stack dynamically, from 4KiB and up
> to the THREAD_SIZE. The intend is to save memory on fleet machines. From
> the initial experiments it shows to save on average 70-75% of the kernel
> stack memory.
>

Can you please elaborate how this works? I have trouble figuring it
out from cursory reading of the patchset and commit messages, that
aside I would argue this should have been explained in the cover
letter.

For example, say a thread takes a bunch of random locks (most notably
spinlocks) and/or disables preemption, then pushes some stuff onto the
stack which now faults. That is to say the fault can happen in rather
arbitrary context.

If any of the conditions described below are prevented in the first
place it really needs to be described how.

That said, from top of my head:
1. what about faults when the thread holds a bunch of arbitrary locks
or has preemption disabled? is the allocation lockless?
2. what happens if there is no memory from which to map extra pages in
the first place? you may be in position where you can't go off cpu
Pasha Tatashin March 11, 2024, 6:58 p.m. UTC | #2
On Mon, Mar 11, 2024 at 1:09 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On 3/11/24, Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> > This is follow-up to the LSF/MM proposal [1]. Please provide your
> > thoughts and comments about dynamic kernel stacks feature. This is a WIP
> > has not been tested beside booting on some machines, and running LKDTM
> > thread exhaust tests. The series also lacks selftests, and
> > documentations.
> >
> > This feature allows to grow kernel stack dynamically, from 4KiB and up
> > to the THREAD_SIZE. The intend is to save memory on fleet machines. From
> > the initial experiments it shows to save on average 70-75% of the kernel
> > stack memory.
> >
>

Hi Mateusz,

> Can you please elaborate how this works? I have trouble figuring it
> out from cursory reading of the patchset and commit messages, that
> aside I would argue this should have been explained in the cover
> letter.

Sure, I answered your questions below.

> For example, say a thread takes a bunch of random locks (most notably
> spinlocks) and/or disables preemption, then pushes some stuff onto the
> stack which now faults. That is to say the fault can happen in rather
> arbitrary context.
>
> If any of the conditions described below are prevented in the first
> place it really needs to be described how.
>
> That said, from top of my head:
> 1. what about faults when the thread holds a bunch of arbitrary locks
> or has preemption disabled? is the allocation lockless?

Each thread has a stack with 4 pages.
Pre-allocated page: This page is always allocated and mapped at thread creation.
Dynamic pages (3): These pages are mapped dynamically upon stack faults.

A per-CPU data structure holds 3 dynamic pages for each CPU. These
pages are used to handle stack faults occurring when a running thread
faults (even within interrupt-disabled contexts). Typically, only one
page is needed, but in the rare case where the thread accesses beyond
that, we might use up to all three pages in a single fault. This
structure allows for atomic handling of stack faults, preventing
conflicts from other processes. Additionally, the thread's 16K-aligned
virtual address (VA) and guaranteed pre-allocated page means no page
table allocation is required during the fault.

When a thread leaves the CPU in normal kernel mode, we check a flag to
see if it has experienced stack faults. If so, we charge the thread
for the new stack pages and refill the per-CPU data structure with any
missing pages.

> 2. what happens if there is no memory from which to map extra pages in
> the first place? you may be in position where you can't go off cpu

When the per-CPU data structure cannot be refilled, and a new thread
faults, we issue a message indicating a critical stack fault. This
triggers a system-wide panic similar to a guard page access violation

Pasha
Mateusz Guzik March 11, 2024, 7:21 p.m. UTC | #3
On 3/11/24, Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> On Mon, Mar 11, 2024 at 1:09 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>> 1. what about faults when the thread holds a bunch of arbitrary locks
>> or has preemption disabled? is the allocation lockless?
>
> Each thread has a stack with 4 pages.
> Pre-allocated page: This page is always allocated and mapped at thread
> creation.
> Dynamic pages (3): These pages are mapped dynamically upon stack faults.
>
> A per-CPU data structure holds 3 dynamic pages for each CPU. These
> pages are used to handle stack faults occurring when a running thread
> faults (even within interrupt-disabled contexts). Typically, only one
> page is needed, but in the rare case where the thread accesses beyond
> that, we might use up to all three pages in a single fault. This
> structure allows for atomic handling of stack faults, preventing
> conflicts from other processes. Additionally, the thread's 16K-aligned
> virtual address (VA) and guaranteed pre-allocated page means no page
> table allocation is required during the fault.
>
> When a thread leaves the CPU in normal kernel mode, we check a flag to
> see if it has experienced stack faults. If so, we charge the thread
> for the new stack pages and refill the per-CPU data structure with any
> missing pages.
>

So this also has to happen if the thread holds a bunch of arbitrary
semaphores and goes off cpu with them? Anyhow, see below.

>> 2. what happens if there is no memory from which to map extra pages in
>> the first place? you may be in position where you can't go off cpu
>
> When the per-CPU data structure cannot be refilled, and a new thread
> faults, we issue a message indicating a critical stack fault. This
> triggers a system-wide panic similar to a guard page access violation
>

OOM handling is fundamentally what I was worried about. I'm confident
this failure mode makes the feature unsuitable for general-purpose
deployments.

Now, I have no vote here, it may be this is perfectly fine as an
optional feature, which it is in your patchset. However, if this is to
go in, the option description definitely needs a big fat warning about
possible panics if enabled.

I fully agree something(tm) should be done about stacks and the
current usage is a massive bummer. I wonder if things would be ok if
they shrinked to just 12K? Perhaps that would provide big enough
saving (of course smaller than the one you are getting now), while
avoiding any of the above.

All that said, it's not my call what do here. Thank you for the explanation.
Pasha Tatashin March 11, 2024, 7:55 p.m. UTC | #4
On Mon, Mar 11, 2024 at 3:21 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On 3/11/24, Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> > On Mon, Mar 11, 2024 at 1:09 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >> 1. what about faults when the thread holds a bunch of arbitrary locks
> >> or has preemption disabled? is the allocation lockless?
> >
> > Each thread has a stack with 4 pages.
> > Pre-allocated page: This page is always allocated and mapped at thread
> > creation.
> > Dynamic pages (3): These pages are mapped dynamically upon stack faults.
> >
> > A per-CPU data structure holds 3 dynamic pages for each CPU. These
> > pages are used to handle stack faults occurring when a running thread
> > faults (even within interrupt-disabled contexts). Typically, only one
> > page is needed, but in the rare case where the thread accesses beyond
> > that, we might use up to all three pages in a single fault. This
> > structure allows for atomic handling of stack faults, preventing
> > conflicts from other processes. Additionally, the thread's 16K-aligned
> > virtual address (VA) and guaranteed pre-allocated page means no page
> > table allocation is required during the fault.
> >
> > When a thread leaves the CPU in normal kernel mode, we check a flag to
> > see if it has experienced stack faults. If so, we charge the thread
> > for the new stack pages and refill the per-CPU data structure with any
> > missing pages.
> >
>
> So this also has to happen if the thread holds a bunch of arbitrary
> semaphores and goes off cpu with them? Anyhow, see below.

Yes, this is alright, if thread is allowed to sleep it should not hold
any alloc_pages() locks.

> >> 2. what happens if there is no memory from which to map extra pages in
> >> the first place? you may be in position where you can't go off cpu
> >
> > When the per-CPU data structure cannot be refilled, and a new thread
> > faults, we issue a message indicating a critical stack fault. This
> > triggers a system-wide panic similar to a guard page access violation
> >
>
> OOM handling is fundamentally what I was worried about. I'm confident
> this failure mode makes the feature unsuitable for general-purpose
> deployments.

The primary goal of this series is to enhance system safety, not
introduce additional risks. Memory saving is a welcome side effect.
Please see below for explanations.

>
> Now, I have no vote here, it may be this is perfectly fine as an
> optional feature, which it is in your patchset. However, if this is to
> go in, the option description definitely needs a big fat warning about
> possible panics if enabled.
>
> I fully agree something(tm) should be done about stacks and the
> current usage is a massive bummer. I wonder if things would be ok if
> they shrinked to just 12K? Perhaps that would provide big enough


The current setting of 1 pre-allocated page 3-dynamic page is just
WIP, we can very well change to 2 pre-allocated 2-dynamic pages, or
3/1 etc.

At Google, we still utilize 8K stacks (have not increased it to 16K
when upstream increased it in 2014) and are only now encountering
extreme cases where the 8K limit is reached. Consequently, we plan to
increase the limit to 16K. Dynamic Kernel Stacks allow us to maintain
an 8K pre-allocated stack while handling page faults only in
exceptionally rare circumstances.

Another example is to increase THREAD_SIZE to 32K, and keep 16K
pre-allocated. This is the same as what upstream has today, but avoids
panics with guard pages thus making the systems safer for everyone.

Pasha
H. Peter Anvin March 12, 2024, 5:18 p.m. UTC | #5
On 3/11/24 09:46, Pasha Tatashin wrote:
> This is follow-up to the LSF/MM proposal [1]. Please provide your
> thoughts and comments about dynamic kernel stacks feature. This is a WIP
> has not been tested beside booting on some machines, and running LKDTM
> thread exhaust tests. The series also lacks selftests, and
> documentations.
> 
> This feature allows to grow kernel stack dynamically, from 4KiB and up
> to the THREAD_SIZE. The intend is to save memory on fleet machines. From
> the initial experiments it shows to save on average 70-75% of the kernel
> stack memory.
> 
> The average depth of a kernel thread depends on the workload, profiling,
> virtualization, compiler optimizations, and driver implementations.
> However, the table below shows the amount of kernel stack memory before
> vs. after on idling freshly booted machines:
> 
> CPU           #Cores #Stacks  BASE(kb) Dynamic(kb)   Saving
> AMD Genoa        384    5786    92576       23388    74.74%
> Intel Skylake    112    3182    50912       12860    74.74%
> AMD Rome         128    3401    54416       14784    72.83%
> AMD Rome         256    4908    78528       20876    73.42%
> Intel Haswell     72    2644    42304       10624    74.89%
> 
> Some workloads with that have millions of threads would can benefit
> significantly from this feature.
> 

Ok, first of all, talking about "kernel memory" here is misleading. 
Unless your threads are spending nearly all their time sleeping, the 
threads will occupy stack and TLS memory in user space as well.

Second, non-dynamic kernel memory is one of the core design decisions in 
Linux from early on. This means there are lot of deeply embedded 
assumptions which would have to be untangled.

Linus would, of course, be the real authority on this, but if someone 
would ask me what the fundamental design philosophies of the Linux 
kernel are -- the design decisions which make Linux Linux, if you will 
-- I would say:

	1. Non-dynamic kernel memory
	2. Permanent mapping of physical memory
	3. Kernel API modeled closely after the POSIX API
	   (no complicated user space layers)
	4. Fast system call entry/exit (a necessity for a
	   kernel API based on simple system calls)
	5. Monolithic (but modular) kernel environment
	   (not cross-privilege, coroutine or message passing)

Third, *IF* this is something that should be done (and I personally 
strongly suspect it should not), at least on x86-64 it probably should 
be for FRED hardware only. With FRED, it is possible to set the #PF 
event stack level to 1, which will cause an automatic stack switch for 
#PF in kernel space (only). However, even in kernel space, #PF can sleep 
if it references a user space page, in which case it would have to be 
demoted back onto the ring 0 stack (there are multiple ways of doing 
that, but it does entail an overhead.)

	-hpa
Pasha Tatashin March 12, 2024, 7:45 p.m. UTC | #6
On Tue, Mar 12, 2024 at 1:19 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
>
>
> On 3/11/24 09:46, Pasha Tatashin wrote:
> > This is follow-up to the LSF/MM proposal [1]. Please provide your
> > thoughts and comments about dynamic kernel stacks feature. This is a WIP
> > has not been tested beside booting on some machines, and running LKDTM
> > thread exhaust tests. The series also lacks selftests, and
> > documentations.
> >
> > This feature allows to grow kernel stack dynamically, from 4KiB and up
> > to the THREAD_SIZE. The intend is to save memory on fleet machines. From
> > the initial experiments it shows to save on average 70-75% of the kernel
> > stack memory.
> >
> > The average depth of a kernel thread depends on the workload, profiling,
> > virtualization, compiler optimizations, and driver implementations.
> > However, the table below shows the amount of kernel stack memory before
> > vs. after on idling freshly booted machines:
> >
> > CPU           #Cores #Stacks  BASE(kb) Dynamic(kb)   Saving
> > AMD Genoa        384    5786    92576       23388    74.74%
> > Intel Skylake    112    3182    50912       12860    74.74%
> > AMD Rome         128    3401    54416       14784    72.83%
> > AMD Rome         256    4908    78528       20876    73.42%
> > Intel Haswell     72    2644    42304       10624    74.89%
> >
> > Some workloads with that have millions of threads would can benefit
> > significantly from this feature.
> >
>
> Ok, first of all, talking about "kernel memory" here is misleading.

Hi Peter,

I re-read my cover letter, and I do not see where "kernel memory" is
mentioned. We are talking about kernel stacks overhead that is
proportional to the user workload, as every active thread has an
associated kernel stack. The idea is to save memory by not
pre-allocating all pages of kernel-stacks, but instead use it as a
safeguard when a stack actually becomes deep. Come-up with a solution
that can handle rare deeper stacks only when needed. This could be
done through faulting on the supported hardware (as proposed in this
series), or via pre-map on every schedule event, and checking the
access when thread goes off cpu (as proposed by Andy Lutomirski to
avoid double faults on x86) .

In other words, this feature is only about one very specific type of
kernel memory that is not even directly mapped (the feature required
vmapped stacks).

> Unless your threads are spending nearly all their time sleeping, the
> threads will occupy stack and TLS memory in user space as well.

Can you please elaborate, what data is contained in the kernel stack
when thread is in user space? My series requires thread_info not to be
in the stack by depending on THREAD_INFO_IN_TASK.

> Second, non-dynamic kernel memory is one of the core design decisions in
> Linux from early on. This means there are lot of deeply embedded
> assumptions which would have to be untangled.
>
> Linus would, of course, be the real authority on this, but if someone
> would ask me what the fundamental design philosophies of the Linux
> kernel are -- the design decisions which make Linux Linux, if you will
> -- I would say:
>
>         1. Non-dynamic kernel memory
>         2. Permanent mapping of physical memory

The one and two are correlated. Given that all the memory is directly
mapped, the kernel core cannot be relocatable, swappable, faultable
etc.

>         3. Kernel API modeled closely after the POSIX API
>            (no complicated user space layers)
>         4. Fast system call entry/exit (a necessity for a
>            kernel API based on simple system calls)
>         5. Monolithic (but modular) kernel environment
>            (not cross-privilege, coroutine or message passing)
>
> Third, *IF* this is something that should be done (and I personally
> strongly suspect it should not), at least on x86-64 it probably should
> be for FRED hardware only. With FRED, it is possible to set the #PF
> event stack level to 1, which will cause an automatic stack switch for
> #PF in kernel space (only). However, even in kernel space, #PF can sleep
> if it references a user space page, in which case it would have to be
> demoted back onto the ring 0 stack (there are multiple ways of doing
> that, but it does entail an overhead.)

My understanding is that with the proposed approach only double faults
are prohibited to be used. Pre-map/check-access could still work, even
though it would add some cost to the context switching.

Thank you,
Pasha
H. Peter Anvin March 12, 2024, 9:36 p.m. UTC | #7
On 3/12/24 12:45, Pasha Tatashin wrote:
>>
>> Ok, first of all, talking about "kernel memory" here is misleading.
> 
> Hi Peter,
> 
> I re-read my cover letter, and I do not see where "kernel memory" is
> mentioned. We are talking about kernel stacks overhead that is
> proportional to the user workload, as every active thread has an
> associated kernel stack. The idea is to save memory by not
> pre-allocating all pages of kernel-stacks, but instead use it as a
> safeguard when a stack actually becomes deep. Come-up with a solution
> that can handle rare deeper stacks only when needed. This could be
> done through faulting on the supported hardware (as proposed in this
> series), or via pre-map on every schedule event, and checking the
> access when thread goes off cpu (as proposed by Andy Lutomirski to
> avoid double faults on x86) .
> 
> In other words, this feature is only about one very specific type of
> kernel memory that is not even directly mapped (the feature required
> vmapped stacks).
> 
>> Unless your threads are spending nearly all their time sleeping, the
>> threads will occupy stack and TLS memory in user space as well.
> 
> Can you please elaborate, what data is contained in the kernel stack
> when thread is in user space? My series requires thread_info not to be
> in the stack by depending on THREAD_INFO_IN_TASK.
> 

My point is that what matters is total memory use, not just memory used 
in the kernel. Amdahl's law.

	-hpa
David Laight March 12, 2024, 10:18 p.m. UTC | #8
...
> I re-read my cover letter, and I do not see where "kernel memory" is
> mentioned. We are talking about kernel stacks overhead that is
> proportional to the user workload, as every active thread has an
> associated kernel stack. The idea is to save memory by not
> pre-allocating all pages of kernel-stacks, but instead use it as a
> safeguard when a stack actually becomes deep. Come-up with a solution
> that can handle rare deeper stacks only when needed. This could be
> done through faulting on the supported hardware (as proposed in this
> series), or via pre-map on every schedule event, and checking the
> access when thread goes off cpu (as proposed by Andy Lutomirski to
> avoid double faults on x86) .
> 
> In other words, this feature is only about one very specific type of
> kernel memory that is not even directly mapped (the feature required
> vmapped stacks).

Just for interest how big does the register save area get?
In the 'good old days' it could be allocated from the low end of the
stack memory. But AVX512 starts making it large - never mind some
other things that (IIRC) might get to 8k.
Even the task area is probably non-trivial since far fewer things
can be shared than one might hope.

I'm sure I remember someone contemplating not allocating stacks to
each thread. I think that requires waking up with a system call
restart for some system calls - plausibly possible for futex() and poll().

Another option is to do a proper static analysis of stack usage
and fix the paths that have deep stacks and remove all recursion.
I'm pretty sure objtool knows the stack offsets of every call instruction.
The indirect call hashes (fine IBT?) should allow indirect calls
be handled as well as direct calls.
Processing the 'A calls B at offset n' to generate a max depth
is just a SMOP.

At the moment I think all 'void (*)(void *)' function have the same hash?
So the compiler would need a function attribute to seed the hash.

With that you might be able to remove all the code paths that actually
use a lot of stack - instead of just guessing and limiting individual
stack frames.

My 'gut feel' from calculating the stack use that way for an embedded
system back in the early 1980s is that the max use will be inside
printk() inside an obscure error path and if you actually hit it
things will explode.
(We didn't have enough memory to allocate big enough stacks!)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Kent Overstreet March 14, 2024, 7:05 p.m. UTC | #9
On Tue, Mar 12, 2024 at 02:36:27PM -0700, H. Peter Anvin wrote:
> On 3/12/24 12:45, Pasha Tatashin wrote:
> > > 
> > > Ok, first of all, talking about "kernel memory" here is misleading.
> > 
> > Hi Peter,
> > 
> > I re-read my cover letter, and I do not see where "kernel memory" is
> > mentioned. We are talking about kernel stacks overhead that is
> > proportional to the user workload, as every active thread has an
> > associated kernel stack. The idea is to save memory by not
> > pre-allocating all pages of kernel-stacks, but instead use it as a
> > safeguard when a stack actually becomes deep. Come-up with a solution
> > that can handle rare deeper stacks only when needed. This could be
> > done through faulting on the supported hardware (as proposed in this
> > series), or via pre-map on every schedule event, and checking the
> > access when thread goes off cpu (as proposed by Andy Lutomirski to
> > avoid double faults on x86) .
> > 
> > In other words, this feature is only about one very specific type of
> > kernel memory that is not even directly mapped (the feature required
> > vmapped stacks).
> > 
> > > Unless your threads are spending nearly all their time sleeping, the
> > > threads will occupy stack and TLS memory in user space as well.
> > 
> > Can you please elaborate, what data is contained in the kernel stack
> > when thread is in user space? My series requires thread_info not to be
> > in the stack by depending on THREAD_INFO_IN_TASK.
> > 
> 
> My point is that what matters is total memory use, not just memory used in
> the kernel. Amdahl's law.

If userspace is running a few processes with many threads and the
userspace stacks are small, kernel stacks could end up dominating.

I'd like to see some numbers though.
Pasha Tatashin March 14, 2024, 7:23 p.m. UTC | #10
> >
> > My point is that what matters is total memory use, not just memory used in
> > the kernel. Amdahl's law.
>
> If userspace is running a few processes with many threads and the
> userspace stacks are small, kernel stacks could end up dominating.
>
> I'd like to see some numbers though.

The unused kernel stack pages occupy petabytes of memory across the fleet [1].

I also submitted a patch [2] that can help visualize the maximum stack
page access distribution.

[1] https://lore.kernel.org/all/CA+CK2bBYt9RAVqASB2eLyRQxYT5aiL0fGhUu3TumQCyJCNTWvw@mail.gmail.com
[2] https://lore.kernel.org/all/20240314145457.1106299-1-pasha.tatashin@soleen.com
Kent Overstreet March 14, 2024, 7:28 p.m. UTC | #11
On Thu, Mar 14, 2024 at 03:23:08PM -0400, Pasha Tatashin wrote:
> > >
> > > My point is that what matters is total memory use, not just memory used in
> > > the kernel. Amdahl's law.
> >
> > If userspace is running a few processes with many threads and the
> > userspace stacks are small, kernel stacks could end up dominating.
> >
> > I'd like to see some numbers though.
> 
> The unused kernel stack pages occupy petabytes of memory across the fleet [1].

Raw number doesn't mean much here (I know how many machines Google has,
of course it's going to be petabytes ;), percentage of system memory
would be better.

What I'd _really_ like to see is raw output from memory allocation
profiling, so we can see how much memory is going to kernel stacks vs.
other kernel allocations.

Number of kernel threads vs. number of user threads would also be good
to know - I've been seeing ps output lately where we've got a lot more
workqueue workers than we should, perhaps that's something that could be
addressed.
Pasha Tatashin March 14, 2024, 7:34 p.m. UTC | #12
On Thu, Mar 14, 2024 at 3:29 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Thu, Mar 14, 2024 at 03:23:08PM -0400, Pasha Tatashin wrote:
> > > >
> > > > My point is that what matters is total memory use, not just memory used in
> > > > the kernel. Amdahl's law.
> > >
> > > If userspace is running a few processes with many threads and the
> > > userspace stacks are small, kernel stacks could end up dominating.
> > >
> > > I'd like to see some numbers though.
> >
> > The unused kernel stack pages occupy petabytes of memory across the fleet [1].
>
> Raw number doesn't mean much here (I know how many machines Google has,
> of course it's going to be petabytes ;), percentage of system memory
> would be better.
>
> What I'd _really_ like to see is raw output from memory allocation
> profiling, so we can see how much memory is going to kernel stacks vs.
> other kernel allocations.

I've heard there is memory profiling working that can help with that...

While I do not have the data you are asking for, the other kernel
allocations might be useful, but this particular project is targeted
to help with reducing overhead where the memory is not used, or used
in very extreme rare cases.

> Number of kernel threads vs. number of user threads would also be good
> to know - I've been seeing ps output lately where we've got a lot more
> workqueue workers than we should, perhaps that's something that could be
> addressed.

Yes, doing other optimizations make sense, reducing the total number
kernel threads if possible might help as well. I will look into this
as well to see how many user threads vs kernel threads we have.
Matthew Wilcox March 14, 2024, 7:43 p.m. UTC | #13
On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> Second, non-dynamic kernel memory is one of the core design decisions in
> Linux from early on. This means there are lot of deeply embedded assumptions
> which would have to be untangled.

I think there are other ways of getting the benefit that Pasha is seeking
without moving to dynamically allocated kernel memory.  One icky thing
that XFS does is punt work over to a kernel thread in order to use more
stack!  That breaks a number of things including lockdep (because the
kernel thread doesn't own the lock, the thread waiting for the kernel
thread owns the lock).

If we had segmented stacks, XFS could say "I need at least 6kB of stack",
and if less than that was available, we could allocate a temporary
stack and switch to it.  I suspect Google would also be able to use this
API for their rare cases when they need more than 8kB of kernel stack.
Who knows, we might all be able to use such a thing.

I'd been thinking about this from the point of view of allocating more
stack elsewhere in kernel space, but combining what Pasha has done here
with this idea might lead to a hybrid approach that works better; allocate
32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
rely on people using this "I need more stack" API correctly, and free the
excess pages on return to userspace.  No complicated "switch stacks" API
needed, just an "ensure we have at least N bytes of stack remaining" API.
Kent Overstreet March 14, 2024, 7:49 p.m. UTC | #14
On Thu, Mar 14, 2024 at 03:34:03PM -0400, Pasha Tatashin wrote:
> On Thu, Mar 14, 2024 at 3:29 PM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Thu, Mar 14, 2024 at 03:23:08PM -0400, Pasha Tatashin wrote:
> > > > >
> > > > > My point is that what matters is total memory use, not just memory used in
> > > > > the kernel. Amdahl's law.
> > > >
> > > > If userspace is running a few processes with many threads and the
> > > > userspace stacks are small, kernel stacks could end up dominating.
> > > >
> > > > I'd like to see some numbers though.
> > >
> > > The unused kernel stack pages occupy petabytes of memory across the fleet [1].
> >
> > Raw number doesn't mean much here (I know how many machines Google has,
> > of course it's going to be petabytes ;), percentage of system memory
> > would be better.
> >
> > What I'd _really_ like to see is raw output from memory allocation
> > profiling, so we can see how much memory is going to kernel stacks vs.
> > other kernel allocations.
> 
> I've heard there is memory profiling working that can help with that...

I heard you've tried it out, too :)

> While I do not have the data you are asking for, the other kernel
> allocations might be useful, but this particular project is targeted
> to help with reducing overhead where the memory is not used, or used
> in very extreme rare cases.

Well, do you think you could gather it? We shouldn't be blindly applying
performance optimizations; we need to know where to focus our efforts.

e.g. on my laptop I've currently got 356 processes for < 6M of kernel
stack out of 32G total ram, so clearly this isn't much use to me. If the
ratio is similar on your servers - nah, don't want it. I expect the
ratio is not similar and you are burning proportially more memory on
kernel stacks, but we still need to gather the data and do the math :)

> 
> > Number of kernel threads vs. number of user threads would also be good
> > to know - I've been seeing ps output lately where we've got a lot more
> > workqueue workers than we should, perhaps that's something that could be
> > addressed.
> 
> Yes, doing other optimizations make sense, reducing the total number
> kernel threads if possible might help as well. I will look into this
> as well to see how many user threads vs kernel threads we have.

Great, that will help too.
Kent Overstreet March 14, 2024, 7:53 p.m. UTC | #15
On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > Second, non-dynamic kernel memory is one of the core design decisions in
> > Linux from early on. This means there are lot of deeply embedded assumptions
> > which would have to be untangled.
> 
> I think there are other ways of getting the benefit that Pasha is seeking
> without moving to dynamically allocated kernel memory.  One icky thing
> that XFS does is punt work over to a kernel thread in order to use more
> stack!  That breaks a number of things including lockdep (because the
> kernel thread doesn't own the lock, the thread waiting for the kernel
> thread owns the lock).
> 
> If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> and if less than that was available, we could allocate a temporary
> stack and switch to it.  I suspect Google would also be able to use this
> API for their rare cases when they need more than 8kB of kernel stack.
> Who knows, we might all be able to use such a thing.
> 
> I'd been thinking about this from the point of view of allocating more
> stack elsewhere in kernel space, but combining what Pasha has done here
> with this idea might lead to a hybrid approach that works better; allocate
> 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> rely on people using this "I need more stack" API correctly, and free the
> excess pages on return to userspace.  No complicated "switch stacks" API
> needed, just an "ensure we have at least N bytes of stack remaining" API.

Why would we need an "I need more stack" API? Pasha's approach seems
like everything we need for what you're talking about.
Matthew Wilcox March 14, 2024, 7:57 p.m. UTC | #16
On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > > Second, non-dynamic kernel memory is one of the core design decisions in
> > > Linux from early on. This means there are lot of deeply embedded assumptions
> > > which would have to be untangled.
> > 
> > I think there are other ways of getting the benefit that Pasha is seeking
> > without moving to dynamically allocated kernel memory.  One icky thing
> > that XFS does is punt work over to a kernel thread in order to use more
> > stack!  That breaks a number of things including lockdep (because the
> > kernel thread doesn't own the lock, the thread waiting for the kernel
> > thread owns the lock).
> > 
> > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > and if less than that was available, we could allocate a temporary
> > stack and switch to it.  I suspect Google would also be able to use this
> > API for their rare cases when they need more than 8kB of kernel stack.
> > Who knows, we might all be able to use such a thing.
> > 
> > I'd been thinking about this from the point of view of allocating more
> > stack elsewhere in kernel space, but combining what Pasha has done here
> > with this idea might lead to a hybrid approach that works better; allocate
> > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > rely on people using this "I need more stack" API correctly, and free the
> > excess pages on return to userspace.  No complicated "switch stacks" API
> > needed, just an "ensure we have at least N bytes of stack remaining" API.
> 
> Why would we need an "I need more stack" API? Pasha's approach seems
> like everything we need for what you're talking about.

Because double faults are hard, possibly impossible, and the FRED approach
Peter described has extra overhead?  This was all described up-thread.
Kent Overstreet March 14, 2024, 7:58 p.m. UTC | #17
On Thu, Mar 14, 2024 at 07:57:22PM +0000, Matthew Wilcox wrote:
> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> > > > which would have to be untangled.
> > > 
> > > I think there are other ways of getting the benefit that Pasha is seeking
> > > without moving to dynamically allocated kernel memory.  One icky thing
> > > that XFS does is punt work over to a kernel thread in order to use more
> > > stack!  That breaks a number of things including lockdep (because the
> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> > > thread owns the lock).
> > > 
> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > > and if less than that was available, we could allocate a temporary
> > > stack and switch to it.  I suspect Google would also be able to use this
> > > API for their rare cases when they need more than 8kB of kernel stack.
> > > Who knows, we might all be able to use such a thing.
> > > 
> > > I'd been thinking about this from the point of view of allocating more
> > > stack elsewhere in kernel space, but combining what Pasha has done here
> > > with this idea might lead to a hybrid approach that works better; allocate
> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > > rely on people using this "I need more stack" API correctly, and free the
> > > excess pages on return to userspace.  No complicated "switch stacks" API
> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
> > 
> > Why would we need an "I need more stack" API? Pasha's approach seems
> > like everything we need for what you're talking about.
> 
> Because double faults are hard, possibly impossible, and the FRED approach
> Peter described has extra overhead?  This was all described up-thread.

*nod*
Pasha Tatashin March 15, 2024, 3:13 a.m. UTC | #18
On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> > > > which would have to be untangled.
> > >
> > > I think there are other ways of getting the benefit that Pasha is seeking
> > > without moving to dynamically allocated kernel memory.  One icky thing
> > > that XFS does is punt work over to a kernel thread in order to use more
> > > stack!  That breaks a number of things including lockdep (because the
> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> > > thread owns the lock).
> > >
> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > > and if less than that was available, we could allocate a temporary
> > > stack and switch to it.  I suspect Google would also be able to use this
> > > API for their rare cases when they need more than 8kB of kernel stack.
> > > Who knows, we might all be able to use such a thing.
> > >
> > > I'd been thinking about this from the point of view of allocating more
> > > stack elsewhere in kernel space, but combining what Pasha has done here
> > > with this idea might lead to a hybrid approach that works better; allocate
> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > > rely on people using this "I need more stack" API correctly, and free the
> > > excess pages on return to userspace.  No complicated "switch stacks" API
> > > needed, just an "ensure we have at least N bytes of stack remaining" API.

I like this approach! I think we could also consider having permanent
big stacks for some kernel only threads like kvm-vcpu. A cooperative
stack increase framework could work well and wouldn't negatively
impact the performance of context switching. However, thorough
analysis would be necessary to proactively identify potential stack
overflow situations.

> > Why would we need an "I need more stack" API? Pasha's approach seems
> > like everything we need for what you're talking about.
>
> Because double faults are hard, possibly impossible, and the FRED approach
> Peter described has extra overhead?  This was all described up-thread.

Handling faults in #DF is possible. It requires code inspection to
handle race conditions such as what was shown by tglx. However, as
Andy pointed out, this is not supported by SDM as it is an abort
context (yet we return from it because of ESPFIX64, so return is
possible).

My question, however, if we ignore memory savings and only consider
reliability aspect of this feature.  What is better unconditionally
crashing the machine because a guard page was reached, or printing a
huge warning with a backtracing information about the offending stack,
handling the fault, and survive? I know that historically Linus
preferred WARN() to BUG() [1]. But, this is a somewhat different
scenario compared to simple BUG vs WARN.

Pasha

[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-100000@home.transmeta.com
H. Peter Anvin March 15, 2024, 3:39 a.m. UTC | #19
On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
>> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
>> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
>> > > > Second, non-dynamic kernel memory is one of the core design decisions in
>> > > > Linux from early on. This means there are lot of deeply embedded assumptions
>> > > > which would have to be untangled.
>> > >
>> > > I think there are other ways of getting the benefit that Pasha is seeking
>> > > without moving to dynamically allocated kernel memory.  One icky thing
>> > > that XFS does is punt work over to a kernel thread in order to use more
>> > > stack!  That breaks a number of things including lockdep (because the
>> > > kernel thread doesn't own the lock, the thread waiting for the kernel
>> > > thread owns the lock).
>> > >
>> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
>> > > and if less than that was available, we could allocate a temporary
>> > > stack and switch to it.  I suspect Google would also be able to use this
>> > > API for their rare cases when they need more than 8kB of kernel stack.
>> > > Who knows, we might all be able to use such a thing.
>> > >
>> > > I'd been thinking about this from the point of view of allocating more
>> > > stack elsewhere in kernel space, but combining what Pasha has done here
>> > > with this idea might lead to a hybrid approach that works better; allocate
>> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
>> > > rely on people using this "I need more stack" API correctly, and free the
>> > > excess pages on return to userspace.  No complicated "switch stacks" API
>> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
>
>I like this approach! I think we could also consider having permanent
>big stacks for some kernel only threads like kvm-vcpu. A cooperative
>stack increase framework could work well and wouldn't negatively
>impact the performance of context switching. However, thorough
>analysis would be necessary to proactively identify potential stack
>overflow situations.
>
>> > Why would we need an "I need more stack" API? Pasha's approach seems
>> > like everything we need for what you're talking about.
>>
>> Because double faults are hard, possibly impossible, and the FRED approach
>> Peter described has extra overhead?  This was all described up-thread.
>
>Handling faults in #DF is possible. It requires code inspection to
>handle race conditions such as what was shown by tglx. However, as
>Andy pointed out, this is not supported by SDM as it is an abort
>context (yet we return from it because of ESPFIX64, so return is
>possible).
>
>My question, however, if we ignore memory savings and only consider
>reliability aspect of this feature.  What is better unconditionally
>crashing the machine because a guard page was reached, or printing a
>huge warning with a backtracing information about the offending stack,
>handling the fault, and survive? I know that historically Linus
>preferred WARN() to BUG() [1]. But, this is a somewhat different
>scenario compared to simple BUG vs WARN.
>
>Pasha
>
>[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-100000@home.transmeta.com
>

The real issue with using #DF is that if the event that caused it was asynchronous, you could lose the event.
H. Peter Anvin March 15, 2024, 4:17 a.m. UTC | #20
On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
>> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
>> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
>> > > > Second, non-dynamic kernel memory is one of the core design decisions in
>> > > > Linux from early on. This means there are lot of deeply embedded assumptions
>> > > > which would have to be untangled.
>> > >
>> > > I think there are other ways of getting the benefit that Pasha is seeking
>> > > without moving to dynamically allocated kernel memory.  One icky thing
>> > > that XFS does is punt work over to a kernel thread in order to use more
>> > > stack!  That breaks a number of things including lockdep (because the
>> > > kernel thread doesn't own the lock, the thread waiting for the kernel
>> > > thread owns the lock).
>> > >
>> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
>> > > and if less than that was available, we could allocate a temporary
>> > > stack and switch to it.  I suspect Google would also be able to use this
>> > > API for their rare cases when they need more than 8kB of kernel stack.
>> > > Who knows, we might all be able to use such a thing.
>> > >
>> > > I'd been thinking about this from the point of view of allocating more
>> > > stack elsewhere in kernel space, but combining what Pasha has done here
>> > > with this idea might lead to a hybrid approach that works better; allocate
>> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
>> > > rely on people using this "I need more stack" API correctly, and free the
>> > > excess pages on return to userspace.  No complicated "switch stacks" API
>> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
>
>I like this approach! I think we could also consider having permanent
>big stacks for some kernel only threads like kvm-vcpu. A cooperative
>stack increase framework could work well and wouldn't negatively
>impact the performance of context switching. However, thorough
>analysis would be necessary to proactively identify potential stack
>overflow situations.
>
>> > Why would we need an "I need more stack" API? Pasha's approach seems
>> > like everything we need for what you're talking about.
>>
>> Because double faults are hard, possibly impossible, and the FRED approach
>> Peter described has extra overhead?  This was all described up-thread.
>
>Handling faults in #DF is possible. It requires code inspection to
>handle race conditions such as what was shown by tglx. However, as
>Andy pointed out, this is not supported by SDM as it is an abort
>context (yet we return from it because of ESPFIX64, so return is
>possible).
>
>My question, however, if we ignore memory savings and only consider
>reliability aspect of this feature.  What is better unconditionally
>crashing the machine because a guard page was reached, or printing a
>huge warning with a backtracing information about the offending stack,
>handling the fault, and survive? I know that historically Linus
>preferred WARN() to BUG() [1]. But, this is a somewhat different
>scenario compared to simple BUG vs WARN.
>
>Pasha
>
>[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-100000@home.transmeta.com
>

From a reliability point of view it is better to die than to proceed with possible data loss. The latter is extremely serious.

However, the one way that this could be made to work would be with stack probes, which could be compiler-inserted. The point is that you touch an offset below the stack pointer that is large enough that you cover not only the maximum amount of stack the function needs, but with an additional margin, which includes enough space that you can safely take the #PF on the remaining stack.
Pasha Tatashin March 16, 2024, 7:17 p.m. UTC | #21
On Thu, Mar 14, 2024 at 11:40 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> >On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> >>
> >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> >> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> >> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> >> > > > which would have to be untangled.
> >> > >
> >> > > I think there are other ways of getting the benefit that Pasha is seeking
> >> > > without moving to dynamically allocated kernel memory.  One icky thing
> >> > > that XFS does is punt work over to a kernel thread in order to use more
> >> > > stack!  That breaks a number of things including lockdep (because the
> >> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> >> > > thread owns the lock).
> >> > >
> >> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> >> > > and if less than that was available, we could allocate a temporary
> >> > > stack and switch to it.  I suspect Google would also be able to use this
> >> > > API for their rare cases when they need more than 8kB of kernel stack.
> >> > > Who knows, we might all be able to use such a thing.
> >> > >
> >> > > I'd been thinking about this from the point of view of allocating more
> >> > > stack elsewhere in kernel space, but combining what Pasha has done here
> >> > > with this idea might lead to a hybrid approach that works better; allocate
> >> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> >> > > rely on people using this "I need more stack" API correctly, and free the
> >> > > excess pages on return to userspace.  No complicated "switch stacks" API
> >> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
> >
> >I like this approach! I think we could also consider having permanent
> >big stacks for some kernel only threads like kvm-vcpu. A cooperative
> >stack increase framework could work well and wouldn't negatively
> >impact the performance of context switching. However, thorough
> >analysis would be necessary to proactively identify potential stack
> >overflow situations.
> >
> >> > Why would we need an "I need more stack" API? Pasha's approach seems
> >> > like everything we need for what you're talking about.
> >>
> >> Because double faults are hard, possibly impossible, and the FRED approach
> >> Peter described has extra overhead?  This was all described up-thread.
> >
> >Handling faults in #DF is possible. It requires code inspection to
> >handle race conditions such as what was shown by tglx. However, as
> >Andy pointed out, this is not supported by SDM as it is an abort
> >context (yet we return from it because of ESPFIX64, so return is
> >possible).
> >
> >My question, however, if we ignore memory savings and only consider
> >reliability aspect of this feature.  What is better unconditionally
> >crashing the machine because a guard page was reached, or printing a
> >huge warning with a backtracing information about the offending stack,
> >handling the fault, and survive? I know that historically Linus
> >preferred WARN() to BUG() [1]. But, this is a somewhat different
> >scenario compared to simple BUG vs WARN.
> >
> >Pasha
> >
> >[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-100000@home.transmeta.com
> >
>
> The real issue with using #DF is that if the event that caused it was asynchronous, you could lose the event.

Got it. So, using a #DF handler for stack page faults isn't feasible.
I suppose the only way for this to work would be to use a dedicated
Interrupt Stack Table (IST) entry for page faults (#PF), but I suspect
that might introduce other complications.

Expanding on Mathew's idea of an interface for dynamic kernel stack
sizes, here's what I'm thinking:

- Kernel Threads: Create all kernel threads with a fully populated
THREAD_SIZE stack.  (i.e. 16K)
- User Threads: Create all user threads with THREAD_SIZE kernel stack
but only the top page mapped. (i.e. 4K)
- In enter_from_user_mode(): Expand the thread stack to 16K by mapping
three additional pages from the per-CPU stack cache. This function is
called early in kernel entry points.
- exit_to_user_mode(): Unmap the extra three pages and return them to
the per-CPU cache. This function is called late in the kernel exit
path.

Both of the above hooks are called with IRQ disabled on all kernel
entries whether through interrupts and syscalls, and they are called
early/late enough that 4K is enough to handle the rest of entry/exit.

Pasha
Matthew Wilcox March 17, 2024, 12:41 a.m. UTC | #22
On Sat, Mar 16, 2024 at 03:17:57PM -0400, Pasha Tatashin wrote:
> Expanding on Mathew's idea of an interface for dynamic kernel stack
> sizes, here's what I'm thinking:
> 
> - Kernel Threads: Create all kernel threads with a fully populated
> THREAD_SIZE stack.  (i.e. 16K)
> - User Threads: Create all user threads with THREAD_SIZE kernel stack
> but only the top page mapped. (i.e. 4K)
> - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> three additional pages from the per-CPU stack cache. This function is
> called early in kernel entry points.
> - exit_to_user_mode(): Unmap the extra three pages and return them to
> the per-CPU cache. This function is called late in the kernel exit
> path.
> 
> Both of the above hooks are called with IRQ disabled on all kernel
> entries whether through interrupts and syscalls, and they are called
> early/late enough that 4K is enough to handle the rest of entry/exit.

At what point do we replenish the per-CPU stash of pages?  If we're
12kB deep in the stack and call mutex_lock(), we can be scheduled out,
and then the new thread can make a syscall.  Do we just assume that
get_free_page() can sleep at kernel entry (seems reasonable)?  I don't
think this is an infeasible problem, I'd just like it to be described.
H. Peter Anvin March 17, 2024, 12:47 a.m. UTC | #23
On March 14, 2024 12:43:06 PM PDT, Matthew Wilcox <willy@infradead.org> wrote:
>On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
>> Second, non-dynamic kernel memory is one of the core design decisions in
>> Linux from early on. This means there are lot of deeply embedded assumptions
>> which would have to be untangled.
>
>I think there are other ways of getting the benefit that Pasha is seeking
>without moving to dynamically allocated kernel memory.  One icky thing
>that XFS does is punt work over to a kernel thread in order to use more
>stack!  That breaks a number of things including lockdep (because the
>kernel thread doesn't own the lock, the thread waiting for the kernel
>thread owns the lock).
>
>If we had segmented stacks, XFS could say "I need at least 6kB of stack",
>and if less than that was available, we could allocate a temporary
>stack and switch to it.  I suspect Google would also be able to use this
>API for their rare cases when they need more than 8kB of kernel stack.
>Who knows, we might all be able to use such a thing.
>
>I'd been thinking about this from the point of view of allocating more
>stack elsewhere in kernel space, but combining what Pasha has done here
>with this idea might lead to a hybrid approach that works better; allocate
>32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
>rely on people using this "I need more stack" API correctly, and free the
>excess pages on return to userspace.  No complicated "switch stacks" API
>needed, just an "ensure we have at least N bytes of stack remaining" API.

This is what stack probes basically does. It provides a very cheap "API" that goes via the #PF (not #DF!) path in the slow case, but synchronously at a well-defined point, but is virtually free in the common case. As a side benefit, they can be compiler-generated, as some operating systems require them.
Kent Overstreet March 17, 2024, 1:32 a.m. UTC | #24
On Sun, Mar 17, 2024 at 12:41:33AM +0000, Matthew Wilcox wrote:
> On Sat, Mar 16, 2024 at 03:17:57PM -0400, Pasha Tatashin wrote:
> > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > sizes, here's what I'm thinking:
> > 
> > - Kernel Threads: Create all kernel threads with a fully populated
> > THREAD_SIZE stack.  (i.e. 16K)
> > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > but only the top page mapped. (i.e. 4K)
> > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > three additional pages from the per-CPU stack cache. This function is
> > called early in kernel entry points.
> > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > the per-CPU cache. This function is called late in the kernel exit
> > path.
> > 
> > Both of the above hooks are called with IRQ disabled on all kernel
> > entries whether through interrupts and syscalls, and they are called
> > early/late enough that 4K is enough to handle the rest of entry/exit.
> 
> At what point do we replenish the per-CPU stash of pages?  If we're
> 12kB deep in the stack and call mutex_lock(), we can be scheduled out,
> and then the new thread can make a syscall.  Do we just assume that
> get_free_page() can sleep at kernel entry (seems reasonable)?  I don't
> think this is an infeasible problem, I'd just like it to be described.

schedule() or return to userspace, I believe was mentioned
Pasha Tatashin March 17, 2024, 2:19 p.m. UTC | #25
On Sat, Mar 16, 2024 at 8:41 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Mar 16, 2024 at 03:17:57PM -0400, Pasha Tatashin wrote:
> > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > sizes, here's what I'm thinking:
> >
> > - Kernel Threads: Create all kernel threads with a fully populated
> > THREAD_SIZE stack.  (i.e. 16K)
> > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > but only the top page mapped. (i.e. 4K)
> > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > three additional pages from the per-CPU stack cache. This function is
> > called early in kernel entry points.
> > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > the per-CPU cache. This function is called late in the kernel exit
> > path.
> >
> > Both of the above hooks are called with IRQ disabled on all kernel
> > entries whether through interrupts and syscalls, and they are called
> > early/late enough that 4K is enough to handle the rest of entry/exit.
>
> At what point do we replenish the per-CPU stash of pages?  If we're
> 12kB deep in the stack and call mutex_lock(), we can be scheduled out,
> and then the new thread can make a syscall.  Do we just assume that
> get_free_page() can sleep at kernel entry (seems reasonable)?  I don't
> think this is an infeasible problem, I'd just like it to be described.

Once irq is enabled it is perfectly OK to sleep and wait for the stack
pages to become available.

The following user entries that enable interrupts:
do_user_addr_fault()
   local_irq_enable()

do_syscall_64()
  syscall_enter_from_user_mode()
    local_irq_enable()

__do_fast_syscall_32()
  syscall_enter_from_user_mode_prepare()
    local_irq_enable()

exc_debug_user()
  local_irq_enable()

do_int3_user()
  cond_local_irq_enable()

With those it is perfectly OK to sleep and wait for the page to become
available when we are in a situation where the per-cpu cache is empty,
and alloc_page(GFP_NOWAIT) does not succeed.

The other interrupts from userland never enable IRQs. We can have
3-pages per-cpu reserved for handling specifically IRQ-never enable
cases, as there cannot be more than one ever needed.

Pasha
Brian Gerst March 17, 2024, 2:43 p.m. UTC | #26
On Sat, Mar 16, 2024 at 3:18 PM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Thu, Mar 14, 2024 at 11:40 PM H. Peter Anvin <hpa@zytor.com> wrote:
> >
> > On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> > >On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >>
> > >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > >> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> > >> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> > >> > > > which would have to be untangled.
> > >> > >
> > >> > > I think there are other ways of getting the benefit that Pasha is seeking
> > >> > > without moving to dynamically allocated kernel memory.  One icky thing
> > >> > > that XFS does is punt work over to a kernel thread in order to use more
> > >> > > stack!  That breaks a number of things including lockdep (because the
> > >> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> > >> > > thread owns the lock).
> > >> > >
> > >> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > >> > > and if less than that was available, we could allocate a temporary
> > >> > > stack and switch to it.  I suspect Google would also be able to use this
> > >> > > API for their rare cases when they need more than 8kB of kernel stack.
> > >> > > Who knows, we might all be able to use such a thing.
> > >> > >
> > >> > > I'd been thinking about this from the point of view of allocating more
> > >> > > stack elsewhere in kernel space, but combining what Pasha has done here
> > >> > > with this idea might lead to a hybrid approach that works better; allocate
> > >> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > >> > > rely on people using this "I need more stack" API correctly, and free the
> > >> > > excess pages on return to userspace.  No complicated "switch stacks" API
> > >> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
> > >
> > >I like this approach! I think we could also consider having permanent
> > >big stacks for some kernel only threads like kvm-vcpu. A cooperative
> > >stack increase framework could work well and wouldn't negatively
> > >impact the performance of context switching. However, thorough
> > >analysis would be necessary to proactively identify potential stack
> > >overflow situations.
> > >
> > >> > Why would we need an "I need more stack" API? Pasha's approach seems
> > >> > like everything we need for what you're talking about.
> > >>
> > >> Because double faults are hard, possibly impossible, and the FRED approach
> > >> Peter described has extra overhead?  This was all described up-thread.
> > >
> > >Handling faults in #DF is possible. It requires code inspection to
> > >handle race conditions such as what was shown by tglx. However, as
> > >Andy pointed out, this is not supported by SDM as it is an abort
> > >context (yet we return from it because of ESPFIX64, so return is
> > >possible).
> > >
> > >My question, however, if we ignore memory savings and only consider
> > >reliability aspect of this feature.  What is better unconditionally
> > >crashing the machine because a guard page was reached, or printing a
> > >huge warning with a backtracing information about the offending stack,
> > >handling the fault, and survive? I know that historically Linus
> > >preferred WARN() to BUG() [1]. But, this is a somewhat different
> > >scenario compared to simple BUG vs WARN.
> > >
> > >Pasha
> > >
> > >[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-100000@home.transmeta.com
> > >
> >
> > The real issue with using #DF is that if the event that caused it was asynchronous, you could lose the event.
>
> Got it. So, using a #DF handler for stack page faults isn't feasible.
> I suppose the only way for this to work would be to use a dedicated
> Interrupt Stack Table (IST) entry for page faults (#PF), but I suspect
> that might introduce other complications.
>
> Expanding on Mathew's idea of an interface for dynamic kernel stack
> sizes, here's what I'm thinking:
>
> - Kernel Threads: Create all kernel threads with a fully populated
> THREAD_SIZE stack.  (i.e. 16K)
> - User Threads: Create all user threads with THREAD_SIZE kernel stack
> but only the top page mapped. (i.e. 4K)
> - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> three additional pages from the per-CPU stack cache. This function is
> called early in kernel entry points.
> - exit_to_user_mode(): Unmap the extra three pages and return them to
> the per-CPU cache. This function is called late in the kernel exit
> path.
>
> Both of the above hooks are called with IRQ disabled on all kernel
> entries whether through interrupts and syscalls, and they are called
> early/late enough that 4K is enough to handle the rest of entry/exit.

This proposal will not have the memory savings that you are looking
for, since sleeping tasks would still have a fully allocated stack.
This also would add extra overhead to each entry and exit (including
syscalls) that can happen multiple times before a context switch.  It
also doesn't make much sense because a task running in user mode will
quickly need those stack pages back when it returns to kernel mode.
Even if it doesn't make a syscall, the timer interrupt will kick it
out of user mode.

What should happen is that the unused stack is reclaimed when a task
goes to sleep.  The kernel does not use a red zone, so any stack pages
below the saved stack pointer of a sleeping task (task->thread.sp) can
be safely discarded.  Before context switching to a task, fully
populate its task stack.  After context switching from a task, reclaim
its unused stack.  This way, the task stack in use is always fully
allocated and we don't have to deal with page faults.

To make this happen, __switch_to() would have to be split into two
parts, to cleanly separate what happens before and after the stack
switch.  The first part saves processor context for the previous task,
and prepares the next task.  Populating the next task's stack would
happen here.  Then it would return to the assembly code to do the
stack switch.  The second part then loads the context of the next
task, and finalizes any work for the previous task.  Reclaiming the
unused stack pages of the previous task would happen here.


Brian Gerst
Pasha Tatashin March 17, 2024, 4:15 p.m. UTC | #27
On Sun, Mar 17, 2024 at 10:43 AM Brian Gerst <brgerst@gmail.com> wrote:
>
> On Sat, Mar 16, 2024 at 3:18 PM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > On Thu, Mar 14, 2024 at 11:40 PM H. Peter Anvin <hpa@zytor.com> wrote:
> > >
> > > On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> > > >On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >>
> > > >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > > >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > > >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > > >> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> > > >> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> > > >> > > > which would have to be untangled.
> > > >> > >
> > > >> > > I think there are other ways of getting the benefit that Pasha is seeking
> > > >> > > without moving to dynamically allocated kernel memory.  One icky thing
> > > >> > > that XFS does is punt work over to a kernel thread in order to use more
> > > >> > > stack!  That breaks a number of things including lockdep (because the
> > > >> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> > > >> > > thread owns the lock).
> > > >> > >
> > > >> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > > >> > > and if less than that was available, we could allocate a temporary
> > > >> > > stack and switch to it.  I suspect Google would also be able to use this
> > > >> > > API for their rare cases when they need more than 8kB of kernel stack.
> > > >> > > Who knows, we might all be able to use such a thing.
> > > >> > >
> > > >> > > I'd been thinking about this from the point of view of allocating more
> > > >> > > stack elsewhere in kernel space, but combining what Pasha has done here
> > > >> > > with this idea might lead to a hybrid approach that works better; allocate
> > > >> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > > >> > > rely on people using this "I need more stack" API correctly, and free the
> > > >> > > excess pages on return to userspace.  No complicated "switch stacks" API
> > > >> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
> > > >
> > > >I like this approach! I think we could also consider having permanent
> > > >big stacks for some kernel only threads like kvm-vcpu. A cooperative
> > > >stack increase framework could work well and wouldn't negatively
> > > >impact the performance of context switching. However, thorough
> > > >analysis would be necessary to proactively identify potential stack
> > > >overflow situations.
> > > >
> > > >> > Why would we need an "I need more stack" API? Pasha's approach seems
> > > >> > like everything we need for what you're talking about.
> > > >>
> > > >> Because double faults are hard, possibly impossible, and the FRED approach
> > > >> Peter described has extra overhead?  This was all described up-thread.
> > > >
> > > >Handling faults in #DF is possible. It requires code inspection to
> > > >handle race conditions such as what was shown by tglx. However, as
> > > >Andy pointed out, this is not supported by SDM as it is an abort
> > > >context (yet we return from it because of ESPFIX64, so return is
> > > >possible).
> > > >
> > > >My question, however, if we ignore memory savings and only consider
> > > >reliability aspect of this feature.  What is better unconditionally
> > > >crashing the machine because a guard page was reached, or printing a
> > > >huge warning with a backtracing information about the offending stack,
> > > >handling the fault, and survive? I know that historically Linus
> > > >preferred WARN() to BUG() [1]. But, this is a somewhat different
> > > >scenario compared to simple BUG vs WARN.
> > > >
> > > >Pasha
> > > >
> > > >[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-100000@home.transmeta.com
> > > >
> > >
> > > The real issue with using #DF is that if the event that caused it was asynchronous, you could lose the event.
> >
> > Got it. So, using a #DF handler for stack page faults isn't feasible.
> > I suppose the only way for this to work would be to use a dedicated
> > Interrupt Stack Table (IST) entry for page faults (#PF), but I suspect
> > that might introduce other complications.
> >
> > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > sizes, here's what I'm thinking:
> >
> > - Kernel Threads: Create all kernel threads with a fully populated
> > THREAD_SIZE stack.  (i.e. 16K)
> > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > but only the top page mapped. (i.e. 4K)
> > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > three additional pages from the per-CPU stack cache. This function is
> > called early in kernel entry points.
> > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > the per-CPU cache. This function is called late in the kernel exit
> > path.
> >
> > Both of the above hooks are called with IRQ disabled on all kernel
> > entries whether through interrupts and syscalls, and they are called
> > early/late enough that 4K is enough to handle the rest of entry/exit.

Hi Brian,

> This proposal will not have the memory savings that you are looking
> for, since sleeping tasks would still have a fully allocated stack.

The tasks that were descheduled while running in user mode should not
increase their stack. The potential saving is greater than the
origianl proposal, because in the origianl proposal we never shrink
stacks after faults.

> This also would add extra overhead to each entry and exit (including
> syscalls) that can happen multiple times before a context switch.  It
> also doesn't make much sense because a task running in user mode will
> quickly need those stack pages back when it returns to kernel mode.
> Even if it doesn't make a syscall, the timer interrupt will kick it
> out of user mode.
>
> What should happen is that the unused stack is reclaimed when a task
> goes to sleep.  The kernel does not use a red zone, so any stack pages
> below the saved stack pointer of a sleeping task (task->thread.sp) can
> be safely discarded.  Before context switching to a task, fully

Excellent observation, this makes Andy Lutomirski per-map proposal [1]
usable without tracking dirty/accessed bits. More reliable, and also
platform independent.

> populate its task stack.  After context switching from a task, reclaim
> its unused stack.  This way, the task stack in use is always fully
> allocated and we don't have to deal with page faults.
>
> To make this happen, __switch_to() would have to be split into two
> parts, to cleanly separate what happens before and after the stack
> switch.  The first part saves processor context for the previous task,
> and prepares the next task.

By knowing the stack requirements of __switch_to(), can't we actually
do all that in the common code in context_switch() right before
__switch_to()? We would do an arch specific call to get the
__switch_to() stack requirement, and use that to change the value of
task->thread.sp to know where the stack is going to be while sleeping.
At this time we can do the unmapping of the stack pages from the
previous task, and mapping the pages to the next task.

> Populating the next task's stack would
> happen here.  Then it would return to the assembly code to do the
> stack switch.  The second part then loads the context of the next
> task, and finalizes any work for the previous task.  Reclaiming the
> unused stack pages of the previous task would happen here.

The problem with this (and the origianl Andy's approach), is that we
cannot sleep here. What happens if we get per-cpu stack cache
exhausted because several threads sleep while having deep stacks? How
can we schedule the next task? This is probably a corner case, but it
needs to have a proper handling solution. One solution is while in
schedule() and while interrupts are still enabled before going to
switch_to() we must pre-allocate 3-page in the per-cpu. However, what
if the pre-allocation itself calls cond_resched() because it enters
page allocator slowpath?

Other than the above concern, I concur, this approach looks to be the
best so far. I will think more about it.

Thank you,
Pasha

[1] https://lore.kernel.org/all/3e180c07-53db-4acb-a75c-1a33447d81af@app.fastmail.com
David Laight March 17, 2024, 6:57 p.m. UTC | #28
From: Pasha Tatashin
> Sent: 16 March 2024 19:18
...
> Expanding on Mathew's idea of an interface for dynamic kernel stack
> sizes, here's what I'm thinking:
> 
> - Kernel Threads: Create all kernel threads with a fully populated
> THREAD_SIZE stack.  (i.e. 16K)
> - User Threads: Create all user threads with THREAD_SIZE kernel stack
> but only the top page mapped. (i.e. 4K)
> - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> three additional pages from the per-CPU stack cache. This function is
> called early in kernel entry points.
> - exit_to_user_mode(): Unmap the extra three pages and return them to
> the per-CPU cache. This function is called late in the kernel exit
> path.

Isn't that entirely horrid for TLB use and so will require a lot of IPI?

Remember, if a thread sleeps in 'extra stack' and is then resheduled
on a different cpu the extra pages get 'pumped' from one cpu to
another.

I also suspect a stack_probe() is likely to end up being a cache miss
and also slow???
So you wouldn't want one on all calls.
I'm not sure you'd want a conditional branch either.

The explicit request for 'more stack' can be required to be allowed
to sleep - removing a lot of issues.
It would also be portable to all architectures.
I'd also suspect that any thread that needs extra stack is likely
to need to again.
So while the memory could be recovered, I'd bet is isn't worth
doing except under memory pressure.
The call could also return 'no' - perhaps useful for (broken) code
that insists on being recursive.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Brian Gerst March 17, 2024, 9:30 p.m. UTC | #29
On Sun, Mar 17, 2024 at 12:15 PM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Sun, Mar 17, 2024 at 10:43 AM Brian Gerst <brgerst@gmail.com> wrote:
> >
> > On Sat, Mar 16, 2024 at 3:18 PM Pasha Tatashin
> > <pasha.tatashin@soleen.com> wrote:
> > >
> > > On Thu, Mar 14, 2024 at 11:40 PM H. Peter Anvin <hpa@zytor.com> wrote:
> > > >
> > > > On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> > > > >On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >>
> > > > >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > > > >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > > > >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > > > >> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> > > > >> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> > > > >> > > > which would have to be untangled.
> > > > >> > >
> > > > >> > > I think there are other ways of getting the benefit that Pasha is seeking
> > > > >> > > without moving to dynamically allocated kernel memory.  One icky thing
> > > > >> > > that XFS does is punt work over to a kernel thread in order to use more
> > > > >> > > stack!  That breaks a number of things including lockdep (because the
> > > > >> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> > > > >> > > thread owns the lock).
> > > > >> > >
> > > > >> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > > > >> > > and if less than that was available, we could allocate a temporary
> > > > >> > > stack and switch to it.  I suspect Google would also be able to use this
> > > > >> > > API for their rare cases when they need more than 8kB of kernel stack.
> > > > >> > > Who knows, we might all be able to use such a thing.
> > > > >> > >
> > > > >> > > I'd been thinking about this from the point of view of allocating more
> > > > >> > > stack elsewhere in kernel space, but combining what Pasha has done here
> > > > >> > > with this idea might lead to a hybrid approach that works better; allocate
> > > > >> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > > > >> > > rely on people using this "I need more stack" API correctly, and free the
> > > > >> > > excess pages on return to userspace.  No complicated "switch stacks" API
> > > > >> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
> > > > >
> > > > >I like this approach! I think we could also consider having permanent
> > > > >big stacks for some kernel only threads like kvm-vcpu. A cooperative
> > > > >stack increase framework could work well and wouldn't negatively
> > > > >impact the performance of context switching. However, thorough
> > > > >analysis would be necessary to proactively identify potential stack
> > > > >overflow situations.
> > > > >
> > > > >> > Why would we need an "I need more stack" API? Pasha's approach seems
> > > > >> > like everything we need for what you're talking about.
> > > > >>
> > > > >> Because double faults are hard, possibly impossible, and the FRED approach
> > > > >> Peter described has extra overhead?  This was all described up-thread.
> > > > >
> > > > >Handling faults in #DF is possible. It requires code inspection to
> > > > >handle race conditions such as what was shown by tglx. However, as
> > > > >Andy pointed out, this is not supported by SDM as it is an abort
> > > > >context (yet we return from it because of ESPFIX64, so return is
> > > > >possible).
> > > > >
> > > > >My question, however, if we ignore memory savings and only consider
> > > > >reliability aspect of this feature.  What is better unconditionally
> > > > >crashing the machine because a guard page was reached, or printing a
> > > > >huge warning with a backtracing information about the offending stack,
> > > > >handling the fault, and survive? I know that historically Linus
> > > > >preferred WARN() to BUG() [1]. But, this is a somewhat different
> > > > >scenario compared to simple BUG vs WARN.
> > > > >
> > > > >Pasha
> > > > >
> > > > >[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-100000@home.transmeta.com
> > > > >
> > > >
> > > > The real issue with using #DF is that if the event that caused it was asynchronous, you could lose the event.
> > >
> > > Got it. So, using a #DF handler for stack page faults isn't feasible.
> > > I suppose the only way for this to work would be to use a dedicated
> > > Interrupt Stack Table (IST) entry for page faults (#PF), but I suspect
> > > that might introduce other complications.
> > >
> > > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > > sizes, here's what I'm thinking:
> > >
> > > - Kernel Threads: Create all kernel threads with a fully populated
> > > THREAD_SIZE stack.  (i.e. 16K)
> > > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > > but only the top page mapped. (i.e. 4K)
> > > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > > three additional pages from the per-CPU stack cache. This function is
> > > called early in kernel entry points.
> > > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > > the per-CPU cache. This function is called late in the kernel exit
> > > path.
> > >
> > > Both of the above hooks are called with IRQ disabled on all kernel
> > > entries whether through interrupts and syscalls, and they are called
> > > early/late enough that 4K is enough to handle the rest of entry/exit.
>
> Hi Brian,
>
> > This proposal will not have the memory savings that you are looking
> > for, since sleeping tasks would still have a fully allocated stack.
>
> The tasks that were descheduled while running in user mode should not
> increase their stack. The potential saving is greater than the
> origianl proposal, because in the origianl proposal we never shrink
> stacks after faults.

A task has to enter kernel mode in order to be rescheduled.  If it
doesn't make a syscall or hit an exception, then the timer interrupt
will eventually kick it out of user mode.  At some point schedule() is
called, the task is put to sleep and context is switched to the next
task.  A sleeping task will always be using some amount of kernel
stack.  How much depends a lot on what caused the task to sleep.  If
the timeslice expired it could switch right before the return to user
mode.  A page fault could go deep into filesystem and device code
waiting on an I/O operation.

> > This also would add extra overhead to each entry and exit (including
> > syscalls) that can happen multiple times before a context switch.  It
> > also doesn't make much sense because a task running in user mode will
> > quickly need those stack pages back when it returns to kernel mode.
> > Even if it doesn't make a syscall, the timer interrupt will kick it
> > out of user mode.
> >
> > What should happen is that the unused stack is reclaimed when a task
> > goes to sleep.  The kernel does not use a red zone, so any stack pages
> > below the saved stack pointer of a sleeping task (task->thread.sp) can
> > be safely discarded.  Before context switching to a task, fully
>
> Excellent observation, this makes Andy Lutomirski per-map proposal [1]
> usable without tracking dirty/accessed bits. More reliable, and also
> platform independent.

This is x86-specific.  Other architectures will likely have differences.

> > populate its task stack.  After context switching from a task, reclaim
> > its unused stack.  This way, the task stack in use is always fully
> > allocated and we don't have to deal with page faults.
> >
> > To make this happen, __switch_to() would have to be split into two
> > parts, to cleanly separate what happens before and after the stack
> > switch.  The first part saves processor context for the previous task,
> > and prepares the next task.
>
> By knowing the stack requirements of __switch_to(), can't we actually
> do all that in the common code in context_switch() right before
> __switch_to()? We would do an arch specific call to get the
> __switch_to() stack requirement, and use that to change the value of
> task->thread.sp to know where the stack is going to be while sleeping.
> At this time we can do the unmapping of the stack pages from the
> previous task, and mapping the pages to the next task.

task->thread.sp is set in __switch_to_asm(), and is pretty much the
last thing done in the context of the previous task.  Trying to
predict that value ahead of time is way too fragile.  Also, the key
point I was trying to make is that you cannot safely shrink the active
stack.  It can only be done after the stack switch to the new task.

> > Populating the next task's stack would
> > happen here.  Then it would return to the assembly code to do the
> > stack switch.  The second part then loads the context of the next
> > task, and finalizes any work for the previous task.  Reclaiming the
> > unused stack pages of the previous task would happen here.
>
> The problem with this (and the origianl Andy's approach), is that we
> cannot sleep here. What happens if we get per-cpu stack cache
> exhausted because several threads sleep while having deep stacks? How
> can we schedule the next task? This is probably a corner case, but it
> needs to have a proper handling solution. One solution is while in
> schedule() and while interrupts are still enabled before going to
> switch_to() we must pre-allocate 3-page in the per-cpu. However, what
> if the pre-allocation itself calls cond_resched() because it enters
> page allocator slowpath?

You would have to keep extra pages in reserve for allocation failures.
mempool could probably help with that.

Brian Gerst
Pasha Tatashin March 18, 2024, 2:59 p.m. UTC | #30
On Sun, Mar 17, 2024 at 5:30 PM Brian Gerst <brgerst@gmail.com> wrote:
>
> On Sun, Mar 17, 2024 at 12:15 PM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > On Sun, Mar 17, 2024 at 10:43 AM Brian Gerst <brgerst@gmail.com> wrote:
> > >
> > > On Sat, Mar 16, 2024 at 3:18 PM Pasha Tatashin
> > > <pasha.tatashin@soleen.com> wrote:
> > > >
> > > > On Thu, Mar 14, 2024 at 11:40 PM H. Peter Anvin <hpa@zytor.com> wrote:
> > > > >
> > > > > On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> > > > > >On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >>
> > > > > >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > > > > >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > > > > >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > > > > >> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> > > > > >> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> > > > > >> > > > which would have to be untangled.
> > > > > >> > >
> > > > > >> > > I think there are other ways of getting the benefit that Pasha is seeking
> > > > > >> > > without moving to dynamically allocated kernel memory.  One icky thing
> > > > > >> > > that XFS does is punt work over to a kernel thread in order to use more
> > > > > >> > > stack!  That breaks a number of things including lockdep (because the
> > > > > >> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> > > > > >> > > thread owns the lock).
> > > > > >> > >
> > > > > >> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > > > > >> > > and if less than that was available, we could allocate a temporary
> > > > > >> > > stack and switch to it.  I suspect Google would also be able to use this
> > > > > >> > > API for their rare cases when they need more than 8kB of kernel stack.
> > > > > >> > > Who knows, we might all be able to use such a thing.
> > > > > >> > >
> > > > > >> > > I'd been thinking about this from the point of view of allocating more
> > > > > >> > > stack elsewhere in kernel space, but combining what Pasha has done here
> > > > > >> > > with this idea might lead to a hybrid approach that works better; allocate
> > > > > >> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > > > > >> > > rely on people using this "I need more stack" API correctly, and free the
> > > > > >> > > excess pages on return to userspace.  No complicated "switch stacks" API
> > > > > >> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
> > > > > >
> > > > > >I like this approach! I think we could also consider having permanent
> > > > > >big stacks for some kernel only threads like kvm-vcpu. A cooperative
> > > > > >stack increase framework could work well and wouldn't negatively
> > > > > >impact the performance of context switching. However, thorough
> > > > > >analysis would be necessary to proactively identify potential stack
> > > > > >overflow situations.
> > > > > >
> > > > > >> > Why would we need an "I need more stack" API? Pasha's approach seems
> > > > > >> > like everything we need for what you're talking about.
> > > > > >>
> > > > > >> Because double faults are hard, possibly impossible, and the FRED approach
> > > > > >> Peter described has extra overhead?  This was all described up-thread.
> > > > > >
> > > > > >Handling faults in #DF is possible. It requires code inspection to
> > > > > >handle race conditions such as what was shown by tglx. However, as
> > > > > >Andy pointed out, this is not supported by SDM as it is an abort
> > > > > >context (yet we return from it because of ESPFIX64, so return is
> > > > > >possible).
> > > > > >
> > > > > >My question, however, if we ignore memory savings and only consider
> > > > > >reliability aspect of this feature.  What is better unconditionally
> > > > > >crashing the machine because a guard page was reached, or printing a
> > > > > >huge warning with a backtracing information about the offending stack,
> > > > > >handling the fault, and survive? I know that historically Linus
> > > > > >preferred WARN() to BUG() [1]. But, this is a somewhat different
> > > > > >scenario compared to simple BUG vs WARN.
> > > > > >
> > > > > >Pasha
> > > > > >
> > > > > >[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-100000@home.transmeta.com
> > > > > >
> > > > >
> > > > > The real issue with using #DF is that if the event that caused it was asynchronous, you could lose the event.
> > > >
> > > > Got it. So, using a #DF handler for stack page faults isn't feasible.
> > > > I suppose the only way for this to work would be to use a dedicated
> > > > Interrupt Stack Table (IST) entry for page faults (#PF), but I suspect
> > > > that might introduce other complications.
> > > >
> > > > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > > > sizes, here's what I'm thinking:
> > > >
> > > > - Kernel Threads: Create all kernel threads with a fully populated
> > > > THREAD_SIZE stack.  (i.e. 16K)
> > > > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > > > but only the top page mapped. (i.e. 4K)
> > > > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > > > three additional pages from the per-CPU stack cache. This function is
> > > > called early in kernel entry points.
> > > > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > > > the per-CPU cache. This function is called late in the kernel exit
> > > > path.
> > > >
> > > > Both of the above hooks are called with IRQ disabled on all kernel
> > > > entries whether through interrupts and syscalls, and they are called
> > > > early/late enough that 4K is enough to handle the rest of entry/exit.
> >
> > Hi Brian,
> >
> > > This proposal will not have the memory savings that you are looking
> > > for, since sleeping tasks would still have a fully allocated stack.
> >
> > The tasks that were descheduled while running in user mode should not
> > increase their stack. The potential saving is greater than the
> > origianl proposal, because in the origianl proposal we never shrink
> > stacks after faults.
>
> A task has to enter kernel mode in order to be rescheduled.  If it
> doesn't make a syscall or hit an exception, then the timer interrupt
> will eventually kick it out of user mode.  At some point schedule() is
> called, the task is put to sleep and context is switched to the next
> task.  A sleeping task will always be using some amount of kernel
> stack.  How much depends a lot on what caused the task to sleep.  If
> the timeslice expired it could switch right before the return to user
> mode.  A page fault could go deep into filesystem and device code
> waiting on an I/O operation.
>
> > > This also would add extra overhead to each entry and exit (including
> > > syscalls) that can happen multiple times before a context switch.  It
> > > also doesn't make much sense because a task running in user mode will
> > > quickly need those stack pages back when it returns to kernel mode.
> > > Even if it doesn't make a syscall, the timer interrupt will kick it
> > > out of user mode.
> > >
> > > What should happen is that the unused stack is reclaimed when a task
> > > goes to sleep.  The kernel does not use a red zone, so any stack pages
> > > below the saved stack pointer of a sleeping task (task->thread.sp) can
> > > be safely discarded.  Before context switching to a task, fully
> >
> > Excellent observation, this makes Andy Lutomirski per-map proposal [1]
> > usable without tracking dirty/accessed bits. More reliable, and also
> > platform independent.
>
> This is x86-specific.  Other architectures will likely have differences.
>
> > > populate its task stack.  After context switching from a task, reclaim
> > > its unused stack.  This way, the task stack in use is always fully
> > > allocated and we don't have to deal with page faults.
> > >
> > > To make this happen, __switch_to() would have to be split into two
> > > parts, to cleanly separate what happens before and after the stack
> > > switch.  The first part saves processor context for the previous task,
> > > and prepares the next task.
> >
> > By knowing the stack requirements of __switch_to(), can't we actually
> > do all that in the common code in context_switch() right before
> > __switch_to()? We would do an arch specific call to get the
> > __switch_to() stack requirement, and use that to change the value of
> > task->thread.sp to know where the stack is going to be while sleeping.
> > At this time we can do the unmapping of the stack pages from the
> > previous task, and mapping the pages to the next task.
>
> task->thread.sp is set in __switch_to_asm(), and is pretty much the
> last thing done in the context of the previous task.  Trying to
> predict that value ahead of time is way too fragile.

We don't require an exact value, but rather an approximate upper
limit. To illustrate, subtract 1K from the current .sp, then determine
the corresponding page to decide the number of pages needing
unmapping. The primary advantage is that we can avoid
platform-specific ifdefs for DYNAMIC_STACKS within the arch-specific
switch_to() function. Instead, each platform can provide an
appropriate upper bound for switch_to() operations. We know the amount
of information is going to be stored on the stack by the routines, and
also since interrupts are disabled stacks are not used for anything
else there, so I do not see a problem with determining a reasonable
upper bound.

>  Also, the key
> point I was trying to make is that you cannot safely shrink the active
> stack.  It can only be done after the stack switch to the new task.

Can you please elaborate why this is so? If the lowest pages are not
used, and interrupts are disabled what is not safe about removing them
from the page table?

I am not against the idea of unmapping in __switch_to(), I just want
to understand the reasons why more generic but perhaps not as precise
approach would not  work.

> > > Populating the next task's stack would
> > > happen here.  Then it would return to the assembly code to do the
> > > stack switch.  The second part then loads the context of the next
> > > task, and finalizes any work for the previous task.  Reclaiming the
> > > unused stack pages of the previous task would happen here.
> >
> > The problem with this (and the origianl Andy's approach), is that we
> > cannot sleep here. What happens if we get per-cpu stack cache
> > exhausted because several threads sleep while having deep stacks? How
> > can we schedule the next task? This is probably a corner case, but it
> > needs to have a proper handling solution. One solution is while in
> > schedule() and while interrupts are still enabled before going to
> > switch_to() we must pre-allocate 3-page in the per-cpu. However, what
> > if the pre-allocation itself calls cond_resched() because it enters
> > page allocator slowpath?
>
> You would have to keep extra pages in reserve for allocation failures.
> mempool could probably help with that.

Right. Mempool do not work when interrupts are disabled, but perhaps
we can use them to keep per-cpu filled with a separate thread. I will
think about it.

Thanks,
Pasha
Pasha Tatashin March 18, 2024, 3:09 p.m. UTC | #31
On Sun, Mar 17, 2024 at 2:58 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Pasha Tatashin
> > Sent: 16 March 2024 19:18
> ...
> > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > sizes, here's what I'm thinking:
> >
> > - Kernel Threads: Create all kernel threads with a fully populated
> > THREAD_SIZE stack.  (i.e. 16K)
> > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > but only the top page mapped. (i.e. 4K)
> > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > three additional pages from the per-CPU stack cache. This function is
> > called early in kernel entry points.
> > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > the per-CPU cache. This function is called late in the kernel exit
> > path.
>
> Isn't that entirely horrid for TLB use and so will require a lot of IPI?

The TLB load is going to be exactly the same as today, we already use
small pages for VMA mapped stacks. We won't need to have extra
flushing either, the mappings are in the kernel space, and once pages
are removed from the page table, no one is going to access that VA
space until that thread enters the kernel again. We will need to
invalidate the VA range only when the pages are mapped, and only on
the local cpu.

> Remember, if a thread sleeps in 'extra stack' and is then resheduled
> on a different cpu the extra pages get 'pumped' from one cpu to
> another.

Yes, the per-cpu cache can get unbalanced this way, we can remember
the original CPU where we acquired the pages to return to the same
place.

> I also suspect a stack_probe() is likely to end up being a cache miss
> and also slow???

Can you please elaborate on this point. I am not aware of
stack_probe() and how it is used.

> So you wouldn't want one on all calls.
> I'm not sure you'd want a conditional branch either.
>
> The explicit request for 'more stack' can be required to be allowed
> to sleep - removing a lot of issues.
> It would also be portable to all architectures.
> I'd also suspect that any thread that needs extra stack is likely
> to need to again.
> So while the memory could be recovered, I'd bet is isn't worth
> doing except under memory pressure.
> The call could also return 'no' - perhaps useful for (broken) code
> that insists on being recursive.

The current approach discussed is somewhat different from explicit
more stack requests API. I am investigating how feasible it is to use
kernel stack multiplexing, so the same pages can be re-used by many
threads when they are actually used. If the multiplexing approach
won't work, I will come back to the explicit more stack API.

> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
Pasha Tatashin March 18, 2024, 3:13 p.m. UTC | #32
On Mon, Mar 18, 2024 at 11:09 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Sun, Mar 17, 2024 at 2:58 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Pasha Tatashin
> > > Sent: 16 March 2024 19:18
> > ...
> > > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > > sizes, here's what I'm thinking:
> > >
> > > - Kernel Threads: Create all kernel threads with a fully populated
> > > THREAD_SIZE stack.  (i.e. 16K)
> > > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > > but only the top page mapped. (i.e. 4K)
> > > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > > three additional pages from the per-CPU stack cache. This function is
> > > called early in kernel entry points.
> > > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > > the per-CPU cache. This function is called late in the kernel exit
> > > path.
> >
> > Isn't that entirely horrid for TLB use and so will require a lot of IPI?
>
> The TLB load is going to be exactly the same as today, we already use
> small pages for VMA mapped stacks. We won't need to have extra
> flushing either, the mappings are in the kernel space, and once pages
> are removed from the page table, no one is going to access that VA
> space until that thread enters the kernel again. We will need to
> invalidate the VA range only when the pages are mapped, and only on
> the local cpu.

The TLB miss rate is going to slightly increase, but very slightly,
because stacks are small 4-pages with only 3-dynamic pages, and
therefore only up-to 2-3 new misses per syscalls, and that is only for
the complicated deep syscalls, therefore, I suspect it won't affect
the real world performance.

> > Remember, if a thread sleeps in 'extra stack' and is then resheduled
> > on a different cpu the extra pages get 'pumped' from one cpu to
> > another.
>
> Yes, the per-cpu cache can get unbalanced this way, we can remember
> the original CPU where we acquired the pages to return to the same
> place.
>
> > I also suspect a stack_probe() is likely to end up being a cache miss
> > and also slow???
>
> Can you please elaborate on this point. I am not aware of
> stack_probe() and how it is used.
>
> > So you wouldn't want one on all calls.
> > I'm not sure you'd want a conditional branch either.
> >
> > The explicit request for 'more stack' can be required to be allowed
> > to sleep - removing a lot of issues.
> > It would also be portable to all architectures.
> > I'd also suspect that any thread that needs extra stack is likely
> > to need to again.
> > So while the memory could be recovered, I'd bet is isn't worth
> > doing except under memory pressure.
> > The call could also return 'no' - perhaps useful for (broken) code
> > that insists on being recursive.
>
> The current approach discussed is somewhat different from explicit
> more stack requests API. I am investigating how feasible it is to use
> kernel stack multiplexing, so the same pages can be re-used by many
> threads when they are actually used. If the multiplexing approach
> won't work, I will come back to the explicit more stack API.
>
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > Registration No: 1397386 (Wales)
Matthew Wilcox March 18, 2024, 3:19 p.m. UTC | #33
On Mon, Mar 18, 2024 at 11:09:47AM -0400, Pasha Tatashin wrote:
> The TLB load is going to be exactly the same as today, we already use
> small pages for VMA mapped stacks. We won't need to have extra
> flushing either, the mappings are in the kernel space, and once pages
> are removed from the page table, no one is going to access that VA
> space until that thread enters the kernel again. We will need to
> invalidate the VA range only when the pages are mapped, and only on
> the local cpu.

No; we can pass pointers to our kernel stack to other threads.  The
obvious one is a mutex; we put a mutex_waiter on our own stack and
add its list_head to the mutex's waiter list.  I'm sure you can
think of many other places we do this (eg wait queues, poll(), select(),
etc).
Pasha Tatashin March 18, 2024, 3:30 p.m. UTC | #34
On Mon, Mar 18, 2024 at 11:19 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Mar 18, 2024 at 11:09:47AM -0400, Pasha Tatashin wrote:
> > The TLB load is going to be exactly the same as today, we already use
> > small pages for VMA mapped stacks. We won't need to have extra
> > flushing either, the mappings are in the kernel space, and once pages
> > are removed from the page table, no one is going to access that VA
> > space until that thread enters the kernel again. We will need to
> > invalidate the VA range only when the pages are mapped, and only on
> > the local cpu.
>
> No; we can pass pointers to our kernel stack to other threads.  The
> obvious one is a mutex; we put a mutex_waiter on our own stack and
> add its list_head to the mutex's waiter list.  I'm sure you can
> think of many other places we do this (eg wait queues, poll(), select(),
> etc).

Hm, it means that stack is sleeping in the kernel space, and has its
stack pages mapped and invalidated on the local CPU, but access from
the remote CPU to that stack pages would be problematic.

I think we still won't need IPI, but VA-range invalidation is actually
needed on unmaps, and should happen during context switch so every
time we go off-cpu. Therefore, what Brian/Andy have suggested makes
more sense instead of kernel/enter/exit paths.

Pasha
David Laight March 18, 2024, 3:38 p.m. UTC | #35
...
> - exit_to_user_mode(): Unmap the extra three pages and return them to
> the per-CPU cache. This function is called late in the kernel exit
> path.

Why bother?
The number of tasks running in user_mode is limited to the number
of cpu. So the most you save is a few pages per cpu.

Plausibly a context switch from an interrupt (eg timer tick)
could suspend a task without saving anything on its kernel stack.
But how common is that in reality?
In a well behaved system most user threads will be sleeping on
some event - so with an active kernel stack.

I can also imagine that something like sys_epoll() actually
sleeps with not (that much) stack allocated.
But the calls into all the drivers to check the status
could easily go into another page.
You really wouldn't to keep allocating and deallocating
physical pages (which I'm sure has TLB flushing costs)
all the time for those processes.

Perhaps a 'garbage collection' activity that reclaims stack
pages from processes that have been asleep 'for a while' or
haven't used a lot of stack recently (if hw 'page accessed'
bit can be used) might make more sense.

Have you done any instrumentation to see which system calls
are actually using more than (say) 8k of stack?
And how often the user threads that make those calls do so?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
David Laight March 18, 2024, 3:53 p.m. UTC | #36
From: Pasha Tatashin
> Sent: 18 March 2024 15:31
> 
> On Mon, Mar 18, 2024 at 11:19 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Mar 18, 2024 at 11:09:47AM -0400, Pasha Tatashin wrote:
> > > The TLB load is going to be exactly the same as today, we already use
> > > small pages for VMA mapped stacks. We won't need to have extra
> > > flushing either, the mappings are in the kernel space, and once pages
> > > are removed from the page table, no one is going to access that VA
> > > space until that thread enters the kernel again. We will need to
> > > invalidate the VA range only when the pages are mapped, and only on
> > > the local cpu.
> >
> > No; we can pass pointers to our kernel stack to other threads.  The
> > obvious one is a mutex; we put a mutex_waiter on our own stack and
> > add its list_head to the mutex's waiter list.  I'm sure you can
> > think of many other places we do this (eg wait queues, poll(), select(),
> > etc).
> 
> Hm, it means that stack is sleeping in the kernel space, and has its
> stack pages mapped and invalidated on the local CPU, but access from
> the remote CPU to that stack pages would be problematic.
> 
> I think we still won't need IPI, but VA-range invalidation is actually
> needed on unmaps, and should happen during context switch so every
> time we go off-cpu. Therefore, what Brian/Andy have suggested makes
> more sense instead of kernel/enter/exit paths.

I think you'll need to broadcast an invalidate.
Consider:
CPU A: task allocates extra pages and adds something to some list.
CPU B: accesses that data and maybe modifies it.
	Some page-table walk setup ut the TLB.
CPU A: task detects the modify, removes the item from the list,
	collapses back the stack and sleeps.
	Stack pages freed.
CPU A: task wakes up (on the same cpu for simplicity).
	Goes down a deep stack and puts an item on a list.
	Different physical pages are allocated.
CPU B: accesses the associated KVA.
	It better not have a cached TLB.

Doesn't that need an IPI?

Freeing the pages is much harder than allocating them.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Pasha Tatashin March 18, 2024, 4:57 p.m. UTC | #37
> I think you'll need to broadcast an invalidate.
> Consider:
> CPU A: task allocates extra pages and adds something to some list.
> CPU B: accesses that data and maybe modifies it.
>         Some page-table walk setup ut the TLB.
> CPU A: task detects the modify, removes the item from the list,
>         collapses back the stack and sleeps.
>         Stack pages freed.
> CPU A: task wakes up (on the same cpu for simplicity).
>         Goes down a deep stack and puts an item on a list.
>         Different physical pages are allocated.
> CPU B: accesses the associated KVA.
>         It better not have a cached TLB.
>
> Doesn't that need an IPI?

Yes, this is annoying. If we share a stack with another CPU, then get
a new stack, and share it again with another CPU we get in trouble.
Yet, IPI during context switch would kill the performance :-\

I wonder if there is a way to optimize this scenario like doing IPI
invalidation only after stack sharing?

Pasha
Pasha Tatashin March 18, 2024, 5 p.m. UTC | #38
On Mon, Mar 18, 2024 at 11:39 AM David Laight <David.Laight@aculab.com> wrote:
>
> ...
> > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > the per-CPU cache. This function is called late in the kernel exit
> > path.
>
> Why bother?
> The number of tasks running in user_mode is limited to the number
> of cpu. So the most you save is a few pages per cpu.
>
> Plausibly a context switch from an interrupt (eg timer tick)
> could suspend a task without saving anything on its kernel stack.
> But how common is that in reality?
> In a well behaved system most user threads will be sleeping on
> some event - so with an active kernel stack.
>
> I can also imagine that something like sys_epoll() actually
> sleeps with not (that much) stack allocated.
> But the calls into all the drivers to check the status
> could easily go into another page.
> You really wouldn't to keep allocating and deallocating
> physical pages (which I'm sure has TLB flushing costs)
> all the time for those processes.
>
> Perhaps a 'garbage collection' activity that reclaims stack
> pages from processes that have been asleep 'for a while' or
> haven't used a lot of stack recently (if hw 'page accessed'
> bit can be used) might make more sense.
>
> Have you done any instrumentation to see which system calls
> are actually using more than (say) 8k of stack?
> And how often the user threads that make those calls do so?

None of our syscalls, AFAIK.

Pasha

>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
Pasha Tatashin March 18, 2024, 5:37 p.m. UTC | #39
> > Perhaps a 'garbage collection' activity that reclaims stack
> > pages from processes that have been asleep 'for a while' or
> > haven't used a lot of stack recently (if hw 'page accessed'
> > bit can be used) might make more sense.

Interesting approach: if we take the original Andy's suggestion of
using an access bit to know which stack pages were never used during
context switch and unmap them, and as an extra optimization have a
"garbage collector" that unmaps stacks in some long sleeping rarely
used threads.  I will think about this.

Thanks,
Pasha
Brian Gerst March 18, 2024, 9:02 p.m. UTC | #40
On Mon, Mar 18, 2024 at 11:00 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Sun, Mar 17, 2024 at 5:30 PM Brian Gerst <brgerst@gmail.com> wrote:
> >
> > On Sun, Mar 17, 2024 at 12:15 PM Pasha Tatashin
> > <pasha.tatashin@soleen.com> wrote:
> > >
> > > On Sun, Mar 17, 2024 at 10:43 AM Brian Gerst <brgerst@gmail.com> wrote:
> > > >
> > > > On Sat, Mar 16, 2024 at 3:18 PM Pasha Tatashin
> > > > <pasha.tatashin@soleen.com> wrote:
> > > > >
> > > > > On Thu, Mar 14, 2024 at 11:40 PM H. Peter Anvin <hpa@zytor.com> wrote:
> > > > > >
> > > > > > On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> > > > > > >On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > >>
> > > > > > >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > > > > > >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > > > > > >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > > > > > >> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> > > > > > >> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> > > > > > >> > > > which would have to be untangled.
> > > > > > >> > >
> > > > > > >> > > I think there are other ways of getting the benefit that Pasha is seeking
> > > > > > >> > > without moving to dynamically allocated kernel memory.  One icky thing
> > > > > > >> > > that XFS does is punt work over to a kernel thread in order to use more
> > > > > > >> > > stack!  That breaks a number of things including lockdep (because the
> > > > > > >> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> > > > > > >> > > thread owns the lock).
> > > > > > >> > >
> > > > > > >> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > > > > > >> > > and if less than that was available, we could allocate a temporary
> > > > > > >> > > stack and switch to it.  I suspect Google would also be able to use this
> > > > > > >> > > API for their rare cases when they need more than 8kB of kernel stack.
> > > > > > >> > > Who knows, we might all be able to use such a thing.
> > > > > > >> > >
> > > > > > >> > > I'd been thinking about this from the point of view of allocating more
> > > > > > >> > > stack elsewhere in kernel space, but combining what Pasha has done here
> > > > > > >> > > with this idea might lead to a hybrid approach that works better; allocate
> > > > > > >> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > > > > > >> > > rely on people using this "I need more stack" API correctly, and free the
> > > > > > >> > > excess pages on return to userspace.  No complicated "switch stacks" API
> > > > > > >> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
> > > > > > >
> > > > > > >I like this approach! I think we could also consider having permanent
> > > > > > >big stacks for some kernel only threads like kvm-vcpu. A cooperative
> > > > > > >stack increase framework could work well and wouldn't negatively
> > > > > > >impact the performance of context switching. However, thorough
> > > > > > >analysis would be necessary to proactively identify potential stack
> > > > > > >overflow situations.
> > > > > > >
> > > > > > >> > Why would we need an "I need more stack" API? Pasha's approach seems
> > > > > > >> > like everything we need for what you're talking about.
> > > > > > >>
> > > > > > >> Because double faults are hard, possibly impossible, and the FRED approach
> > > > > > >> Peter described has extra overhead?  This was all described up-thread.
> > > > > > >
> > > > > > >Handling faults in #DF is possible. It requires code inspection to
> > > > > > >handle race conditions such as what was shown by tglx. However, as
> > > > > > >Andy pointed out, this is not supported by SDM as it is an abort
> > > > > > >context (yet we return from it because of ESPFIX64, so return is
> > > > > > >possible).
> > > > > > >
> > > > > > >My question, however, if we ignore memory savings and only consider
> > > > > > >reliability aspect of this feature.  What is better unconditionally
> > > > > > >crashing the machine because a guard page was reached, or printing a
> > > > > > >huge warning with a backtracing information about the offending stack,
> > > > > > >handling the fault, and survive? I know that historically Linus
> > > > > > >preferred WARN() to BUG() [1]. But, this is a somewhat different
> > > > > > >scenario compared to simple BUG vs WARN.
> > > > > > >
> > > > > > >Pasha
> > > > > > >
> > > > > > >[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-100000@home.transmeta.com
> > > > > > >
> > > > > >
> > > > > > The real issue with using #DF is that if the event that caused it was asynchronous, you could lose the event.
> > > > >
> > > > > Got it. So, using a #DF handler for stack page faults isn't feasible.
> > > > > I suppose the only way for this to work would be to use a dedicated
> > > > > Interrupt Stack Table (IST) entry for page faults (#PF), but I suspect
> > > > > that might introduce other complications.
> > > > >
> > > > > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > > > > sizes, here's what I'm thinking:
> > > > >
> > > > > - Kernel Threads: Create all kernel threads with a fully populated
> > > > > THREAD_SIZE stack.  (i.e. 16K)
> > > > > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > > > > but only the top page mapped. (i.e. 4K)
> > > > > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > > > > three additional pages from the per-CPU stack cache. This function is
> > > > > called early in kernel entry points.
> > > > > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > > > > the per-CPU cache. This function is called late in the kernel exit
> > > > > path.
> > > > >
> > > > > Both of the above hooks are called with IRQ disabled on all kernel
> > > > > entries whether through interrupts and syscalls, and they are called
> > > > > early/late enough that 4K is enough to handle the rest of entry/exit.
> > >
> > > Hi Brian,
> > >
> > > > This proposal will not have the memory savings that you are looking
> > > > for, since sleeping tasks would still have a fully allocated stack.
> > >
> > > The tasks that were descheduled while running in user mode should not
> > > increase their stack. The potential saving is greater than the
> > > origianl proposal, because in the origianl proposal we never shrink
> > > stacks after faults.
> >
> > A task has to enter kernel mode in order to be rescheduled.  If it
> > doesn't make a syscall or hit an exception, then the timer interrupt
> > will eventually kick it out of user mode.  At some point schedule() is
> > called, the task is put to sleep and context is switched to the next
> > task.  A sleeping task will always be using some amount of kernel
> > stack.  How much depends a lot on what caused the task to sleep.  If
> > the timeslice expired it could switch right before the return to user
> > mode.  A page fault could go deep into filesystem and device code
> > waiting on an I/O operation.
> >
> > > > This also would add extra overhead to each entry and exit (including
> > > > syscalls) that can happen multiple times before a context switch.  It
> > > > also doesn't make much sense because a task running in user mode will
> > > > quickly need those stack pages back when it returns to kernel mode.
> > > > Even if it doesn't make a syscall, the timer interrupt will kick it
> > > > out of user mode.
> > > >
> > > > What should happen is that the unused stack is reclaimed when a task
> > > > goes to sleep.  The kernel does not use a red zone, so any stack pages
> > > > below the saved stack pointer of a sleeping task (task->thread.sp) can
> > > > be safely discarded.  Before context switching to a task, fully
> > >
> > > Excellent observation, this makes Andy Lutomirski per-map proposal [1]
> > > usable without tracking dirty/accessed bits. More reliable, and also
> > > platform independent.
> >
> > This is x86-specific.  Other architectures will likely have differences.
> >
> > > > populate its task stack.  After context switching from a task, reclaim
> > > > its unused stack.  This way, the task stack in use is always fully
> > > > allocated and we don't have to deal with page faults.
> > > >
> > > > To make this happen, __switch_to() would have to be split into two
> > > > parts, to cleanly separate what happens before and after the stack
> > > > switch.  The first part saves processor context for the previous task,
> > > > and prepares the next task.
> > >
> > > By knowing the stack requirements of __switch_to(), can't we actually
> > > do all that in the common code in context_switch() right before
> > > __switch_to()? We would do an arch specific call to get the
> > > __switch_to() stack requirement, and use that to change the value of
> > > task->thread.sp to know where the stack is going to be while sleeping.
> > > At this time we can do the unmapping of the stack pages from the
> > > previous task, and mapping the pages to the next task.
> >
> > task->thread.sp is set in __switch_to_asm(), and is pretty much the
> > last thing done in the context of the previous task.  Trying to
> > predict that value ahead of time is way too fragile.
>
> We don't require an exact value, but rather an approximate upper
> limit. To illustrate, subtract 1K from the current .sp, then determine
> the corresponding page to decide the number of pages needing
> unmapping. The primary advantage is that we can avoid
> platform-specific ifdefs for DYNAMIC_STACKS within the arch-specific
> switch_to() function. Instead, each platform can provide an
> appropriate upper bound for switch_to() operations. We know the amount
> of information is going to be stored on the stack by the routines, and
> also since interrupts are disabled stacks are not used for anything
> else there, so I do not see a problem with determining a reasonable
> upper bound.

The stack usage will vary depending on compiler version and
optimization settings.  Making an educated guess is possible, but may
not be enough in the future.

What would be nice is to get some actual data on stack usage under
various workloads, both maximum depth and depth at context switch.

> >  Also, the key
> > point I was trying to make is that you cannot safely shrink the active
> > stack.  It can only be done after the stack switch to the new task.
>
> Can you please elaborate why this is so? If the lowest pages are not
> used, and interrupts are disabled what is not safe about removing them
> from the page table?
>
> I am not against the idea of unmapping in __switch_to(), I just want
> to understand the reasons why more generic but perhaps not as precise
> approach would not  work.

As long as a wide buffer is given, it would probably be safe.  But it
would still be safer and more precise if done after the switch.



Brian Gerst
Pasha Tatashin March 19, 2024, 2:56 p.m. UTC | #41
On Mon, Mar 18, 2024 at 5:02 PM Brian Gerst <brgerst@gmail.com> wrote:
>
> On Mon, Mar 18, 2024 at 11:00 AM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > On Sun, Mar 17, 2024 at 5:30 PM Brian Gerst <brgerst@gmail.com> wrote:
> > >
> > > On Sun, Mar 17, 2024 at 12:15 PM Pasha Tatashin
> > > <pasha.tatashin@soleen.com> wrote:
> > > >
> > > > On Sun, Mar 17, 2024 at 10:43 AM Brian Gerst <brgerst@gmail.com> wrote:
> > > > >
> > > > > On Sat, Mar 16, 2024 at 3:18 PM Pasha Tatashin
> > > > > <pasha.tatashin@soleen.com> wrote:
> > > > > >
> > > > > > On Thu, Mar 14, 2024 at 11:40 PM H. Peter Anvin <hpa@zytor.com> wrote:
> > > > > > >
> > > > > > > On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> > > > > > > >On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > >>
> > > > > > > >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote:
> > > > > > > >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote:
> > > > > > > >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
> > > > > > > >> > > > Second, non-dynamic kernel memory is one of the core design decisions in
> > > > > > > >> > > > Linux from early on. This means there are lot of deeply embedded assumptions
> > > > > > > >> > > > which would have to be untangled.
> > > > > > > >> > >
> > > > > > > >> > > I think there are other ways of getting the benefit that Pasha is seeking
> > > > > > > >> > > without moving to dynamically allocated kernel memory.  One icky thing
> > > > > > > >> > > that XFS does is punt work over to a kernel thread in order to use more
> > > > > > > >> > > stack!  That breaks a number of things including lockdep (because the
> > > > > > > >> > > kernel thread doesn't own the lock, the thread waiting for the kernel
> > > > > > > >> > > thread owns the lock).
> > > > > > > >> > >
> > > > > > > >> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack",
> > > > > > > >> > > and if less than that was available, we could allocate a temporary
> > > > > > > >> > > stack and switch to it.  I suspect Google would also be able to use this
> > > > > > > >> > > API for their rare cases when they need more than 8kB of kernel stack.
> > > > > > > >> > > Who knows, we might all be able to use such a thing.
> > > > > > > >> > >
> > > > > > > >> > > I'd been thinking about this from the point of view of allocating more
> > > > > > > >> > > stack elsewhere in kernel space, but combining what Pasha has done here
> > > > > > > >> > > with this idea might lead to a hybrid approach that works better; allocate
> > > > > > > >> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
> > > > > > > >> > > rely on people using this "I need more stack" API correctly, and free the
> > > > > > > >> > > excess pages on return to userspace.  No complicated "switch stacks" API
> > > > > > > >> > > needed, just an "ensure we have at least N bytes of stack remaining" API.
> > > > > > > >
> > > > > > > >I like this approach! I think we could also consider having permanent
> > > > > > > >big stacks for some kernel only threads like kvm-vcpu. A cooperative
> > > > > > > >stack increase framework could work well and wouldn't negatively
> > > > > > > >impact the performance of context switching. However, thorough
> > > > > > > >analysis would be necessary to proactively identify potential stack
> > > > > > > >overflow situations.
> > > > > > > >
> > > > > > > >> > Why would we need an "I need more stack" API? Pasha's approach seems
> > > > > > > >> > like everything we need for what you're talking about.
> > > > > > > >>
> > > > > > > >> Because double faults are hard, possibly impossible, and the FRED approach
> > > > > > > >> Peter described has extra overhead?  This was all described up-thread.
> > > > > > > >
> > > > > > > >Handling faults in #DF is possible. It requires code inspection to
> > > > > > > >handle race conditions such as what was shown by tglx. However, as
> > > > > > > >Andy pointed out, this is not supported by SDM as it is an abort
> > > > > > > >context (yet we return from it because of ESPFIX64, so return is
> > > > > > > >possible).
> > > > > > > >
> > > > > > > >My question, however, if we ignore memory savings and only consider
> > > > > > > >reliability aspect of this feature.  What is better unconditionally
> > > > > > > >crashing the machine because a guard page was reached, or printing a
> > > > > > > >huge warning with a backtracing information about the offending stack,
> > > > > > > >handling the fault, and survive? I know that historically Linus
> > > > > > > >preferred WARN() to BUG() [1]. But, this is a somewhat different
> > > > > > > >scenario compared to simple BUG vs WARN.
> > > > > > > >
> > > > > > > >Pasha
> > > > > > > >
> > > > > > > >[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-100000@home.transmeta.com
> > > > > > > >
> > > > > > >
> > > > > > > The real issue with using #DF is that if the event that caused it was asynchronous, you could lose the event.
> > > > > >
> > > > > > Got it. So, using a #DF handler for stack page faults isn't feasible.
> > > > > > I suppose the only way for this to work would be to use a dedicated
> > > > > > Interrupt Stack Table (IST) entry for page faults (#PF), but I suspect
> > > > > > that might introduce other complications.
> > > > > >
> > > > > > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > > > > > sizes, here's what I'm thinking:
> > > > > >
> > > > > > - Kernel Threads: Create all kernel threads with a fully populated
> > > > > > THREAD_SIZE stack.  (i.e. 16K)
> > > > > > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > > > > > but only the top page mapped. (i.e. 4K)
> > > > > > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > > > > > three additional pages from the per-CPU stack cache. This function is
> > > > > > called early in kernel entry points.
> > > > > > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > > > > > the per-CPU cache. This function is called late in the kernel exit
> > > > > > path.
> > > > > >
> > > > > > Both of the above hooks are called with IRQ disabled on all kernel
> > > > > > entries whether through interrupts and syscalls, and they are called
> > > > > > early/late enough that 4K is enough to handle the rest of entry/exit.
> > > >
> > > > Hi Brian,
> > > >
> > > > > This proposal will not have the memory savings that you are looking
> > > > > for, since sleeping tasks would still have a fully allocated stack.
> > > >
> > > > The tasks that were descheduled while running in user mode should not
> > > > increase their stack. The potential saving is greater than the
> > > > origianl proposal, because in the origianl proposal we never shrink
> > > > stacks after faults.
> > >
> > > A task has to enter kernel mode in order to be rescheduled.  If it
> > > doesn't make a syscall or hit an exception, then the timer interrupt
> > > will eventually kick it out of user mode.  At some point schedule() is
> > > called, the task is put to sleep and context is switched to the next
> > > task.  A sleeping task will always be using some amount of kernel
> > > stack.  How much depends a lot on what caused the task to sleep.  If
> > > the timeslice expired it could switch right before the return to user
> > > mode.  A page fault could go deep into filesystem and device code
> > > waiting on an I/O operation.
> > >
> > > > > This also would add extra overhead to each entry and exit (including
> > > > > syscalls) that can happen multiple times before a context switch.  It
> > > > > also doesn't make much sense because a task running in user mode will
> > > > > quickly need those stack pages back when it returns to kernel mode.
> > > > > Even if it doesn't make a syscall, the timer interrupt will kick it
> > > > > out of user mode.
> > > > >
> > > > > What should happen is that the unused stack is reclaimed when a task
> > > > > goes to sleep.  The kernel does not use a red zone, so any stack pages
> > > > > below the saved stack pointer of a sleeping task (task->thread.sp) can
> > > > > be safely discarded.  Before context switching to a task, fully
> > > >
> > > > Excellent observation, this makes Andy Lutomirski per-map proposal [1]
> > > > usable without tracking dirty/accessed bits. More reliable, and also
> > > > platform independent.
> > >
> > > This is x86-specific.  Other architectures will likely have differences.
> > >
> > > > > populate its task stack.  After context switching from a task, reclaim
> > > > > its unused stack.  This way, the task stack in use is always fully
> > > > > allocated and we don't have to deal with page faults.
> > > > >
> > > > > To make this happen, __switch_to() would have to be split into two
> > > > > parts, to cleanly separate what happens before and after the stack
> > > > > switch.  The first part saves processor context for the previous task,
> > > > > and prepares the next task.
> > > >
> > > > By knowing the stack requirements of __switch_to(), can't we actually
> > > > do all that in the common code in context_switch() right before
> > > > __switch_to()? We would do an arch specific call to get the
> > > > __switch_to() stack requirement, and use that to change the value of
> > > > task->thread.sp to know where the stack is going to be while sleeping.
> > > > At this time we can do the unmapping of the stack pages from the
> > > > previous task, and mapping the pages to the next task.
> > >
> > > task->thread.sp is set in __switch_to_asm(), and is pretty much the
> > > last thing done in the context of the previous task.  Trying to
> > > predict that value ahead of time is way too fragile.
> >
> > We don't require an exact value, but rather an approximate upper
> > limit. To illustrate, subtract 1K from the current .sp, then determine
> > the corresponding page to decide the number of pages needing
> > unmapping. The primary advantage is that we can avoid
> > platform-specific ifdefs for DYNAMIC_STACKS within the arch-specific
> > switch_to() function. Instead, each platform can provide an
> > appropriate upper bound for switch_to() operations. We know the amount
> > of information is going to be stored on the stack by the routines, and
> > also since interrupts are disabled stacks are not used for anything
> > else there, so I do not see a problem with determining a reasonable
> > upper bound.
>
> The stack usage will vary depending on compiler version and
> optimization settings.  Making an educated guess is possible, but may
> not be enough in the future.
>
> What would be nice is to get some actual data on stack usage under
> various workloads, both maximum depth and depth at context switch.
>
> > >  Also, the key
> > > point I was trying to make is that you cannot safely shrink the active
> > > stack.  It can only be done after the stack switch to the new task.
> >
> > Can you please elaborate why this is so? If the lowest pages are not
> > used, and interrupts are disabled what is not safe about removing them
> > from the page table?
> >
> > I am not against the idea of unmapping in __switch_to(), I just want
> > to understand the reasons why more generic but perhaps not as precise
> > approach would not  work.
>
> As long as a wide buffer is given, it would probably be safe.  But it
> would still be safer and more precise if done after the switch.

Makes sense. Looks like using task->thread.sp during context is not
possible because the pages might have been shared with another CPU. We
would need to do ipi tlb invalidation, which would be too expensive
for the context switch. Therefore, using pte->accessed is more
reliable to determine which pages can be unmapped. However, we could
still use task->thread.sp in a garbage collector.

Pasha