diff mbox series

[v2] vmstat: Keep count of the maximum page reached by the kernel stack

Message ID 20240314145457.1106299-1-pasha.tatashin@soleen.com (mailing list archive)
State New
Headers show
Series [v2] vmstat: Keep count of the maximum page reached by the kernel stack | expand

Commit Message

Pasha Tatashin March 14, 2024, 2:54 p.m. UTC
CONFIG_DEBUG_STACK_USAGE provides a mechanism to determine the minimum
amount of memory left in a stack. Every time a new low-memory record is
reached, a message is printed to the console.

However, this doesn't reveal how many pages within each stack were
actually used. Introduce a mechanism that keeps count the number of
times each of the stack's pages were reached:

	$ grep kstack /proc/vmstat
	kstack_page_1 19974
	kstack_page_2 94
	kstack_page_3 0
	kstack_page_4 0

In the above example, out of 20,068 threads that exited on this
machine, only 94 reached the second page of their stack, and none
touched pages three or four.

In fleet environments with millions of machines, this data can help
optimize kernel stack sizes.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
Changelog:
v2:
- Fixed enum name KSTACK_PAGE_5 ->KSTACK_PAGE_REST.
- Improved commit message based on Christophe Leroy
  comment.

 include/linux/sched/task_stack.h | 40 ++++++++++++++++++++++++++++++--
 include/linux/vm_event_item.h    | 29 +++++++++++++++++++++++
 include/linux/vmstat.h           | 16 -------------
 mm/vmstat.c                      | 11 +++++++++
 4 files changed, 78 insertions(+), 18 deletions(-)

Comments

Andrew Morton March 18, 2024, 8:40 p.m. UTC | #1
On Thu, 14 Mar 2024 14:54:57 +0000 Pasha Tatashin <pasha.tatashin@soleen.com> wrote:

> CONFIG_DEBUG_STACK_USAGE provides a mechanism to determine the minimum
> amount of memory left in a stack. Every time a new low-memory record is
> reached, a message is printed to the console.
> 
> However, this doesn't reveal how many pages within each stack were
> actually used. Introduce a mechanism that keeps count the number of
> times each of the stack's pages were reached:
> 
> 	$ grep kstack /proc/vmstat
> 	kstack_page_1 19974
> 	kstack_page_2 94
> 	kstack_page_3 0
> 	kstack_page_4 0
> 
> In the above example, out of 20,068 threads that exited on this
> machine, only 94 reached the second page of their stack, and none
> touched pages three or four.
> 
> In fleet environments with millions of machines, this data can help
> optimize kernel stack sizes.

We really should have somewhere to document vmstat things.

> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -153,10 +153,39 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		VMA_LOCK_ABORT,
>  		VMA_LOCK_RETRY,
>  		VMA_LOCK_MISS,
> +#endif
> +#ifdef CONFIG_DEBUG_STACK_USAGE
> +		KSTACK_PAGE_1,
> +		KSTACK_PAGE_2,
> +#if THREAD_SIZE >= (4 * PAGE_SIZE)
> +		KSTACK_PAGE_3,
> +		KSTACK_PAGE_4,
> +#endif
> +#if THREAD_SIZE > (4 * PAGE_SIZE)
> +		KSTACK_PAGE_REST,
> +#endif
>  #endif
>  		NR_VM_EVENT_ITEMS
>  };

This seems a rather cumbersome way to produce a kind of histogram.  I
wonder if there should be a separate pseudo file for this.

And there may be a call for extending this.  For example I can forsee
people wanting to know "hey, which process did that", in which case
we'll want to record additional info.
Pasha Tatashin March 19, 2024, 2:23 p.m. UTC | #2
On Mon, Mar 18, 2024 at 4:40 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 14 Mar 2024 14:54:57 +0000 Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>
> > CONFIG_DEBUG_STACK_USAGE provides a mechanism to determine the minimum
> > amount of memory left in a stack. Every time a new low-memory record is
> > reached, a message is printed to the console.
> >
> > However, this doesn't reveal how many pages within each stack were
> > actually used. Introduce a mechanism that keeps count the number of
> > times each of the stack's pages were reached:
> >
> >       $ grep kstack /proc/vmstat
> >       kstack_page_1 19974
> >       kstack_page_2 94
> >       kstack_page_3 0
> >       kstack_page_4 0
> >
> > In the above example, out of 20,068 threads that exited on this
> > machine, only 94 reached the second page of their stack, and none
> > touched pages three or four.
> >
> > In fleet environments with millions of machines, this data can help
> > optimize kernel stack sizes.
>
> We really should have somewhere to document vmstat things.

We really should have a documentation for both procfs and sysfs
versions of these files:

$ wc -l /proc/vmstat
177 /proc/vmstat

$ wc -l /sys/devices/system/node/node0/vmstat
61 /sys/devices/system/node/node0/vmstat

Some of the counters are shared between the two (where procfs contains
machine global view), and some such as vm_event are only part of
/proc/vmstat. All of that requires a documentation somewhere under
Documentation/mm/vmstat.rst. We must explain that this is not a stable
API, as these counters depend on the kernel internal implementation.
However, there are so many of them, that it will take some effort to
do the initial explanation of all of them.

> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -153,10 +153,39 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >               VMA_LOCK_ABORT,
> >               VMA_LOCK_RETRY,
> >               VMA_LOCK_MISS,
> > +#endif
> > +#ifdef CONFIG_DEBUG_STACK_USAGE
> > +             KSTACK_PAGE_1,
> > +             KSTACK_PAGE_2,
> > +#if THREAD_SIZE >= (4 * PAGE_SIZE)
> > +             KSTACK_PAGE_3,
> > +             KSTACK_PAGE_4,
> > +#endif
> > +#if THREAD_SIZE > (4 * PAGE_SIZE)
> > +             KSTACK_PAGE_REST,
> > +#endif
> >  #endif
> >               NR_VM_EVENT_ITEMS
> >  };
>
> This seems a rather cumbersome way to produce a kind of histogram.  I
> wonder if there should be a separate pseudo file for this.

If you would like, the #if for stack size can be removed, I added them
not to print kstack_page_3 and kstack_page_4 on order 1 stacks where
there are only two-pages. This series shows the frequency of reaching
each of these pages by the existing threads.

> And there may be a call for extending this.  For example I can forsee
> people wanting to know "hey, which process did that", in which case

Which process did that (to some extent) would be what is printed out
by: CONFIG_DEBUG_STACK_USAGE when it finds the new record size stack.
Other than that, we could use tracing to find the callers that cause
these deep stacks.

However, the information provided by this patch can help to figure out
if tracing is needed or not. For example, if it is known that the
third or forth pages are extremely rarely used, say 0.00001% of time,
they could have a special API for deep stack, and save half of the
stack  memory in the fleet.

Thank you,
Pasha
diff mbox series

Patch

diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
index ccd72b978e1f..09e6874c2ced 100644
--- a/include/linux/sched/task_stack.h
+++ b/include/linux/sched/task_stack.h
@@ -95,9 +95,42 @@  static inline int object_is_on_stack(const void *obj)
 extern void thread_stack_cache_init(void);
 
 #ifdef CONFIG_DEBUG_STACK_USAGE
+#ifdef CONFIG_VM_EVENT_COUNTERS
+#include <linux/vm_event_item.h>
+
+/* Count the maximum pages reached in kernel stacks */
+static inline void count_kstack_page(int stack_max_page)
+{
+	switch (stack_max_page) {
+	case 1:
+		this_cpu_inc(vm_event_states.event[KSTACK_PAGE_1]);
+		break;
+	case 2:
+		this_cpu_inc(vm_event_states.event[KSTACK_PAGE_2]);
+		break;
+#if THREAD_SIZE >= (4 * PAGE_SIZE)
+	case 3:
+		this_cpu_inc(vm_event_states.event[KSTACK_PAGE_3]);
+		break;
+	case 4:
+		this_cpu_inc(vm_event_states.event[KSTACK_PAGE_4]);
+		break;
+#endif
+#if THREAD_SIZE > (4 * PAGE_SIZE)
+	default:
+		this_cpu_inc(vm_event_states.event[KSTACK_PAGE_REST]);
+		break;
+#endif
+	}
+}
+#else /* !CONFIG_VM_EVENT_COUNTERS */
+static inline void count_kstack_page(int stack_max_page) {}
+#endif /* CONFIG_VM_EVENT_COUNTERS */
+
 static inline unsigned long stack_not_used(struct task_struct *p)
 {
 	unsigned long *n = end_of_stack(p);
+	unsigned long unused_stack;
 
 	do { 	/* Skip over canary */
 # ifdef CONFIG_STACK_GROWSUP
@@ -108,10 +141,13 @@  static inline unsigned long stack_not_used(struct task_struct *p)
 	} while (!*n);
 
 # ifdef CONFIG_STACK_GROWSUP
-	return (unsigned long)end_of_stack(p) - (unsigned long)n;
+	unused_stack = (unsigned long)end_of_stack(p) - (unsigned long)n;
 # else
-	return (unsigned long)n - (unsigned long)end_of_stack(p);
+	unused_stack = (unsigned long)n - (unsigned long)end_of_stack(p);
 # endif
+	count_kstack_page(((THREAD_SIZE - unused_stack) >> PAGE_SHIFT) + 1);
+
+	return unused_stack;
 }
 #endif
 extern void set_task_stack_end_magic(struct task_struct *tsk);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 747943bc8cc2..1dbfe47ff048 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -153,10 +153,39 @@  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		VMA_LOCK_ABORT,
 		VMA_LOCK_RETRY,
 		VMA_LOCK_MISS,
+#endif
+#ifdef CONFIG_DEBUG_STACK_USAGE
+		KSTACK_PAGE_1,
+		KSTACK_PAGE_2,
+#if THREAD_SIZE >= (4 * PAGE_SIZE)
+		KSTACK_PAGE_3,
+		KSTACK_PAGE_4,
+#endif
+#if THREAD_SIZE > (4 * PAGE_SIZE)
+		KSTACK_PAGE_REST,
+#endif
 #endif
 		NR_VM_EVENT_ITEMS
 };
 
+#ifdef CONFIG_VM_EVENT_COUNTERS
+/*
+ * Light weight per cpu counter implementation.
+ *
+ * Counters should only be incremented and no critical kernel component
+ * should rely on the counter values.
+ *
+ * Counters are handled completely inline. On many platforms the code
+ * generated will simply be the increment of a global address.
+ */
+
+struct vm_event_state {
+	unsigned long event[NR_VM_EVENT_ITEMS];
+};
+
+DECLARE_PER_CPU(struct vm_event_state, vm_event_states);
+#endif
+
 #ifndef CONFIG_TRANSPARENT_HUGEPAGE
 #define THP_FILE_ALLOC ({ BUILD_BUG(); 0; })
 #define THP_FILE_FALLBACK ({ BUILD_BUG(); 0; })
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 343906a98d6e..18d4a97d3afd 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -41,22 +41,6 @@  enum writeback_stat_item {
 };
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
-/*
- * Light weight per cpu counter implementation.
- *
- * Counters should only be incremented and no critical kernel component
- * should rely on the counter values.
- *
- * Counters are handled completely inline. On many platforms the code
- * generated will simply be the increment of a global address.
- */
-
-struct vm_event_state {
-	unsigned long event[NR_VM_EVENT_ITEMS];
-};
-
-DECLARE_PER_CPU(struct vm_event_state, vm_event_states);
-
 /*
  * vm counters are allowed to be racy. Use raw_cpu_ops to avoid the
  * local_irq_disable overhead.
diff --git a/mm/vmstat.c b/mm/vmstat.c
index db79935e4a54..737c85689251 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1413,6 +1413,17 @@  const char * const vmstat_text[] = {
 	"vma_lock_retry",
 	"vma_lock_miss",
 #endif
+#ifdef CONFIG_DEBUG_STACK_USAGE
+	"kstack_page_1",
+	"kstack_page_2",
+#if THREAD_SIZE >= (4 * PAGE_SIZE)
+	"kstack_page_3",
+	"kstack_page_4",
+#endif
+#if THREAD_SIZE > (4 * PAGE_SIZE)
+	"kstack_page_rest",
+#endif
+#endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */