diff mbox series

[RFC,2/2] mm: kmem: add direct objcg pointer to task_struct

Message ID 20221220182745.1903540-3-roman.gushchin@linux.dev (mailing list archive)
State New
Headers show
Series mm: kmem: optimize obj_cgroup pointer retrieval | expand

Commit Message

Roman Gushchin Dec. 20, 2022, 6:27 p.m. UTC
To charge a freshly allocated kernel object to a memory cgroup, the
kernel needs to obtain an objcg pointer. Currently it does it
indirectly by obtaining the memcg pointer first and then calling to
__get_obj_cgroup_from_memcg().

Usually tasks spend their entire life belonging to the same object
cgroup. So it makes sense to save the objcg pointer on task_struct
directly, so it can be obtained faster. It requires some work on fork,
exit and cgroup migrate paths, but these paths are way colder.

The old indirect way is still used for remote memcg charging.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/sched.h |  4 +++
 mm/memcontrol.c       | 84 +++++++++++++++++++++++++++++++++++++------
 2 files changed, 77 insertions(+), 11 deletions(-)

Comments

Shakeel Butt Dec. 20, 2022, 8:44 p.m. UTC | #1
On Tue, Dec 20, 2022 at 10:27:45AM -0800, Roman Gushchin wrote:
> To charge a freshly allocated kernel object to a memory cgroup, the
> kernel needs to obtain an objcg pointer. Currently it does it
> indirectly by obtaining the memcg pointer first and then calling to
> __get_obj_cgroup_from_memcg().
> 
> Usually tasks spend their entire life belonging to the same object
> cgroup. So it makes sense to save the objcg pointer on task_struct
> directly, so it can be obtained faster. It requires some work on fork,
> exit and cgroup migrate paths, but these paths are way colder.
> 
> The old indirect way is still used for remote memcg charging.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

This looks good too. Few comments below:

[...]
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +static void mem_cgroup_kmem_attach(struct cgroup_taskset *tset)
> +{
> +	struct task_struct *task;
> +	struct cgroup_subsys_state *css;
> +
> +	cgroup_taskset_for_each(task, css, tset) {
> +		struct mem_cgroup *memcg;
> +
> +		if (task->objcg)
> +			obj_cgroup_put(task->objcg);
> +
> +		rcu_read_lock();
> +		memcg = container_of(css, struct mem_cgroup, css);
> +		task->objcg = __get_obj_cgroup_from_memcg(memcg);
> +		rcu_read_unlock();
> +	}
> +}
> +#else
> +static void mem_cgroup_kmem_attach(struct cgroup_taskset *tset) {}
> +#endif /* CONFIG_MEMCG_KMEM */
> +
> +#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEMCG_KMEM)

I think you want CONFIG_LRU_GEN in the above check.

>  static void mem_cgroup_attach(struct cgroup_taskset *tset)
>  {
> +	mem_cgroup_lru_gen_attach(tset);
> +	mem_cgroup_kmem_attach(tset);
>  }
> -#endif /* CONFIG_LRU_GEN */
> +#endif
>  
>  static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
>  {
> @@ -6816,9 +6872,15 @@ struct cgroup_subsys memory_cgrp_subsys = {
>  	.css_reset = mem_cgroup_css_reset,
>  	.css_rstat_flush = mem_cgroup_css_rstat_flush,
>  	.can_attach = mem_cgroup_can_attach,
> +#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEMCG_KMEM)

Same here.

>  	.attach = mem_cgroup_attach,
> +#endif
>  	.cancel_attach = mem_cgroup_cancel_attach,
>  	.post_attach = mem_cgroup_move_task,
> +#ifdef CONFIG_MEMCG_KMEM
> +	.fork = mem_cgroup_fork,
> +	.exit = mem_cgroup_exit,
> +#endif
>  	.dfl_cftypes = memory_files,
>  	.legacy_cftypes = mem_cgroup_legacy_files,
>  	.early_init = 0,
> -- 
> 2.39.0
>
Michal Koutný Dec. 22, 2022, 1:50 p.m. UTC | #2
On Tue, Dec 20, 2022 at 10:27:45AM -0800, Roman Gushchin <roman.gushchin@linux.dev> wrote:
> To charge a freshly allocated kernel object to a memory cgroup, the
> kernel needs to obtain an objcg pointer. Currently it does it
> indirectly by obtaining the memcg pointer first and then calling to
> __get_obj_cgroup_from_memcg().

Jinx [1].

You report additional 7% improvement with this patch (focused on
allocations only). I didn't see impressive numbers (different benchmark
in [1]), so it looked as a microoptimization without big benefit to me.

My 0.02€ to RFC,
Michal


[1] https://bugzilla.kernel.org/show_bug.cgi?id=216038#c5
Roman Gushchin Dec. 22, 2022, 4:21 p.m. UTC | #3
On Thu, Dec 22, 2022 at 02:50:44PM +0100, Michal Koutný wrote:
> On Tue, Dec 20, 2022 at 10:27:45AM -0800, Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > To charge a freshly allocated kernel object to a memory cgroup, the
> > kernel needs to obtain an objcg pointer. Currently it does it
> > indirectly by obtaining the memcg pointer first and then calling to
> > __get_obj_cgroup_from_memcg().
> 
> Jinx [1].
> 
> You report additional 7% improvement with this patch (focused on
> allocations only). I didn't see impressive numbers (different benchmark
> in [1]), so it looked as a microoptimization without big benefit to me.

Hi Michal!

Thank you for taking a look.
Do you have any numbers to share?

In general, I agree that it's a micro-optimization, but:
1) some people periodically complain that accounted allocations are slow
   in comparison to non-accounted and slower than they were with page-based
   accounting,
2) I don't see any particular hot point or obviously non-optimal place on the
   allocation path.

so if we want to make it faster, we have to micro-optimize it here and there,
no other way. It's basically the question how many cache lines we touch.

Btw, I'm working on a patch 3 for this series, which in early tests brings
additional ~25% improvement in my benchmark, hopefully will post it soon as
a part of v1.

Thanks!
Michal Koutný Jan. 2, 2023, 4:09 p.m. UTC | #4
Hello.

On Thu, Dec 22, 2022 at 08:21:49AM -0800, Roman Gushchin <roman.gushchin@linux.dev> wrote:
> Do you have any numbers to share?

The numbers are in bko#216038, let me explain them here a bit.
I used the will-it-scale benchmark that repeatedly locks/unlocks a file
and runs in parallel.

The final numbers were:
  sample			metric	 	  δ	 δ_cg
  no accounting implemented	32307750	  0 %	 ­
  accounting in cg		2.49577e+07	-23 %	 0 %
  accounting in cg + cache	2.51642e+07 	-22 %	+1 %

Hence my result was only 1% improvement.

(But it was a very simple try, not delving into any of the CPU cache
statistics.)

Question: Were your measurements multi-threaded?

> 1) some people periodically complain that accounted allocations are slow
>    in comparison to non-accounted and slower than they were with page-based
>    accounting,

My result above would not likely satisfy those complainers I know about.
But if your additional changes are better the additional code complexity
may be justified in the end.


> Btw, I'm working on a patch 3 for this series, which in early tests brings
> additional ~25% improvement in my benchmark, hopefully will post it soon as
> a part of v1.

Please send it with more details about your benchmark to put the numbers
into context.


Michal
diff mbox series

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 853d08f7562b..e17be609cbcb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1435,6 +1435,10 @@  struct task_struct {
 	struct mem_cgroup		*active_memcg;
 #endif
 
+#ifdef CONFIG_MEMCG_KMEM
+	struct obj_cgroup		*objcg;
+#endif
+
 #ifdef CONFIG_BLK_CGROUP
 	struct request_queue		*throttle_queue;
 #endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 82828c51d2ea..e0547b224f40 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3001,23 +3001,29 @@  static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
 __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
 {
 	struct mem_cgroup *memcg;
-	struct obj_cgroup *objcg;
+	struct obj_cgroup *objcg = NULL;
 
 	if (in_task()) {
 		memcg = current->active_memcg;
-
-		/* Memcg to charge can't be determined. */
-		if (likely(!memcg) && (!current->mm || (current->flags & PF_KTHREAD)))
-			return NULL;
+		if (unlikely(memcg))
+			goto from_memcg;
+
+		if (current->objcg) {
+			rcu_read_lock();
+			do {
+				objcg = READ_ONCE(current->objcg);
+			} while (objcg && !obj_cgroup_tryget(objcg));
+			rcu_read_unlock();
+		}
 	} else {
 		memcg = this_cpu_read(int_active_memcg);
-		if (likely(!memcg))
-			return NULL;
+		if (unlikely(memcg))
+			goto from_memcg;
 	}
+	return objcg;
 
+from_memcg:
 	rcu_read_lock();
-	if (!memcg)
-		memcg = mem_cgroup_from_task(current);
 	objcg = __get_obj_cgroup_from_memcg(memcg);
 	rcu_read_unlock();
 	return objcg;
@@ -6303,6 +6309,28 @@  static void mem_cgroup_move_task(void)
 		mem_cgroup_clear_mc();
 	}
 }
+
+#ifdef CONFIG_MEMCG_KMEM
+static void mem_cgroup_fork(struct task_struct *task)
+{
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(task);
+	if (!memcg || mem_cgroup_is_root(memcg))
+		task->objcg = NULL;
+	else
+		task->objcg = __get_obj_cgroup_from_memcg(memcg);
+	rcu_read_unlock();
+}
+
+static void mem_cgroup_exit(struct task_struct *task)
+{
+	if (task->objcg)
+		obj_cgroup_put(task->objcg);
+}
+#endif
+
 #else	/* !CONFIG_MMU */
 static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
 {
@@ -6317,7 +6345,7 @@  static void mem_cgroup_move_task(void)
 #endif
 
 #ifdef CONFIG_LRU_GEN
-static void mem_cgroup_attach(struct cgroup_taskset *tset)
+static void mem_cgroup_lru_gen_attach(struct cgroup_taskset *tset)
 {
 	struct task_struct *task;
 	struct cgroup_subsys_state *css;
@@ -6335,10 +6363,38 @@  static void mem_cgroup_attach(struct cgroup_taskset *tset)
 	task_unlock(task);
 }
 #else
+static void mem_cgroup_lru_gen_attach(struct cgroup_taskset *tset) {}
+#endif /* CONFIG_LRU_GEN */
+
+#ifdef CONFIG_MEMCG_KMEM
+static void mem_cgroup_kmem_attach(struct cgroup_taskset *tset)
+{
+	struct task_struct *task;
+	struct cgroup_subsys_state *css;
+
+	cgroup_taskset_for_each(task, css, tset) {
+		struct mem_cgroup *memcg;
+
+		if (task->objcg)
+			obj_cgroup_put(task->objcg);
+
+		rcu_read_lock();
+		memcg = container_of(css, struct mem_cgroup, css);
+		task->objcg = __get_obj_cgroup_from_memcg(memcg);
+		rcu_read_unlock();
+	}
+}
+#else
+static void mem_cgroup_kmem_attach(struct cgroup_taskset *tset) {}
+#endif /* CONFIG_MEMCG_KMEM */
+
+#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEMCG_KMEM)
 static void mem_cgroup_attach(struct cgroup_taskset *tset)
 {
+	mem_cgroup_lru_gen_attach(tset);
+	mem_cgroup_kmem_attach(tset);
 }
-#endif /* CONFIG_LRU_GEN */
+#endif
 
 static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
 {
@@ -6816,9 +6872,15 @@  struct cgroup_subsys memory_cgrp_subsys = {
 	.css_reset = mem_cgroup_css_reset,
 	.css_rstat_flush = mem_cgroup_css_rstat_flush,
 	.can_attach = mem_cgroup_can_attach,
+#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEMCG_KMEM)
 	.attach = mem_cgroup_attach,
+#endif
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.post_attach = mem_cgroup_move_task,
+#ifdef CONFIG_MEMCG_KMEM
+	.fork = mem_cgroup_fork,
+	.exit = mem_cgroup_exit,
+#endif
 	.dfl_cftypes = memory_files,
 	.legacy_cftypes = mem_cgroup_legacy_files,
 	.early_init = 0,