mm: memcontrol: remove page_memcg()

Message ID	20240521131556.142176-1-wangkefeng.wang@huawei.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Kefeng Wang <wangkefeng.wang@huawei.com> To: Andrew Morton <akpm@linux-foundation.org> CC: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeel.butt@linux.dev>, Muchun Song <muchun.song@linux.dev>, <linux-mm@kvack.org>, <cgroups@vger.kernel.org>, Kefeng Wang <wangkefeng.wang@huawei.com> Subject: [PATCH] mm: memcontrol: remove page_memcg() Date: Tue, 21 May 2024 21:15:56 +0800 Message-ID: <20240521131556.142176-1-wangkefeng.wang@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: memcontrol: remove page_memcg() \| expand mm: memcontrol: remove page_memcg()

Kefeng Wang May 21, 2024, 1:15 p.m. UTC

The page_memcg() only called by mod_memcg_page_state(), so squash it to
cleanup page_memcg().

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 include/linux/memcontrol.h | 14 ++------------
 mm/memcontrol.c            |  2 +-
 2 files changed, 3 insertions(+), 13 deletions(-)

Michal Hocko May 21, 2024, 1:30 p.m. UTC | #1

On Tue 21-05-24 21:15:56, Kefeng Wang wrote:
> The page_memcg() only called by mod_memcg_page_state(), so squash it to
> cleanup page_memcg().
> 
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  include/linux/memcontrol.h | 14 ++------------
>  mm/memcontrol.c            |  2 +-
>  2 files changed, 3 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 030d34e9d117..8abc70cc7219 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -443,11 +443,6 @@ static inline struct mem_cgroup *folio_memcg(struct folio *folio)
>  	return __folio_memcg(folio);
>  }
>  
> -static inline struct mem_cgroup *page_memcg(struct page *page)
> -{
> -	return folio_memcg(page_folio(page));
> -}
> -
>  /**
>   * folio_memcg_rcu - Locklessly get the memory cgroup associated with a folio.
>   * @folio: Pointer to the folio.
> @@ -1014,7 +1009,7 @@ static inline void mod_memcg_page_state(struct page *page,
>  		return;
>  
>  	rcu_read_lock();
> -	memcg = page_memcg(page);
> +	memcg = folio_memcg(page_folio(page));
>  	if (memcg)
>  		mod_memcg_state(memcg, idx, val);
>  	rcu_read_unlock();
> @@ -1133,11 +1128,6 @@ static inline struct mem_cgroup *folio_memcg(struct folio *folio)
>  	return NULL;
>  }
>  
> -static inline struct mem_cgroup *page_memcg(struct page *page)
> -{
> -	return NULL;
> -}
> -
>  static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
>  {
>  	WARN_ON_ONCE(!rcu_read_lock_held());
> @@ -1636,7 +1626,7 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
>  	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
>  }
>  
> -/* Test requires a stable page->memcg binding, see page_memcg() */
> +/* Test requires a stable page->memcg binding, see folio_memcg() */
>  static inline bool folio_matches_lruvec(struct folio *folio,
>  		struct lruvec *lruvec)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 54070687aad2..72833f6f0944 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3811,7 +3811,7 @@ void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
>  #endif /* CONFIG_MEMCG_KMEM */
>  
>  /*
> - * Because page_memcg(head) is not set on tails, set it now.
> + * Because folio_memcg(head) is not set on tails, set it now.
>   */
>  void split_page_memcg(struct page *head, int old_order, int new_order)
>  {
> -- 
> 2.41.0

Matthew Wilcox May 21, 2024, 2:21 p.m. UTC | #2

On Tue, May 21, 2024 at 09:15:56PM +0800, Kefeng Wang wrote:
> -/* Test requires a stable page->memcg binding, see page_memcg() */
> +/* Test requires a stable page->memcg binding, see folio_memcg() */

folio->memcg, not page->memcg

Matthew Wilcox May 21, 2024, 2:44 p.m. UTC | #3

On Tue, May 21, 2024 at 09:15:56PM +0800, Kefeng Wang wrote:
> The page_memcg() only called by mod_memcg_page_state(), so squash it to
> cleanup page_memcg().

This isn't wrong, except that the entire usage of memcg is wrong in the
only two callers of mod_memcg_page_state():

$ git grep mod_memcg_page_state
include/linux/memcontrol.h:static inline void mod_memcg_page_state(struct page *page,
include/linux/memcontrol.h:static inline void mod_memcg_page_state(struct page *page,
mm/vmalloc.c:           mod_memcg_page_state(page, MEMCG_VMALLOC, -1);
mm/vmalloc.c:                   mod_memcg_page_state(area->pages[i], MEMCG_VMALLOC, 1);

The memcg should not be attached to the individual pages that make up a
vmalloc allocation.  Rather, it should be managed by the vmalloc
allocation itself.  I don't have the knowledge to poke around inside
vmalloc right now, but maybe somebody else could take that on.

Michal Hocko May 21, 2024, 4:03 p.m. UTC | #4

On Tue 21-05-24 15:44:21, Matthew Wilcox wrote:
> On Tue, May 21, 2024 at 09:15:56PM +0800, Kefeng Wang wrote:
> > The page_memcg() only called by mod_memcg_page_state(), so squash it to
> > cleanup page_memcg().
> 
> This isn't wrong, except that the entire usage of memcg is wrong in the
> only two callers of mod_memcg_page_state():
> 
> $ git grep mod_memcg_page_state
> include/linux/memcontrol.h:static inline void mod_memcg_page_state(struct page *page,
> include/linux/memcontrol.h:static inline void mod_memcg_page_state(struct page *page,
> mm/vmalloc.c:           mod_memcg_page_state(page, MEMCG_VMALLOC, -1);
> mm/vmalloc.c:                   mod_memcg_page_state(area->pages[i], MEMCG_VMALLOC, 1);
> 
> The memcg should not be attached to the individual pages that make up a
> vmalloc allocation.  Rather, it should be managed by the vmalloc
> allocation itself.  I don't have the knowledge to poke around inside
> vmalloc right now, but maybe somebody else could take that on.

This would make sense as a follow up.

Shakeel Butt May 21, 2024, 7:29 p.m. UTC | #5

On Tue, May 21, 2024 at 03:44:21PM +0100, Matthew Wilcox wrote:
> On Tue, May 21, 2024 at 09:15:56PM +0800, Kefeng Wang wrote:
> > The page_memcg() only called by mod_memcg_page_state(), so squash it to
> > cleanup page_memcg().
> 
> This isn't wrong, except that the entire usage of memcg is wrong in the
> only two callers of mod_memcg_page_state():
> 
> $ git grep mod_memcg_page_state
> include/linux/memcontrol.h:static inline void mod_memcg_page_state(struct page *page,
> include/linux/memcontrol.h:static inline void mod_memcg_page_state(struct page *page,
> mm/vmalloc.c:           mod_memcg_page_state(page, MEMCG_VMALLOC, -1);
> mm/vmalloc.c:                   mod_memcg_page_state(area->pages[i], MEMCG_VMALLOC, 1);
> 
> The memcg should not be attached to the individual pages that make up a
> vmalloc allocation.  Rather, it should be managed by the vmalloc
> allocation itself.  I don't have the knowledge to poke around inside
> vmalloc right now, but maybe somebody else could take that on.

Are you concerned about accessing just memcg or any field of the
sub-page? There are drivers accessing fields of pages allocated through
vmalloc. Some details at 3b8000ae185c ("mm/vmalloc: huge vmalloc backing
pages should be split rather than compound").

Kefeng Wang May 23, 2024, 8:57 a.m. UTC | #6

On 2024/5/22 3:29, Shakeel Butt wrote:
> On Tue, May 21, 2024 at 03:44:21PM +0100, Matthew Wilcox wrote:
>> On Tue, May 21, 2024 at 09:15:56PM +0800, Kefeng Wang wrote:
>>> The page_memcg() only called by mod_memcg_page_state(), so squash it to
>>> cleanup page_memcg().
>>
>> This isn't wrong, except that the entire usage of memcg is wrong in the
>> only two callers of mod_memcg_page_state():
>>
>> $ git grep mod_memcg_page_state
>> include/linux/memcontrol.h:static inline void mod_memcg_page_state(struct page *page,
>> include/linux/memcontrol.h:static inline void mod_memcg_page_state(struct page *page,
>> mm/vmalloc.c:           mod_memcg_page_state(page, MEMCG_VMALLOC, -1);
>> mm/vmalloc.c:                   mod_memcg_page_state(area->pages[i], MEMCG_VMALLOC, 1);
>>
>> The memcg should not be attached to the individual pages that make up a
>> vmalloc allocation.  Rather, it should be managed by the vmalloc
>> allocation itself.  I don't have the knowledge to poke around inside
>> vmalloc right now, but maybe somebody else could take that on.
> 
> Are you concerned about accessing just memcg or any field of the
> sub-page? There are drivers accessing fields of pages allocated through
> vmalloc. Some details at 3b8000ae185c ("mm/vmalloc: huge vmalloc backing
> pages should be split rather than compound").

Maybe Matthew want something shown below, move the memcg MEMCG_VMALLOC 
stat update from per-page to per-vmalloc-allocation? It should be speed 
up the statistic after conversion.

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index e4a631ec430b..89f115623124 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -55,6 +55,9 @@ struct vm_struct {
  	unsigned long		size;
  	unsigned long		flags;
  	struct page		**pages;
+#ifdef CONFIG_MEMCG_KMEM
+	struct obj_cgroup	*objcg;
+#endif
  #ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
  	unsigned int		page_order;
  #endif
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5d3aa2dc88a8..3e28c382f604 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3001,6 +3001,49 @@ static inline void set_vm_area_page_order(struct 
vm_struct *vm, unsigned int ord
  #endif
  }

+#ifdef CONFIG_MEMCG_KMEM
+static void vmalloc_memcg_alloc_hook(struct vm_struct *area, gfp_t gfp,
+				     int nr_pages)
+{
+	struct obj_cgroup *objcg;
+
+	if (!memcg_kmem_online() || !(gfp & __GFP_ACCOUNT))
+		return;
+
+	objcg = get_obj_cgroup_from_current();
+	if (objcg)
+		return;
+
+	area->objcg = objcg;
+
+	rcu_read_lock();
+	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_VMALLOC, nr_pages);
+	rcu_read_unlock();
+}
+
+static void vmalloc_memcg_free_hook(struct vm_struct *area)
+{
+	struct obj_cgroup *objcg = area->objcg;
+
+	if (!objcg)
+		return;
+
+	rcu_read_lock();
+	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_VMALLOC, -area->nr_pages);
+	rcu_read_unlock();
+
+	obj_cgroup_put(objcg);
+}
+#else
+static void vmalloc_memcg_alloc_hook(struct vm_struct *area, gfp_t gfp,
+				     int nr_pages)
+{
+}
+static void vmalloc_memcg_free_hook(struct vm_struct *area)
+{
+}
+#endif
+
  /**
   * vm_area_add_early - add vmap area early during boot
   * @vm: vm_struct to add
@@ -3338,7 +3381,6 @@ void vfree(const void *addr)
  		struct page *page = vm->pages[i];

  		BUG_ON(!page);
-		mod_memcg_page_state(page, MEMCG_VMALLOC, -1);
  		/*
  		 * High-order allocs for huge vmallocs are split, so
  		 * can be freed as an array of order-0 allocations
@@ -3347,6 +3389,7 @@ void vfree(const void *addr)
  		cond_resched();
  	}
  	atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
+	vmalloc_memcg_free_hook(vm);
  	kvfree(vm->pages);
  	kfree(vm);
  }
@@ -3643,12 +3686,7 @@ static void *__vmalloc_area_node(struct vm_struct 
*area, gfp_t gfp_mask,
  		node, page_order, nr_small_pages, area->pages);

  	atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
-	if (gfp_mask & __GFP_ACCOUNT) {
-		int i;
-
-		for (i = 0; i < area->nr_pages; i++)
-			mod_memcg_page_state(area->pages[i], MEMCG_VMALLOC, 1);
-	}
+	vmalloc_memcg_alloc_hook(area, gfp_mask, area->nr_pages);

  	/*
  	 * If not enough pages were obtained to accomplish an

Kefeng Wang May 23, 2024, 9:43 a.m. UTC | #7

On 2024/5/21 22:21, Matthew Wilcox wrote:
> On Tue, May 21, 2024 at 09:15:56PM +0800, Kefeng Wang wrote:
>> -/* Test requires a stable page->memcg binding, see page_memcg() */
>> +/* Test requires a stable page->memcg binding, see folio_memcg() */
> 
> folio->memcg, not page->memcg

OK, will fix.

Matthew Wilcox May 23, 2024, 1:31 p.m. UTC | #8

On Tue, May 21, 2024 at 12:29:39PM -0700, Shakeel Butt wrote:
> On Tue, May 21, 2024 at 03:44:21PM +0100, Matthew Wilcox wrote:
> > The memcg should not be attached to the individual pages that make up a
> > vmalloc allocation.  Rather, it should be managed by the vmalloc
> > allocation itself.  I don't have the knowledge to poke around inside
> > vmalloc right now, but maybe somebody else could take that on.
> 
> Are you concerned about accessing just memcg or any field of the
> sub-page? There are drivers accessing fields of pages allocated through
> vmalloc. Some details at 3b8000ae185c ("mm/vmalloc: huge vmalloc backing
> pages should be split rather than compound").

Thanks for the pointer, and fb_deferred_io_fault() is already on my
hitlist for abusing struct page.

My primary concern is that we should track the entire allocation as a
single object rather than tracking each page individually.  That means
assigning the vmalloc allocation to a memcg rather than assigning each
page to a memcg.  It's a lot less overhead to increment the counter once
per allocation rather than once per page in the allocation!

But secondarily, yes, pages allocated by vmalloc probably don't need
any per-page state, other than tracking the vmalloc allocation they're
assigned to.  We'll see how that theory turns out.

Shakeel Butt May 23, 2024, 3:41 p.m. UTC | #9

On Thu, May 23, 2024 at 02:31:05PM +0100, Matthew Wilcox wrote:
> On Tue, May 21, 2024 at 12:29:39PM -0700, Shakeel Butt wrote:
> > On Tue, May 21, 2024 at 03:44:21PM +0100, Matthew Wilcox wrote:
> > > The memcg should not be attached to the individual pages that make up a
> > > vmalloc allocation.  Rather, it should be managed by the vmalloc
> > > allocation itself.  I don't have the knowledge to poke around inside
> > > vmalloc right now, but maybe somebody else could take that on.
> > 
> > Are you concerned about accessing just memcg or any field of the
> > sub-page? There are drivers accessing fields of pages allocated through
> > vmalloc. Some details at 3b8000ae185c ("mm/vmalloc: huge vmalloc backing
> > pages should be split rather than compound").
> 
> Thanks for the pointer, and fb_deferred_io_fault() is already on my
> hitlist for abusing struct page.
> 
> My primary concern is that we should track the entire allocation as a
> single object rather than tracking each page individually.  That means
> assigning the vmalloc allocation to a memcg rather than assigning each
> page to a memcg.  It's a lot less overhead to increment the counter once
> per allocation rather than once per page in the allocation!
> 
> But secondarily, yes, pages allocated by vmalloc probably don't need
> any per-page state, other than tracking the vmalloc allocation they're
> assigned to.  We'll see how that theory turns out.

I think the tricky part would be vmalloc having pages spanning multiple
nodes which is not an issue for MEMCG_VMALLOC stat but the vmap based
kernel stack (CONFIG_VMAP_STACK) metric NR_KERNEL_STACK_KB cares about
that information.

Matthew Wilcox May 23, 2024, 4:34 p.m. UTC | #10

On Thu, May 23, 2024 at 08:41:25AM -0700, Shakeel Butt wrote:
> On Thu, May 23, 2024 at 02:31:05PM +0100, Matthew Wilcox wrote:
> > On Tue, May 21, 2024 at 12:29:39PM -0700, Shakeel Butt wrote:
> > > On Tue, May 21, 2024 at 03:44:21PM +0100, Matthew Wilcox wrote:
> > > > The memcg should not be attached to the individual pages that make up a
> > > > vmalloc allocation.  Rather, it should be managed by the vmalloc
> > > > allocation itself.  I don't have the knowledge to poke around inside
> > > > vmalloc right now, but maybe somebody else could take that on.
> > > 
> > > Are you concerned about accessing just memcg or any field of the
> > > sub-page? There are drivers accessing fields of pages allocated through
> > > vmalloc. Some details at 3b8000ae185c ("mm/vmalloc: huge vmalloc backing
> > > pages should be split rather than compound").
> > 
> > Thanks for the pointer, and fb_deferred_io_fault() is already on my
> > hitlist for abusing struct page.
> > 
> > My primary concern is that we should track the entire allocation as a
> > single object rather than tracking each page individually.  That means
> > assigning the vmalloc allocation to a memcg rather than assigning each
> > page to a memcg.  It's a lot less overhead to increment the counter once
> > per allocation rather than once per page in the allocation!
> > 
> > But secondarily, yes, pages allocated by vmalloc probably don't need
> > any per-page state, other than tracking the vmalloc allocation they're
> > assigned to.  We'll see how that theory turns out.
> 
> I think the tricky part would be vmalloc having pages spanning multiple
> nodes which is not an issue for MEMCG_VMALLOC stat but the vmap based
> kernel stack (CONFIG_VMAP_STACK) metric NR_KERNEL_STACK_KB cares about
> that information.

Yes, we'll have to handle mod_lruvec_page_state() differently since that
stat is tracked per node.  Or we could stop tracking that stat per node.
Is it useful to track it per node?  Why is it useful to track kernel
stacks per node, but not track vmalloc allocations per node?

Shakeel Butt May 31, 2024, 10:51 p.m. UTC | #11

On Thu, May 23, 2024 at 05:34:27PM GMT, Matthew Wilcox wrote:
> On Thu, May 23, 2024 at 08:41:25AM -0700, Shakeel Butt wrote:
> > On Thu, May 23, 2024 at 02:31:05PM +0100, Matthew Wilcox wrote:
> > > On Tue, May 21, 2024 at 12:29:39PM -0700, Shakeel Butt wrote:
> > > > On Tue, May 21, 2024 at 03:44:21PM +0100, Matthew Wilcox wrote:
> > > > > The memcg should not be attached to the individual pages that make up a
> > > > > vmalloc allocation.  Rather, it should be managed by the vmalloc
> > > > > allocation itself.  I don't have the knowledge to poke around inside
> > > > > vmalloc right now, but maybe somebody else could take that on.
> > > > 
> > > > Are you concerned about accessing just memcg or any field of the
> > > > sub-page? There are drivers accessing fields of pages allocated through
> > > > vmalloc. Some details at 3b8000ae185c ("mm/vmalloc: huge vmalloc backing
> > > > pages should be split rather than compound").
> > > 
> > > Thanks for the pointer, and fb_deferred_io_fault() is already on my
> > > hitlist for abusing struct page.
> > > 
> > > My primary concern is that we should track the entire allocation as a
> > > single object rather than tracking each page individually.  That means
> > > assigning the vmalloc allocation to a memcg rather than assigning each
> > > page to a memcg.  It's a lot less overhead to increment the counter once
> > > per allocation rather than once per page in the allocation!
> > > 
> > > But secondarily, yes, pages allocated by vmalloc probably don't need
> > > any per-page state, other than tracking the vmalloc allocation they're
> > > assigned to.  We'll see how that theory turns out.
> > 
> > I think the tricky part would be vmalloc having pages spanning multiple
> > nodes which is not an issue for MEMCG_VMALLOC stat but the vmap based
> > kernel stack (CONFIG_VMAP_STACK) metric NR_KERNEL_STACK_KB cares about
> > that information.
> 
> Yes, we'll have to handle mod_lruvec_page_state() differently since that
> stat is tracked per node.  Or we could stop tracking that stat per node.
> Is it useful to track it per node?  Why is it useful to track kernel
> stacks per node, but not track vmalloc allocations per node?

This is a good question and other than that there are user visible APIs
(per numa meminfo & memory.numa_stat), I don't have a good answer.

mm: memcontrol: remove page_memcg()

Commit Message

Comments

Patch