Message ID | 20241211043252.3295947-2-willy@infradead.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/2] vmalloc: Fix accounting of VmallocUsed with i915 | expand |
On Wed, Dec 11, 2024 at 04:32:50AM +0000, Matthew Wilcox (Oracle) wrote: [...] > +int obj_cgroup_charge_vmalloc(struct obj_cgroup **objcgp, > + unsigned int nr_pages, gfp_t gfp) > +{ > + struct obj_cgroup *objcg; > + int err; > + > + if (mem_cgroup_disabled() || !(gfp & __GFP_ACCOUNT)) > + return 0; > + > + objcg = current_obj_cgroup(); > + if (!objcg) > + return 0; > + > + err = obj_cgroup_charge_pages(objcg, gfp, nr_pages); > + if (err) > + return err; > + obj_cgroup_get(objcg); > + mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_VMALLOC, nr_pages); obj_cgroup_memcg() needs to be within rcu. See MEMCG_PERCPU_B as an example. > + *objcgp = objcg; > + > + return 0; > +} > + > +/** > + * obj_cgroup_uncharge_vmalloc - Uncharge vmalloc memory > + * @objcg: The object cgroup > + * @nr_pages: Number of pages > + */ > +void obj_cgroup_uncharge_vmalloc(struct obj_cgroup *objcg, > + unsigned int nr_pages) > +{ > + if (!objcg) > + return; > + mod_memcg_state(objcg->memcg, MEMCG_VMALLOC, 0L - nr_pages); Please use obj_cgroup_memcg() above instead of objcg->memcg and within rcu lock. Overall the patch looks good.
On Wed, Dec 11, 2024 at 04:32:50AM +0000, Matthew Wilcox (Oracle) wrote: > Today we account each page individually to the memcg, which works well > enough, if a little inefficiently (N atomic operations per page instead > of N per allocation). Unfortunately, the stats can get out of sync when > i915 calls vmap() with VM_MAP_PUT_PAGES. The pages being passed were not > allocated by vmalloc, so the MEMCG_VMALLOC counter was never incremented. > But it is decremented when the pages are freed with vfree(). > > Solve all of this by tracking the memcg at the vm_struct level. > This logic has to live in the memcontrol file as it calls several > functions which are currently static. > > Fixes: b944afc9d64d (mm: add a VM_MAP_PUT_PAGES flag for vmap) > Cc: stable@vger.kernel.org > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > --- > include/linux/memcontrol.h | 7 ++++++ > include/linux/vmalloc.h | 3 +++ > mm/memcontrol.c | 46 ++++++++++++++++++++++++++++++++++++++ > mm/vmalloc.c | 14 ++++++------ > 4 files changed, 63 insertions(+), 7 deletions(-) This would work, but it seems somewhat complicated. The atomics in memcg charging and the vmstat updates are batched, and the per-page overhead is for the most part cheap per-cpu ops. Not an issue per se. You could do for MEMCG_VMALLOC what you did for nr_vmalloc_pages: diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 634162271c00..a889bb04405c 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3353,7 +3353,11 @@ void vfree(const void *addr) struct page *page = vm->pages[i]; BUG_ON(!page); - mod_memcg_page_state(page, MEMCG_VMALLOC, -1); + + /* Pages were allocated elsewhere */ + if (!(vm->flags & VM_MAP_PUT_PAGES)) + mod_memcg_page_state(page, MEMCG_VMALLOC, -1); + /* * High-order allocs for huge vmallocs are split, so * can be freed as an array of order-0 allocations
On Wed, Dec 11, 2024 at 11:09:56AM -0500, Johannes Weiner wrote: > This would work, but it seems somewhat complicated. The atomics in > memcg charging and the vmstat updates are batched, and the per-page > overhead is for the most part cheap per-cpu ops. Not an issue per se. OK, fair enough, I hadn't realised it was a percpu-refcount. Still, we might consume several batches (batch size of 64) when we could do it all in one shot. Perhaps you'd be more persuaded by: (a) If we clear __GFP_ACCOUNT then alloc_pages_bulk() will work, and that's a pretty significant performance win over calling alloc_pages() in a loop. (b) Once we get to memdescs, calling alloc_pages() with __GFP_ACCOUNT set is going to require allocating a memdesc to store the obj_cgroup in, so in the future we'll save an allocation. Your proposed alternative will work and is way less churn. But it's not preparing us for memdescs ;-)
On Wed, Dec 11, 2024 at 04:50:39PM +0000, Matthew Wilcox wrote: > On Wed, Dec 11, 2024 at 11:09:56AM -0500, Johannes Weiner wrote: > > This would work, but it seems somewhat complicated. The atomics in > > memcg charging and the vmstat updates are batched, and the per-page > > overhead is for the most part cheap per-cpu ops. Not an issue per se. > > OK, fair enough, I hadn't realised it was a percpu-refcount. Still, > we might consume several batches (batch size of 64) when we could do it > all in one shot. > > Perhaps you'd be more persuaded by: > > (a) If we clear __GFP_ACCOUNT then alloc_pages_bulk() will work, and > that's a pretty significant performance win over calling alloc_pages() > in a loop. > > (b) Once we get to memdescs, calling alloc_pages() with __GFP_ACCOUNT > set is going to require allocating a memdesc to store the obj_cgroup > in, so in the future we'll save an allocation. > > Your proposed alternative will work and is way less churn. But it's > not preparing us for memdescs ;-) We can make alloc_pages_bulk() work with __GFP_ACCOUNT but your second argument is more compelling. I am trying to think of what will we miss if we remove this per-page memcg metadata. One thing I can think of is debugging a live system or kdump where I need to track where a given page came from. I think memory profiling will still be useful in combination with going through all vmalloc regions where this page is mapped (is there an easy way to tell if a page is from a vmalloc region?). So, for now I think we will have alternative way to extract the useful information. I think we can go with Johannes' solution for stable and discuss the future direction more separately.
On Wed, Dec 11, 2024 at 11:32:13AM -0800, Shakeel Butt wrote: > On Wed, Dec 11, 2024 at 04:50:39PM +0000, Matthew Wilcox wrote: > > Perhaps you'd be more persuaded by: > > > > (a) If we clear __GFP_ACCOUNT then alloc_pages_bulk() will work, and > > that's a pretty significant performance win over calling alloc_pages() > > in a loop. > > > > (b) Once we get to memdescs, calling alloc_pages() with __GFP_ACCOUNT > > set is going to require allocating a memdesc to store the obj_cgroup > > in, so in the future we'll save an allocation. > > > > Your proposed alternative will work and is way less churn. But it's > > not preparing us for memdescs ;-) > > We can make alloc_pages_bulk() work with __GFP_ACCOUNT but your second > argument is more compelling. > > I am trying to think of what will we miss if we remove this per-page > memcg metadata. One thing I can think of is debugging a live system > or kdump where I need to track where a given page came from. I think Umm, I don't think you know which vmalloc allocation a page came from today? I've sent patches to add that information before, but they were rejected. In fact, I don't think we know even _that_ a page belongs to vmalloc today, do we? Yes, we know that the page is accounted, and which memcg it belongs to ... but nothing more. I actually want to improve this, without adding additional overhead. What I'm working on right now (before I got waylaid by this bug) is: +struct choir { + struct kref refcount; + unsigned int nr; + struct page *pages[] __counted_by(nr); +}; and rewriting vmalloc to be based on choirs instead of its own pages. One thing I've come to realise today is that the obj_cgroup pointer needs to be in the choir and not in the vm_struct so that we uncharge the allocation when the choir refcount drops to 0, not when the allocation is unmapped. A regular choir allocation will (today) mark the pages in it as being allocated to a choir (and thus not having their own refcount / mapcount), but I'll give vmalloc a way to mark the pages as specifically being from vmalloc. There's a lot of moving parts to this ... it's proving quite tricky! > I think we can go with Johannes' solution for stable and discuss the > future direction more separately. OK, I'll send a patch to do that.
On Wed, Dec 11, 2024 at 08:20:36PM +0000, Matthew Wilcox wrote: > On Wed, Dec 11, 2024 at 11:32:13AM -0800, Shakeel Butt wrote: > > On Wed, Dec 11, 2024 at 04:50:39PM +0000, Matthew Wilcox wrote: > > > Perhaps you'd be more persuaded by: > > > > > > (a) If we clear __GFP_ACCOUNT then alloc_pages_bulk() will work, and > > > that's a pretty significant performance win over calling alloc_pages() > > > in a loop. > > > > > > (b) Once we get to memdescs, calling alloc_pages() with __GFP_ACCOUNT > > > set is going to require allocating a memdesc to store the obj_cgroup > > > in, so in the future we'll save an allocation. > > > > > > Your proposed alternative will work and is way less churn. But it's > > > not preparing us for memdescs ;-) > > > > We can make alloc_pages_bulk() work with __GFP_ACCOUNT but your second > > argument is more compelling. > > > > I am trying to think of what will we miss if we remove this per-page > > memcg metadata. One thing I can think of is debugging a live system > > or kdump where I need to track where a given page came from. I think > > Umm, I don't think you know which vmalloc allocation a page came from > today? I've sent patches to add that information before, but they were > rejected. Do you have a link handy for that discussion? > In fact, I don't think we know even _that_ a page belongs to > vmalloc today, do we? Yes, we know that the page is accounted, and > which memcg it belongs to ... but nothing more. Yes you are correct. At the moment it is a guesswork and exhaustive search into multiple sources. > > I actually want to improve this, without adding additional overhead. > What I'm working on right now (before I got waylaid by this bug) is: > > +struct choir { > + struct kref refcount; > + unsigned int nr; > + struct page *pages[] __counted_by(nr); > +}; > > and rewriting vmalloc to be based on choirs instead of its own pages. > One thing I've come to realise today is that the obj_cgroup pointer > needs to be in the choir and not in the vm_struct so that we uncharge the > allocation when the choir refcount drops to 0, not when the allocation > is unmapped. What/who else can take a reference on a choir? > > A regular choir allocation will (today) mark the pages in it as being > allocated to a choir (and thus not having their own refcount / mapcount), > but I'll give vmalloc a way to mark the pages as specifically being > from vmalloc. This sounds good. Thanks for the awesome work.
On Wed, Dec 11, 2024 at 12:58:36PM -0800, Shakeel Butt wrote: > On Wed, Dec 11, 2024 at 08:20:36PM +0000, Matthew Wilcox wrote: > > Umm, I don't think you know which vmalloc allocation a page came from > > today? I've sent patches to add that information before, but they were > > rejected. > > Do you have a link handy for that discussion? It's not really relevant any more ... https://lore.kernel.org/linux-mm/20180518194519.3820-18-willy@infradead.org/ and subsequent discussion: https://lore.kernel.org/linux-mm/20180611121129.GB12912@bombadil.infradead.org/ It all predates memdesc. > Yes you are correct. At the moment it is a guesswork and exhaustive > search into multiple sources. At least I should be able to narrow it down somewhat if we have a PGTY_vmalloc. > > I actually want to improve this, without adding additional overhead. > > What I'm working on right now (before I got waylaid by this bug) is: > > > > +struct choir { > > + struct kref refcount; > > + unsigned int nr; > > + struct page *pages[] __counted_by(nr); > > +}; > > > > and rewriting vmalloc to be based on choirs instead of its own pages. > > One thing I've come to realise today is that the obj_cgroup pointer > > needs to be in the choir and not in the vm_struct so that we uncharge the > > allocation when the choir refcount drops to 0, not when the allocation > > is unmapped. > > What/who else can take a reference on a choir? The first user is remap_vmalloc_range() which today takes a mapcount+refcount on the page underlying the vmalloc inside vm_insert_page(). But I see choirs being useful more widely; for example in the XFS buffer cache (which wouldn't be mappable to userspace). They might even be the right approach for zswap, replacing zpdesc. Haven't looked into that in enough detail yet.
Hi Matthew, kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] [also build test ERROR on linus/master v6.13-rc2 next-20241211] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Matthew-Wilcox-Oracle/vmalloc-Account-memcg-per-vmalloc/20241211-123433 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20241211043252.3295947-2-willy%40infradead.org patch subject: [PATCH 2/2] vmalloc: Account memcg per vmalloc config: s390-randconfig-001-20241212 (https://download.01.org/0day-ci/archive/20241212/202412120605.mggOamQB-lkp@intel.com/config) compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241212/202412120605.mggOamQB-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202412120605.mggOamQB-lkp@intel.com/ All errors (new ones prefixed by >>): s390x-linux-ld: mm/vmalloc.o: in function `vfree': >> vmalloc.c:(.text+0xb040): undefined reference to `obj_cgroup_uncharge_vmalloc' s390x-linux-ld: mm/vmalloc.o: in function `__vmalloc_node_range_noprof': >> vmalloc.c:(.text+0xc5ec): undefined reference to `obj_cgroup_charge_vmalloc'
Hi Matthew, kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] [also build test ERROR on linus/master v6.13-rc2 next-20241211] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Matthew-Wilcox-Oracle/vmalloc-Account-memcg-per-vmalloc/20241211-123433 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20241211043252.3295947-2-willy%40infradead.org patch subject: [PATCH 2/2] vmalloc: Account memcg per vmalloc config: sparc64-randconfig-001-20241212 (https://download.01.org/0day-ci/archive/20241212/202412120722.YWlWwpqk-lkp@intel.com/config) compiler: sparc64-linux-gcc (GCC) 14.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241212/202412120722.YWlWwpqk-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202412120722.YWlWwpqk-lkp@intel.com/ All errors (new ones prefixed by >>): sparc64-linux-ld: mm/vmalloc.o: in function `vfree': >> mm/vmalloc.c:3386:(.text+0x6bbc): undefined reference to `obj_cgroup_uncharge_vmalloc' sparc64-linux-ld: mm/vmalloc.o: in function `__vmalloc_area_node': >> mm/vmalloc.c:3676:(.text+0x6f94): undefined reference to `obj_cgroup_charge_vmalloc' vim +3386 mm/vmalloc.c 3329 3330 /** 3331 * vfree - Release memory allocated by vmalloc() 3332 * @addr: Memory base address 3333 * 3334 * Free the virtually continuous memory area starting at @addr, as obtained 3335 * from one of the vmalloc() family of APIs. This will usually also free the 3336 * physical memory underlying the virtual allocation, but that memory is 3337 * reference counted, so it will not be freed until the last user goes away. 3338 * 3339 * If @addr is NULL, no operation is performed. 3340 * 3341 * Context: 3342 * May sleep if called *not* from interrupt context. 3343 * Must not be called in NMI context (strictly speaking, it could be 3344 * if we have CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG, but making the calling 3345 * conventions for vfree() arch-dependent would be a really bad idea). 3346 */ 3347 void vfree(const void *addr) 3348 { 3349 struct vm_struct *vm; 3350 int i; 3351 3352 if (unlikely(in_interrupt())) { 3353 vfree_atomic(addr); 3354 return; 3355 } 3356 3357 BUG_ON(in_nmi()); 3358 kmemleak_free(addr); 3359 might_sleep(); 3360 3361 if (!addr) 3362 return; 3363 3364 vm = remove_vm_area(addr); 3365 if (unlikely(!vm)) { 3366 WARN(1, KERN_ERR "Trying to vfree() nonexistent vm area (%p)\n", 3367 addr); 3368 return; 3369 } 3370 3371 if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS)) 3372 vm_reset_perms(vm); 3373 for (i = 0; i < vm->nr_pages; i++) { 3374 struct page *page = vm->pages[i]; 3375 3376 BUG_ON(!page); 3377 /* 3378 * High-order allocs for huge vmallocs are split, so 3379 * can be freed as an array of order-0 allocations 3380 */ 3381 __free_page(page); 3382 cond_resched(); 3383 } 3384 if (!(vm->flags & VM_MAP_PUT_PAGES)) 3385 atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages); > 3386 obj_cgroup_uncharge_vmalloc(vm->objcg, vm->nr_pages); 3387 kvfree(vm->pages); 3388 kfree(vm); 3389 } 3390 EXPORT_SYMBOL(vfree); 3391
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5502aa8e138e..83ebcadebba6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1676,6 +1676,10 @@ static inline struct obj_cgroup *get_obj_cgroup_from_current(void) int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size); void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size); +int obj_cgroup_charge_vmalloc(struct obj_cgroup **objcgp, + unsigned int nr_pages, gfp_t gfp); +void obj_cgroup_uncharge_vmalloc(struct obj_cgroup *objcgp, + unsigned int nr_pages); extern struct static_key_false memcg_bpf_enabled_key; static inline bool memcg_bpf_enabled(void) @@ -1756,6 +1760,9 @@ static inline void __memcg_kmem_uncharge_page(struct page *page, int order) { } +/* Must be macros to avoid dereferencing objcg in vm_struct */ +#define obj_cgroup_charge_vmalloc(objcgp, nr_pages, gfp) 0 +#define obj_cgroup_uncharge_vmalloc(objcg, nr_pages) do { } while (0) static inline struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio) { return NULL; diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 31e9ffd936e3..ec7c2d607382 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -60,6 +60,9 @@ struct vm_struct { #endif unsigned int nr_pages; phys_addr_t phys_addr; +#ifdef CONFIG_MEMCG + struct obj_cgroup *objcg; +#endif const void *caller; }; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7b3503d12aaf..629bffc3e26d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5472,4 +5472,50 @@ static int __init mem_cgroup_swap_init(void) } subsys_initcall(mem_cgroup_swap_init); +/** + * obj_cgroup_charge_vmalloc - Charge vmalloc memory + * @objcgp: Pointer to an object cgroup + * @nr_pages: Number of pages + * @gfp: Memory allocation flags + * + * Return: 0 on success, negative errno on failure. + */ +int obj_cgroup_charge_vmalloc(struct obj_cgroup **objcgp, + unsigned int nr_pages, gfp_t gfp) +{ + struct obj_cgroup *objcg; + int err; + + if (mem_cgroup_disabled() || !(gfp & __GFP_ACCOUNT)) + return 0; + + objcg = current_obj_cgroup(); + if (!objcg) + return 0; + + err = obj_cgroup_charge_pages(objcg, gfp, nr_pages); + if (err) + return err; + obj_cgroup_get(objcg); + mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_VMALLOC, nr_pages); + *objcgp = objcg; + + return 0; +} + +/** + * obj_cgroup_uncharge_vmalloc - Uncharge vmalloc memory + * @objcg: The object cgroup + * @nr_pages: Number of pages + */ +void obj_cgroup_uncharge_vmalloc(struct obj_cgroup *objcg, + unsigned int nr_pages) +{ + if (!objcg) + return; + mod_memcg_state(objcg->memcg, MEMCG_VMALLOC, 0L - nr_pages); + obj_cgroup_uncharge_pages(objcg, nr_pages); + obj_cgroup_put(objcg); +} + #endif /* CONFIG_SWAP */ diff --git a/mm/vmalloc.c b/mm/vmalloc.c index bc9c91f3b373..d5e9068d9091 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3374,7 +3374,6 @@ void vfree(const void *addr) struct page *page = vm->pages[i]; BUG_ON(!page); - mod_memcg_page_state(page, MEMCG_VMALLOC, -1); /* * High-order allocs for huge vmallocs are split, so * can be freed as an array of order-0 allocations @@ -3384,6 +3383,7 @@ void vfree(const void *addr) } if (!(vm->flags & VM_MAP_PUT_PAGES)) atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages); + obj_cgroup_uncharge_vmalloc(vm->objcg, vm->nr_pages); kvfree(vm->pages); kfree(vm); } @@ -3537,6 +3537,9 @@ vm_area_alloc_pages(gfp_t gfp, int nid, struct page *page; int i; + /* Accounting handled in caller */ + gfp &= ~__GFP_ACCOUNT; + /* * For order-0 pages we make use of bulk allocator, if * the page array is partly or not at all populated due @@ -3670,12 +3673,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, node, page_order, nr_small_pages, area->pages); atomic_long_add(area->nr_pages, &nr_vmalloc_pages); - if (gfp_mask & __GFP_ACCOUNT) { - int i; - - for (i = 0; i < area->nr_pages; i++) - mod_memcg_page_state(area->pages[i], MEMCG_VMALLOC, 1); - } + ret = obj_cgroup_charge_vmalloc(&area->objcg, gfp_mask, area->nr_pages); + if (ret) + goto fail; /* * If not enough pages were obtained to accomplish an
Today we account each page individually to the memcg, which works well enough, if a little inefficiently (N atomic operations per page instead of N per allocation). Unfortunately, the stats can get out of sync when i915 calls vmap() with VM_MAP_PUT_PAGES. The pages being passed were not allocated by vmalloc, so the MEMCG_VMALLOC counter was never incremented. But it is decremented when the pages are freed with vfree(). Solve all of this by tracking the memcg at the vm_struct level. This logic has to live in the memcontrol file as it calls several functions which are currently static. Fixes: b944afc9d64d (mm: add a VM_MAP_PUT_PAGES flag for vmap) Cc: stable@vger.kernel.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> --- include/linux/memcontrol.h | 7 ++++++ include/linux/vmalloc.h | 3 +++ mm/memcontrol.c | 46 ++++++++++++++++++++++++++++++++++++++ mm/vmalloc.c | 14 ++++++------ 4 files changed, 63 insertions(+), 7 deletions(-)