Message ID | 20211108211959.1750915-2-almasrymina@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v1,1/5] mm/shmem: support deterministic charging of tmpfs | expand |
On Mon, Nov 08, 2021 at 01:19:55PM -0800, Mina Almasry wrote: > Add memcg= option to shmem mount. > > Users can specify this option at mount time and all data page charges > will be charged to the memcg supplied. ..... > +/* > + * Returns the memcg to charge for inode pages. If non-NULL is returned, caller > + * must drop reference with css_put(). NULL indicates that the inode does not > + * have a memcg to charge, so the default process based policy should be used. > + */ > +static struct mem_cgroup * > +mem_cgroup_mapping_get_charge_target(struct address_space *mapping) > +{ > + struct mem_cgroup *memcg; > + > + if (!mapping) > + return NULL; > + > + rcu_read_lock(); > + memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge); Anything doing pointer chasing to obtain static, unchanging superblock state is poorly implemented. The s_memcg_to_charge value never changes, so this code should associate the memcg to charge directly on the mapping when the mapping is first initialised by the filesystem. We already do this with things like attaching address space ops and mapping specific gfp masks (i.e mapping_set_gfp_mask()), so this association should be set up that way, too (e.g. mapping_set_memcg_to_charge()). And because this memcg pointer is static and unchanging for the entire life of the superblock, the superblock *must* pin the memcg into memory and that means we can elide the rcu locking altogether in the fast path for all filesystems that don't support this functionality. i.e. we can simply do: if (!mapping || !mapping->memcg_to_charge) return NULL; And then only if there is a memcg to charge, we do the slow path locking and lookup stuff... Cheers, Dave.
On Tue, Nov 09, 2021 at 09:10:47AM +1100, Dave Chinner wrote: > > + rcu_read_lock(); > > + memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge); > > Anything doing pointer chasing to obtain static, unchanging > superblock state is poorly implemented. The s_memcg_to_charge value never > changes, so this code should associate the memcg to charge directly > on the mapping when the mapping is first initialised by the > filesystem. We already do this with things like attaching address > space ops and mapping specific gfp masks (i.e > mapping_set_gfp_mask()), so this association should be set up that > way, too (e.g. mapping_set_memcg_to_charge()). I'm not a fan of enlarging struct address_space with another pointer unless it's going to be used by all/most filesystems. If this is destined to be a shmem-only feature, then it should be in the shmem_inode instead of the mapping. If we are to have this for all filesystems, then let's do that properly and make it generic functionality from its introduction.
On Mon, Nov 08, 2021 at 01:19:55PM -0800, Mina Almasry wrote: > Add memcg= option to shmem mount. > > Users can specify this option at mount time and all data page charges > will be charged to the memcg supplied. > > Signed-off-by: Mina Almasry <almasrymina@google.com> Hello, Mina! Overall I think it's a useful option and I do see it being used outside of tmpfs usecase. In certain cases a user might want to use memory cgroups to limit the amount of pagecache and something like what you've suggested might be useful. I agree with Michal that it opens some hard questions about certain corner cases, but I don't think there any show stoppers (at least as now). > > Cc: Michal Hocko <mhocko@suse.com> > Cc: Theodore Ts'o <tytso@mit.edu> > Cc: Greg Thelen <gthelen@google.com> > Cc: Shakeel Butt <shakeelb@google.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Hugh Dickins <hughd@google.com> > Cc: Roman Gushchin <songmuchun@bytedance.com> This is clearly not my email. > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Hugh Dickins <hughd@google.com> > Cc: Tejun Heo <tj@kernel.org> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com> > Cc: Muchun Song <songmuchun@bytedance.com> > Cc: riel@surriel.com > Cc: linux-mm@kvack.org > Cc: linux-fsdevel@vger.kernel.org > Cc: cgroups@vger.kernel.org > > --- > fs/super.c | 3 ++ > include/linux/fs.h | 5 ++ > include/linux/memcontrol.h | 46 ++++++++++++++-- > mm/filemap.c | 2 +- > mm/memcontrol.c | 104 ++++++++++++++++++++++++++++++++++++- > mm/shmem.c | 50 +++++++++++++++++- > 6 files changed, 201 insertions(+), 9 deletions(-) > > diff --git a/fs/super.c b/fs/super.c > index 3bfc0f8fbd5bc..8aafe5e4e6200 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -24,6 +24,7 @@ > #include <linux/export.h> > #include <linux/slab.h> > #include <linux/blkdev.h> > +#include <linux/memcontrol.h> > #include <linux/mount.h> > #include <linux/security.h> > #include <linux/writeback.h> /* for the emergency remount stuff */ > @@ -180,6 +181,7 @@ static void destroy_unused_super(struct super_block *s) > up_write(&s->s_umount); > list_lru_destroy(&s->s_dentry_lru); > list_lru_destroy(&s->s_inode_lru); > + mem_cgroup_set_charge_target(&s->s_memcg_to_charge, NULL); > security_sb_free(s); > put_user_ns(s->s_user_ns); > kfree(s->s_subtype); > @@ -292,6 +294,7 @@ static void __put_super(struct super_block *s) > WARN_ON(s->s_dentry_lru.node); > WARN_ON(s->s_inode_lru.node); > WARN_ON(!list_empty(&s->s_mounts)); > + mem_cgroup_set_charge_target(&s->s_memcg_to_charge, NULL); > security_sb_free(s); > fscrypt_sb_free(s); > put_user_ns(s->s_user_ns); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 3afca821df32e..59407b3e7aee3 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1567,6 +1567,11 @@ struct super_block { > struct workqueue_struct *s_dio_done_wq; > struct hlist_head s_pins; > > +#ifdef CONFIG_MEMCG > + /* memcg to charge for pages allocated to this filesystem */ > + struct mem_cgroup *s_memcg_to_charge; > +#endif > + > /* > * Owning user namespace and default context in which to > * interpret filesystem uids, gids, quotas, device nodes, > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 0c5c403f4be6b..e9a64c1b8295b 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -27,6 +27,7 @@ struct obj_cgroup; > struct page; > struct mm_struct; > struct kmem_cache; > +struct super_block; > > /* Cgroup-specific page state, on top of universal node page state */ > enum memcg_stat_item { > @@ -689,7 +690,8 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg) > page_counter_read(&memcg->memory); > } > > -int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp); > +int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp, > + struct address_space *mapping); > > /** > * mem_cgroup_charge - Charge a newly allocated folio to a cgroup. > @@ -710,7 +712,7 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, > { > if (mem_cgroup_disabled()) > return 0; > - return __mem_cgroup_charge(folio, mm, gfp); > + return __mem_cgroup_charge(folio, mm, gfp, NULL); > } > > int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm, > @@ -923,6 +925,16 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg) > return !!(memcg->css.flags & CSS_ONLINE); > } > > +bool is_remote_oom(struct mem_cgroup *memcg_under_oom); > +void mem_cgroup_set_charge_target(struct mem_cgroup **target, > + struct mem_cgroup *memcg); > +struct mem_cgroup *mem_cgroup_get_from_path(const char *path); > +/** > + * User is responsible for providing a buffer @buf of length @len and freeing > + * it. > + */ > +int mem_cgroup_get_name_from_sb(struct super_block *sb, char *buf, size_t len); > + > void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, > int zid, int nr_pages); > > @@ -1217,8 +1229,15 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg) > return false; > } > > -static inline int mem_cgroup_charge(struct folio *folio, > - struct mm_struct *mm, gfp_t gfp) > +static inline int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, > + gfp_t gfp_mask, > + struct address_space *mapping) > +{ > + return 0; > +} > + > +static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, > + gfp_t gfp_mask) > { > return 0; > } > @@ -1326,6 +1345,25 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup *root, > { > } > > +static inline bool is_remote_oom(struct mem_cgroup *memcg_under_oom) > +{ > + return false; > +} > + > +static inline void mem_cgroup_set_charge_target(struct mem_cgroup **target, > + struct mem_cgroup *memcg) > +{ > +} > + > +static inline int mem_cgroup_get_name_from_sb(struct super_block *sb, char *buf, > + size_t len) > +{ > + if (len < 1) > + return -EINVAL; > + buf[0] = '\0'; > + return 0; > +} > + > static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, > int (*fn)(struct task_struct *, void *), void *arg) > { > diff --git a/mm/filemap.c b/mm/filemap.c > index 6844c9816a864..75e81dfd2c580 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -903,7 +903,7 @@ noinline int __filemap_add_folio(struct address_space *mapping, > folio->index = index; > > if (!huge) { > - error = mem_cgroup_charge(folio, NULL, gfp); > + error = __mem_cgroup_charge(folio, NULL, gfp, mapping); > VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio); > if (error) > goto error; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 781605e920153..389d2f2be9674 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2580,6 +2580,103 @@ void mem_cgroup_handle_over_high(void) > css_put(&memcg->css); > } > > +/* > + * Non error return value must eventually be released with css_put(). > + */ > +struct mem_cgroup *mem_cgroup_get_from_path(const char *path) > +{ > + struct file *file; > + struct cgroup_subsys_state *css; > + struct mem_cgroup *memcg; > + > + file = filp_open(path, O_DIRECTORY | O_RDONLY, 0); > + if (IS_ERR(file)) > + return (struct mem_cgroup *)file; > + > + css = css_tryget_online_from_dir(file->f_path.dentry, > + &memory_cgrp_subsys); > + if (IS_ERR(css)) > + memcg = (struct mem_cgroup *)css; > + else > + memcg = container_of(css, struct mem_cgroup, css); > + > + fput(file); > + return memcg; > +} > + > +/* > + * Get the name of the optional charge target memcg associated with @sb. This > + * is the cgroup name, not the cgroup path. > + */ > +int mem_cgroup_get_name_from_sb(struct super_block *sb, char *buf, size_t len) > +{ > + struct mem_cgroup *memcg; > + int ret = 0; > + > + buf[0] = '\0'; > + > + rcu_read_lock(); > + memcg = rcu_dereference(sb->s_memcg_to_charge); > + if (memcg && !css_tryget_online(&memcg->css)) > + memcg = NULL; > + rcu_read_unlock(); > + > + if (!memcg) > + return 0; > + > + ret = cgroup_path(memcg->css.cgroup, buf + len / 2, len / 2); > + if (ret >= len / 2) > + strcpy(buf, "?"); > + else { > + char *p = mangle_path(buf, buf + len / 2, " \t\n\\"); > + > + if (p) > + *p = '\0'; > + else > + strcpy(buf, "?"); > + } > + > + css_put(&memcg->css); > + return ret < 0 ? ret : 0; > +} > + > +/* > + * Set or clear (if @memcg is NULL) charge association from file system to > + * memcg. If @memcg != NULL, then a css reference must be held by the caller to > + * ensure that the cgroup is not deleted during this operation. > + */ > +void mem_cgroup_set_charge_target(struct mem_cgroup **target, > + struct mem_cgroup *memcg) > +{ > + if (memcg) > + css_get(&memcg->css); > + memcg = xchg(target, memcg); > + if (memcg) > + css_put(&memcg->css); > +} > + > +/* > + * Returns the memcg to charge for inode pages. If non-NULL is returned, caller > + * must drop reference with css_put(). NULL indicates that the inode does not > + * have a memcg to charge, so the default process based policy should be used. > + */ > +static struct mem_cgroup * > +mem_cgroup_mapping_get_charge_target(struct address_space *mapping) > +{ > + struct mem_cgroup *memcg; > + > + if (!mapping) > + return NULL; > + > + rcu_read_lock(); > + memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge); > + if (memcg && !css_tryget_online(&memcg->css)) > + memcg = NULL; > + rcu_read_unlock(); > + > + return memcg; > +} > + > static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > unsigned int nr_pages) > { > @@ -6678,12 +6775,15 @@ static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg, > return ret; > } > > -int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp) > +int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp, > + struct address_space *mapping) > { > struct mem_cgroup *memcg; > int ret; > > - memcg = get_mem_cgroup_from_mm(mm); > + memcg = mem_cgroup_mapping_get_charge_target(mapping); > + if (!memcg) > + memcg = get_mem_cgroup_from_mm(mm); > ret = charge_memcg(folio, memcg, gfp); > css_put(&memcg->css); This function becomes very non-obvious as it's taking a folio (ex-page), mm and now mapping as arguments. I'd just add a new wrapper around charge_memcg() instead. I'd also merge the next patch into this one. Thanks!
On Mon, Nov 08, 2021 at 11:41:51PM +0000, Matthew Wilcox wrote: > On Tue, Nov 09, 2021 at 09:10:47AM +1100, Dave Chinner wrote: > > > + rcu_read_lock(); > > > + memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge); > > > > Anything doing pointer chasing to obtain static, unchanging > > superblock state is poorly implemented. The s_memcg_to_charge value never > > changes, so this code should associate the memcg to charge directly > > on the mapping when the mapping is first initialised by the > > filesystem. We already do this with things like attaching address > > space ops and mapping specific gfp masks (i.e > > mapping_set_gfp_mask()), so this association should be set up that > > way, too (e.g. mapping_set_memcg_to_charge()). > > I'm not a fan of enlarging struct address_space with another pointer > unless it's going to be used by all/most filesystems. If this is > destined to be a shmem-only feature, then it should be in the > shmem_inode instead of the mapping. Neither am I, but I'm also not a fan of the filemap code still having to drill through the mapping to the host inode just to check if it needs to do special stuff for shmem inodes on every call that adds a page to the page cache. This is just as messy and intrusive and the memcg code really has no business digging about in the filesystem specific details of the inode behind the mapping. Hmmm. The mem_cgroup_charge() call in filemap_add_folio() passes a null mm context, so deep in the guts it ends getting the memcg from active_memcg() in get_mem_cgroup_from_mm(). That ends up using current->active_memcg, so maybe a better approach here is to have shmem override current->active_memcg via set_active_memcg() before it enters the generic fs paths and restore it on return... current_fsmemcg()? > If we are to have this for all filesystems, then let's do that properly > and make it generic functionality from its introduction. Fully agree. Cheers, Dave.
On Mon, Nov 8, 2021 at 5:18 PM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Nov 08, 2021 at 11:41:51PM +0000, Matthew Wilcox wrote: > > On Tue, Nov 09, 2021 at 09:10:47AM +1100, Dave Chinner wrote: > > > > + rcu_read_lock(); > > > > + memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge); > > > > > > Anything doing pointer chasing to obtain static, unchanging > > > superblock state is poorly implemented. The s_memcg_to_charge value never > > > changes, so this code should associate the memcg to charge directly > > > on the mapping when the mapping is first initialised by the > > > filesystem. We already do this with things like attaching address > > > space ops and mapping specific gfp masks (i.e > > > mapping_set_gfp_mask()), so this association should be set up that > > > way, too (e.g. mapping_set_memcg_to_charge()). > > > > I'm not a fan of enlarging struct address_space with another pointer > > unless it's going to be used by all/most filesystems. If this is > > destined to be a shmem-only feature, then it should be in the > > shmem_inode instead of the mapping. > > Neither am I, but I'm also not a fan of the filemap code still > having to drill through the mapping to the host inode just to check > if it needs to do special stuff for shmem inodes on every call that > adds a page to the page cache. This is just as messy and intrusive > and the memcg code really has no business digging about in the > filesystem specific details of the inode behind the mapping. > > Hmmm. The mem_cgroup_charge() call in filemap_add_folio() passes a > null mm context, so deep in the guts it ends getting the memcg from > active_memcg() in get_mem_cgroup_from_mm(). That ends up using > current->active_memcg, so maybe a better approach here is to have > shmem override current->active_memcg via set_active_memcg() before > it enters the generic fs paths and restore it on return... > > current_fsmemcg()? > Thank you for providing a detailed alternative. To be honest it seems a bit brittle to me, as in folks can easily add calls to generic fs paths forgetting to override the active_memcg and having memory charged incorrectly, but if there is no other option and we want to make this a shmem-only feature, I can do this anyway. > > If we are to have this for all filesystems, then let's do that properly > > and make it generic functionality from its introduction. > > Fully agree. So the tmpfs feature addresses the first 2 usecases I mention in the cover letter. For the 3rd usecase I will likely need to extend this support to 1 disk-based filesystem, and I'm not sure which at this point. It also looks like Roman has in mind 1 or more use cases and may extend it to other filesystems as a result. I'm hoping that I can provide the generic implementation and the tmpfs support and in follow up patches folks can extend this to other file systems by providing the fs-specific changes needed for that filesystem. AFAICT with this patch the work to extend to another file system is to parse the memcg= option in that filesystem, set the s_memcg_to_charge on the superblock (or mapping) of that filesystem, and to charge s_memcg_to_charge in fs specific code paths, so all are fs-specific changes. Based on this, it seems to me the suggestion is to hang the memcg_to_charge off the mapping? I'm not sure if *most/all* filesystems will eventually support it, but likely more than just tmpfs. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Tue, Nov 9, 2021 at 3:56 PM Mina Almasry <almasrymina@google.com> wrote: > > On Mon, Nov 8, 2021 at 5:18 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Mon, Nov 08, 2021 at 11:41:51PM +0000, Matthew Wilcox wrote: > > > On Tue, Nov 09, 2021 at 09:10:47AM +1100, Dave Chinner wrote: > > > > > + rcu_read_lock(); > > > > > + memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge); > > > > > > > > Anything doing pointer chasing to obtain static, unchanging > > > > superblock state is poorly implemented. The s_memcg_to_charge value never > > > > changes, so this code should associate the memcg to charge directly > > > > on the mapping when the mapping is first initialised by the > > > > filesystem. We already do this with things like attaching address > > > > space ops and mapping specific gfp masks (i.e > > > > mapping_set_gfp_mask()), so this association should be set up that > > > > way, too (e.g. mapping_set_memcg_to_charge()). > > > > > > I'm not a fan of enlarging struct address_space with another pointer > > > unless it's going to be used by all/most filesystems. If this is > > > destined to be a shmem-only feature, then it should be in the > > > shmem_inode instead of the mapping. > > > > Neither am I, but I'm also not a fan of the filemap code still > > having to drill through the mapping to the host inode just to check > > if it needs to do special stuff for shmem inodes on every call that > > adds a page to the page cache. This is just as messy and intrusive > > and the memcg code really has no business digging about in the > > filesystem specific details of the inode behind the mapping. > > > > Hmmm. The mem_cgroup_charge() call in filemap_add_folio() passes a > > null mm context, so deep in the guts it ends getting the memcg from > > active_memcg() in get_mem_cgroup_from_mm(). That ends up using > > current->active_memcg, so maybe a better approach here is to have > > shmem override current->active_memcg via set_active_memcg() before > > it enters the generic fs paths and restore it on return... > > > > current_fsmemcg()? > > > > Thank you for providing a detailed alternative. To be honest it seems > a bit brittle to me, as in folks can easily add calls to generic fs > paths forgetting to override the active_memcg and having memory > charged incorrectly, but if there is no other option and we want to > make this a shmem-only feature, I can do this anyway. > > > > If we are to have this for all filesystems, then let's do that properly > > > and make it generic functionality from its introduction. > > > > Fully agree. > > So the tmpfs feature addresses the first 2 usecases I mention in the > cover letter. For the 3rd usecase I will likely need to extend this > support to 1 disk-based filesystem, and I'm not sure which at this > point. It also looks like Roman has in mind 1 or more use cases and > may extend it to other filesystems as a result. I'm hoping that I can > provide the generic implementation and the tmpfs support and in follow > up patches folks can extend this to other file systems by providing > the fs-specific changes needed for that filesystem. > > AFAICT with this patch the work to extend to another file system is to > parse the memcg= option in that filesystem, set the s_memcg_to_charge > on the superblock (or mapping) of that filesystem, and to charge > s_memcg_to_charge in fs specific code paths, so all are fs-specific > changes. > > Based on this, it seems to me the suggestion is to hang the > memcg_to_charge off the mapping? I'm not sure if *most/all* > filesystems will eventually support it, but likely more than just > tmpfs. > Greg actually points out to me off list that the patches - as written - supports remounting the tmpfs with a different memcg= option, where future charges will be directed to the new memcg provided by the option, so s_memcg_to_charge is more of a property of the super_block. We could explicitly disable remounting with a different memcg=, but I'm hoping to preserve that support if possible. The only way to preserve it I think and avoid the pointer chasing in fs generic paths is for shmem to set_active_memcg() before calling into the generic fs code, but all other fs that apply this feature will need to do the same. I'm not sure if that's the better option. Let me know what you think please. Thanks! > > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com
On Mon, Nov 8, 2021 at 5:18 PM Dave Chinner <david@fromorbit.com> wrote: > [...] > > > If we are to have this for all filesystems, then let's do that properly > > and make it generic functionality from its introduction. > > Fully agree. > Mina, I think supporting all filesystems might be a much cleaner solution than adding fs specific code. We need to: 1) Add memcg option handling in vfs_parse_fs_param() before fs specific param handling. 2) Add a new page cache memcg charging interface (similar to swap). With (1), no need to change any fs specific code. With (2), fs codepaths will be free of memcg specific handling. This new interface will be used in __filemap_add_folio(), shmem_add_to_page_cache() and collapse_file(). thanks, Shakeel
diff --git a/fs/super.c b/fs/super.c index 3bfc0f8fbd5bc..8aafe5e4e6200 100644 --- a/fs/super.c +++ b/fs/super.c @@ -24,6 +24,7 @@ #include <linux/export.h> #include <linux/slab.h> #include <linux/blkdev.h> +#include <linux/memcontrol.h> #include <linux/mount.h> #include <linux/security.h> #include <linux/writeback.h> /* for the emergency remount stuff */ @@ -180,6 +181,7 @@ static void destroy_unused_super(struct super_block *s) up_write(&s->s_umount); list_lru_destroy(&s->s_dentry_lru); list_lru_destroy(&s->s_inode_lru); + mem_cgroup_set_charge_target(&s->s_memcg_to_charge, NULL); security_sb_free(s); put_user_ns(s->s_user_ns); kfree(s->s_subtype); @@ -292,6 +294,7 @@ static void __put_super(struct super_block *s) WARN_ON(s->s_dentry_lru.node); WARN_ON(s->s_inode_lru.node); WARN_ON(!list_empty(&s->s_mounts)); + mem_cgroup_set_charge_target(&s->s_memcg_to_charge, NULL); security_sb_free(s); fscrypt_sb_free(s); put_user_ns(s->s_user_ns); diff --git a/include/linux/fs.h b/include/linux/fs.h index 3afca821df32e..59407b3e7aee3 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1567,6 +1567,11 @@ struct super_block { struct workqueue_struct *s_dio_done_wq; struct hlist_head s_pins; +#ifdef CONFIG_MEMCG + /* memcg to charge for pages allocated to this filesystem */ + struct mem_cgroup *s_memcg_to_charge; +#endif + /* * Owning user namespace and default context in which to * interpret filesystem uids, gids, quotas, device nodes, diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0c5c403f4be6b..e9a64c1b8295b 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -27,6 +27,7 @@ struct obj_cgroup; struct page; struct mm_struct; struct kmem_cache; +struct super_block; /* Cgroup-specific page state, on top of universal node page state */ enum memcg_stat_item { @@ -689,7 +690,8 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg) page_counter_read(&memcg->memory); } -int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp); +int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp, + struct address_space *mapping); /** * mem_cgroup_charge - Charge a newly allocated folio to a cgroup. @@ -710,7 +712,7 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, { if (mem_cgroup_disabled()) return 0; - return __mem_cgroup_charge(folio, mm, gfp); + return __mem_cgroup_charge(folio, mm, gfp, NULL); } int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm, @@ -923,6 +925,16 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg) return !!(memcg->css.flags & CSS_ONLINE); } +bool is_remote_oom(struct mem_cgroup *memcg_under_oom); +void mem_cgroup_set_charge_target(struct mem_cgroup **target, + struct mem_cgroup *memcg); +struct mem_cgroup *mem_cgroup_get_from_path(const char *path); +/** + * User is responsible for providing a buffer @buf of length @len and freeing + * it. + */ +int mem_cgroup_get_name_from_sb(struct super_block *sb, char *buf, size_t len); + void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, int zid, int nr_pages); @@ -1217,8 +1229,15 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg) return false; } -static inline int mem_cgroup_charge(struct folio *folio, - struct mm_struct *mm, gfp_t gfp) +static inline int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, + gfp_t gfp_mask, + struct address_space *mapping) +{ + return 0; +} + +static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, + gfp_t gfp_mask) { return 0; } @@ -1326,6 +1345,25 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup *root, { } +static inline bool is_remote_oom(struct mem_cgroup *memcg_under_oom) +{ + return false; +} + +static inline void mem_cgroup_set_charge_target(struct mem_cgroup **target, + struct mem_cgroup *memcg) +{ +} + +static inline int mem_cgroup_get_name_from_sb(struct super_block *sb, char *buf, + size_t len) +{ + if (len < 1) + return -EINVAL; + buf[0] = '\0'; + return 0; +} + static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, int (*fn)(struct task_struct *, void *), void *arg) { diff --git a/mm/filemap.c b/mm/filemap.c index 6844c9816a864..75e81dfd2c580 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -903,7 +903,7 @@ noinline int __filemap_add_folio(struct address_space *mapping, folio->index = index; if (!huge) { - error = mem_cgroup_charge(folio, NULL, gfp); + error = __mem_cgroup_charge(folio, NULL, gfp, mapping); VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio); if (error) goto error; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 781605e920153..389d2f2be9674 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2580,6 +2580,103 @@ void mem_cgroup_handle_over_high(void) css_put(&memcg->css); } +/* + * Non error return value must eventually be released with css_put(). + */ +struct mem_cgroup *mem_cgroup_get_from_path(const char *path) +{ + struct file *file; + struct cgroup_subsys_state *css; + struct mem_cgroup *memcg; + + file = filp_open(path, O_DIRECTORY | O_RDONLY, 0); + if (IS_ERR(file)) + return (struct mem_cgroup *)file; + + css = css_tryget_online_from_dir(file->f_path.dentry, + &memory_cgrp_subsys); + if (IS_ERR(css)) + memcg = (struct mem_cgroup *)css; + else + memcg = container_of(css, struct mem_cgroup, css); + + fput(file); + return memcg; +} + +/* + * Get the name of the optional charge target memcg associated with @sb. This + * is the cgroup name, not the cgroup path. + */ +int mem_cgroup_get_name_from_sb(struct super_block *sb, char *buf, size_t len) +{ + struct mem_cgroup *memcg; + int ret = 0; + + buf[0] = '\0'; + + rcu_read_lock(); + memcg = rcu_dereference(sb->s_memcg_to_charge); + if (memcg && !css_tryget_online(&memcg->css)) + memcg = NULL; + rcu_read_unlock(); + + if (!memcg) + return 0; + + ret = cgroup_path(memcg->css.cgroup, buf + len / 2, len / 2); + if (ret >= len / 2) + strcpy(buf, "?"); + else { + char *p = mangle_path(buf, buf + len / 2, " \t\n\\"); + + if (p) + *p = '\0'; + else + strcpy(buf, "?"); + } + + css_put(&memcg->css); + return ret < 0 ? ret : 0; +} + +/* + * Set or clear (if @memcg is NULL) charge association from file system to + * memcg. If @memcg != NULL, then a css reference must be held by the caller to + * ensure that the cgroup is not deleted during this operation. + */ +void mem_cgroup_set_charge_target(struct mem_cgroup **target, + struct mem_cgroup *memcg) +{ + if (memcg) + css_get(&memcg->css); + memcg = xchg(target, memcg); + if (memcg) + css_put(&memcg->css); +} + +/* + * Returns the memcg to charge for inode pages. If non-NULL is returned, caller + * must drop reference with css_put(). NULL indicates that the inode does not + * have a memcg to charge, so the default process based policy should be used. + */ +static struct mem_cgroup * +mem_cgroup_mapping_get_charge_target(struct address_space *mapping) +{ + struct mem_cgroup *memcg; + + if (!mapping) + return NULL; + + rcu_read_lock(); + memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge); + if (memcg && !css_tryget_online(&memcg->css)) + memcg = NULL; + rcu_read_unlock(); + + return memcg; +} + static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int nr_pages) { @@ -6678,12 +6775,15 @@ static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg, return ret; } -int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp) +int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp, + struct address_space *mapping) { struct mem_cgroup *memcg; int ret; - memcg = get_mem_cgroup_from_mm(mm); + memcg = mem_cgroup_mapping_get_charge_target(mapping); + if (!memcg) + memcg = get_mem_cgroup_from_mm(mm); ret = charge_memcg(folio, memcg, gfp); css_put(&memcg->css); diff --git a/mm/shmem.c b/mm/shmem.c index 23c91a8beb781..01510fa8ab725 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -115,10 +115,14 @@ struct shmem_options { bool full_inums; int huge; int seen; +#if CONFIG_MEMCG + struct mem_cgroup *memcg; +#endif #define SHMEM_SEEN_BLOCKS 1 #define SHMEM_SEEN_INODES 2 #define SHMEM_SEEN_HUGE 4 #define SHMEM_SEEN_INUMS 8 +#define SHMEM_SEEN_MEMCG 16 }; #ifdef CONFIG_TMPFS @@ -709,7 +713,8 @@ static int shmem_add_to_page_cache(struct page *page, page->index = index; if (!PageSwapCache(page)) { - error = mem_cgroup_charge(page_folio(page), charge_mm, gfp); + error = __mem_cgroup_charge(page_folio(page), charge_mm, gfp, + mapping); if (error) { if (PageTransHuge(page)) { count_vm_event(THP_FILE_FALLBACK); @@ -3342,6 +3347,7 @@ static const struct export_operations shmem_export_ops = { enum shmem_param { Opt_gid, Opt_huge, + Opt_memcg, Opt_mode, Opt_mpol, Opt_nr_blocks, @@ -3363,6 +3369,7 @@ static const struct constant_table shmem_param_enums_huge[] = { const struct fs_parameter_spec shmem_fs_parameters[] = { fsparam_u32 ("gid", Opt_gid), fsparam_enum ("huge", Opt_huge, shmem_param_enums_huge), + fsparam_string("memcg", Opt_memcg), fsparam_u32oct("mode", Opt_mode), fsparam_string("mpol", Opt_mpol), fsparam_string("nr_blocks", Opt_nr_blocks), @@ -3379,6 +3386,7 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param) struct shmem_options *ctx = fc->fs_private; struct fs_parse_result result; unsigned long long size; + struct mem_cgroup *memcg; char *rest; int opt; @@ -3412,6 +3420,17 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param) goto bad_value; ctx->seen |= SHMEM_SEEN_INODES; break; +#ifdef CONFIG_MEMCG + case Opt_memcg: + if (ctx->memcg) + css_put(&ctx->memcg->css); + memcg = mem_cgroup_get_from_path(param->string); + if (IS_ERR(memcg)) + goto bad_value; + ctx->memcg = memcg; + ctx->seen |= SHMEM_SEEN_MEMCG; + break; +#endif case Opt_mode: ctx->mode = result.uint_32 & 07777; break; @@ -3573,6 +3592,14 @@ static int shmem_reconfigure(struct fs_context *fc) } raw_spin_unlock(&sbinfo->stat_lock); mpol_put(mpol); +#if CONFIG_MEMCG + if (ctx->seen & SHMEM_SEEN_MEMCG && ctx->memcg) { + mem_cgroup_set_charge_target(&fc->root->d_sb->s_memcg_to_charge, + ctx->memcg); + css_put(&ctx->memcg->css); + ctx->memcg = NULL; + } +#endif return 0; out: raw_spin_unlock(&sbinfo->stat_lock); @@ -3582,6 +3609,11 @@ static int shmem_reconfigure(struct fs_context *fc) static int shmem_show_options(struct seq_file *seq, struct dentry *root) { struct shmem_sb_info *sbinfo = SHMEM_SB(root->d_sb); + int err; + char *buf = __getname(); + + if (!buf) + return -ENOMEM; if (sbinfo->max_blocks != shmem_default_max_blocks()) seq_printf(seq, ",size=%luk", @@ -3625,7 +3657,13 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root) seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge)); #endif shmem_show_mpol(seq, sbinfo->mpol); - return 0; + /* Memory cgroup binding: memcg=cgroup_name */ + err = mem_cgroup_get_name_from_sb(root->d_sb, buf, PATH_MAX); + if (!err && buf[0] != '\0') + seq_printf(seq, ",memcg=%s", buf); + + __putname(buf); + return err; } #endif /* CONFIG_TMPFS */ @@ -3710,6 +3748,14 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc) sb->s_flags |= SB_POSIXACL; #endif uuid_gen(&sb->s_uuid); +#if CONFIG_MEMCG + if (ctx->memcg) { + mem_cgroup_set_charge_target(&sb->s_memcg_to_charge, + ctx->memcg); + css_put(&ctx->memcg->css); + ctx->memcg = NULL; + } +#endif inode = shmem_get_inode(sb, NULL, S_IFDIR | sbinfo->mode, 0, VM_NORESERVE); if (!inode)
Add memcg= option to shmem mount. Users can specify this option at mount time and all data page charges will be charged to the memcg supplied. Signed-off-by: Mina Almasry <almasrymina@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Greg Thelen <gthelen@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Roman Gushchin <songmuchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: riel@surriel.com Cc: linux-mm@kvack.org Cc: linux-fsdevel@vger.kernel.org Cc: cgroups@vger.kernel.org --- fs/super.c | 3 ++ include/linux/fs.h | 5 ++ include/linux/memcontrol.h | 46 ++++++++++++++-- mm/filemap.c | 2 +- mm/memcontrol.c | 104 ++++++++++++++++++++++++++++++++++++- mm/shmem.c | 50 +++++++++++++++++- 6 files changed, 201 insertions(+), 9 deletions(-) -- 2.34.0.rc0.344.g81b53c2807-goog