Message ID | 20231003001828.2554080-3-nphamcs@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | hugetlb memcg accounting | expand |
On Mon, Oct 2, 2023 at 5:18 PM Nhat Pham <nphamcs@gmail.com> wrote: > > Currently, hugetlb memory usage is not acounted for in the memory > controller, which could lead to memory overprotection for cgroups with > hugetlb-backed memory. This has been observed in our production system. > > For instance, here is one of our usecases: suppose there are two 32G > containers. The machine is booted with hugetlb_cma=6G, and each > container may or may not use up to 3 gigantic page, depending on the > workload within it. The rest is anon, cache, slab, etc. We can set the > hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. > But it is very difficult to configure memory.max to keep overall > consumption, including anon, cache, slab etc. fair. > > What we have had to resort to is to constantly poll hugetlb usage and > readjust memory.max. Similar procedure is done to other memory limits > (memory.low for e.g). However, this is rather cumbersome and buggy. > Furthermore, when there is a delay in memory limits correction, (for e.g > when hugetlb usage changes within consecutive runs of the userspace > agent), the system could be in an over/underprotected state. > > This patch rectifies this issue by charging the memcg when the hugetlb > folio is utilized, and uncharging when the folio is freed (analogous to > the hugetlb controller). Note that we do not charge when the folio is > allocated to the hugetlb pool, because at this point it is not owned by > any memcg. > > Some caveats to consider: > * This feature is only available on cgroup v2. > * There is no hugetlb pool management involved in the memory > controller. As stated above, hugetlb folios are only charged towards > the memory controller when it is used. Host overcommit management > has to consider it when configuring hard limits. > * Failure to charge towards the memcg results in SIGBUS. This could > happen even if the hugetlb pool still has pages (but the cgroup > limit is hit and reclaim attempt fails). > * When this feature is enabled, hugetlb pages contribute to memory > reclaim protection. low, min limits tuning must take into account > hugetlb memory. > * Hugetlb pages utilized while this option is not selected will not > be tracked by the memory controller (even if cgroup v2 is remounted > later on). (special thanks to Michal Hocko, who pointed most of these out). > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > --- > Documentation/admin-guide/cgroup-v2.rst | 29 ++++++++++++++++++++ > include/linux/cgroup-defs.h | 5 ++++ > include/linux/memcontrol.h | 9 +++++++ > kernel/cgroup/cgroup.c | 15 ++++++++++- > mm/hugetlb.c | 35 ++++++++++++++++++++----- > mm/memcontrol.c | 35 +++++++++++++++++++++++++ > 6 files changed, 120 insertions(+), 8 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 622a7f28db1f..606b2e0eac4b 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -210,6 +210,35 @@ cgroup v2 currently supports the following mount options. > relying on the original semantics (e.g. specifying bogusly > high 'bypass' protection values at higher tree levels). > > + memory_hugetlb_accounting > + Count HugeTLB memory usage towards the cgroup's overall > + memory usage for the memory controller (for the purpose of > + statistics reporting and memory protetion). This is a new > + behavior that could regress existing setups, so it must be > + explicitly opted in with this mount option. > + > + A few caveats to keep in mind: > + > + * There is no HugeTLB pool management involved in the memory > + controller. The pre-allocated pool does not belong to anyone. > + Specifically, when a new HugeTLB folio is allocated to > + the pool, it is not accounted for from the perspective of the > + memory controller. It is only charged to a cgroup when it is > + actually used (for e.g at page fault time). Host memory > + overcommit management has to consider this when configuring > + hard limits. In general, HugeTLB pool management should be > + done via other mechanisms (such as the HugeTLB controller). > + * Failure to charge a HugeTLB folio to the memory controller > + results in SIGBUS. This could happen even if the HugeTLB pool > + still has pages available (but the cgroup limit is hit and > + reclaim attempt fails). > + * Charging HugeTLB memory towards the memory controller affects > + memory protection and reclaim dynamics. Any userspace tuning > + (of low, min limits for e.g) needs to take this into account. > + * HugeTLB pages utilized while this option is not selected > + will not be tracked by the memory controller (even if cgroup > + v2 is remounted later on). > + > > Organizing Processes and Threads > -------------------------------- > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > index f1b3151ac30b..8641f4320c98 100644 > --- a/include/linux/cgroup-defs.h > +++ b/include/linux/cgroup-defs.h > @@ -115,6 +115,11 @@ enum { > * Enable recursive subtree protection > */ > CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18), > + > + /* > + * Enable hugetlb accounting for the memory controller. > + */ > + CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19), > }; > > /* cftype->flags */ > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 42bf7e9b1a2f..a827e2129790 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -679,6 +679,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, > return __mem_cgroup_charge(folio, mm, gfp); > } > > +int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, > + long nr_pages); > + > int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > gfp_t gfp, swp_entry_t entry); > void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry); > @@ -1262,6 +1265,12 @@ static inline int mem_cgroup_charge(struct folio *folio, > return 0; > } > > +static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, > + gfp_t gfp, long nr_pages) > +{ > + return 0; > +} > + > static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, > struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) > { > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 1fb7f562289d..f11488b18ceb 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -1902,6 +1902,7 @@ enum cgroup2_param { > Opt_favordynmods, > Opt_memory_localevents, > Opt_memory_recursiveprot, > + Opt_memory_hugetlb_accounting, > nr__cgroup2_params > }; > > @@ -1910,6 +1911,7 @@ static const struct fs_parameter_spec cgroup2_fs_parameters[] = { > fsparam_flag("favordynmods", Opt_favordynmods), > fsparam_flag("memory_localevents", Opt_memory_localevents), > fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot), > + fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting), > {} > }; > > @@ -1936,6 +1938,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param > case Opt_memory_recursiveprot: > ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT; > return 0; > + case Opt_memory_hugetlb_accounting: > + ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; > + return 0; > } > return -EINVAL; > } > @@ -1960,6 +1965,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags) > cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT; > else > cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT; > + > + if (root_flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING) > + cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; > + else > + cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; > } > } > > @@ -1973,6 +1983,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root > seq_puts(seq, ",memory_localevents"); > if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT) > seq_puts(seq, ",memory_recursiveprot"); > + if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING) > + seq_puts(seq, ",memory_hugetlb_accounting"); > return 0; > } > > @@ -7050,7 +7062,8 @@ static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr, > "nsdelegate\n" > "favordynmods\n" > "memory_localevents\n" > - "memory_recursiveprot\n"); > + "memory_recursiveprot\n" > + "memory_hugetlb_accounting\n"); > } > static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features); > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index de220e3ff8be..74472e911b0a 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio) > pages_per_huge_page(h), folio); > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > pages_per_huge_page(h), folio); > + mem_cgroup_uncharge(folio); > if (restore_reserve) > h->resv_huge_pages++; > > @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > struct hugepage_subpool *spool = subpool_vma(vma); > struct hstate *h = hstate_vma(vma); > struct folio *folio; > - long map_chg, map_commit; > + long map_chg, map_commit, nr_pages = pages_per_huge_page(h); > long gbl_chg; > - int ret, idx; > + int memcg_charge_ret, ret, idx; > struct hugetlb_cgroup *h_cg = NULL; > + struct mem_cgroup *memcg; > bool deferred_reserve; > + gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL; > + > + memcg = get_mem_cgroup_from_current(); > + memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages); > + if (memcg_charge_ret == -ENOMEM) { > + mem_cgroup_put(memcg); > + return ERR_PTR(-ENOMEM); > + } > > idx = hstate_index(h); > /* > @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > * code of zero indicates a reservation exists (no change). > */ > map_chg = gbl_chg = vma_needs_reservation(h, vma, addr); > - if (map_chg < 0) > + if (map_chg < 0) { > + if (!memcg_charge_ret) > + mem_cgroup_cancel_charge(memcg, nr_pages); > + mem_cgroup_put(memcg); > return ERR_PTR(-ENOMEM); > + } > > /* > * Processes that did not create the mapping will have no > @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > */ > if (map_chg || avoid_reserve) { > gbl_chg = hugepage_subpool_get_pages(spool, 1); > - if (gbl_chg < 0) { > - vma_end_reservation(h, vma, addr); > - return ERR_PTR(-ENOSPC); > - } > + if (gbl_chg < 0) > + goto out_end_reservation; > > /* > * Even though there was no reservation in the region/reserve > @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > pages_per_huge_page(h), folio); > } > + > + if (!memcg_charge_ret) > + mem_cgroup_commit_charge(folio, memcg); > + mem_cgroup_put(memcg); > + > return folio; > > out_uncharge_cgroup: > @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > out_subpool_put: > if (map_chg || avoid_reserve) > hugepage_subpool_put_pages(spool, 1); > +out_end_reservation: > vma_end_reservation(h, vma, addr); > + if (!memcg_charge_ret) > + mem_cgroup_cancel_charge(memcg, nr_pages); > + mem_cgroup_put(memcg); > return ERR_PTR(-ENOSPC); > } > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 0219befeae38..6660684f6f97 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -7085,6 +7085,41 @@ int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp) > return ret; > } > > +/** > + * mem_cgroup_hugetlb_try_charge - try to charge the memcg for a hugetlb folio > + * @memcg: memcg to charge. > + * @gfp: reclaim mode. > + * @nr_pages: number of pages to charge. > + * > + * This function is called when allocating a huge page folio to determine if > + * the memcg has the capacity for it. It does not commit the charge yet, > + * as the hugetlb folio itself has not been obtained from the hugetlb pool. > + * > + * Once we have obtained the hugetlb folio, we can call > + * mem_cgroup_commit_charge() to commit the charge. If we fail to obtain the > + * folio, we should instead call mem_cgroup_cancel_charge() to undo the effect > + * of try_charge(). > + * > + * Returns 0 on success. Otherwise, an error code is returned. > + */ > +int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, > + long nr_pages) > +{ > + /* > + * If hugetlb memcg charging is not enabled, do not fail hugetlb allocation, > + * but do not attempt to commit charge later (or cancel on error) either. > + */ > + if (mem_cgroup_disabled() || !memcg || > + !cgroup_subsys_on_dfl(memory_cgrp_subsys) || > + !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)) > + return -EOPNOTSUPP; > + > + if (try_charge(memcg, gfp, nr_pages)) > + return -ENOMEM; > + > + return 0; > +} > + > /** > * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin. > * @folio: folio to charge. > -- > 2.34.1
On Mon, Oct 02, 2023 at 05:18:27PM -0700, Nhat Pham wrote: > Currently, hugetlb memory usage is not acounted for in the memory > controller, which could lead to memory overprotection for cgroups with > hugetlb-backed memory. This has been observed in our production system. > > For instance, here is one of our usecases: suppose there are two 32G > containers. The machine is booted with hugetlb_cma=6G, and each > container may or may not use up to 3 gigantic page, depending on the > workload within it. The rest is anon, cache, slab, etc. We can set the > hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. > But it is very difficult to configure memory.max to keep overall > consumption, including anon, cache, slab etc. fair. > > What we have had to resort to is to constantly poll hugetlb usage and > readjust memory.max. Similar procedure is done to other memory limits > (memory.low for e.g). However, this is rather cumbersome and buggy. > Furthermore, when there is a delay in memory limits correction, (for e.g > when hugetlb usage changes within consecutive runs of the userspace > agent), the system could be in an over/underprotected state. > > This patch rectifies this issue by charging the memcg when the hugetlb > folio is utilized, and uncharging when the folio is freed (analogous to > the hugetlb controller). Note that we do not charge when the folio is > allocated to the hugetlb pool, because at this point it is not owned by > any memcg. > > Some caveats to consider: > * This feature is only available on cgroup v2. > * There is no hugetlb pool management involved in the memory > controller. As stated above, hugetlb folios are only charged towards > the memory controller when it is used. Host overcommit management > has to consider it when configuring hard limits. > * Failure to charge towards the memcg results in SIGBUS. This could > happen even if the hugetlb pool still has pages (but the cgroup > limit is hit and reclaim attempt fails). > * When this feature is enabled, hugetlb pages contribute to memory > reclaim protection. low, min limits tuning must take into account > hugetlb memory. > * Hugetlb pages utilized while this option is not selected will not > be tracked by the memory controller (even if cgroup v2 is remounted > later on). > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On Mon 02-10-23 17:18:27, Nhat Pham wrote: > Currently, hugetlb memory usage is not acounted for in the memory > controller, which could lead to memory overprotection for cgroups with > hugetlb-backed memory. This has been observed in our production system. > > For instance, here is one of our usecases: suppose there are two 32G > containers. The machine is booted with hugetlb_cma=6G, and each > container may or may not use up to 3 gigantic page, depending on the > workload within it. The rest is anon, cache, slab, etc. We can set the > hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. > But it is very difficult to configure memory.max to keep overall > consumption, including anon, cache, slab etc. fair. > > What we have had to resort to is to constantly poll hugetlb usage and > readjust memory.max. Similar procedure is done to other memory limits > (memory.low for e.g). However, this is rather cumbersome and buggy. Could you expand some more on how this _helps_ memory.low? The hugetlb memory is not reclaimable so whatever portion of its memcg consumption will be "protected from the reclaim". Consider this parent / \ A B low=50% low=0 current=40% current=60% We have an external memory pressure and the reclaim should prefer B as A is under its low limit, correct? But now consider that the predominant consumption of B is hugetlb which would mean the memory reclaim cannot do much for B and so the A's protection might be breached. As an admin (or a tool) you need to know about hugetlb as a potential contributor to this behavior (sure mlocked memory would behave the same but mlock rarely consumes huge amount of memory in my experience). Without the accounting there might not be any external pressure in the first place. All that being said, I do not see how adding hugetlb into accounting makes low, min limits management any easier.
On Tue, Oct 03, 2023 at 02:58:58PM +0200, Michal Hocko wrote: > On Mon 02-10-23 17:18:27, Nhat Pham wrote: > > Currently, hugetlb memory usage is not acounted for in the memory > > controller, which could lead to memory overprotection for cgroups with > > hugetlb-backed memory. This has been observed in our production system. > > > > For instance, here is one of our usecases: suppose there are two 32G > > containers. The machine is booted with hugetlb_cma=6G, and each > > container may or may not use up to 3 gigantic page, depending on the > > workload within it. The rest is anon, cache, slab, etc. We can set the > > hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. > > But it is very difficult to configure memory.max to keep overall > > consumption, including anon, cache, slab etc. fair. > > > > What we have had to resort to is to constantly poll hugetlb usage and > > readjust memory.max. Similar procedure is done to other memory limits > > (memory.low for e.g). However, this is rather cumbersome and buggy. > > Could you expand some more on how this _helps_ memory.low? The > hugetlb memory is not reclaimable so whatever portion of its memcg > consumption will be "protected from the reclaim". Consider this > parent > / \ > A B > low=50% low=0 > current=40% current=60% > > We have an external memory pressure and the reclaim should prefer B as A > is under its low limit, correct? But now consider that the predominant > consumption of B is hugetlb which would mean the memory reclaim cannot > do much for B and so the A's protection might be breached. > > As an admin (or a tool) you need to know about hugetlb as a potential > contributor to this behavior (sure mlocked memory would behave the same > but mlock rarely consumes huge amount of memory in my experience). > Without the accounting there might not be any external pressure in the > first place. > > All that being said, I do not see how adding hugetlb into accounting > makes low, min limits management any easier. It's important to differentiate the cgroup usecases. One is of course the cloud/virtual server scenario, where you set the hard limits to whatever the customer paid for, and don't know and don't care about the workload running inside. In that case, memory.low and overcommit aren't really safe to begin with due to unknown unreclaimable mem. The other common usecase is the datacenter where you run your own applications. You understand their workingset and requirements, and configure and overcommit the containers in a way where jobs always meet their SLAs. E.g. if multiple containers spike, memory.low is set such that interactive workloads are prioritized over batch jobs, and both have priority over routine system management tasks. This is arguably the only case where it's safe to use memory.low. You have to know what's reclaimable and what isn't, otherwise you cannot know that memory.low will even do anything, and isolation breaks down. So we already have that knowledge: mlocked sections, how much anon is without swap space, and how much memory must not be reclaimed (even if it is reclaimable) for the workload to meet its SLAs. Hugetlb doesn't really complicate this equation - we already have to consider it unreclaimable workingset from an overcommit POV on those hosts. The reason this patch helps in this scenario is that the service teams are usually different from the containers/infra team. The service understands their workload and declares its workingset. But it's the infra team running the containers that currently has to go and find out if they're using hugetlb and tweak the cgroups. Bugs and untimeliness in the tweaking have caused multiple production incidents already. And both teams are regularly confused when there are large parts of the workload that don't show up in memory.current which both sides monitor. Keep in mind that these systems are already pretty complex, with multiple overcommitted containers and system-level activity. The current hugetlb quirk can heavily distort what a given container is doing on the host. With this patch, the service can declare its workingset, the container team can configure the container, and memory.current makes sense to everybody. The workload parameters are pretty reliable, but if the service team gets it wrong and we underprotect the workload, and/or its unreclaimable memory exceeds what was declared, the infra team gets alarms on elevated LOW breaching events and investigates if its an infra problem or a service spec problem that needs escalation. So the case you describe above only happens when mistakes are made, and we detect and rectify them. In the common case, hugetlb is part of the recognized workingset, and we configure memory.low to cut off only known optional and reclaimable memory under pressure.
On 10/02/23 17:18, Nhat Pham wrote: > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index de220e3ff8be..74472e911b0a 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio) > pages_per_huge_page(h), folio); > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > pages_per_huge_page(h), folio); > + mem_cgroup_uncharge(folio); > if (restore_reserve) > h->resv_huge_pages++; > > @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > struct hugepage_subpool *spool = subpool_vma(vma); > struct hstate *h = hstate_vma(vma); > struct folio *folio; > - long map_chg, map_commit; > + long map_chg, map_commit, nr_pages = pages_per_huge_page(h); > long gbl_chg; > - int ret, idx; > + int memcg_charge_ret, ret, idx; > struct hugetlb_cgroup *h_cg = NULL; > + struct mem_cgroup *memcg; > bool deferred_reserve; > + gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL; > + > + memcg = get_mem_cgroup_from_current(); > + memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages); > + if (memcg_charge_ret == -ENOMEM) { > + mem_cgroup_put(memcg); > + return ERR_PTR(-ENOMEM); > + } > > idx = hstate_index(h); > /* > @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > * code of zero indicates a reservation exists (no change). > */ > map_chg = gbl_chg = vma_needs_reservation(h, vma, addr); > - if (map_chg < 0) > + if (map_chg < 0) { > + if (!memcg_charge_ret) > + mem_cgroup_cancel_charge(memcg, nr_pages); > + mem_cgroup_put(memcg); > return ERR_PTR(-ENOMEM); > + } > > /* > * Processes that did not create the mapping will have no > @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > */ > if (map_chg || avoid_reserve) { > gbl_chg = hugepage_subpool_get_pages(spool, 1); > - if (gbl_chg < 0) { > - vma_end_reservation(h, vma, addr); > - return ERR_PTR(-ENOSPC); > - } > + if (gbl_chg < 0) > + goto out_end_reservation; > > /* > * Even though there was no reservation in the region/reserve > @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > pages_per_huge_page(h), folio); > } > + > + if (!memcg_charge_ret) > + mem_cgroup_commit_charge(folio, memcg); > + mem_cgroup_put(memcg); > + > return folio; > > out_uncharge_cgroup: > @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > out_subpool_put: > if (map_chg || avoid_reserve) > hugepage_subpool_put_pages(spool, 1); > +out_end_reservation: > vma_end_reservation(h, vma, addr); > + if (!memcg_charge_ret) > + mem_cgroup_cancel_charge(memcg, nr_pages); > + mem_cgroup_put(memcg); > return ERR_PTR(-ENOSPC); > } > IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in free_huge_folio. During migration, huge pages are allocated via alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio. So, there is no charging for the migration target page and we uncharge the source page. It looks like there will be no charge for the huge page after migration? If my analysis above is correct, then we may need to be careful about this accounting. We may not want both source and target pages to be charged at the same time.
On Tue, Oct 3, 2023 at 10:13 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 10/02/23 17:18, Nhat Pham wrote: > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index de220e3ff8be..74472e911b0a 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio) > > pages_per_huge_page(h), folio); > > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > > pages_per_huge_page(h), folio); > > + mem_cgroup_uncharge(folio); > > if (restore_reserve) > > h->resv_huge_pages++; > > > > @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > struct hugepage_subpool *spool = subpool_vma(vma); > > struct hstate *h = hstate_vma(vma); > > struct folio *folio; > > - long map_chg, map_commit; > > + long map_chg, map_commit, nr_pages = pages_per_huge_page(h); > > long gbl_chg; > > - int ret, idx; > > + int memcg_charge_ret, ret, idx; > > struct hugetlb_cgroup *h_cg = NULL; > > + struct mem_cgroup *memcg; > > bool deferred_reserve; > > + gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL; > > + > > + memcg = get_mem_cgroup_from_current(); > > + memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages); > > + if (memcg_charge_ret == -ENOMEM) { > > + mem_cgroup_put(memcg); > > + return ERR_PTR(-ENOMEM); > > + } > > > > idx = hstate_index(h); > > /* > > @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > * code of zero indicates a reservation exists (no change). > > */ > > map_chg = gbl_chg = vma_needs_reservation(h, vma, addr); > > - if (map_chg < 0) > > + if (map_chg < 0) { > > + if (!memcg_charge_ret) > > + mem_cgroup_cancel_charge(memcg, nr_pages); > > + mem_cgroup_put(memcg); > > return ERR_PTR(-ENOMEM); > > + } > > > > /* > > * Processes that did not create the mapping will have no > > @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > */ > > if (map_chg || avoid_reserve) { > > gbl_chg = hugepage_subpool_get_pages(spool, 1); > > - if (gbl_chg < 0) { > > - vma_end_reservation(h, vma, addr); > > - return ERR_PTR(-ENOSPC); > > - } > > + if (gbl_chg < 0) > > + goto out_end_reservation; > > > > /* > > * Even though there was no reservation in the region/reserve > > @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > > pages_per_huge_page(h), folio); > > } > > + > > + if (!memcg_charge_ret) > > + mem_cgroup_commit_charge(folio, memcg); > > + mem_cgroup_put(memcg); > > + > > return folio; > > > > out_uncharge_cgroup: > > @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > out_subpool_put: > > if (map_chg || avoid_reserve) > > hugepage_subpool_put_pages(spool, 1); > > +out_end_reservation: > > vma_end_reservation(h, vma, addr); > > + if (!memcg_charge_ret) > > + mem_cgroup_cancel_charge(memcg, nr_pages); > > + mem_cgroup_put(memcg); > > return ERR_PTR(-ENOSPC); > > } > > > > IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in > free_huge_folio. During migration, huge pages are allocated via > alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio. So, there is no > charging for the migration target page and we uncharge the source page. > It looks like there will be no charge for the huge page after migration? > Ah I see! This is a bit subtle indeed. For the hugetlb controller, it looks like they update the cgroup info inside move_hugetlb_state(), which calls hugetlb_cgroup_migrate() to transfer the hugetlb cgroup info to the destination folio. Perhaps we can do something analogous here. > If my analysis above is correct, then we may need to be careful about > this accounting. We may not want both source and target pages to be > charged at the same time. We can create a variant of mem_cgroup_migrate that does not double charge, but instead just copy the mem_cgroup information to the new folio, and then clear that info from the old folio. That way the memory usage counters are untouched. Somebody with more expertise on migration should fact check me of course :) > -- > Mike Kravetz
On Tue, Oct 03, 2023 at 11:01:24AM -0700, Nhat Pham wrote: > On Tue, Oct 3, 2023 at 10:13 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > On 10/02/23 17:18, Nhat Pham wrote: > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > index de220e3ff8be..74472e911b0a 100644 > > > --- a/mm/hugetlb.c > > > +++ b/mm/hugetlb.c > > > @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio) > > > pages_per_huge_page(h), folio); > > > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > > > pages_per_huge_page(h), folio); > > > + mem_cgroup_uncharge(folio); > > > if (restore_reserve) > > > h->resv_huge_pages++; > > > > > > @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > struct hugepage_subpool *spool = subpool_vma(vma); > > > struct hstate *h = hstate_vma(vma); > > > struct folio *folio; > > > - long map_chg, map_commit; > > > + long map_chg, map_commit, nr_pages = pages_per_huge_page(h); > > > long gbl_chg; > > > - int ret, idx; > > > + int memcg_charge_ret, ret, idx; > > > struct hugetlb_cgroup *h_cg = NULL; > > > + struct mem_cgroup *memcg; > > > bool deferred_reserve; > > > + gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL; > > > + > > > + memcg = get_mem_cgroup_from_current(); > > > + memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages); > > > + if (memcg_charge_ret == -ENOMEM) { > > > + mem_cgroup_put(memcg); > > > + return ERR_PTR(-ENOMEM); > > > + } > > > > > > idx = hstate_index(h); > > > /* > > > @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > * code of zero indicates a reservation exists (no change). > > > */ > > > map_chg = gbl_chg = vma_needs_reservation(h, vma, addr); > > > - if (map_chg < 0) > > > + if (map_chg < 0) { > > > + if (!memcg_charge_ret) > > > + mem_cgroup_cancel_charge(memcg, nr_pages); > > > + mem_cgroup_put(memcg); > > > return ERR_PTR(-ENOMEM); > > > + } > > > > > > /* > > > * Processes that did not create the mapping will have no > > > @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > */ > > > if (map_chg || avoid_reserve) { > > > gbl_chg = hugepage_subpool_get_pages(spool, 1); > > > - if (gbl_chg < 0) { > > > - vma_end_reservation(h, vma, addr); > > > - return ERR_PTR(-ENOSPC); > > > - } > > > + if (gbl_chg < 0) > > > + goto out_end_reservation; > > > > > > /* > > > * Even though there was no reservation in the region/reserve > > > @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > > > pages_per_huge_page(h), folio); > > > } > > > + > > > + if (!memcg_charge_ret) > > > + mem_cgroup_commit_charge(folio, memcg); > > > + mem_cgroup_put(memcg); > > > + > > > return folio; > > > > > > out_uncharge_cgroup: > > > @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > out_subpool_put: > > > if (map_chg || avoid_reserve) > > > hugepage_subpool_put_pages(spool, 1); > > > +out_end_reservation: > > > vma_end_reservation(h, vma, addr); > > > + if (!memcg_charge_ret) > > > + mem_cgroup_cancel_charge(memcg, nr_pages); > > > + mem_cgroup_put(memcg); > > > return ERR_PTR(-ENOSPC); > > > } > > > > > > > IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in > > free_huge_folio. During migration, huge pages are allocated via > > alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio. So, there is no > > charging for the migration target page and we uncharge the source page. > > It looks like there will be no charge for the huge page after migration? > > > > Ah I see! This is a bit subtle indeed. > > For the hugetlb controller, it looks like they update the cgroup info > inside move_hugetlb_state(), which calls hugetlb_cgroup_migrate() > to transfer the hugetlb cgroup info to the destination folio. > > Perhaps we can do something analogous here. > > > If my analysis above is correct, then we may need to be careful about > > this accounting. We may not want both source and target pages to be > > charged at the same time. > > We can create a variant of mem_cgroup_migrate that does not double > charge, but instead just copy the mem_cgroup information to the new > folio, and then clear that info from the old folio. That way the memory > usage counters are untouched. > > Somebody with more expertise on migration should fact check me > of course :) The only reason mem_cgroup_migrate() double charges right now is because it's used by replace_page_cache_folio(). In that context, the isolation of the old page isn't quite as thorough as with migration, so it cannot transfer and uncharge directly. This goes back a long time: 0a31bc97c80c3fa87b32c091d9a930ac19cd0c40 If you rename the current implementation to mem_cgroup_replace_page() for that one caller, you can add a mem_cgroup_migrate() variant which is charge neutral and clears old->memcg_data. This can be used for regular and hugetlb page migration. Something like this (totally untested): diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a4d3282493b6..17ec45bf3653 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7226,29 +7226,14 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new) if (mem_cgroup_disabled()) return; - /* Page cache replacement: new folio already charged? */ - if (folio_memcg(new)) - return; - memcg = folio_memcg(old); VM_WARN_ON_ONCE_FOLIO(!memcg, old); if (!memcg) return; - /* Force-charge the new page. The old one will be freed soon */ - if (!mem_cgroup_is_root(memcg)) { - page_counter_charge(&memcg->memory, nr_pages); - if (do_memsw_account()) - page_counter_charge(&memcg->memsw, nr_pages); - } - - css_get(&memcg->css); + /* Transfer the charge and the css ref */ commit_charge(new, memcg); - - local_irq_save(flags); - mem_cgroup_charge_statistics(memcg, nr_pages); - memcg_check_events(memcg, folio_nid(new)); - local_irq_restore(flags); + old->memcg_data = 0; } DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
On Tue, Oct 3, 2023 at 11:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Oct 03, 2023 at 11:01:24AM -0700, Nhat Pham wrote: > > On Tue, Oct 3, 2023 at 10:13 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > > > On 10/02/23 17:18, Nhat Pham wrote: > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > > index de220e3ff8be..74472e911b0a 100644 > > > > --- a/mm/hugetlb.c > > > > +++ b/mm/hugetlb.c > > > > @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio) > > > > pages_per_huge_page(h), folio); > > > > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > > > > pages_per_huge_page(h), folio); > > > > + mem_cgroup_uncharge(folio); > > > > if (restore_reserve) > > > > h->resv_huge_pages++; > > > > > > > > @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > > struct hugepage_subpool *spool = subpool_vma(vma); > > > > struct hstate *h = hstate_vma(vma); > > > > struct folio *folio; > > > > - long map_chg, map_commit; > > > > + long map_chg, map_commit, nr_pages = pages_per_huge_page(h); > > > > long gbl_chg; > > > > - int ret, idx; > > > > + int memcg_charge_ret, ret, idx; > > > > struct hugetlb_cgroup *h_cg = NULL; > > > > + struct mem_cgroup *memcg; > > > > bool deferred_reserve; > > > > + gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL; > > > > + > > > > + memcg = get_mem_cgroup_from_current(); > > > > + memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages); > > > > + if (memcg_charge_ret == -ENOMEM) { > > > > + mem_cgroup_put(memcg); > > > > + return ERR_PTR(-ENOMEM); > > > > + } > > > > > > > > idx = hstate_index(h); > > > > /* > > > > @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > > * code of zero indicates a reservation exists (no change). > > > > */ > > > > map_chg = gbl_chg = vma_needs_reservation(h, vma, addr); > > > > - if (map_chg < 0) > > > > + if (map_chg < 0) { > > > > + if (!memcg_charge_ret) > > > > + mem_cgroup_cancel_charge(memcg, nr_pages); > > > > + mem_cgroup_put(memcg); > > > > return ERR_PTR(-ENOMEM); > > > > + } > > > > > > > > /* > > > > * Processes that did not create the mapping will have no > > > > @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > > */ > > > > if (map_chg || avoid_reserve) { > > > > gbl_chg = hugepage_subpool_get_pages(spool, 1); > > > > - if (gbl_chg < 0) { > > > > - vma_end_reservation(h, vma, addr); > > > > - return ERR_PTR(-ENOSPC); > > > > - } > > > > + if (gbl_chg < 0) > > > > + goto out_end_reservation; > > > > > > > > /* > > > > * Even though there was no reservation in the region/reserve > > > > @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > > > > pages_per_huge_page(h), folio); > > > > } > > > > + > > > > + if (!memcg_charge_ret) > > > > + mem_cgroup_commit_charge(folio, memcg); > > > > + mem_cgroup_put(memcg); > > > > + > > > > return folio; > > > > > > > > out_uncharge_cgroup: > > > > @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > > out_subpool_put: > > > > if (map_chg || avoid_reserve) > > > > hugepage_subpool_put_pages(spool, 1); > > > > +out_end_reservation: > > > > vma_end_reservation(h, vma, addr); > > > > + if (!memcg_charge_ret) > > > > + mem_cgroup_cancel_charge(memcg, nr_pages); > > > > + mem_cgroup_put(memcg); > > > > return ERR_PTR(-ENOSPC); > > > > } > > > > > > > > > > IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in > > > free_huge_folio. During migration, huge pages are allocated via > > > alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio. So, there is no > > > charging for the migration target page and we uncharge the source page. > > > It looks like there will be no charge for the huge page after migration? > > > > > > > Ah I see! This is a bit subtle indeed. > > > > For the hugetlb controller, it looks like they update the cgroup info > > inside move_hugetlb_state(), which calls hugetlb_cgroup_migrate() > > to transfer the hugetlb cgroup info to the destination folio. > > > > Perhaps we can do something analogous here. > > > > > If my analysis above is correct, then we may need to be careful about > > > this accounting. We may not want both source and target pages to be > > > charged at the same time. > > > > We can create a variant of mem_cgroup_migrate that does not double > > charge, but instead just copy the mem_cgroup information to the new > > folio, and then clear that info from the old folio. That way the memory > > usage counters are untouched. > > > > Somebody with more expertise on migration should fact check me > > of course :) > > The only reason mem_cgroup_migrate() double charges right now is > because it's used by replace_page_cache_folio(). In that context, the > isolation of the old page isn't quite as thorough as with migration, > so it cannot transfer and uncharge directly. This goes back a long > time: 0a31bc97c80c3fa87b32c091d9a930ac19cd0c40 > > If you rename the current implementation to mem_cgroup_replace_page() > for that one caller, you can add a mem_cgroup_migrate() variant which > is charge neutral and clears old->memcg_data. This can be used for > regular and hugetlb page migration. Something like this (totally > untested): > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index a4d3282493b6..17ec45bf3653 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -7226,29 +7226,14 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new) > if (mem_cgroup_disabled()) > return; > > - /* Page cache replacement: new folio already charged? */ > - if (folio_memcg(new)) > - return; > - > memcg = folio_memcg(old); > VM_WARN_ON_ONCE_FOLIO(!memcg, old); > if (!memcg) > return; > > - /* Force-charge the new page. The old one will be freed soon */ > - if (!mem_cgroup_is_root(memcg)) { > - page_counter_charge(&memcg->memory, nr_pages); > - if (do_memsw_account()) > - page_counter_charge(&memcg->memsw, nr_pages); > - } > - > - css_get(&memcg->css); > + /* Transfer the charge and the css ref */ > commit_charge(new, memcg); > - > - local_irq_save(flags); > - mem_cgroup_charge_statistics(memcg, nr_pages); > - memcg_check_events(memcg, folio_nid(new)); > - local_irq_restore(flags); > + old->memcg_data = 0; > } > > DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key); > Ah, I like this. Will send a fixlet based on this :) I was scratching my head trying to figure out why we were doing the double charging in the first place. Thanks for the context, Johannes!
On 10/03/23 15:09, Nhat Pham wrote: > On Tue, Oct 3, 2023 at 11:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Oct 03, 2023 at 11:01:24AM -0700, Nhat Pham wrote: > > > On Tue, Oct 3, 2023 at 10:13 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > On 10/02/23 17:18, Nhat Pham wrote: > > > > > > > > IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in > > > > free_huge_folio. During migration, huge pages are allocated via > > > > alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio. So, there is no > > > > charging for the migration target page and we uncharge the source page. > > > > It looks like there will be no charge for the huge page after migration? > > > > > > > > > > Ah I see! This is a bit subtle indeed. > > > > > > For the hugetlb controller, it looks like they update the cgroup info > > > inside move_hugetlb_state(), which calls hugetlb_cgroup_migrate() > > > to transfer the hugetlb cgroup info to the destination folio. > > > > > > Perhaps we can do something analogous here. > > > > > > > If my analysis above is correct, then we may need to be careful about > > > > this accounting. We may not want both source and target pages to be > > > > charged at the same time. > > > > > > We can create a variant of mem_cgroup_migrate that does not double > > > charge, but instead just copy the mem_cgroup information to the new > > > folio, and then clear that info from the old folio. That way the memory > > > usage counters are untouched. > > > > > > Somebody with more expertise on migration should fact check me > > > of course :) > > > > The only reason mem_cgroup_migrate() double charges right now is > > because it's used by replace_page_cache_folio(). In that context, the > > isolation of the old page isn't quite as thorough as with migration, > > so it cannot transfer and uncharge directly. This goes back a long > > time: 0a31bc97c80c3fa87b32c091d9a930ac19cd0c40 > > > > If you rename the current implementation to mem_cgroup_replace_page() > > for that one caller, you can add a mem_cgroup_migrate() variant which > > is charge neutral and clears old->memcg_data. This can be used for > > regular and hugetlb page migration. Something like this (totally > > untested): > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index a4d3282493b6..17ec45bf3653 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -7226,29 +7226,14 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new) > > if (mem_cgroup_disabled()) > > return; > > > > - /* Page cache replacement: new folio already charged? */ > > - if (folio_memcg(new)) > > - return; > > - > > memcg = folio_memcg(old); > > VM_WARN_ON_ONCE_FOLIO(!memcg, old); > > if (!memcg) > > return; > > > > - /* Force-charge the new page. The old one will be freed soon */ > > - if (!mem_cgroup_is_root(memcg)) { > > - page_counter_charge(&memcg->memory, nr_pages); > > - if (do_memsw_account()) > > - page_counter_charge(&memcg->memsw, nr_pages); > > - } > > - > > - css_get(&memcg->css); > > + /* Transfer the charge and the css ref */ > > commit_charge(new, memcg); > > - > > - local_irq_save(flags); > > - mem_cgroup_charge_statistics(memcg, nr_pages); > > - memcg_check_events(memcg, folio_nid(new)); > > - local_irq_restore(flags); > > + old->memcg_data = 0; > > } > > > > DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key); > > > > Ah, I like this. Will send a fixlet based on this :) > I was scratching my head trying to figure out why we were > doing the double charging in the first place. Thanks for the context, > Johannes! Be sure to check for code similar to this in folio_migrate_flags: void folio_migrate_flags(struct folio *newfolio, struct folio *folio) { ... if (!folio_test_hugetlb(folio)) mem_cgroup_migrate(folio, newfolio); } There are many places where hugetlb is special cased.
On Tue, Oct 3, 2023 at 3:42 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 10/03/23 15:09, Nhat Pham wrote: > > On Tue, Oct 3, 2023 at 11:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > On Tue, Oct 03, 2023 at 11:01:24AM -0700, Nhat Pham wrote: > > > > On Tue, Oct 3, 2023 at 10:13 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > > On 10/02/23 17:18, Nhat Pham wrote: > > > > > > > > > > IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in > > > > > free_huge_folio. During migration, huge pages are allocated via > > > > > alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio. So, there is no > > > > > charging for the migration target page and we uncharge the source page. > > > > > It looks like there will be no charge for the huge page after migration? > > > > > > > > > > > > > Ah I see! This is a bit subtle indeed. > > > > > > > > For the hugetlb controller, it looks like they update the cgroup info > > > > inside move_hugetlb_state(), which calls hugetlb_cgroup_migrate() > > > > to transfer the hugetlb cgroup info to the destination folio. > > > > > > > > Perhaps we can do something analogous here. > > > > > > > > > If my analysis above is correct, then we may need to be careful about > > > > > this accounting. We may not want both source and target pages to be > > > > > charged at the same time. > > > > > > > > We can create a variant of mem_cgroup_migrate that does not double > > > > charge, but instead just copy the mem_cgroup information to the new > > > > folio, and then clear that info from the old folio. That way the memory > > > > usage counters are untouched. > > > > > > > > Somebody with more expertise on migration should fact check me > > > > of course :) > > > > > > The only reason mem_cgroup_migrate() double charges right now is > > > because it's used by replace_page_cache_folio(). In that context, the > > > isolation of the old page isn't quite as thorough as with migration, > > > so it cannot transfer and uncharge directly. This goes back a long > > > time: 0a31bc97c80c3fa87b32c091d9a930ac19cd0c40 > > > > > > If you rename the current implementation to mem_cgroup_replace_page() > > > for that one caller, you can add a mem_cgroup_migrate() variant which > > > is charge neutral and clears old->memcg_data. This can be used for > > > regular and hugetlb page migration. Something like this (totally > > > untested): > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index a4d3282493b6..17ec45bf3653 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -7226,29 +7226,14 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new) > > > if (mem_cgroup_disabled()) > > > return; > > > > > > - /* Page cache replacement: new folio already charged? */ > > > - if (folio_memcg(new)) > > > - return; > > > - > > > memcg = folio_memcg(old); > > > VM_WARN_ON_ONCE_FOLIO(!memcg, old); > > > if (!memcg) > > > return; > > > > > > - /* Force-charge the new page. The old one will be freed soon */ > > > - if (!mem_cgroup_is_root(memcg)) { > > > - page_counter_charge(&memcg->memory, nr_pages); > > > - if (do_memsw_account()) > > > - page_counter_charge(&memcg->memsw, nr_pages); > > > - } > > > - > > > - css_get(&memcg->css); > > > + /* Transfer the charge and the css ref */ > > > commit_charge(new, memcg); > > > - > > > - local_irq_save(flags); > > > - mem_cgroup_charge_statistics(memcg, nr_pages); > > > - memcg_check_events(memcg, folio_nid(new)); > > > - local_irq_restore(flags); > > > + old->memcg_data = 0; > > > } > > > > > > DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key); > > > > > > > Ah, I like this. Will send a fixlet based on this :) > > I was scratching my head trying to figure out why we were > > doing the double charging in the first place. Thanks for the context, > > Johannes! > > Be sure to check for code similar to this in folio_migrate_flags: > > void folio_migrate_flags(struct folio *newfolio, struct folio *folio) > { > ... > if (!folio_test_hugetlb(folio)) > mem_cgroup_migrate(folio, newfolio); > } > > There are many places where hugetlb is special cased. Yeah makes sense. I'm actually gonna take advantage of this, and remove the test hugetlb check here, so that it will also migrate the memcg metadata in this case too. See the new patch I just sent out. > -- > Mike Kravetz
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 622a7f28db1f..606b2e0eac4b 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -210,6 +210,35 @@ cgroup v2 currently supports the following mount options. relying on the original semantics (e.g. specifying bogusly high 'bypass' protection values at higher tree levels). + memory_hugetlb_accounting + Count HugeTLB memory usage towards the cgroup's overall + memory usage for the memory controller (for the purpose of + statistics reporting and memory protetion). This is a new + behavior that could regress existing setups, so it must be + explicitly opted in with this mount option. + + A few caveats to keep in mind: + + * There is no HugeTLB pool management involved in the memory + controller. The pre-allocated pool does not belong to anyone. + Specifically, when a new HugeTLB folio is allocated to + the pool, it is not accounted for from the perspective of the + memory controller. It is only charged to a cgroup when it is + actually used (for e.g at page fault time). Host memory + overcommit management has to consider this when configuring + hard limits. In general, HugeTLB pool management should be + done via other mechanisms (such as the HugeTLB controller). + * Failure to charge a HugeTLB folio to the memory controller + results in SIGBUS. This could happen even if the HugeTLB pool + still has pages available (but the cgroup limit is hit and + reclaim attempt fails). + * Charging HugeTLB memory towards the memory controller affects + memory protection and reclaim dynamics. Any userspace tuning + (of low, min limits for e.g) needs to take this into account. + * HugeTLB pages utilized while this option is not selected + will not be tracked by the memory controller (even if cgroup + v2 is remounted later on). + Organizing Processes and Threads -------------------------------- diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index f1b3151ac30b..8641f4320c98 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -115,6 +115,11 @@ enum { * Enable recursive subtree protection */ CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18), + + /* + * Enable hugetlb accounting for the memory controller. + */ + CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19), }; /* cftype->flags */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 42bf7e9b1a2f..a827e2129790 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -679,6 +679,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, return __mem_cgroup_charge(folio, mm, gfp); } +int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, + long nr_pages); + int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry); void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry); @@ -1262,6 +1265,12 @@ static inline int mem_cgroup_charge(struct folio *folio, return 0; } +static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, + gfp_t gfp, long nr_pages) +{ + return 0; +} + static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) { diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 1fb7f562289d..f11488b18ceb 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1902,6 +1902,7 @@ enum cgroup2_param { Opt_favordynmods, Opt_memory_localevents, Opt_memory_recursiveprot, + Opt_memory_hugetlb_accounting, nr__cgroup2_params }; @@ -1910,6 +1911,7 @@ static const struct fs_parameter_spec cgroup2_fs_parameters[] = { fsparam_flag("favordynmods", Opt_favordynmods), fsparam_flag("memory_localevents", Opt_memory_localevents), fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot), + fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting), {} }; @@ -1936,6 +1938,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param case Opt_memory_recursiveprot: ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT; return 0; + case Opt_memory_hugetlb_accounting: + ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; + return 0; } return -EINVAL; } @@ -1960,6 +1965,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags) cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT; else cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT; + + if (root_flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING) + cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; + else + cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; } } @@ -1973,6 +1983,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root seq_puts(seq, ",memory_localevents"); if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT) seq_puts(seq, ",memory_recursiveprot"); + if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING) + seq_puts(seq, ",memory_hugetlb_accounting"); return 0; } @@ -7050,7 +7062,8 @@ static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr, "nsdelegate\n" "favordynmods\n" "memory_localevents\n" - "memory_recursiveprot\n"); + "memory_recursiveprot\n" + "memory_hugetlb_accounting\n"); } static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index de220e3ff8be..74472e911b0a 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio) pages_per_huge_page(h), folio); hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), pages_per_huge_page(h), folio); + mem_cgroup_uncharge(folio); if (restore_reserve) h->resv_huge_pages++; @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, struct hugepage_subpool *spool = subpool_vma(vma); struct hstate *h = hstate_vma(vma); struct folio *folio; - long map_chg, map_commit; + long map_chg, map_commit, nr_pages = pages_per_huge_page(h); long gbl_chg; - int ret, idx; + int memcg_charge_ret, ret, idx; struct hugetlb_cgroup *h_cg = NULL; + struct mem_cgroup *memcg; bool deferred_reserve; + gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL; + + memcg = get_mem_cgroup_from_current(); + memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages); + if (memcg_charge_ret == -ENOMEM) { + mem_cgroup_put(memcg); + return ERR_PTR(-ENOMEM); + } idx = hstate_index(h); /* @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, * code of zero indicates a reservation exists (no change). */ map_chg = gbl_chg = vma_needs_reservation(h, vma, addr); - if (map_chg < 0) + if (map_chg < 0) { + if (!memcg_charge_ret) + mem_cgroup_cancel_charge(memcg, nr_pages); + mem_cgroup_put(memcg); return ERR_PTR(-ENOMEM); + } /* * Processes that did not create the mapping will have no @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, */ if (map_chg || avoid_reserve) { gbl_chg = hugepage_subpool_get_pages(spool, 1); - if (gbl_chg < 0) { - vma_end_reservation(h, vma, addr); - return ERR_PTR(-ENOSPC); - } + if (gbl_chg < 0) + goto out_end_reservation; /* * Even though there was no reservation in the region/reserve @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), pages_per_huge_page(h), folio); } + + if (!memcg_charge_ret) + mem_cgroup_commit_charge(folio, memcg); + mem_cgroup_put(memcg); + return folio; out_uncharge_cgroup: @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, out_subpool_put: if (map_chg || avoid_reserve) hugepage_subpool_put_pages(spool, 1); +out_end_reservation: vma_end_reservation(h, vma, addr); + if (!memcg_charge_ret) + mem_cgroup_cancel_charge(memcg, nr_pages); + mem_cgroup_put(memcg); return ERR_PTR(-ENOSPC); } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0219befeae38..6660684f6f97 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7085,6 +7085,41 @@ int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp) return ret; } +/** + * mem_cgroup_hugetlb_try_charge - try to charge the memcg for a hugetlb folio + * @memcg: memcg to charge. + * @gfp: reclaim mode. + * @nr_pages: number of pages to charge. + * + * This function is called when allocating a huge page folio to determine if + * the memcg has the capacity for it. It does not commit the charge yet, + * as the hugetlb folio itself has not been obtained from the hugetlb pool. + * + * Once we have obtained the hugetlb folio, we can call + * mem_cgroup_commit_charge() to commit the charge. If we fail to obtain the + * folio, we should instead call mem_cgroup_cancel_charge() to undo the effect + * of try_charge(). + * + * Returns 0 on success. Otherwise, an error code is returned. + */ +int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, + long nr_pages) +{ + /* + * If hugetlb memcg charging is not enabled, do not fail hugetlb allocation, + * but do not attempt to commit charge later (or cancel on error) either. + */ + if (mem_cgroup_disabled() || !memcg || + !cgroup_subsys_on_dfl(memory_cgrp_subsys) || + !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)) + return -EOPNOTSUPP; + + if (try_charge(memcg, gfp, nr_pages)) + return -ENOMEM; + + return 0; +} + /** * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin. * @folio: folio to charge.
Currently, hugetlb memory usage is not acounted for in the memory controller, which could lead to memory overprotection for cgroups with hugetlb-backed memory. This has been observed in our production system. For instance, here is one of our usecases: suppose there are two 32G containers. The machine is booted with hugetlb_cma=6G, and each container may or may not use up to 3 gigantic page, depending on the workload within it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. But it is very difficult to configure memory.max to keep overall consumption, including anon, cache, slab etc. fair. What we have had to resort to is to constantly poll hugetlb usage and readjust memory.max. Similar procedure is done to other memory limits (memory.low for e.g). However, this is rather cumbersome and buggy. Furthermore, when there is a delay in memory limits correction, (for e.g when hugetlb usage changes within consecutive runs of the userspace agent), the system could be in an over/underprotected state. This patch rectifies this issue by charging the memcg when the hugetlb folio is utilized, and uncharging when the folio is freed (analogous to the hugetlb controller). Note that we do not charge when the folio is allocated to the hugetlb pool, because at this point it is not owned by any memcg. Some caveats to consider: * This feature is only available on cgroup v2. * There is no hugetlb pool management involved in the memory controller. As stated above, hugetlb folios are only charged towards the memory controller when it is used. Host overcommit management has to consider it when configuring hard limits. * Failure to charge towards the memcg results in SIGBUS. This could happen even if the hugetlb pool still has pages (but the cgroup limit is hit and reclaim attempt fails). * When this feature is enabled, hugetlb pages contribute to memory reclaim protection. low, min limits tuning must take into account hugetlb memory. * Hugetlb pages utilized while this option is not selected will not be tracked by the memory controller (even if cgroup v2 is remounted later on). Signed-off-by: Nhat Pham <nphamcs@gmail.com> --- Documentation/admin-guide/cgroup-v2.rst | 29 ++++++++++++++++++++ include/linux/cgroup-defs.h | 5 ++++ include/linux/memcontrol.h | 9 +++++++ kernel/cgroup/cgroup.c | 15 ++++++++++- mm/hugetlb.c | 35 ++++++++++++++++++++----- mm/memcontrol.c | 35 +++++++++++++++++++++++++ 6 files changed, 120 insertions(+), 8 deletions(-)