Message ID | 20230926194949.2637078-1-nphamcs@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | hugetlb memcg accounting | expand |
On Tue, Sep 26, 2023 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote: > > Currently, hugetlb memory usage is not acounted for in the memory > controller, which could lead to memory overprotection for cgroups with > hugetlb-backed memory. This has been observed in our production system. > > This patch series rectifies this issue by charging the memcg when the > hugetlb folio is allocated, and uncharging when the folio is freed. In > addition, a new selftest is added to demonstrate and verify this new > behavior. > > Nhat Pham (2): > hugetlb: memcg: account hugetlb-backed memory in memory controller > selftests: add a selftest to verify hugetlb usage in memcg > > MAINTAINERS | 2 + > fs/hugetlbfs/inode.c | 2 +- > include/linux/hugetlb.h | 6 +- > include/linux/memcontrol.h | 8 + > mm/hugetlb.c | 23 +- > mm/memcontrol.c | 40 ++++ > tools/testing/selftests/cgroup/.gitignore | 1 + > tools/testing/selftests/cgroup/Makefile | 2 + > .../selftests/cgroup/test_hugetlb_memcg.c | 222 ++++++++++++++++++ > 9 files changed, 297 insertions(+), 9 deletions(-) > create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c > > -- > 2.34.1 > We've had this behavior at Google for a long time, and we're actually getting rid of it. hugetlb pages are a precious resource that should be accounted for separately. They are not just any memory, they are physically contiguous memory, charging them the same as any other region of the same size ended up not making sense, especially not for larger hugetlb page sizes. Additionally, if this behavior is changed just like that, there will be quite a few workloads that will break badly because they'll hit their limits immediately - imagine a container that uses 1G hugetlb pages to back something large (a database, a VM), and 'plain' memory for control processes. What do your workloads do? Is it not possible for you to account for hugetlb pages separately? Sure, it can be annoying to have to deal with 2 separate totals that you need to take into account, but again, hugetlb pages are a resource that is best dealt with separately. - Frank
Hi Frank, On Tue, Sep 26, 2023 at 01:50:10PM -0700, Frank van der Linden wrote: > On Tue, Sep 26, 2023 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > Currently, hugetlb memory usage is not acounted for in the memory > > controller, which could lead to memory overprotection for cgroups with > > hugetlb-backed memory. This has been observed in our production system. > > > > This patch series rectifies this issue by charging the memcg when the > > hugetlb folio is allocated, and uncharging when the folio is freed. In > > addition, a new selftest is added to demonstrate and verify this new > > behavior. > > > > Nhat Pham (2): > > hugetlb: memcg: account hugetlb-backed memory in memory controller > > selftests: add a selftest to verify hugetlb usage in memcg > > > > MAINTAINERS | 2 + > > fs/hugetlbfs/inode.c | 2 +- > > include/linux/hugetlb.h | 6 +- > > include/linux/memcontrol.h | 8 + > > mm/hugetlb.c | 23 +- > > mm/memcontrol.c | 40 ++++ > > tools/testing/selftests/cgroup/.gitignore | 1 + > > tools/testing/selftests/cgroup/Makefile | 2 + > > .../selftests/cgroup/test_hugetlb_memcg.c | 222 ++++++++++++++++++ > > 9 files changed, 297 insertions(+), 9 deletions(-) > > create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c > > > > -- > > 2.34.1 > > > > We've had this behavior at Google for a long time, and we're actually > getting rid of it. hugetlb pages are a precious resource that should > be accounted for separately. They are not just any memory, they are > physically contiguous memory, charging them the same as any other > region of the same size ended up not making sense, especially not for > larger hugetlb page sizes. I agree that on one hand they're a limited resource, and some form of access control makes sense. There is the hugetlb cgroup controller that allows for tracking and apportioning them per-cgroups. But on the other hand they're also still just host memory that a cgroup can consume, which is the domain of memcg. Those two aren't mutually exclusive. It makes sense to set a limit on a cgroup's access to hugetlb. It also makes sense that the huge pages a cgroup IS using count toward its memory limit, where they displace file cache and anonymous pages under pressure. Or that they're considered when determining degree of protection from global pressure. This isn't unlike e.g. kernel memory being special in that it consumes lowmem and isn't reclaimable. This shows up in total memory, while it was also tracked and limited separately. (Separate control disappeared for lack of a good enforcement mechanism - but hugetlb has that.) The fact that memory consumed by hugetlb is currently not considered inside memcg (host memory accounting and control) is inconsistent. It has been quite confusing to our service owners and complicating things for our containers team. For example, jobs need to describe their overall memory size in order to match them to machines and co-locate them. Based on that parameter the container limits as well as protection (memory.low) from global pressure is set. Currently, there are ugly hacks in place to subtract any hugetlb quota from the container config - otherwise the limits and protection settings would be way too big if a large part of the host memory consumption isn't a part of it. This has been quite cumbersome and error prone. > Additionally, if this behavior is changed just like that, there will > be quite a few workloads that will break badly because they'll hit > their limits immediately - imagine a container that uses 1G hugetlb > pages to back something large (a database, a VM), and 'plain' memory > for control processes. I agree with you there. This could break existing setups. We've added new consumers to memcg in the past without thinking too hard about it, but hugetlb often makes up a huge portion of a group's overall memory footprint. And we *do* have those subtraction hacks in place that would then fail in the other direction. A cgroup mountflag makes sense for this to ease the transition.
On Tue, Sep 26, 2023 at 1:50 PM Frank van der Linden <fvdl@google.com> wrote: > > On Tue, Sep 26, 2023 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > Currently, hugetlb memory usage is not acounted for in the memory > > controller, which could lead to memory overprotection for cgroups with > > hugetlb-backed memory. This has been observed in our production system. > > > > This patch series rectifies this issue by charging the memcg when the > > hugetlb folio is allocated, and uncharging when the folio is freed. In > > addition, a new selftest is added to demonstrate and verify this new > > behavior. > > > > Nhat Pham (2): > > hugetlb: memcg: account hugetlb-backed memory in memory controller > > selftests: add a selftest to verify hugetlb usage in memcg > > > > MAINTAINERS | 2 + > > fs/hugetlbfs/inode.c | 2 +- > > include/linux/hugetlb.h | 6 +- > > include/linux/memcontrol.h | 8 + > > mm/hugetlb.c | 23 +- > > mm/memcontrol.c | 40 ++++ > > tools/testing/selftests/cgroup/.gitignore | 1 + > > tools/testing/selftests/cgroup/Makefile | 2 + > > .../selftests/cgroup/test_hugetlb_memcg.c | 222 ++++++++++++++++++ > > 9 files changed, 297 insertions(+), 9 deletions(-) > > create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c > > > > -- > > 2.34.1 > > > > We've had this behavior at Google for a long time, and we're actually > getting rid of it. hugetlb pages are a precious resource that should > be accounted for separately. They are not just any memory, they are > physically contiguous memory, charging them the same as any other > region of the same size ended up not making sense, especially not for > larger hugetlb page sizes. I agree hugetlb is a special kind of resource. But as Johannes pointed out, it is still a form of memory. Semantically, its usage should be modulated by the memory controller. We do have the HugeTLB controller for hugetlb-specific restriction, and where appropriate we definitely should take advantage of it. But it does not fix the hole we have in memory usage reporting, as well as (over)protection and reclaim dynamics. Hence the need for the userspace hack (as Johannes described): manually adding/subtracting HugeTLB usage where applicable. This is not only inelegant, but also cumbersome and buggy. > > Additionally, if this behavior is changed just like that, there will > be quite a few workloads that will break badly because they'll hit > their limits immediately - imagine a container that uses 1G hugetlb > pages to back something large (a database, a VM), and 'plain' memory > for control processes. > > What do your workloads do? Is it not possible for you to account for > hugetlb pages separately? Sure, it can be annoying to have to deal > with 2 separate totals that you need to take into account, but again, > hugetlb pages are a resource that is best dealt with separately. > Johannes beat me to it - he described our use case, and what we have hacked together to temporarily get around the issue. A knob/flag to turn on/off this behavior sounds good to me. > - Frank Thanks for the comments, Frank!
On Tue 26-09-23 12:49:47, Nhat Pham wrote: > Currently, hugetlb memory usage is not acounted for in the memory > controller, which could lead to memory overprotection for cgroups with > hugetlb-backed memory. This has been observed in our production system. > > This patch series rectifies this issue by charging the memcg when the > hugetlb folio is allocated, and uncharging when the folio is freed. In > addition, a new selftest is added to demonstrate and verify this new > behavior. The primary reason why hugetlb is living outside of memcg (and the core MM as well) is that it doesn't really fit the whole scheme. In several aspects. First and the foremost it is an independently managed resource with its own pool management, use and lifetime. There is no notion of memory reclaim and this makes a huge difference for the pool that might consume considerable amount of memory. While this is the case for many kernel allocations as well they usually do not consume considerable portions of the accounted memory. This makes it really tricky to handle limit enforcement gracefully. Another important aspect comes from the lifetime semantics when a proper reservations accounting and managing needs to handle mmap time rather than than usual allocation path. While pages are allocated they do not belong to anybody and only later at the #PF time (or read for the fs backed mapping) the ownership is established. That makes it really hard to manage memory as whole under the memcg anyway as a large part of that pool sits without an ownership yet it cannot be used for any other purpose. These and more reasons where behind the earlier decision o have a dedicated hugetlb controller. Also I will also Nack involving hugetlb pages being accounted by default. This would break any setups which mix normal and hugetlb memory with memcg limits applied.
On Tue 26-09-23 18:14:14, Johannes Weiner wrote: [...] > The fact that memory consumed by hugetlb is currently not considered > inside memcg (host memory accounting and control) is inconsistent. It > has been quite confusing to our service owners and complicating things > for our containers team. I do understand how that is confusing and inconsistent as well. Hugetlb is bringing throughout its existence I am afraid. As noted in other reply though I am not sure hugeltb pool can be reasonably incorporated with a sane semantic. Neither of the regular allocation nor the hugetlb reservation/actual use can fallback to the pool of the other. This makes them 2 different things each hitting their own failure cases that require a dedicated handling. Just from top of my head these are cases I do not see easy way out from: - hugetlb charge failure has two failure modes - pool empty or memcg limit reached. The former is not recoverable and should fail without any further intervention the latter might benefit from reclaiming. - !hugetlb memory charge failure cannot consider any hugetlb pages - they are implicit memory.min protection so it is impossible to manage reclaim protection without having a knowledge of the hugetlb use. - there is no way to control the hugetlb pool distribution by memcg limits. How do we distinguish reservations from actual use? - pre-allocated pool is consuming memory without any actual owner until it is actually used and even that has two stages (reserved and really used). This makes it really hard to manage memory as whole when there is a considerable amount of hugetlb memore preallocated. I am pretty sure there are many more interesting cases.
On Wed, Sep 27, 2023 at 02:50:10PM +0200, Michal Hocko wrote: > On Tue 26-09-23 18:14:14, Johannes Weiner wrote: > [...] > > The fact that memory consumed by hugetlb is currently not considered > > inside memcg (host memory accounting and control) is inconsistent. It > > has been quite confusing to our service owners and complicating things > > for our containers team. > > I do understand how that is confusing and inconsistent as well. Hugetlb > is bringing throughout its existence I am afraid. > > As noted in other reply though I am not sure hugeltb pool can be > reasonably incorporated with a sane semantic. Neither of the regular > allocation nor the hugetlb reservation/actual use can fallback to the > pool of the other. This makes them 2 different things each hitting their > own failure cases that require a dedicated handling. > > Just from top of my head these are cases I do not see easy way out from: > - hugetlb charge failure has two failure modes - pool empty > or memcg limit reached. The former is not recoverable and > should fail without any further intervention the latter might > benefit from reclaiming. > - !hugetlb memory charge failure cannot consider any hugetlb > pages - they are implicit memory.min protection so it is > impossible to manage reclaim protection without having a > knowledge of the hugetlb use. > - there is no way to control the hugetlb pool distribution by > memcg limits. How do we distinguish reservations from actual > use? > - pre-allocated pool is consuming memory without any actual > owner until it is actually used and even that has two stages > (reserved and really used). This makes it really hard to > manage memory as whole when there is a considerable amount of > hugetlb memore preallocated. It's important to distinguish hugetlb access policy from memory use policy. This patch isn't about hugetlb access, it's about general memory use. Hugetlb access policy is a separate domain with separate answers. Preallocating is a privileged operation, for access control there is the hugetlb cgroup controller etc. What's missing is that once you get past the access restrictions and legitimately get your hands on huge pages, that memory use gets reflected in memory.current and exerts pressure on *other* memory inside the group, such as anon or optimistic cache allocations. Note that hugetlb *can* be allocated on demand. It's unexpected that when an application optimistically allocates a couple of 2M hugetlb pages those aren't reflected in its memory.current. The same is true for hugetlb_cma. If the gigantic pages aren't currently allocated to a cgroup, that CMA memory can be used for movable memory elsewhere. The points you and Frank raise are reasons and scenarios where additional hugetlb access control is necessary - preallocation, limited availability of 1G pages etc. But they're not reasons against charging faulted in hugetlb to the memcg *as well*. My point is we need both. One to manage competition over hugetlb, because it has unique limitations. The other to manage competition over host memory which hugetlb is a part of. Here is a usecase from our fleet. Imagine a configuration with two 32G containers. The machine is booted with hugetlb_cma=6G, and each container may or may not use up to 3 gigantic page, depending on the workload within it. The rest is anon, cache, slab, etc. You set the hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. But how do you configure memory.max to keep *overall* consumption, including anon, cache, slab etc. fair? If used hugetlb is charged, you can just set memory.max=32G regardless of the workload inside. Without it, you'd have to constantly poll hugetlb usage and readjust memory.max!
On Wed, Sep 27, 2023 at 9:44 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Sep 27, 2023 at 02:50:10PM +0200, Michal Hocko wrote: > > On Tue 26-09-23 18:14:14, Johannes Weiner wrote: > > [...] > > > The fact that memory consumed by hugetlb is currently not considered > > > inside memcg (host memory accounting and control) is inconsistent. It > > > has been quite confusing to our service owners and complicating things > > > for our containers team. > > > > I do understand how that is confusing and inconsistent as well. Hugetlb > > is bringing throughout its existence I am afraid. > > > > As noted in other reply though I am not sure hugeltb pool can be > > reasonably incorporated with a sane semantic. Neither of the regular > > allocation nor the hugetlb reservation/actual use can fallback to the > > pool of the other. This makes them 2 different things each hitting their > > own failure cases that require a dedicated handling. > > > > Just from top of my head these are cases I do not see easy way out from: > > - hugetlb charge failure has two failure modes - pool empty > > or memcg limit reached. The former is not recoverable and > > should fail without any further intervention the latter might > > benefit from reclaiming. > > - !hugetlb memory charge failure cannot consider any hugetlb > > pages - they are implicit memory.min protection so it is > > impossible to manage reclaim protection without having a > > knowledge of the hugetlb use. > > - there is no way to control the hugetlb pool distribution by > > memcg limits. How do we distinguish reservations from actual > > use? > > - pre-allocated pool is consuming memory without any actual > > owner until it is actually used and even that has two stages > > (reserved and really used). This makes it really hard to > > manage memory as whole when there is a considerable amount of > > hugetlb memore preallocated. > > It's important to distinguish hugetlb access policy from memory use > policy. This patch isn't about hugetlb access, it's about general > memory use. > > Hugetlb access policy is a separate domain with separate > answers. Preallocating is a privileged operation, for access control > there is the hugetlb cgroup controller etc. > > What's missing is that once you get past the access restrictions and > legitimately get your hands on huge pages, that memory use gets > reflected in memory.current and exerts pressure on *other* memory > inside the group, such as anon or optimistic cache allocations. > > Note that hugetlb *can* be allocated on demand. It's unexpected that > when an application optimistically allocates a couple of 2M hugetlb > pages those aren't reflected in its memory.current. The same is true > for hugetlb_cma. If the gigantic pages aren't currently allocated to a > cgroup, that CMA memory can be used for movable memory elsewhere. > > The points you and Frank raise are reasons and scenarios where > additional hugetlb access control is necessary - preallocation, > limited availability of 1G pages etc. But they're not reasons against > charging faulted in hugetlb to the memcg *as well*. > > My point is we need both. One to manage competition over hugetlb, > because it has unique limitations. The other to manage competition > over host memory which hugetlb is a part of. > > Here is a usecase from our fleet. > > Imagine a configuration with two 32G containers. The machine is booted > with hugetlb_cma=6G, and each container may or may not use up to 3 > gigantic page, depending on the workload within it. The rest is anon, > cache, slab, etc. You set the hugetlb cgroup limit of each cgroup to > 3G to enforce hugetlb fairness. But how do you configure memory.max to > keep *overall* consumption, including anon, cache, slab etc. fair? > > If used hugetlb is charged, you can just set memory.max=32G regardless > of the workload inside. > > Without it, you'd have to constantly poll hugetlb usage and readjust > memory.max! Yep, and I'd like to add that this could and have caused issues in our production system, when there is a delay in memory limits (low or max) correction. The userspace agent in charge of correcting these only runs periodically, and within consecutive runs the system could be in an over/underprotected state. An instantaneous charge towards the memory controller would close this gap. I think we need both a HugeTLB controller and memory controller accounting.
On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote: > On Tue 26-09-23 12:49:47, Nhat Pham wrote: > > Currently, hugetlb memory usage is not acounted for in the memory > > controller, which could lead to memory overprotection for cgroups with > > hugetlb-backed memory. This has been observed in our production system. > > > > This patch series rectifies this issue by charging the memcg when the > > hugetlb folio is allocated, and uncharging when the folio is freed. In > > addition, a new selftest is added to demonstrate and verify this new > > behavior. > > The primary reason why hugetlb is living outside of memcg (and the core > MM as well) is that it doesn't really fit the whole scheme. In several > aspects. First and the foremost it is an independently managed resource > with its own pool management, use and lifetime. Honestly, the simpler explanation is that few people have used hugetlb in regular, containerized non-HPC workloads. Hugetlb has historically been much more special, and it retains a specialness that warrants e.g. the hugetlb cgroup container. But it has also made strides with hugetlb_cma, migratability, madvise support etc. that allows much more on-demand use. It's no longer the case that you just put a static pool of memory aside during boot and only a few blessed applications are using it. For example, we're using hugetlb_cma very broadly with generic containers. The CMA region is fully usable by movable non-huge stuff until huge pages are allocated in it. With the hugetlb controller you can define a maximum number of hugetlb pages that can be used per container. But what if that container isn't using any? Why shouldn't it be allowed to use its overall memory allowance for anon and cache instead? With hugetlb being more dynamic, it becomes the same problem that we had with dedicated tcp and kmem pools. It didn't make sense to fail a random slab allocation when you still have memory headroom or can reclaim some cache. Nowadays, the same problem applies to hugetlb. > There is no notion of memory reclaim and this makes a huge difference > for the pool that might consume considerable amount of memory. While > this is the case for many kernel allocations as well they usually do not > consume considerable portions of the accounted memory. This makes it > really tricky to handle limit enforcement gracefully. I don't think that's true. For some workloads, network buffers can absolutely dominate. And they work just fine with cgroup limits. It isn't a problem that they aren't reclaimable themselves, it's just important that they put pressure on stuff that is. So that if you use 80% hugetlb, the other memory is forced to stay in the remaining 20%, or it OOMs; and that if you don't use hugetlb, the group is still allowed to use the full 100% of its host memory allowance, without requiring some outside agent continuously monitoring and adjusting the container limits. > Another important aspect comes from the lifetime semantics when a proper > reservations accounting and managing needs to handle mmap time rather > than than usual allocation path. While pages are allocated they do not > belong to anybody and only later at the #PF time (or read for the fs > backed mapping) the ownership is established. That makes it really hard > to manage memory as whole under the memcg anyway as a large part of > that pool sits without an ownership yet it cannot be used for any other > purpose. > > These and more reasons where behind the earlier decision o have a > dedicated hugetlb controller. Yeah, there is still a need for an actual hugetlb controller for the static use cases (and even for dynamic access to hugetlb_cma). But you need memcg coverage as well for the more dynamic cases to work as expected. And having that doesn't really interfere with the static usecases. > Also I will also Nack involving hugetlb pages being accounted by > default. This would break any setups which mix normal and hugetlb memory > with memcg limits applied. Yes, no disagreement there. I think we're all on the same page this needs to be opt-in, say with a cgroup mount option.
On Wed, Sep 27, 2023 at 02:47:38PM -0400, Johannes Weiner wrote: > On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote: > > On Tue 26-09-23 12:49:47, Nhat Pham wrote: > > > Currently, hugetlb memory usage is not acounted for in the memory > > > controller, which could lead to memory overprotection for cgroups with > > > hugetlb-backed memory. This has been observed in our production system. > > > > > > This patch series rectifies this issue by charging the memcg when the > > > hugetlb folio is allocated, and uncharging when the folio is freed. In > > > addition, a new selftest is added to demonstrate and verify this new > > > behavior. > > > > The primary reason why hugetlb is living outside of memcg (and the core > > MM as well) is that it doesn't really fit the whole scheme. In several > > aspects. First and the foremost it is an independently managed resource > > with its own pool management, use and lifetime. > > Honestly, the simpler explanation is that few people have used hugetlb > in regular, containerized non-HPC workloads. > > Hugetlb has historically been much more special, and it retains a > specialness that warrants e.g. the hugetlb cgroup container. But it > has also made strides with hugetlb_cma, migratability, madvise support > etc. that allows much more on-demand use. It's no longer the case that > you just put a static pool of memory aside during boot and only a few > blessed applications are using it. > > For example, we're using hugetlb_cma very broadly with generic > containers. The CMA region is fully usable by movable non-huge stuff > until huge pages are allocated in it. With the hugetlb controller you > can define a maximum number of hugetlb pages that can be used per > container. But what if that container isn't using any? Why shouldn't > it be allowed to use its overall memory allowance for anon and cache > instead? Cool, I remember proposing hugetlb memcg stats several years ago and if I remember correctly at that time you was opposing it based on the idea that huge pages are not a part of the overall memcg flow: they are not a subject for memory pressure, can't be evicted, etc. And thp's were seen as a long-term replacement. Even though all above it's true, hugetlb has it's niche and I don't think thp's will realistically replace it any time soon. So I'm glad to see this effort (and very supportive) on making hugetlb more convenient and transparent for an end user. Thanks!
On Wed, Sep 27, 2023 at 4:21 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 26-09-23 12:49:47, Nhat Pham wrote: > > Currently, hugetlb memory usage is not acounted for in the memory > > controller, which could lead to memory overprotection for cgroups with > > hugetlb-backed memory. This has been observed in our production system. > > > > This patch series rectifies this issue by charging the memcg when the > > hugetlb folio is allocated, and uncharging when the folio is freed. In > > addition, a new selftest is added to demonstrate and verify this new > > behavior. > > The primary reason why hugetlb is living outside of memcg (and the core > MM as well) is that it doesn't really fit the whole scheme. In several > aspects. First and the foremost it is an independently managed resource > with its own pool management, use and lifetime. > > There is no notion of memory reclaim and this makes a huge difference > for the pool that might consume considerable amount of memory. While > this is the case for many kernel allocations as well they usually do not > consume considerable portions of the accounted memory. This makes it > really tricky to handle limit enforcement gracefully. > > Another important aspect comes from the lifetime semantics when a proper > reservations accounting and managing needs to handle mmap time rather > than than usual allocation path. While pages are allocated they do not > belong to anybody and only later at the #PF time (or read for the fs > backed mapping) the ownership is established. That makes it really hard > to manage memory as whole under the memcg anyway as a large part of > that pool sits without an ownership yet it cannot be used for any other > purpose. > > These and more reasons where behind the earlier decision o have a > dedicated hugetlb controller. While I believe all of these are true, I think they are not reasons not to have memcg accounting. As everyone has pointed out, memcg accounting by itself cannot handle all situations - it is not a fix-all. Other mechanisms, such as the HugeTLB controller, could be the better solution in these cases, and hugetlb memcg accounting is definitely not an attempt to infringe upon these control domains. However, memcg accounting is still necessary for certain memory limits enforcement to work cleanly and properly - such as the use cases we have (as Johannes has beautifully described). It will certainly help administrators simplify their control workflow a lot (assuming we do not surprise them with this change - a new mount option to opt-in should help with the transition). > > Also I will also Nack involving hugetlb pages being accounted by > default. This would break any setups which mix normal and hugetlb memory > with memcg limits applied. Got it! I'll introduce some opt-in mechanisms in the next version. This is my oversight. > -- > Michal Hocko > SUSE Labs
On Tue, Sep 26, 2023 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote: > > Currently, hugetlb memory usage is not acounted for in the memory > controller, which could lead to memory overprotection for cgroups with > hugetlb-backed memory. This has been observed in our production system. > > This patch series rectifies this issue by charging the memcg when the > hugetlb folio is allocated, and uncharging when the folio is freed. In > addition, a new selftest is added to demonstrate and verify this new > behavior. > > Nhat Pham (2): > hugetlb: memcg: account hugetlb-backed memory in memory controller > selftests: add a selftest to verify hugetlb usage in memcg > > MAINTAINERS | 2 + > fs/hugetlbfs/inode.c | 2 +- > include/linux/hugetlb.h | 6 +- > include/linux/memcontrol.h | 8 + > mm/hugetlb.c | 23 +- > mm/memcontrol.c | 40 ++++ > tools/testing/selftests/cgroup/.gitignore | 1 + > tools/testing/selftests/cgroup/Makefile | 2 + > .../selftests/cgroup/test_hugetlb_memcg.c | 222 ++++++++++++++++++ > 9 files changed, 297 insertions(+), 9 deletions(-) > create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c > > -- > 2.34.1 Thanks for all the comments and suggestions everyone! FYI, I have sent out a second version of the patch series with the new mount flag: https://lore.kernel.org/lkml/20230928005723.1709119-1-nphamcs@gmail.com/T/#t Feel free to check it out and discuss it over there too!
On Wed, Sep 27, 2023 at 02:37:47PM -0700, Roman Gushchin wrote: > On Wed, Sep 27, 2023 at 02:47:38PM -0400, Johannes Weiner wrote: > > On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote: > > > On Tue 26-09-23 12:49:47, Nhat Pham wrote: > > > > Currently, hugetlb memory usage is not acounted for in the memory > > > > controller, which could lead to memory overprotection for cgroups with > > > > hugetlb-backed memory. This has been observed in our production system. > > > > > > > > This patch series rectifies this issue by charging the memcg when the > > > > hugetlb folio is allocated, and uncharging when the folio is freed. In > > > > addition, a new selftest is added to demonstrate and verify this new > > > > behavior. > > > > > > The primary reason why hugetlb is living outside of memcg (and the core > > > MM as well) is that it doesn't really fit the whole scheme. In several > > > aspects. First and the foremost it is an independently managed resource > > > with its own pool management, use and lifetime. > > > > Honestly, the simpler explanation is that few people have used hugetlb > > in regular, containerized non-HPC workloads. > > > > Hugetlb has historically been much more special, and it retains a > > specialness that warrants e.g. the hugetlb cgroup container. But it > > has also made strides with hugetlb_cma, migratability, madvise support > > etc. that allows much more on-demand use. It's no longer the case that > > you just put a static pool of memory aside during boot and only a few > > blessed applications are using it. > > > > For example, we're using hugetlb_cma very broadly with generic > > containers. The CMA region is fully usable by movable non-huge stuff > > until huge pages are allocated in it. With the hugetlb controller you > > can define a maximum number of hugetlb pages that can be used per > > container. But what if that container isn't using any? Why shouldn't > > it be allowed to use its overall memory allowance for anon and cache > > instead? > > Cool, I remember proposing hugetlb memcg stats several years ago and if > I remember correctly at that time you was opposing it based on the idea > that huge pages are not a part of the overall memcg flow: they are not > a subject for memory pressure, can't be evicted, etc. And thp's were seen > as a long-term replacement. Even though all above it's true, hugetlb has > it's niche and I don't think thp's will realistically replace it any time > soon. Yeah, Michal's arguments very much reminded me of my stance then. I stand corrected. I'm still hopeful that we can make 2M work transparently. I'd expect 1G to remain in the hugetlb domain for some time to come, but even those are mostly dynamic now with your hugetlb_cma feature! > So I'm glad to see this effort (and very supportive) on making hugetlb > more convenient and transparent for an end user. Thanks!
On 09/27/23 14:47, Johannes Weiner wrote: > On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote: > > On Tue 26-09-23 12:49:47, Nhat Pham wrote: > > So that if you use 80% hugetlb, the other memory is forced to stay in > the remaining 20%, or it OOMs; and that if you don't use hugetlb, the > group is still allowed to use the full 100% of its host memory > allowance, without requiring some outside agent continuously > monitoring and adjusting the container limits. Jumping in late here as I was traveling last week. In addition, I want to state my limited cgroup knowledge up front. I was thinking of your scenario above a little differently. Suppose a group is up and running at almost 100% memory usage. However, the majority of that memory is reclaimable. Now, someone wants to allocate a 2M hugetlb page. There is not 2MB free, but we could easily reclaim 2MB to make room for the hugetlb page. I may be missing something, but I do not see how that is going to happen. It seems like we would really want that behavior.
On Sun, Oct 01, 2023 at 04:27:30PM -0700, Mike Kravetz wrote: > On 09/27/23 14:47, Johannes Weiner wrote: > > On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote: > > > On Tue 26-09-23 12:49:47, Nhat Pham wrote: > > > > So that if you use 80% hugetlb, the other memory is forced to stay in > > the remaining 20%, or it OOMs; and that if you don't use hugetlb, the > > group is still allowed to use the full 100% of its host memory > > allowance, without requiring some outside agent continuously > > monitoring and adjusting the container limits. > > Jumping in late here as I was traveling last week. In addition, I want > to state my limited cgroup knowledge up front. > > I was thinking of your scenario above a little differently. Suppose a > group is up and running at almost 100% memory usage. However, the majority > of that memory is reclaimable. Now, someone wants to allocate a 2M hugetlb > page. There is not 2MB free, but we could easily reclaim 2MB to make room > for the hugetlb page. I may be missing something, but I do not see how that > is going to happen. It seems like we would really want that behavior. But that is actually what it does, no? alloc_hugetlb_folio mem_cgroup_hugetlb_charge_folio charge_memcg try_charge !page_counter_try_charge ? !try_to_free_mem_cgroup_pages ? mem_cgroup_oom So it does reclaim when the hugetlb hits the cgroup limit. And if that fails to make room, it OOMs the cgroup. Or maybe I'm missing something?
On Mon 02-10-23 10:42:50, Johannes Weiner wrote: > On Sun, Oct 01, 2023 at 04:27:30PM -0700, Mike Kravetz wrote: > > On 09/27/23 14:47, Johannes Weiner wrote: > > > On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote: > > > > On Tue 26-09-23 12:49:47, Nhat Pham wrote: > > > > > > So that if you use 80% hugetlb, the other memory is forced to stay in > > > the remaining 20%, or it OOMs; and that if you don't use hugetlb, the > > > group is still allowed to use the full 100% of its host memory > > > allowance, without requiring some outside agent continuously > > > monitoring and adjusting the container limits. > > > > Jumping in late here as I was traveling last week. In addition, I want > > to state my limited cgroup knowledge up front. > > > > I was thinking of your scenario above a little differently. Suppose a > > group is up and running at almost 100% memory usage. However, the majority > > of that memory is reclaimable. Now, someone wants to allocate a 2M hugetlb > > page. There is not 2MB free, but we could easily reclaim 2MB to make room > > for the hugetlb page. I may be missing something, but I do not see how that > > is going to happen. It seems like we would really want that behavior. > > But that is actually what it does, no? > > alloc_hugetlb_folio > mem_cgroup_hugetlb_charge_folio > charge_memcg > try_charge > !page_counter_try_charge ? > !try_to_free_mem_cgroup_pages ? > mem_cgroup_oom > > So it does reclaim when the hugetlb hits the cgroup limit. And if that > fails to make room, it OOMs the cgroup. > > Or maybe I'm missing something? I beleve that Mike alludes to what I have pointed in other email: http://lkml.kernel.org/r/ZRrI90KcRBwVZn/r@dhcp22.suse.cz and a situation when the hugetlb requests results in an acutal hugetlb allocation rather than consumption from the pre-allocated pool. In that case memcg is not involved because the charge happens only after the allocation happens. That btw. means that this request could disrupt a different memcg even if the current one is at the limit or it could be reclaimed instead. Also there is not OOM as hugetlb pages are costly requests and we do not invoke the oom killer.
On Mon, Oct 02, 2023 at 04:58:21PM +0200, Michal Hocko wrote: > Also there is not OOM as hugetlb pages are costly requests and we do not > invoke the oom killer. Ah good point. That seems like a policy choice we could make. However, since hugetlb users are already set up for and come to expect SIGBUS for physical failure as well as hugetlb_cgroup limits, we should have memcg follow established precedent and leave the OOM killer out. Agree that a sentence in the changelog about this makes sense though.