Message ID | alpine.DEB.2.23.453.2007142018150.2667860@chino.kir.corp.google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, memcg: provide a stat to describe reclaimable memory | expand |
On Tue, 14 Jul 2020, David Rientjes wrote: > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1314,6 +1314,18 @@ PAGE_SIZE multiple when read back. > Part of "slab" that cannot be reclaimed on memory > pressure. > > + avail > + An estimate of how much memory can be made available for > + starting new applications, similar to MemAvailable from > + /proc/meminfo (Documentation/filesystems/proc.rst). > + > + This is derived by assuming that half of page cahce and > + reclaimable slab can be uncharged without significantly > + impacting the workload, similar to MemAvailable. It also > + factors in the amount of lazy freeable memory (MADV_FREE) and > + compound pages that can be split and uncharged under memory > + pressure. > + > pgfault > Total number of page faults incurred > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1350,6 +1350,35 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg) > return false; > } > > +/* > + * Returns an estimate of the amount of available memory that can be reclaimed > + * for a memcg, in pages. > + */ > +static unsigned long mem_cgroup_avail(struct mem_cgroup *memcg) > +{ > + long deferred, lazyfree; > + > + /* > + * Deferred pages are charged anonymous pages that are on the LRU but > + * are unmapped. These compound pages are split under memory pressure. > + */ > + deferred = max_t(long, memcg_page_state(memcg, NR_ACTIVE_ANON) + > + memcg_page_state(memcg, NR_INACTIVE_ANON) - > + memcg_page_state(memcg, NR_ANON_MAPPED), 0); > + /* > + * Lazyfree pages are charged clean anonymous pages that are on the file > + * LRU and can be reclaimed under memory pressure. > + */ > + lazyfree = max_t(long, memcg_page_state(memcg, NR_ACTIVE_FILE) + > + memcg_page_state(memcg, NR_INACTIVE_FILE) - > + memcg_page_state(memcg, NR_FILE_PAGES), 0); > + > + /* Using same heuristic as si_mem_available() */ > + return (unsigned long)deferred + (unsigned long)lazyfree + > + (memcg_page_state(memcg, NR_FILE_PAGES) + > + memcg_page_state(memcg, NR_SLAB_RECLAIMABLE)) / 2; > +} > + > static char *memory_stat_format(struct mem_cgroup *memcg) > { > struct seq_buf s; > @@ -1417,6 +1446,12 @@ static char *memory_stat_format(struct mem_cgroup *memcg) > seq_buf_printf(&s, "slab_unreclaimable %llu\n", > (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) * > PAGE_SIZE); > + /* > + * All values in this buffer are read individually, no implied > + * consistency amongst them. > + */ > + seq_buf_printf(&s, "avail %llu\n", > + (u64)mem_cgroup_avail(memcg) * PAGE_SIZE); > > /* Accumulated memory events */ > > An alternative to this would also be to change from an "available" metric to an "anon_reclaimable" metric since both the deferred split queues and lazy freeable memory would pertain to anon. This would no longer attempt to mimic MemAvailable and leave any such calculation to userspace (anon_reclaimable + (file + slab_reclaimable) / 2). With this route, care would need to be taken to clearly indicate that anon_reclaimable is not necessarily a subset of the "anon" metric since reclaimable memory from compound pages on deferred split queues is not mapped, so it doesn't show up in NR_ANON_MAPPED. I'm indifferent to either approach and would be happy to switch to anon_reclaimable if others agree and doesn't foresee any extensibility issues.
Hi David, I'm somewhat against adding more metrics which try to approximate availability of memory when we already know it not to generally manifest very well in practice, especially since this *is* calculable by userspace (albeit with some knowledge of mm internals). Users and applications often vastly overestimate the reliability of these metrics, especially since they heavily depend on transient page states and whatever reclaim efficacy happens to be achieved at the time there is demand. What do you intend to do with these metrics and how do you envisage other users should use them? Is it not possible to rework the strategy to use pressure information and/or workingset pressurisation instead? Thanks, Chris
Hi David, David Rientjes writes: >With the proposed anon_reclaimable, do you have any reliability concerns? >This would be the amount of lazy freeable memory and memory that can be >uncharged if compound pages from the deferred split queue are split under >memory pressure. It seems to be a very precise value (as slab_reclaimable >already in memory.stat is), so I'm not sure why there is a reliability >concern. Maybe you can elaborate? Ability to reclaim a page is largely about context at the time of reclaim. For example, if you are running at the edge of swap, at a metric that truly describes "reclaimable memory" will contain vastly different numbers from one second to the next as cluster and page availability increases and decreases. We may also have to do things like look for youngness at reclaim time, so I'm not convinced metrics like this makes sense in the general case. >Today, this information is indeed possible to calculate from userspace. >The idea is to present this information that will be backwards compatible, >however, as the kernel implementation changes. When lazy freeable memory >was added, for instance, userspace likely would not have preemptively been >doing an "active_file + inactive_file - file" calculation to factor that >in as reclaimable anon :) I agree it's hard to calculate from userspace without assistance, but I also generally think generally exposing a highly nuanced and situational value to userspace is a recipe for confusion. The user either knows mm internals and can understand it, or don't and probably only misunderstand it. There is a non-zero cognitive cost to adding more metrics like this, which is why I'm interested in knowing more about the userspace usage semantics intended :-) >The example I gave earlier in the thread showed how dramatically different >memory.current is before and after the introduction of deferred split >queues. Userspace sees ballooning memcg usage and alerts on it (suspects >a memory leak, for example) when in reality this is purely reclaimable >memory under pressure and is the result of a kernel implementation detail. Again, I'm curious why this can't be solved by artificial workingset pressurisation and monitoring. Generally, the most reliable reclaim metrics come from operating reclaim itself.
On Fri, 17 Jul 2020, Chris Down wrote: > > With the proposed anon_reclaimable, do you have any reliability concerns? > > This would be the amount of lazy freeable memory and memory that can be > > uncharged if compound pages from the deferred split queue are split under > > memory pressure. It seems to be a very precise value (as slab_reclaimable > > already in memory.stat is), so I'm not sure why there is a reliability > > concern. Maybe you can elaborate? > > Ability to reclaim a page is largely about context at the time of reclaim. For > example, if you are running at the edge of swap, at a metric that truly > describes "reclaimable memory" will contain vastly different numbers from one > second to the next as cluster and page availability increases and decreases. > We may also have to do things like look for youngness at reclaim time, so I'm > not convinced metrics like this makes sense in the general case. ... > Again, I'm curious why this can't be solved by artificial workingset > pressurisation and monitoring. Generally, the most reliable reclaim metrics > come from operating reclaim itself. > Perhaps this is best discussed in the context I gave in the earlier thread: imagine a thp-backed heap of 64MB and then a malloc implementation doing MADV_DONTNEED over all but one page in every one of these pageblocks. On a 4.3 kernel, for example, memory.current for the heap segment is now (64MB / 2MB) * 4KB = 128KB because we have synchronous splitting and uncharging of the underlying hugepage. On a 4.15 kernel, for example, memory.current is still 64MB because the underlying hugepages are still charged to the memcg due to deferred split queues. For any application that monitors this, pressurization is not going to help: the memory will be reclaimed under memcg pressure but we aren't facing that pressure yet. Userspace could identify this as a memory leak unless we describe what anon memory is actually reclaimable in this context (including on systems without swap). For any entity that uses this information to infer if new work can be scheduled in this memcg (the reason MemAvailable exists in /proc/meminfo at the system level), this is now dramatically skewed. At worse, on a swapless system, this memory is seen from userspace as unreclaimable because it's charged anon. Do you have other suggestions for how userspace can understand what anon is reclaimable in this context before encountering memory pressure? If so, it may be a great alternative to this: I haven't been able to think of such a way other than an anon_reclaimable stat.
On Fri 17-07-20 12:37:57, David Rientjes wrote: [...] > On a 4.3 kernel, for example, memory.current for the heap segment is now > (64MB / 2MB) * 4KB = 128KB because we have synchronous splitting and > uncharging of the underlying hugepage. On a 4.15 kernel, for example, > memory.current is still 64MB because the underlying hugepages are still > charged to the memcg due to deferred split queues. Deferred THP split should be a kernel internal implementation optimization and a detail that userspace shouldn't really be worrying about. If there are user visible effects that are standing in the way then we should reconsider how much is the optimization worth. I do not really remember any actual numbers that would strongly justify its existence while I do remember several problems that this has introduced. So I am really wondering whether exporting subtle metrics to the userspace which can lead to confusion is the right approach to the problem you have at hands. Also could you be more specific about the numbers we are talking here? E.g. what is the overal percentage of the "mis-accounted" split THPs wrt. to the high/max limit? Is the userspace relying on very precise numbers?
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1314,6 +1314,18 @@ PAGE_SIZE multiple when read back. Part of "slab" that cannot be reclaimed on memory pressure. + avail + An estimate of how much memory can be made available for + starting new applications, similar to MemAvailable from + /proc/meminfo (Documentation/filesystems/proc.rst). + + This is derived by assuming that half of page cahce and + reclaimable slab can be uncharged without significantly + impacting the workload, similar to MemAvailable. It also + factors in the amount of lazy freeable memory (MADV_FREE) and + compound pages that can be split and uncharged under memory + pressure. + pgfault Total number of page faults incurred diff --git a/mm/memcontrol.c b/mm/memcontrol.c --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1350,6 +1350,35 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg) return false; } +/* + * Returns an estimate of the amount of available memory that can be reclaimed + * for a memcg, in pages. + */ +static unsigned long mem_cgroup_avail(struct mem_cgroup *memcg) +{ + long deferred, lazyfree; + + /* + * Deferred pages are charged anonymous pages that are on the LRU but + * are unmapped. These compound pages are split under memory pressure. + */ + deferred = max_t(long, memcg_page_state(memcg, NR_ACTIVE_ANON) + + memcg_page_state(memcg, NR_INACTIVE_ANON) - + memcg_page_state(memcg, NR_ANON_MAPPED), 0); + /* + * Lazyfree pages are charged clean anonymous pages that are on the file + * LRU and can be reclaimed under memory pressure. + */ + lazyfree = max_t(long, memcg_page_state(memcg, NR_ACTIVE_FILE) + + memcg_page_state(memcg, NR_INACTIVE_FILE) - + memcg_page_state(memcg, NR_FILE_PAGES), 0); + + /* Using same heuristic as si_mem_available() */ + return (unsigned long)deferred + (unsigned long)lazyfree + + (memcg_page_state(memcg, NR_FILE_PAGES) + + memcg_page_state(memcg, NR_SLAB_RECLAIMABLE)) / 2; +} + static char *memory_stat_format(struct mem_cgroup *memcg) { struct seq_buf s; @@ -1417,6 +1446,12 @@ static char *memory_stat_format(struct mem_cgroup *memcg) seq_buf_printf(&s, "slab_unreclaimable %llu\n", (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) * PAGE_SIZE); + /* + * All values in this buffer are read individually, no implied + * consistency amongst them. + */ + seq_buf_printf(&s, "avail %llu\n", + (u64)mem_cgroup_avail(memcg) * PAGE_SIZE); /* Accumulated memory events */
MemAvailable in /proc/meminfo provides some guidance on the amount of memory that can be made available for starting new applications (see Documentation/filesystems/proc.rst). Userspace can lack insight into the amount of memory that can be reclaimed from a memcg based on values from memory.stat, however. Two specific examples: - Lazy freeable memory (MADV_FREE) that are clean anonymous pages on the inactive file LRU that can be quickly reclaimed under memory pressure but otherwise shows up as mapped anon in memory.stat, and - Memory on deferred split queues (thp) that are compound pages that can be split and uncharged from the memcg under memory pressure, but otherwise shows up as charged anon LRU memory in memory.stat. Userspace can currently derive this information and use the same heuristic as MemAvailable by doing this: deferred = (active_anon + inactive_anon) - anon lazyfree = (active_file + inactive_file) - file avail = deferred + lazyfree + (file + slab_reclaimable) / 2 But this depends on implementation details for how this memory is handled in the kernel for the purposes of reclaim (anon on inactive file LRU or unmapped anon on the LRU). For the purposes of writing portable userspace code that does not need to have insight into the kernel implementation for reclaimable memory, this exports a metric that can provide an estimate of the amount of memory that can be reclaimed and uncharged from the memcg to start new applications. As the kernel implementation evolves for memory that can be reclaimed under memory pressure, this metric can be kept consistent. Signed-off-by: David Rientjes <rientjes@google.com> --- Documentation/admin-guide/cgroup-v2.rst | 12 +++++++++ mm/memcontrol.c | 35 +++++++++++++++++++++++++ 2 files changed, 47 insertions(+)