Message ID | 1648713656-24254-1-git-send-email-zhaoyang.huang@unisoc.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] cgroup: introduce dynamic protection for memcg | expand |
On Thu 31-03-22 16:00:56, zhaoyang.huang wrote: > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > For some kind of memcg, the usage is varies greatly from scenarios. Such as > multimedia app could have the usage range from 50MB to 500MB, which generated > by loading an special algorithm into its virtual address space and make it hard > to protect the expanded usage without userspace's interaction. Do I get it correctly that the concern you have is that you do not know how much memory your workload will need because that depends on some parameters? > Furthermore, fixed > memory.low is a little bit against its role of soft protection as it will response > any system's memory pressure in same way. Could you be more specific about this as well? > Taking all above into consideration, we introduce a kind of dynamic protection > based on group's watermark and system's memory pressure in this patch. Our aims are: > 1. dynamic protection with no fixed setting > 2. proper protection value on memory.current How do you define the "proper protection" > 3. time based decay protection > 4. memory pressue related protection > > The basic concept could be descripted as bellowing, where we take group->watermark > as a representative of usage > group->memory.low = decayed_watermark * decay_factor > decayed_watermark = group->watermark * func_wm_decay(time) > decay_factor = psi_system[PSI_MEM][time] > > func_wm_decay could be deemed as a linear decay funcion that will decay 1/2 in > 68s(36bit).If we take 2048 as "1", it could be descripted as: > decayed_watermark = time >> (group->wm_dec_factor - 10) > decayed_watermark = new_usage(if new_usage > decayed_watermark) > > decay_factor is as simple as a table lookingup and compose the final value by > weight of some and full as > some = psi_system.avg[PSI_MEM * 2][time] > full = psi_system.avg[PSI_MEM * 2 + 1][time] > decay_factor = some * 70% + full *30% One very important thing that I am missing here is the overall objective of this tuning. From the above it seems that you want to (ab)use memory->low to protect some portion of the charged memory and that the protection shrinks over time depending on the the global PSI metrict and time. But why this is a good thing? Also you seem to base your back off on the global PSI numbers. How does that cope with a memcg memory pressure? > We simply test above change on a v5.4 based system in bellowing topology and > observe some behavious as we expected: > A > / \ > B C > 1. With regard to the protection, elow is in a proper range as proportion of watermark. > 2. Elapsed time has positive impact on elow via decayed_watermark. > 3. Memory pressure has negitive impact on elow which could keep more usage when > system is under less pressure. I am sorry but this doesn't really help much. As pointed out out above, I completely fail to see what is the expected behavior and why it makes sense. > PS: It should be configured as a sub-type of memcg and choosed by the user when > create the group. I do not understand but I suspect that a non-RFC proposal would use a dedicated user interface. Your current implementation is surely not acceptable as it changes semantic of a limit without any way to opt-out of that behavior. Memory low has a well established semantic so this cannot really be changed without an explicit opt-in. > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > --- > include/linux/memcontrol.h | 50 ++++++++++++++++++++++++++++++++++++++++++++ > include/linux/page_counter.h | 4 ++++ > include/linux/psi.h | 2 ++ > kernel/sched/psi.c | 18 ++++++++++++++++ > mm/memcontrol.c | 4 ++++ > mm/page_counter.c | 4 ++++ > 6 files changed, 82 insertions(+) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 0c5c403..a510057 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -21,6 +21,9 @@ > #include <linux/vmstat.h> > #include <linux/writeback.h> > #include <linux/page-flags.h> > +#include <linux/sched/loadavg.h> > +#include <linux/sched/clock.h> > +#include <linux/psi.h> > > struct mem_cgroup; > struct obj_cgroup; > @@ -28,6 +31,8 @@ > struct mm_struct; > struct kmem_cache; > > +#define MEMCG_INTERVAL (2*HZ+1) /* 2 sec intervals */ > + > /* Cgroup-specific page state, on top of universal node page state */ > enum memcg_stat_item { > MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS, > @@ -340,6 +345,10 @@ struct mem_cgroup { > struct deferred_split deferred_split_queue; > #endif > > + u64 wm_dec_fact; > + u64 avg_next_update; > + u64 avg_last_update; > + > struct mem_cgroup_per_node *nodeinfo[]; > }; > > @@ -608,6 +617,47 @@ static inline bool mem_cgroup_disabled(void) > return !cgroup_subsys_enabled(memory_cgrp_subsys); > } > > +/* > + * calculate memory.low based on the historic watermark and memory pressure > + */ > +static inline void calc_protected_low(struct mem_cgroup *group) > +{ > + u64 now, decay_factor; > + u64 decayed_watermark; > + u64 delta_time; > + > + now = sched_clock(); > + > + if (!group->avg_next_update) { > + group->avg_next_update = now + jiffies_to_nsecs(5*HZ); > + return; > + } > + > + if (time_before((unsigned long)now, (unsigned long)group->avg_next_update)) > + return; > + > + delta_time = group->avg_last_update ? now - group->avg_last_update : 0; > + /* > + * we take 2048 as "1" and 68s decay 1/2(36bit) by default > + * decay_factor = 1024 * delta_time / 68s(0x1000000000) > + * 0.5(1024)/68s = decay_factor/delta_time ==> decay_factor = delta_time >> 26 > + */ > + decay_factor = (2048 - min(2048ULL, delta_time >> (group->wm_dec_fact - 10))); > + decayed_watermark = group->memory.decayed_watermark * decay_factor / 2048; > + /* decay_factor: based on average memory pressure over elapsed time */ > + decay_factor = psi_mem_get(delta_time); > + group->memory.low = decayed_watermark * (100 - decay_factor) / 100; > + > + /* > + * avg_next_update: expected expire time according to current status > + */ > + group->memory.decayed_watermark = decayed_watermark; > + group->avg_last_update = now; > + group->avg_next_update = now + jiffies_to_nsecs(2*HZ); > + > + return; > +} > + > static inline void mem_cgroup_protection(struct mem_cgroup *root, > struct mem_cgroup *memcg, > unsigned long *min, > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > index 6795913..2720eb9f 100644 > --- a/include/linux/page_counter.h > +++ b/include/linux/page_counter.h > @@ -25,8 +25,12 @@ struct page_counter { > > /* legacy */ > unsigned long watermark; > + unsigned long decayed_watermark; > unsigned long failcnt; > > + /* proportional protection */ > + unsigned long min_prop; > + unsigned long low_prop; > /* > * 'parent' is placed here to be far from 'usage' to reduce > * cache false sharing, as 'usage' is written mostly while > diff --git a/include/linux/psi.h b/include/linux/psi.h > index 65eb147..6c76993 100644 > --- a/include/linux/psi.h > +++ b/include/linux/psi.h > @@ -25,6 +25,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > > int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res); > > +unsigned long psi_mem_get(unsigned long time); > + > #ifdef CONFIG_CGROUPS > int psi_cgroup_alloc(struct cgroup *cgrp); > void psi_cgroup_free(struct cgroup *cgrp); > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > index dd80bd2..8d315e0 100644 > --- a/kernel/sched/psi.c > +++ b/kernel/sched/psi.c > @@ -291,6 +291,24 @@ static void get_recent_times(struct psi_group *group, int cpu, > } > } > > +unsigned long psi_mem_get(unsigned long time_ns) > +{ > + unsigned long time_sec = time_ns / (1000 * 1000 * 1000); > + unsigned long some, full; > + if (time_sec < 10) { > + some = LOAD_INT(psi_system.avg[PSI_MEM * 2][0]); > + full = LOAD_INT(psi_system.avg[PSI_MEM * 2 + 1][0]); > + } else if (time_sec < 60) { > + some = LOAD_INT(psi_system.avg[PSI_MEM * 2][1]); > + full = LOAD_INT(psi_system.avg[PSI_MEM * 2 + 1][1]); > + } else { > + some = LOAD_INT(psi_system.avg[PSI_MEM * 2][2]); > + full = LOAD_INT(psi_system.avg[PSI_MEM * 2 + 1][2]); > + } > + > + return (some * 768 + full * 256) / 1024; > +} > + > static void calc_avgs(unsigned long avg[3], int missed_periods, > u64 time, u64 period) > { > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 508bcea..6b579a4 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5188,6 +5188,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) > page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX); > memcg->soft_limit = PAGE_COUNTER_MAX; > page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX); > + memcg->wm_dec_fact = 36; > if (parent) { > memcg->swappiness = mem_cgroup_swappiness(parent); > memcg->oom_kill_disable = parent->oom_kill_disable; > @@ -6616,6 +6617,8 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root, > { > unsigned long usage, parent_usage; > struct mem_cgroup *parent; > + unsigned long memcg_emin, memcg_elow, parent_emin, parent_elow; > + unsigned long watermark; > > if (mem_cgroup_disabled()) > return; > @@ -6642,6 +6645,7 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root, > if (!parent) > return; > > + calc_protected_low(memcg); > if (parent == root) { > memcg->memory.emin = READ_ONCE(memcg->memory.min); > memcg->memory.elow = READ_ONCE(memcg->memory.low); > diff --git a/mm/page_counter.c b/mm/page_counter.c > index 7d83641..18abfdd 100644 > --- a/mm/page_counter.c > +++ b/mm/page_counter.c > @@ -83,6 +83,8 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages) > */ > if (new > READ_ONCE(c->watermark)) > WRITE_ONCE(c->watermark, new); > + if (new > READ_ONCE(c->decayed_watermark)) > + WRITE_ONCE(c->decayed_watermark, new); > } > } > > @@ -137,6 +139,8 @@ bool page_counter_try_charge(struct page_counter *counter, > */ > if (new > READ_ONCE(c->watermark)) > WRITE_ONCE(c->watermark, new); > + if (new > READ_ONCE(c->decayed_watermark)) > + WRITE_ONCE(c->decayed_watermark, new); > } > return true; > > -- > 1.9.1
On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote: > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as > > multimedia app could have the usage range from 50MB to 500MB, which generated > > by loading an special algorithm into its virtual address space and make it hard > > to protect the expanded usage without userspace's interaction. > > Do I get it correctly that the concern you have is that you do not know > how much memory your workload will need because that depends on some > parameters? right. such as a camera APP will expand the usage from 50MB to 500MB because of launching a special function(face beauty etc need special algorithm) > > > Furthermore, fixed > > memory.low is a little bit against its role of soft protection as it will response > > any system's memory pressure in same way. > > Could you be more specific about this as well? As the camera case above, if we set memory.low as 200MB to keep the APP run smoothly, the system will experience high memory pressure when another high load APP launched simultaneously. I would like to have camera be reclaimed under this scenario. > > > Taking all above into consideration, we introduce a kind of dynamic protection > > based on group's watermark and system's memory pressure in this patch. Our aims are: > > 1. dynamic protection with no fixed setting > > 2. proper protection value on memory.current > > How do you define the "proper protection" I would like to share more test result in next version > > > 3. time based decay protection > > 4. memory pressue related protection > > > > The basic concept could be descripted as bellowing, where we take group->watermark > > as a representative of usage > > group->memory.low = decayed_watermark * decay_factor > > decayed_watermark = group->watermark * func_wm_decay(time) > > decay_factor = psi_system[PSI_MEM][time] > > > > func_wm_decay could be deemed as a linear decay funcion that will decay 1/2 in > > 68s(36bit).If we take 2048 as "1", it could be descripted as: > > decayed_watermark = time >> (group->wm_dec_factor - 10) > > decayed_watermark = new_usage(if new_usage > decayed_watermark) > > > > decay_factor is as simple as a table lookingup and compose the final value by > > weight of some and full as > > some = psi_system.avg[PSI_MEM * 2][time] > > full = psi_system.avg[PSI_MEM * 2 + 1][time] > > decay_factor = some * 70% + full *30% > > One very important thing that I am missing here is the overall objective of this > tuning. From the above it seems that you want to (ab)use memory->low to > protect some portion of the charged memory and that the protection > shrinks over time depending on the the global PSI metrict and time. > But why this is a good thing? 'Good' means it meets my original goal of keeping the usage during a period of time and responding to the system's memory pressure. For an android like system, memory is almost forever being in a tight status no matter how many RAM it has. What we need from memcg is more than control and grouping, we need it to be more responsive to the system's load and could sacrifice its usage under certain criteria. > > Also you seem to base your back off on the global PSI numbers. How does > that cope with a memcg memory pressure? For simplification, just only take global PSI into consideration now. > > > We simply test above change on a v5.4 based system in bellowing topology and > > observe some behavious as we expected: > > A > > / \ > > B C > > 1. With regard to the protection, elow is in a proper range as proportion of watermark. > > 2. Elapsed time has positive impact on elow via decayed_watermark. > > 3. Memory pressure has negitive impact on elow which could keep more usage when > > system is under less pressure. > > I am sorry but this doesn't really help much. As pointed out out above, > I completely fail to see what is the expected behavior and why it makes > sense. I will test it under real scenarios and provide more intuitive result。 > > > PS: It should be configured as a sub-type of memcg and choosed by the user when > > create the group. > > I do not understand but I suspect that a non-RFC proposal would use a > dedicated user interface. Your current implementation is surely not > acceptable as it changes semantic of a limit without any way to opt-out > of that behavior. Memory low has a well established semantic so this > cannot really be changed without an explicit opt-in. opt-in changes will be added in next version and would NOT affect current design. > > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > --- > > include/linux/memcontrol.h | 50 ++++++++++++++++++++++++++++++++++++++++++++ > > include/linux/page_counter.h | 4 ++++ > > include/linux/psi.h | 2 ++ > > kernel/sched/psi.c | 18 ++++++++++++++++ > > mm/memcontrol.c | 4 ++++ > > mm/page_counter.c | 4 ++++ > > 6 files changed, 82 insertions(+) > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 0c5c403..a510057 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -21,6 +21,9 @@ > > #include <linux/vmstat.h> > > #include <linux/writeback.h> > > #include <linux/page-flags.h> > > +#include <linux/sched/loadavg.h> > > +#include <linux/sched/clock.h> > > +#include <linux/psi.h> > > > > struct mem_cgroup; > > struct obj_cgroup; > > @@ -28,6 +31,8 @@ > > struct mm_struct; > > struct kmem_cache; > > > > +#define MEMCG_INTERVAL (2*HZ+1) /* 2 sec intervals */ > > + > > /* Cgroup-specific page state, on top of universal node page state */ > > enum memcg_stat_item { > > MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS, > > @@ -340,6 +345,10 @@ struct mem_cgroup { > > struct deferred_split deferred_split_queue; > > #endif > > > > + u64 wm_dec_fact; > > + u64 avg_next_update; > > + u64 avg_last_update; > > + > > struct mem_cgroup_per_node *nodeinfo[]; > > }; > > > > @@ -608,6 +617,47 @@ static inline bool mem_cgroup_disabled(void) > > return !cgroup_subsys_enabled(memory_cgrp_subsys); > > } > > > > +/* > > + * calculate memory.low based on the historic watermark and memory pressure > > + */ > > +static inline void calc_protected_low(struct mem_cgroup *group) > > +{ > > + u64 now, decay_factor; > > + u64 decayed_watermark; > > + u64 delta_time; > > + > > + now = sched_clock(); > > + > > + if (!group->avg_next_update) { > > + group->avg_next_update = now + jiffies_to_nsecs(5*HZ); > > + return; > > + } > > + > > + if (time_before((unsigned long)now, (unsigned long)group->avg_next_update)) > > + return; > > + > > + delta_time = group->avg_last_update ? now - group->avg_last_update : 0; > > + /* > > + * we take 2048 as "1" and 68s decay 1/2(36bit) by default > > + * decay_factor = 1024 * delta_time / 68s(0x1000000000) > > + * 0.5(1024)/68s = decay_factor/delta_time ==> decay_factor = delta_time >> 26 > > + */ > > + decay_factor = (2048 - min(2048ULL, delta_time >> (group->wm_dec_fact - 10))); > > + decayed_watermark = group->memory.decayed_watermark * decay_factor / 2048; > > + /* decay_factor: based on average memory pressure over elapsed time */ > > + decay_factor = psi_mem_get(delta_time); > > + group->memory.low = decayed_watermark * (100 - decay_factor) / 100; > > + > > + /* > > + * avg_next_update: expected expire time according to current status > > + */ > > + group->memory.decayed_watermark = decayed_watermark; > > + group->avg_last_update = now; > > + group->avg_next_update = now + jiffies_to_nsecs(2*HZ); > > + > > + return; > > +} > > + > > static inline void mem_cgroup_protection(struct mem_cgroup *root, > > struct mem_cgroup *memcg, > > unsigned long *min, > > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h > > index 6795913..2720eb9f 100644 > > --- a/include/linux/page_counter.h > > > > > +++ b/include/linux/page_counter.h > > @@ -25,8 +25,12 @@ struct page_counter { > > > > /* legacy */ > > unsigned long watermark; > > + unsigned long decayed_watermark; > > unsigned long failcnt; > > > > + /* proportional protection */ > > + unsigned long min_prop; > > + unsigned long low_prop; > > /* > > * 'parent' is placed here to be far from 'usage' to reduce > > * cache false sharing, as 'usage' is written mostly while > > diff --git a/include/linux/psi.h b/include/linux/psi.h > > index 65eb147..6c76993 100644 > > --- a/include/linux/psi.h > > +++ b/include/linux/psi.h > > @@ -25,6 +25,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > > > > int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res); > > > > +unsigned long psi_mem_get(unsigned long time); > > + > > #ifdef CONFIG_CGROUPS > > int psi_cgroup_alloc(struct cgroup *cgrp); > > void psi_cgroup_free(struct cgroup *cgrp); > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > > index dd80bd2..8d315e0 100644 > > --- a/kernel/sched/psi.c > > +++ b/kernel/sched/psi.c > > @@ -291,6 +291,24 @@ static void get_recent_times(struct psi_group *group, int cpu, > > } > > } > > > > +unsigned long psi_mem_get(unsigned long time_ns) > > +{ > > + unsigned long time_sec = time_ns / (1000 * 1000 * 1000); > > + unsigned long some, full; > > + if (time_sec < 10) { > > + some = LOAD_INT(psi_system.avg[PSI_MEM * 2][0]); > > + full = LOAD_INT(psi_system.avg[PSI_MEM * 2 + 1][0]); > > + } else if (time_sec < 60) { > > + some = LOAD_INT(psi_system.avg[PSI_MEM * 2][1]); > > + full = LOAD_INT(psi_system.avg[PSI_MEM * 2 + 1][1]); > > + } else { > > + some = LOAD_INT(psi_system.avg[PSI_MEM * 2][2]); > > + full = LOAD_INT(psi_system.avg[PSI_MEM * 2 + 1][2]); > > + } > > + > > + return (some * 768 + full * 256) / 1024; > > +} > > + > > static void calc_avgs(unsigned long avg[3], int missed_periods, > > u64 time, u64 period) > > { > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 508bcea..6b579a4 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -5188,6 +5188,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) > > page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX); > > memcg->soft_limit = PAGE_COUNTER_MAX; > > page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX); > > + memcg->wm_dec_fact = 36; > > if (parent) { > > memcg->swappiness = mem_cgroup_swappiness(parent); > > memcg->oom_kill_disable = parent->oom_kill_disable; > > @@ -6616,6 +6617,8 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root, > > { > > unsigned long usage, parent_usage; > > struct mem_cgroup *parent; > > + unsigned long memcg_emin, memcg_elow, parent_emin, parent_elow; > > + unsigned long watermark; > > > > if (mem_cgroup_disabled()) > > return; > > @@ -6642,6 +6645,7 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root, > > if (!parent) > > return; > > > > + calc_protected_low(memcg); > > if (parent == root) { > > memcg->memory.emin = READ_ONCE(memcg->memory.min); > > memcg->memory.elow = READ_ONCE(memcg->memory.low); > > diff --git a/mm/page_counter.c b/mm/page_counter.c > > index 7d83641..18abfdd 100644 > > --- a/mm/page_counter.c > > +++ b/mm/page_counter.c > > @@ -83,6 +83,8 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages) > > */ > > if (new > READ_ONCE(c->watermark)) > > WRITE_ONCE(c->watermark, new); > > + if (new > READ_ONCE(c->decayed_watermark)) > > + WRITE_ONCE(c->decayed_watermark, new); > > } > > } > > > > @@ -137,6 +139,8 @@ bool page_counter_try_charge(struct page_counter *counter, > > */ > > if (new > READ_ONCE(c->watermark)) > > WRITE_ONCE(c->watermark, new); > > + if (new > READ_ONCE(c->decayed_watermark)) > > + WRITE_ONCE(c->decayed_watermark, new); > > } > > return true; > > > > -- > > 1.9.1 > > -- > Michal Hocko > SUSE Labs
On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote: > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote: > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as > > > multimedia app could have the usage range from 50MB to 500MB, which generated > > > by loading an special algorithm into its virtual address space and make it hard > > > to protect the expanded usage without userspace's interaction. > > > > Do I get it correctly that the concern you have is that you do not know > > how much memory your workload will need because that depends on some > > parameters? > right. such as a camera APP will expand the usage from 50MB to 500MB > because of launching a special function(face beauty etc need special > algorithm) > > > > > Furthermore, fixed > > > memory.low is a little bit against its role of soft protection as it will response > > > any system's memory pressure in same way. > > > > Could you be more specific about this as well? > As the camera case above, if we set memory.low as 200MB to keep the > APP run smoothly, the system will experience high memory pressure when > another high load APP launched simultaneously. I would like to have > camera be reclaimed under this scenario. OK, so you effectivelly want to keep the memory protection when there is a "normal" memory pressure but want to relax the protection on other high memory utilization situations? How do you exactly tell a difference between a steady memory pressure (say stream IO on the page cache) from "high load APP launched"? Should you reduce the protection on the stram IO situation as well? [...] > > One very important thing that I am missing here is the overall objective of this > > tuning. From the above it seems that you want to (ab)use memory->low to > > protect some portion of the charged memory and that the protection > > shrinks over time depending on the the global PSI metrict and time. > > But why this is a good thing? > 'Good' means it meets my original goal of keeping the usage during a > period of time and responding to the system's memory pressure. For an > android like system, memory is almost forever being in a tight status > no matter how many RAM it has. What we need from memcg is more than > control and grouping, we need it to be more responsive to the system's > load and could sacrifice its usage under certain criteria. Why existing tools/APIs are insufficient for that? You can watch for both global and memcg memory pressure including PSI metrics and update limits dynamically. Why is it necessary to put such a logic into the kernel?
On Thu, Mar 31, 2022 at 4:35 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote: > > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote: > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as > > > > multimedia app could have the usage range from 50MB to 500MB, which generated > > > > by loading an special algorithm into its virtual address space and make it hard > > > > to protect the expanded usage without userspace's interaction. > > > > > > Do I get it correctly that the concern you have is that you do not know > > > how much memory your workload will need because that depends on some > > > parameters? > > right. such as a camera APP will expand the usage from 50MB to 500MB > > because of launching a special function(face beauty etc need special > > algorithm) > > > > > > > Furthermore, fixed > > > > memory.low is a little bit against its role of soft protection as it will response > > > > any system's memory pressure in same way. > > > > > > Could you be more specific about this as well? > > As the camera case above, if we set memory.low as 200MB to keep the > > APP run smoothly, the system will experience high memory pressure when > > another high load APP launched simultaneously. I would like to have > > camera be reclaimed under this scenario. > > OK, so you effectivelly want to keep the memory protection when there is > a "normal" memory pressure but want to relax the protection on other > high memory utilization situations? > > How do you exactly tell a difference between a steady memory pressure > (say stream IO on the page cache) from "high load APP launched"? Should > you reduce the protection on the stram IO situation as well? IIUC what you are implementing here is a "memory allowance boost" feature and it seems you are implementing it entirely inside the kernel, while only userspace knows when to apply this boost (say at app launch time). This does not make sense to me. > > [...] > > > One very important thing that I am missing here is the overall objective of this > > > tuning. From the above it seems that you want to (ab)use memory->low to > > > protect some portion of the charged memory and that the protection > > > shrinks over time depending on the the global PSI metrict and time. > > > But why this is a good thing? > > 'Good' means it meets my original goal of keeping the usage during a > > period of time and responding to the system's memory pressure. For an > > android like system, memory is almost forever being in a tight status > > no matter how many RAM it has. What we need from memcg is more than > > control and grouping, we need it to be more responsive to the system's > > load and could sacrifice its usage under certain criteria. > > Why existing tools/APIs are insufficient for that? You can watch for > both global and memcg memory pressure including PSI metrics and update > limits dynamically. Why is it necessary to put such a logic into the > kernel? I had exactly the same thought while reading through this. In Android you would probably need to implement a userspace service which would temporarily relax the memcg limits when required, monitor PSI levels and adjust the limits accordingly. > > -- > Michal Hocko > SUSE Labs
On Thu, Mar 31, 2022 at 7:35 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote: > > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote: > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as > > > > multimedia app could have the usage range from 50MB to 500MB, which generated > > > > by loading an special algorithm into its virtual address space and make it hard > > > > to protect the expanded usage without userspace's interaction. > > > > > > Do I get it correctly that the concern you have is that you do not know > > > how much memory your workload will need because that depends on some > > > parameters? > > right. such as a camera APP will expand the usage from 50MB to 500MB > > because of launching a special function(face beauty etc need special > > algorithm) > > > > > > > Furthermore, fixed > > > > memory.low is a little bit against its role of soft protection as it will response > > > > any system's memory pressure in same way. > > > > > > Could you be more specific about this as well? > > As the camera case above, if we set memory.low as 200MB to keep the > > APP run smoothly, the system will experience high memory pressure when > > another high load APP launched simultaneously. I would like to have > > camera be reclaimed under this scenario. > > OK, so you effectivelly want to keep the memory protection when there is > a "normal" memory pressure but want to relax the protection on other > high memory utilization situations? > > How do you exactly tell a difference between a steady memory pressure > (say stream IO on the page cache) from "high load APP launched"? Should > you reduce the protection on the stram IO situation as well? We can take either system's io_wait or PSI_IO into consideration for these. > > [...] > > > One very important thing that I am missing here is the overall objective of this > > > tuning. From the above it seems that you want to (ab)use memory->low to > > > protect some portion of the charged memory and that the protection > > > shrinks over time depending on the the global PSI metrict and time. > > > But why this is a good thing? > > 'Good' means it meets my original goal of keeping the usage during a > > period of time and responding to the system's memory pressure. For an > > android like system, memory is almost forever being in a tight status > > no matter how many RAM it has. What we need from memcg is more than > > control and grouping, we need it to be more responsive to the system's > > load and could sacrifice its usage under certain criteria. > > Why existing tools/APIs are insufficient for that? You can watch for > both global and memcg memory pressure including PSI metrics and update > limits dynamically. Why is it necessary to put such a logic into the > kernel? Poll and then React method in userspace requires a polling interval and response time. Take PSI as an example, it polls ten times during POLLING_INTERVAL while just report once, which introduce latency in some extend. > > -- > Michal Hocko > SUSE Labs
On Fri, Apr 1, 2022 at 3:26 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Mar 31, 2022 at 4:35 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote: > > > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote: > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as > > > > > multimedia app could have the usage range from 50MB to 500MB, which generated > > > > > by loading an special algorithm into its virtual address space and make it hard > > > > > to protect the expanded usage without userspace's interaction. > > > > > > > > Do I get it correctly that the concern you have is that you do not know > > > > how much memory your workload will need because that depends on some > > > > parameters? > > > right. such as a camera APP will expand the usage from 50MB to 500MB > > > because of launching a special function(face beauty etc need special > > > algorithm) > > > > > > > > > Furthermore, fixed > > > > > memory.low is a little bit against its role of soft protection as it will response > > > > > any system's memory pressure in same way. > > > > > > > > Could you be more specific about this as well? > > > As the camera case above, if we set memory.low as 200MB to keep the > > > APP run smoothly, the system will experience high memory pressure when > > > another high load APP launched simultaneously. I would like to have > > > camera be reclaimed under this scenario. > > > > OK, so you effectivelly want to keep the memory protection when there is > > a "normal" memory pressure but want to relax the protection on other > > high memory utilization situations? > > > > How do you exactly tell a difference between a steady memory pressure > > (say stream IO on the page cache) from "high load APP launched"? Should > > you reduce the protection on the stram IO situation as well? > > IIUC what you are implementing here is a "memory allowance boost" > feature and it seems you are implementing it entirely inside the > kernel, while only userspace knows when to apply this boost (say at > app launch time). This does not make sense to me. I am wondering if it could be more helpful to apply this patch on the background services(system_server etc) than APP, while the latter ones are persistent to the system. > > > > > [...] > > > > One very important thing that I am missing here is the overall objective of this > > > > tuning. From the above it seems that you want to (ab)use memory->low to > > > > protect some portion of the charged memory and that the protection > > > > shrinks over time depending on the the global PSI metrict and time. > > > > But why this is a good thing? > > > 'Good' means it meets my original goal of keeping the usage during a > > > period of time and responding to the system's memory pressure. For an > > > android like system, memory is almost forever being in a tight status > > > no matter how many RAM it has. What we need from memcg is more than > > > control and grouping, we need it to be more responsive to the system's > > > load and could sacrifice its usage under certain criteria. > > > > Why existing tools/APIs are insufficient for that? You can watch for > > both global and memcg memory pressure including PSI metrics and update > > limits dynamically. Why is it necessary to put such a logic into the > > kernel? > > I had exactly the same thought while reading through this. > In Android you would probably need to implement a userspace service > which would temporarily relax the memcg limits when required, monitor > PSI levels and adjust the limits accordingly. As my response to Michal's comment. Userspace monitors introduce latency. Take LMKD as an example, it is actually driven by the PSI_POLL_PERIOD_XXX_MS after first wakeup, which means PSI_WINDOW_SIZE_MS could be too big to rely on. IMHO, with regards to the responding time, LMKD is less efficient than lmk driver but more strong in strategy things. I would like to test this patch in real android's work load and feedback in next version. > > > > > -- > > Michal Hocko > > SUSE Labs
On Thu, Mar 31, 2022 at 6:51 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > > On Fri, Apr 1, 2022 at 3:26 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Thu, Mar 31, 2022 at 4:35 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote: > > > > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote: > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > > > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as > > > > > > multimedia app could have the usage range from 50MB to 500MB, which generated > > > > > > by loading an special algorithm into its virtual address space and make it hard > > > > > > to protect the expanded usage without userspace's interaction. > > > > > > > > > > Do I get it correctly that the concern you have is that you do not know > > > > > how much memory your workload will need because that depends on some > > > > > parameters? > > > > right. such as a camera APP will expand the usage from 50MB to 500MB > > > > because of launching a special function(face beauty etc need special > > > > algorithm) > > > > > > > > > > > Furthermore, fixed > > > > > > memory.low is a little bit against its role of soft protection as it will response > > > > > > any system's memory pressure in same way. > > > > > > > > > > Could you be more specific about this as well? > > > > As the camera case above, if we set memory.low as 200MB to keep the > > > > APP run smoothly, the system will experience high memory pressure when > > > > another high load APP launched simultaneously. I would like to have > > > > camera be reclaimed under this scenario. > > > > > > OK, so you effectivelly want to keep the memory protection when there is > > > a "normal" memory pressure but want to relax the protection on other > > > high memory utilization situations? > > > > > > How do you exactly tell a difference between a steady memory pressure > > > (say stream IO on the page cache) from "high load APP launched"? Should > > > you reduce the protection on the stram IO situation as well? > > > > IIUC what you are implementing here is a "memory allowance boost" > > feature and it seems you are implementing it entirely inside the > > kernel, while only userspace knows when to apply this boost (say at > > app launch time). This does not make sense to me. > I am wondering if it could be more helpful to apply this patch on the > background services(system_server etc) than APP, while the latter ones > are persistent to the system. > > > > > > > > [...] > > > > > One very important thing that I am missing here is the overall objective of this > > > > > tuning. From the above it seems that you want to (ab)use memory->low to > > > > > protect some portion of the charged memory and that the protection > > > > > shrinks over time depending on the the global PSI metrict and time. > > > > > But why this is a good thing? > > > > 'Good' means it meets my original goal of keeping the usage during a > > > > period of time and responding to the system's memory pressure. For an > > > > android like system, memory is almost forever being in a tight status > > > > no matter how many RAM it has. What we need from memcg is more than > > > > control and grouping, we need it to be more responsive to the system's > > > > load and could sacrifice its usage under certain criteria. > > > > > > Why existing tools/APIs are insufficient for that? You can watch for > > > both global and memcg memory pressure including PSI metrics and update > > > limits dynamically. Why is it necessary to put such a logic into the > > > kernel? > > > > I had exactly the same thought while reading through this. > > In Android you would probably need to implement a userspace service > > which would temporarily relax the memcg limits when required, monitor > > PSI levels and adjust the limits accordingly. > As my response to Michal's comment. Userspace monitors introduce > latency. Take LMKD as an example, it is actually driven by the > PSI_POLL_PERIOD_XXX_MS after first wakeup, which means > PSI_WINDOW_SIZE_MS could be too big to rely on. IMHO, with regards to > the responding time, LMKD is less efficient than lmk driver but more > strong in strategy things. I would like to test this patch in real > android's work load and feedback in next version. LMKD is a reactive mechanism which does not know when memory pressure might rise, therefore its response latency matters. The usecases you mentioned seemed to imply that userspace was aware of increased memory demands of the process (app startup time, maybe the moment the app becomes foreground, etc.). Therefore the userspace could relax memory allowance before that memory is requested. Was my understanding incorrect? > > > > > > > > -- > > > Michal Hocko > > > SUSE Labs
On Fri 01-04-22 09:34:02, Zhaoyang Huang wrote: > On Thu, Mar 31, 2022 at 7:35 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote: > > > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote: > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as > > > > > multimedia app could have the usage range from 50MB to 500MB, which generated > > > > > by loading an special algorithm into its virtual address space and make it hard > > > > > to protect the expanded usage without userspace's interaction. > > > > > > > > Do I get it correctly that the concern you have is that you do not know > > > > how much memory your workload will need because that depends on some > > > > parameters? > > > right. such as a camera APP will expand the usage from 50MB to 500MB > > > because of launching a special function(face beauty etc need special > > > algorithm) > > > > > > > > > Furthermore, fixed > > > > > memory.low is a little bit against its role of soft protection as it will response > > > > > any system's memory pressure in same way. > > > > > > > > Could you be more specific about this as well? > > > As the camera case above, if we set memory.low as 200MB to keep the > > > APP run smoothly, the system will experience high memory pressure when > > > another high load APP launched simultaneously. I would like to have > > > camera be reclaimed under this scenario. > > > > OK, so you effectivelly want to keep the memory protection when there is > > a "normal" memory pressure but want to relax the protection on other > > high memory utilization situations? > > > > How do you exactly tell a difference between a steady memory pressure > > (say stream IO on the page cache) from "high load APP launched"? Should > > you reduce the protection on the stram IO situation as well? > We can take either system's io_wait or PSI_IO into consideration for these. I do not follow. Let's say you have a stream IO workload which is mostly RO. Reclaiming those pages means effectivelly to drop them from the cache so there is no IO involved during the reclaim. This will generate a constant flow of reclaim that shouldn't normally affect other workloads (as long as kswapd keeps up with the IO pace). How does your scheme cope with this scenario? My understanding is that it will simply relax the protection. > > [...] > > > > One very important thing that I am missing here is the overall objective of this > > > > tuning. From the above it seems that you want to (ab)use memory->low to > > > > protect some portion of the charged memory and that the protection > > > > shrinks over time depending on the the global PSI metrict and time. > > > > But why this is a good thing? > > > 'Good' means it meets my original goal of keeping the usage during a > > > period of time and responding to the system's memory pressure. For an > > > android like system, memory is almost forever being in a tight status > > > no matter how many RAM it has. What we need from memcg is more than > > > control and grouping, we need it to be more responsive to the system's > > > load and could sacrifice its usage under certain criteria. > > > > Why existing tools/APIs are insufficient for that? You can watch for > > both global and memcg memory pressure including PSI metrics and update > > limits dynamically. Why is it necessary to put such a logic into the > > kernel? > Poll and then React method in userspace requires a polling interval > and response time. Take PSI as an example, it polls ten times during > POLLING_INTERVAL while just report once, which introduce latency in > some extend. Do workload transitions happen so often in your situation that the interval really matters? As Suren already pointed out starting a new application is usually an explicit event which can pro-activelly update limits.
On Fri, Apr 1, 2022 at 12:46 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Mar 31, 2022 at 6:51 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > > > > On Fri, Apr 1, 2022 at 3:26 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Thu, Mar 31, 2022 at 4:35 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote: > > > > > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote: > > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > > > > > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as > > > > > > > multimedia app could have the usage range from 50MB to 500MB, which generated > > > > > > > by loading an special algorithm into its virtual address space and make it hard > > > > > > > to protect the expanded usage without userspace's interaction. > > > > > > > > > > > > Do I get it correctly that the concern you have is that you do not know > > > > > > how much memory your workload will need because that depends on some > > > > > > parameters? > > > > > right. such as a camera APP will expand the usage from 50MB to 500MB > > > > > because of launching a special function(face beauty etc need special > > > > > algorithm) > > > > > > > > > > > > > Furthermore, fixed > > > > > > > memory.low is a little bit against its role of soft protection as it will response > > > > > > > any system's memory pressure in same way. > > > > > > > > > > > > Could you be more specific about this as well? > > > > > As the camera case above, if we set memory.low as 200MB to keep the > > > > > APP run smoothly, the system will experience high memory pressure when > > > > > another high load APP launched simultaneously. I would like to have > > > > > camera be reclaimed under this scenario. > > > > > > > > OK, so you effectivelly want to keep the memory protection when there is > > > > a "normal" memory pressure but want to relax the protection on other > > > > high memory utilization situations? > > > > > > > > How do you exactly tell a difference between a steady memory pressure > > > > (say stream IO on the page cache) from "high load APP launched"? Should > > > > you reduce the protection on the stram IO situation as well? > > > > > > IIUC what you are implementing here is a "memory allowance boost" > > > feature and it seems you are implementing it entirely inside the > > > kernel, while only userspace knows when to apply this boost (say at > > > app launch time). This does not make sense to me. > > I am wondering if it could be more helpful to apply this patch on the > > background services(system_server etc) than APP, while the latter ones > > are persistent to the system. > > > > > > > > > > > [...] > > > > > > One very important thing that I am missing here is the overall objective of this > > > > > > tuning. From the above it seems that you want to (ab)use memory->low to > > > > > > protect some portion of the charged memory and that the protection > > > > > > shrinks over time depending on the the global PSI metrict and time. > > > > > > But why this is a good thing? > > > > > 'Good' means it meets my original goal of keeping the usage during a > > > > > period of time and responding to the system's memory pressure. For an > > > > > android like system, memory is almost forever being in a tight status > > > > > no matter how many RAM it has. What we need from memcg is more than > > > > > control and grouping, we need it to be more responsive to the system's > > > > > load and could sacrifice its usage under certain criteria. > > > > > > > > Why existing tools/APIs are insufficient for that? You can watch for > > > > both global and memcg memory pressure including PSI metrics and update > > > > limits dynamically. Why is it necessary to put such a logic into the > > > > kernel? > > > > > > I had exactly the same thought while reading through this. > > > In Android you would probably need to implement a userspace service > > > which would temporarily relax the memcg limits when required, monitor > > > PSI levels and adjust the limits accordingly. > > As my response to Michal's comment. Userspace monitors introduce > > latency. Take LMKD as an example, it is actually driven by the > > PSI_POLL_PERIOD_XXX_MS after first wakeup, which means > > PSI_WINDOW_SIZE_MS could be too big to rely on. IMHO, with regards to > > the responding time, LMKD is less efficient than lmk driver but more > > strong in strategy things. I would like to test this patch in real > > android's work load and feedback in next version. > > LMKD is a reactive mechanism which does not know when memory pressure > might rise, therefore its response latency matters. > The usecases you mentioned seemed to imply that userspace was aware of > increased memory demands of the process (app startup time, maybe the > moment the app becomes foreground, etc.). Therefore the userspace > could relax memory allowance before that memory is requested. Was my > understanding incorrect? In terms of Android specific scenarios, I think it is not the best choice to have AMS relax memory allowance directly when activity arised, as: 1. AMS also has to face latency issues like LMKD does when polling systems PSI value 2. AMS needs to launch syscalls to act. A read for either memcg.usage/watermark and a write on memcg.min/low. 3. Processes in CACHED adj are not capable of shrinking themselves except waiting to be killed by LMKD. IMO, this patch is alike an Pressure Aware Per Process Reclaim which does NOT launch real reclaiming. > > > > > > > > > > > > -- > > > > Michal Hocko > > > > SUSE Labs
On Fri, Apr 1, 2022 at 7:34 PM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 01-04-22 09:34:02, Zhaoyang Huang wrote: > > On Thu, Mar 31, 2022 at 7:35 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote: > > > > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote: > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > > > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as > > > > > > multimedia app could have the usage range from 50MB to 500MB, which generated > > > > > > by loading an special algorithm into its virtual address space and make it hard > > > > > > to protect the expanded usage without userspace's interaction. > > > > > > > > > > Do I get it correctly that the concern you have is that you do not know > > > > > how much memory your workload will need because that depends on some > > > > > parameters? > > > > right. such as a camera APP will expand the usage from 50MB to 500MB > > > > because of launching a special function(face beauty etc need special > > > > algorithm) > > > > > > > > > > > Furthermore, fixed > > > > > > memory.low is a little bit against its role of soft protection as it will response > > > > > > any system's memory pressure in same way. > > > > > > > > > > Could you be more specific about this as well? > > > > As the camera case above, if we set memory.low as 200MB to keep the > > > > APP run smoothly, the system will experience high memory pressure when > > > > another high load APP launched simultaneously. I would like to have > > > > camera be reclaimed under this scenario. > > > > > > OK, so you effectivelly want to keep the memory protection when there is > > > a "normal" memory pressure but want to relax the protection on other > > > high memory utilization situations? > > > > > > How do you exactly tell a difference between a steady memory pressure > > > (say stream IO on the page cache) from "high load APP launched"? Should > > > you reduce the protection on the stram IO situation as well? > > We can take either system's io_wait or PSI_IO into consideration for these. > > I do not follow. Let's say you have a stream IO workload which is mostly > RO. Reclaiming those pages means effectivelly to drop them from the > cache so there is no IO involved during the reclaim. This will generate > a constant flow of reclaim that shouldn't normally affect other > workloads (as long as kswapd keeps up with the IO pace). How does your > scheme cope with this scenario? My understanding is that it will simply > relax the protection. You are right. This scheme treats the system's memory pressure equally, no matter if it comes from in-kernel page allocation with high order or cache drop by IO like things. The decay_factor composed of PSI_SOME and PSI_FULL which represent the system is tight on memory, every entity has the obligation to donate to solve this issue. > > > > [...] > > > > > One very important thing that I am missing here is the overall objective of this > > > > > tuning. From the above it seems that you want to (ab)use memory->low to > > > > > protect some portion of the charged memory and that the protection > > > > > shrinks over time depending on the the global PSI metrict and time. > > > > > But why this is a good thing? > > > > 'Good' means it meets my original goal of keeping the usage during a > > > > period of time and responding to the system's memory pressure. For an > > > > android like system, memory is almost forever being in a tight status > > > > no matter how many RAM it has. What we need from memcg is more than > > > > control and grouping, we need it to be more responsive to the system's > > > > load and could sacrifice its usage under certain criteria. > > > > > > Why existing tools/APIs are insufficient for that? You can watch for > > > both global and memcg memory pressure including PSI metrics and update > > > limits dynamically. Why is it necessary to put such a logic into the > > > kernel? > > Poll and then React method in userspace requires a polling interval > > and response time. Take PSI as an example, it polls ten times during > > POLLING_INTERVAL while just report once, which introduce latency in > > some extend. > > Do workload transitions happen so often in your situation that the > interval really matters? As Suren already pointed out starting a new > application is usually an explicit event which can pro-activelly update > limits. Yes. As my reply to Suren's comment, even a positive monitor service which could be aware of the activity starting(APP launching etc) at the very first time, has to 1. read PSI and memcg->watermark/usage 2. make a decision. 3. write memcg->memory.low to adjust memory allowance. Furthermore, monitors could not supervise the APP for whole life time, while the reclaiming could arise at any time. > -- > Michal Hocko > SUSE Labs
On Fri, Apr 1, 2022 at 10:18 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > > On Fri, Apr 1, 2022 at 7:34 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 01-04-22 09:34:02, Zhaoyang Huang wrote: > > > On Thu, Mar 31, 2022 at 7:35 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote: > > > > > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote: > > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > > > > > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as > > > > > > > multimedia app could have the usage range from 50MB to 500MB, which generated > > > > > > > by loading an special algorithm into its virtual address space and make it hard > > > > > > > to protect the expanded usage without userspace's interaction. > > > > > > > > > > > > Do I get it correctly that the concern you have is that you do not know > > > > > > how much memory your workload will need because that depends on some > > > > > > parameters? > > > > > right. such as a camera APP will expand the usage from 50MB to 500MB > > > > > because of launching a special function(face beauty etc need special > > > > > algorithm) > > > > > > > > > > > > > Furthermore, fixed > > > > > > > memory.low is a little bit against its role of soft protection as it will response > > > > > > > any system's memory pressure in same way. > > > > > > > > > > > > Could you be more specific about this as well? > > > > > As the camera case above, if we set memory.low as 200MB to keep the > > > > > APP run smoothly, the system will experience high memory pressure when > > > > > another high load APP launched simultaneously. I would like to have > > > > > camera be reclaimed under this scenario. > > > > > > > > OK, so you effectivelly want to keep the memory protection when there is > > > > a "normal" memory pressure but want to relax the protection on other > > > > high memory utilization situations? > > > > > > > > How do you exactly tell a difference between a steady memory pressure > > > > (say stream IO on the page cache) from "high load APP launched"? Should > > > > you reduce the protection on the stram IO situation as well? > > > We can take either system's io_wait or PSI_IO into consideration for these. > > > > I do not follow. Let's say you have a stream IO workload which is mostly > > RO. Reclaiming those pages means effectivelly to drop them from the > > cache so there is no IO involved during the reclaim. This will generate > > a constant flow of reclaim that shouldn't normally affect other > > workloads (as long as kswapd keeps up with the IO pace). How does your > > scheme cope with this scenario? My understanding is that it will simply > > relax the protection. > You are right. This scheme treats the system's memory pressure > equally, no matter if it comes from in-kernel page allocation with > high order or cache drop by IO like things. The decay_factor composed > of PSI_SOME and PSI_FULL which represent the system is tight on > memory, every entity has the obligation to donate to solve this issue. > > > > > > [...] > > > > > > One very important thing that I am missing here is the overall objective of this > > > > > > tuning. From the above it seems that you want to (ab)use memory->low to > > > > > > protect some portion of the charged memory and that the protection > > > > > > shrinks over time depending on the the global PSI metrict and time. > > > > > > But why this is a good thing? > > > > > 'Good' means it meets my original goal of keeping the usage during a > > > > > period of time and responding to the system's memory pressure. For an > > > > > android like system, memory is almost forever being in a tight status > > > > > no matter how many RAM it has. What we need from memcg is more than > > > > > control and grouping, we need it to be more responsive to the system's > > > > > load and could sacrifice its usage under certain criteria. > > > > > > > > Why existing tools/APIs are insufficient for that? You can watch for > > > > both global and memcg memory pressure including PSI metrics and update > > > > limits dynamically. Why is it necessary to put such a logic into the > > > > kernel? > > > Poll and then React method in userspace requires a polling interval > > > and response time. Take PSI as an example, it polls ten times during > > > POLLING_INTERVAL while just report once, which introduce latency in > > > some extend. > > > > Do workload transitions happen so often in your situation that the > > interval really matters? As Suren already pointed out starting a new > > application is usually an explicit event which can pro-activelly update > > limits. > Yes. As my reply to Suren's comment, even a positive monitor service > which could be aware of the activity starting(APP launching etc) at > the very first time, has to 1. read PSI and memcg->watermark/usage 2. > make a decision. 3. write memcg->memory.low to adjust memory > allowance. Furthermore, monitors could not supervise the APP for whole > life time, while the reclaiming could arise at any time. Ok, sounds like you want this dynamic limit to be active all the time, not only at specific points in the process's life cycle. One thing that I don't understand in this approach is: why memory.low should depend on the system's memory pressure. It seems you want to allow a process to allocate more when memory pressure is high. That is very counter-intuitive to me. Could you please explain the underlying logic of why this is the right thing to do, without going into technical details? > > > -- > > Michal Hocko > > SUSE Labs
On Sun, Apr 3, 2022 at 11:04 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Fri, Apr 1, 2022 at 10:18 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > > > > On Fri, Apr 1, 2022 at 7:34 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Fri 01-04-22 09:34:02, Zhaoyang Huang wrote: > > > > On Thu, Mar 31, 2022 at 7:35 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Thu 31-03-22 19:18:58, Zhaoyang Huang wrote: > > > > > > On Thu, Mar 31, 2022 at 5:01 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Thu 31-03-22 16:00:56, zhaoyang.huang wrote: > > > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > > > > > > > > > For some kind of memcg, the usage is varies greatly from scenarios. Such as > > > > > > > > multimedia app could have the usage range from 50MB to 500MB, which generated > > > > > > > > by loading an special algorithm into its virtual address space and make it hard > > > > > > > > to protect the expanded usage without userspace's interaction. > > > > > > > > > > > > > > Do I get it correctly that the concern you have is that you do not know > > > > > > > how much memory your workload will need because that depends on some > > > > > > > parameters? > > > > > > right. such as a camera APP will expand the usage from 50MB to 500MB > > > > > > because of launching a special function(face beauty etc need special > > > > > > algorithm) > > > > > > > > > > > > > > > Furthermore, fixed > > > > > > > > memory.low is a little bit against its role of soft protection as it will response > > > > > > > > any system's memory pressure in same way. > > > > > > > > > > > > > > Could you be more specific about this as well? > > > > > > As the camera case above, if we set memory.low as 200MB to keep the > > > > > > APP run smoothly, the system will experience high memory pressure when > > > > > > another high load APP launched simultaneously. I would like to have > > > > > > camera be reclaimed under this scenario. > > > > > > > > > > OK, so you effectivelly want to keep the memory protection when there is > > > > > a "normal" memory pressure but want to relax the protection on other > > > > > high memory utilization situations? > > > > > > > > > > How do you exactly tell a difference between a steady memory pressure > > > > > (say stream IO on the page cache) from "high load APP launched"? Should > > > > > you reduce the protection on the stram IO situation as well? > > > > We can take either system's io_wait or PSI_IO into consideration for these. > > > > > > I do not follow. Let's say you have a stream IO workload which is mostly > > > RO. Reclaiming those pages means effectivelly to drop them from the > > > cache so there is no IO involved during the reclaim. This will generate > > > a constant flow of reclaim that shouldn't normally affect other > > > workloads (as long as kswapd keeps up with the IO pace). How does your > > > scheme cope with this scenario? My understanding is that it will simply > > > relax the protection. > > You are right. This scheme treats the system's memory pressure > > equally, no matter if it comes from in-kernel page allocation with > > high order or cache drop by IO like things. The decay_factor composed > > of PSI_SOME and PSI_FULL which represent the system is tight on > > memory, every entity has the obligation to donate to solve this issue. > > > > > > > > [...] > > > > > > > One very important thing that I am missing here is the overall objective of this > > > > > > > tuning. From the above it seems that you want to (ab)use memory->low to > > > > > > > protect some portion of the charged memory and that the protection > > > > > > > shrinks over time depending on the the global PSI metrict and time. > > > > > > > But why this is a good thing? > > > > > > 'Good' means it meets my original goal of keeping the usage during a > > > > > > period of time and responding to the system's memory pressure. For an > > > > > > android like system, memory is almost forever being in a tight status > > > > > > no matter how many RAM it has. What we need from memcg is more than > > > > > > control and grouping, we need it to be more responsive to the system's > > > > > > load and could sacrifice its usage under certain criteria. > > > > > > > > > > Why existing tools/APIs are insufficient for that? You can watch for > > > > > both global and memcg memory pressure including PSI metrics and update > > > > > limits dynamically. Why is it necessary to put such a logic into the > > > > > kernel? > > > > Poll and then React method in userspace requires a polling interval > > > > and response time. Take PSI as an example, it polls ten times during > > > > POLLING_INTERVAL while just report once, which introduce latency in > > > > some extend. > > > > > > Do workload transitions happen so often in your situation that the > > > interval really matters? As Suren already pointed out starting a new > > > application is usually an explicit event which can pro-activelly update > > > limits. > > Yes. As my reply to Suren's comment, even a positive monitor service > > which could be aware of the activity starting(APP launching etc) at > > the very first time, has to 1. read PSI and memcg->watermark/usage 2. > > make a decision. 3. write memcg->memory.low to adjust memory > > allowance. Furthermore, monitors could not supervise the APP for whole > > life time, while the reclaiming could arise at any time. > > Ok, sounds like you want this dynamic limit to be active all the time, > not only at specific points in the process's life cycle. Not sure yet. I think it would be better to cooperate with AMS like things which could distinguish scenarios. Such as place foreground/background processes into different memcgs with different decay configurations OR dynamic set up the parameters to a specific memcg. > One thing that I don't understand in this approach is: why memory.low > should depend on the system's memory pressure. It seems you want to > allow a process to allocate more when memory pressure is high. That is > very counter-intuitive to me. Could you please explain the underlying > logic of why this is the right thing to do, without going into > technical details? What I want to achieve is make memory.low be positive correlation with timing and negative to memory pressure, which means the protected memcg should lower its protection(via lower memcg.low) for helping system's memory pressure when it's high. The concept behind is memcg's fault back of dropped memory is less important than system's latency on high memory pressure. Please refer to my new version's test data for more detail. > > > > > > -- > > > Michal Hocko > > > SUSE Labs
On Mon 04-04-22 10:33:58, Zhaoyang Huang wrote: [...] > > One thing that I don't understand in this approach is: why memory.low > > should depend on the system's memory pressure. It seems you want to > > allow a process to allocate more when memory pressure is high. That is > > very counter-intuitive to me. Could you please explain the underlying > > logic of why this is the right thing to do, without going into > > technical details? > What I want to achieve is make memory.low be positive correlation with > timing and negative to memory pressure, which means the protected > memcg should lower its protection(via lower memcg.low) for helping > system's memory pressure when it's high. I have to say this is still very confusing to me. The low limit is a protection against external (e.g. global) memory pressure. Decreasing the protection based on the external pressure sounds like it goes right against the purpose of the knob. I can see reasons to update protection based on refaults or other metrics from the userspace but I still do not see how this is a good auto-magic tuning done by the kernel. > The concept behind is memcg's > fault back of dropped memory is less important than system's latency > on high memory pressure. Can you give some specific examples? > Please refer to my new version's test data > for more detail. Please note that sending new RFCs will just make the discussion spread over several email threads which will get increasingly hard to follow. So do not post another version until it is really clear what is the actual semantic you are proposing.
On Mon, Apr 4, 2022 at 4:51 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 04-04-22 10:33:58, Zhaoyang Huang wrote: > [...] > > > One thing that I don't understand in this approach is: why memory.low > > > should depend on the system's memory pressure. It seems you want to > > > allow a process to allocate more when memory pressure is high. That is > > > very counter-intuitive to me. Could you please explain the underlying > > > logic of why this is the right thing to do, without going into > > > technical details? > > What I want to achieve is make memory.low be positive correlation with > > timing and negative to memory pressure, which means the protected > > memcg should lower its protection(via lower memcg.low) for helping > > system's memory pressure when it's high. > > I have to say this is still very confusing to me. The low limit is a > protection against external (e.g. global) memory pressure. Decreasing > the protection based on the external pressure sounds like it goes right > against the purpose of the knob. I can see reasons to update protection > based on refaults or other metrics from the userspace but I still do not > see how this is a good auto-magic tuning done by the kernel. > > > The concept behind is memcg's > > fault back of dropped memory is less important than system's latency > > on high memory pressure. > > Can you give some specific examples? For both of the above two comments, please refer to the latest test result in Patchv2 I have sent. I prefer to name my change as focus transfer under pressure as protected memcg is the focus when system's memory pressure is low which will reclaim from root, this is not against current design. However, when global memory pressure is high, then the focus has to be changed to the whole system, because it doesn't make sense to let the protected memcg out of everybody, it can't do anything when the system is trapped in the kernel with reclaiming work. > > > Please refer to my new version's test data > > for more detail. > > Please note that sending new RFCs will just make the discussion spread > over several email threads which will get increasingly hard to follow. > So do not post another version until it is really clear what is the > actual semantic you are proposing. ok, I will hold until all question done. > > -- > Michal Hocko > SUSE Labs
On Mon, Apr 4, 2022 at 5:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > > On Mon, Apr 4, 2022 at 4:51 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 04-04-22 10:33:58, Zhaoyang Huang wrote: > > [...] > > > > One thing that I don't understand in this approach is: why memory.low > > > > should depend on the system's memory pressure. It seems you want to > > > > allow a process to allocate more when memory pressure is high. That is > > > > very counter-intuitive to me. Could you please explain the underlying > > > > logic of why this is the right thing to do, without going into > > > > technical details? > > > What I want to achieve is make memory.low be positive correlation with > > > timing and negative to memory pressure, which means the protected > > > memcg should lower its protection(via lower memcg.low) for helping > > > system's memory pressure when it's high. > > > > I have to say this is still very confusing to me. The low limit is a > > protection against external (e.g. global) memory pressure. Decreasing > > the protection based on the external pressure sounds like it goes right > > against the purpose of the knob. I can see reasons to update protection > > based on refaults or other metrics from the userspace but I still do not > > see how this is a good auto-magic tuning done by the kernel. > > > > > The concept behind is memcg's > > > fault back of dropped memory is less important than system's latency > > > on high memory pressure. > > > > Can you give some specific examples? > For both of the above two comments, please refer to the latest test > result in Patchv2 I have sent. I prefer to name my change as focus > transfer under pressure as protected memcg is the focus when system's > memory pressure is low which will reclaim from root, this is not > against current design. However, when global memory pressure is high, > then the focus has to be changed to the whole system, because it > doesn't make sense to let the protected memcg out of everybody, it > can't > do anything when the system is trapped in the kernel with reclaiming work. Does it make more sense if I describe the change as memcg will be protect long as system pressure is under the threshold(partially coherent with current design) and will sacrifice the memcg if pressure is over the threshold(added change) > > > > > Please refer to my new version's test data > > > for more detail. > > > > Please note that sending new RFCs will just make the discussion spread > > over several email threads which will get increasingly hard to follow. > > So do not post another version until it is really clear what is the > > actual semantic you are proposing. > ok, I will hold until all question done. > > > > -- > > Michal Hocko > > SUSE Labs
On Mon 04-04-22 17:23:43, Zhaoyang Huang wrote: > On Mon, Apr 4, 2022 at 5:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > > > > On Mon, Apr 4, 2022 at 4:51 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 04-04-22 10:33:58, Zhaoyang Huang wrote: > > > [...] > > > > > One thing that I don't understand in this approach is: why memory.low > > > > > should depend on the system's memory pressure. It seems you want to > > > > > allow a process to allocate more when memory pressure is high. That is > > > > > very counter-intuitive to me. Could you please explain the underlying > > > > > logic of why this is the right thing to do, without going into > > > > > technical details? > > > > What I want to achieve is make memory.low be positive correlation with > > > > timing and negative to memory pressure, which means the protected > > > > memcg should lower its protection(via lower memcg.low) for helping > > > > system's memory pressure when it's high. > > > > > > I have to say this is still very confusing to me. The low limit is a > > > protection against external (e.g. global) memory pressure. Decreasing > > > the protection based on the external pressure sounds like it goes right > > > against the purpose of the knob. I can see reasons to update protection > > > based on refaults or other metrics from the userspace but I still do not > > > see how this is a good auto-magic tuning done by the kernel. > > > > > > > The concept behind is memcg's > > > > fault back of dropped memory is less important than system's latency > > > > on high memory pressure. > > > > > > Can you give some specific examples? > > For both of the above two comments, please refer to the latest test > > result in Patchv2 I have sent. I prefer to name my change as focus > > transfer under pressure as protected memcg is the focus when system's > > memory pressure is low which will reclaim from root, this is not > > against current design. However, when global memory pressure is high, > > then the focus has to be changed to the whole system, because it > > doesn't make sense to let the protected memcg out of everybody, it > > can't > > do anything when the system is trapped in the kernel with reclaiming work. > Does it make more sense if I describe the change as memcg will be > protect long as system pressure is under the threshold(partially > coherent with current design) and will sacrifice the memcg if pressure > is over the threshold(added change) No, not really. For one it is still really unclear why there should be any difference in the semantic between global and external memory pressure in general. The low limit is always a protection from the external pressure. And what should be the actual threshold? Amount of the reclaim performed, effectivness of the reclaim or what?
On Mon 04-04-22 11:32:28, Michal Hocko wrote: > On Mon 04-04-22 17:23:43, Zhaoyang Huang wrote: > > On Mon, Apr 4, 2022 at 5:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > > > > > > On Mon, Apr 4, 2022 at 4:51 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Mon 04-04-22 10:33:58, Zhaoyang Huang wrote: > > > > [...] > > > > > > One thing that I don't understand in this approach is: why memory.low > > > > > > should depend on the system's memory pressure. It seems you want to > > > > > > allow a process to allocate more when memory pressure is high. That is > > > > > > very counter-intuitive to me. Could you please explain the underlying > > > > > > logic of why this is the right thing to do, without going into > > > > > > technical details? > > > > > What I want to achieve is make memory.low be positive correlation with > > > > > timing and negative to memory pressure, which means the protected > > > > > memcg should lower its protection(via lower memcg.low) for helping > > > > > system's memory pressure when it's high. > > > > > > > > I have to say this is still very confusing to me. The low limit is a > > > > protection against external (e.g. global) memory pressure. Decreasing > > > > the protection based on the external pressure sounds like it goes right > > > > against the purpose of the knob. I can see reasons to update protection > > > > based on refaults or other metrics from the userspace but I still do not > > > > see how this is a good auto-magic tuning done by the kernel. > > > > > > > > > The concept behind is memcg's > > > > > fault back of dropped memory is less important than system's latency > > > > > on high memory pressure. > > > > > > > > Can you give some specific examples? > > > For both of the above two comments, please refer to the latest test > > > result in Patchv2 I have sent. I prefer to name my change as focus > > > transfer under pressure as protected memcg is the focus when system's > > > memory pressure is low which will reclaim from root, this is not > > > against current design. However, when global memory pressure is high, > > > then the focus has to be changed to the whole system, because it > > > doesn't make sense to let the protected memcg out of everybody, it > > > can't > > > do anything when the system is trapped in the kernel with reclaiming work. > > Does it make more sense if I describe the change as memcg will be > > protect long as system pressure is under the threshold(partially > > coherent with current design) and will sacrifice the memcg if pressure > > is over the threshold(added change) > > No, not really. For one it is still really unclear why there should be any > difference in the semantic between global and external memory pressure > in general. The low limit is always a protection from the external > pressure. And what should be the actual threshold? Amount of the reclaim > performed, effectivness of the reclaim or what? Btw. you might want to have a look at http://lkml.kernel.org/r/20220331084151.2600229-1-yosryahmed@google.com where a new interface to allow pro-active memory reclaim is discussed. I think that this might turn out to be a better fit then an automagic kernel manipulation with a low limit. It will require a user agent to drive the reclaim though.
On Mon, Apr 4, 2022 at 5:32 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 04-04-22 17:23:43, Zhaoyang Huang wrote: > > On Mon, Apr 4, 2022 at 5:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > > > > > > On Mon, Apr 4, 2022 at 4:51 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Mon 04-04-22 10:33:58, Zhaoyang Huang wrote: > > > > [...] > > > > > > One thing that I don't understand in this approach is: why memory.low > > > > > > should depend on the system's memory pressure. It seems you want to > > > > > > allow a process to allocate more when memory pressure is high. That is > > > > > > very counter-intuitive to me. Could you please explain the underlying > > > > > > logic of why this is the right thing to do, without going into > > > > > > technical details? > > > > > What I want to achieve is make memory.low be positive correlation with > > > > > timing and negative to memory pressure, which means the protected > > > > > memcg should lower its protection(via lower memcg.low) for helping > > > > > system's memory pressure when it's high. > > > > > > > > I have to say this is still very confusing to me. The low limit is a > > > > protection against external (e.g. global) memory pressure. Decreasing > > > > the protection based on the external pressure sounds like it goes right > > > > against the purpose of the knob. I can see reasons to update protection > > > > based on refaults or other metrics from the userspace but I still do not > > > > see how this is a good auto-magic tuning done by the kernel. > > > > > > > > > The concept behind is memcg's > > > > > fault back of dropped memory is less important than system's latency > > > > > on high memory pressure. > > > > > > > > Can you give some specific examples? > > > For both of the above two comments, please refer to the latest test > > > result in Patchv2 I have sent. I prefer to name my change as focus > > > transfer under pressure as protected memcg is the focus when system's > > > memory pressure is low which will reclaim from root, this is not > > > against current design. However, when global memory pressure is high, > > > then the focus has to be changed to the whole system, because it > > > doesn't make sense to let the protected memcg out of everybody, it > > > can't > > > do anything when the system is trapped in the kernel with reclaiming work. > > Does it make more sense if I describe the change as memcg will be > > protect long as system pressure is under the threshold(partially > > coherent with current design) and will sacrifice the memcg if pressure > > is over the threshold(added change) > > No, not really. For one it is still really unclear why there should be any > difference in the semantic between global and external memory pressure > in general. The low limit is always a protection from the external > pressure. And what should be the actual threshold? Amount of the reclaim > performed, effectivness of the reclaim or what? Please find bellowing for the test result, which shows current design has more effective protection when system memory pressure is high. It could be argued that the protected memcg lost the protection as its usage dropped too much. I would like to say that this is just the goal of the change. Is it reasonable to let the whole system be trapped in memory pressure while the memcg holds the memory? With regard to threshold, it is a dynamic decayed watermark value which represents the historic(watermark) and present(update to new usage if it expands again) usage. Actually, I have update the code by adding opt-in code which means this is a opt type of the memcg. This patch is coherent to the original design if user want to set the fixed value by default and also provide a new way of dynamic protected memcg without external monitor and interactivation. We simply test above change by comparing it with current design on a v5.4 based system in 3GB RAM in bellowing steps, via which we can find that fixed memory.low have the system experience high memory pressure with holding too much memory. 1. setting up the topology seperatly as [1] 2. place a memory cost process into B and have it consume 1GB memory from userspace. 3. generating global memory pressure via mlock 1GB memory. 4. watching B's memory.current and PSI_MEM. 5. repeat 3,4 twice. [1]. setting fixed low=500MB; low=600MB; wm_decay_factor=36(68s decay 1/2) A(low=500MB) / B(low=500MB) What we observed are: PSI_MEM, usage PSI_MEM,usage PSI_MEM,usage (Mlock 1GB) (Mlock 2GB) (stable) low=600MB s=23 f=17 u=720/600MB s=91 f=48 u=202MB s=68 f=32 u=106MB low=500MB s=22 f=13 u=660/530MB s=88 f=50 u=156MB s=30 f=20 u=120MB patch s=23 f=12 u=692/470MB s=40 f=23 u=67MB s=21 f=18 u=45MB > -- > Michal Hocko > SUSE Labs
On Mon, Apr 4, 2022 at 5:36 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 04-04-22 11:32:28, Michal Hocko wrote: > > On Mon 04-04-22 17:23:43, Zhaoyang Huang wrote: > > > On Mon, Apr 4, 2022 at 5:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > > > > > > > > On Mon, Apr 4, 2022 at 4:51 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Mon 04-04-22 10:33:58, Zhaoyang Huang wrote: > > > > > [...] > > > > > > > One thing that I don't understand in this approach is: why memory.low > > > > > > > should depend on the system's memory pressure. It seems you want to > > > > > > > allow a process to allocate more when memory pressure is high. That is > > > > > > > very counter-intuitive to me. Could you please explain the underlying > > > > > > > logic of why this is the right thing to do, without going into > > > > > > > technical details? > > > > > > What I want to achieve is make memory.low be positive correlation with > > > > > > timing and negative to memory pressure, which means the protected > > > > > > memcg should lower its protection(via lower memcg.low) for helping > > > > > > system's memory pressure when it's high. > > > > > > > > > > I have to say this is still very confusing to me. The low limit is a > > > > > protection against external (e.g. global) memory pressure. Decreasing > > > > > the protection based on the external pressure sounds like it goes right > > > > > against the purpose of the knob. I can see reasons to update protection > > > > > based on refaults or other metrics from the userspace but I still do not > > > > > see how this is a good auto-magic tuning done by the kernel. > > > > > > > > > > > The concept behind is memcg's > > > > > > fault back of dropped memory is less important than system's latency > > > > > > on high memory pressure. > > > > > > > > > > Can you give some specific examples? > > > > For both of the above two comments, please refer to the latest test > > > > result in Patchv2 I have sent. I prefer to name my change as focus > > > > transfer under pressure as protected memcg is the focus when system's > > > > memory pressure is low which will reclaim from root, this is not > > > > against current design. However, when global memory pressure is high, > > > > then the focus has to be changed to the whole system, because it > > > > doesn't make sense to let the protected memcg out of everybody, it > > > > can't > > > > do anything when the system is trapped in the kernel with reclaiming work. > > > Does it make more sense if I describe the change as memcg will be > > > protect long as system pressure is under the threshold(partially > > > coherent with current design) and will sacrifice the memcg if pressure > > > is over the threshold(added change) > > > > No, not really. For one it is still really unclear why there should be any > > difference in the semantic between global and external memory pressure > > in general. The low limit is always a protection from the external > > pressure. And what should be the actual threshold? Amount of the reclaim > > performed, effectivness of the reclaim or what? > > Btw. you might want to have a look at http://lkml.kernel.org/r/20220331084151.2600229-1-yosryahmed@google.com > where a new interface to allow pro-active memory reclaim is discussed. > I think that this might turn out to be a better fit then an automagic > kernel manipulation with a low limit. It will require a user agent to > drive the reclaim though. Ok. But AFAIK, there are some of this kinds of method working as out of tree code now. such as PPR in android etc. As I have replied to Suren, there is always latency issue on this scheme as the agent should poll the event/read current status/write to launch the action. This patch is aiming at solve part of these issues. > -- > Michal Hocko > SUSE Labs
On Mon 04-04-22 19:23:03, Zhaoyang Huang wrote: > On Mon, Apr 4, 2022 at 5:32 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 04-04-22 17:23:43, Zhaoyang Huang wrote: > > > On Mon, Apr 4, 2022 at 5:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > > > > > > > > On Mon, Apr 4, 2022 at 4:51 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Mon 04-04-22 10:33:58, Zhaoyang Huang wrote: > > > > > [...] > > > > > > > One thing that I don't understand in this approach is: why memory.low > > > > > > > should depend on the system's memory pressure. It seems you want to > > > > > > > allow a process to allocate more when memory pressure is high. That is > > > > > > > very counter-intuitive to me. Could you please explain the underlying > > > > > > > logic of why this is the right thing to do, without going into > > > > > > > technical details? > > > > > > What I want to achieve is make memory.low be positive correlation with > > > > > > timing and negative to memory pressure, which means the protected > > > > > > memcg should lower its protection(via lower memcg.low) for helping > > > > > > system's memory pressure when it's high. > > > > > > > > > > I have to say this is still very confusing to me. The low limit is a > > > > > protection against external (e.g. global) memory pressure. Decreasing > > > > > the protection based on the external pressure sounds like it goes right > > > > > against the purpose of the knob. I can see reasons to update protection > > > > > based on refaults or other metrics from the userspace but I still do not > > > > > see how this is a good auto-magic tuning done by the kernel. > > > > > > > > > > > The concept behind is memcg's > > > > > > fault back of dropped memory is less important than system's latency > > > > > > on high memory pressure. > > > > > > > > > > Can you give some specific examples? > > > > For both of the above two comments, please refer to the latest test > > > > result in Patchv2 I have sent. I prefer to name my change as focus > > > > transfer under pressure as protected memcg is the focus when system's > > > > memory pressure is low which will reclaim from root, this is not > > > > against current design. However, when global memory pressure is high, > > > > then the focus has to be changed to the whole system, because it > > > > doesn't make sense to let the protected memcg out of everybody, it > > > > can't > > > > do anything when the system is trapped in the kernel with reclaiming work. > > > Does it make more sense if I describe the change as memcg will be > > > protect long as system pressure is under the threshold(partially > > > coherent with current design) and will sacrifice the memcg if pressure > > > is over the threshold(added change) > > > > No, not really. For one it is still really unclear why there should be any > > difference in the semantic between global and external memory pressure > > in general. The low limit is always a protection from the external > > pressure. And what should be the actual threshold? Amount of the reclaim > > performed, effectivness of the reclaim or what? > Please find bellowing for the test result, which shows current design > has more effective protection when system memory pressure is high. It > could be argued that the protected memcg lost the protection as its > usage dropped too much. Yes, this is exactly how I do see it. The memory low/min is a clear decision of the administrator to protect the memory consumption of the said memcg (or hierarchy) from external memory pressure. > I would like to say that this is just the goal > of the change. Is it reasonable to let the whole system be trapped in > memory pressure while the memcg holds the memory? I would argue that this is expected and reasonable indeed. You cannot provide a protection withtout pushing the pressure to others. The memory is a finite resource. > With regard to > threshold, it is a dynamic decayed watermark value which represents > the historic(watermark) and present(update to new usage if it expands > again) usage. Actually, I have update the code by adding opt-in code > which means this is a opt type of the memcg. This patch is coherent to > the original design if user want to set the fixed value by default and > also provide a new way of dynamic protected memcg without external > monitor and interactivation. The more I read here to more I am convinced that hooking into low/min limits is simply wrong. If you want to achieve a more "clever" way to balance memory reclaim among existing memcgs then fine but trying to achieve that by dynamically interpreting low limits is just an abuse of an existing interface IMO. What has led you to (ab)use low limit in the first place? > We simply test above change by comparing it with current design on a v5.4 based > system in 3GB RAM in bellowing steps, via which we can find that fixed > memory.low have the system experience high memory pressure with holding too > much memory. > > 1. setting up the topology seperatly as [1] > 2. place a memory cost process into B and have it consume 1GB memory > from userspace. > 3. generating global memory pressure via mlock 1GB memory. > 4. watching B's memory.current and PSI_MEM. > 5. repeat 3,4 twice. This is effectively an OOM test, isn't it? Memory low protection will be enforced as long as there is something reclaimable but your memory pressure is unreclaimable due to mlock so it has a stronger guarantee than low limit so the protected memcg is going to be reclaimed. Maybe I am just not following but this makes less and less sense as I am reading through. So either I am missing something really significant or we are just not on the same page.
On Mon, Apr 4, 2022 at 8:30 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 04-04-22 19:23:03, Zhaoyang Huang wrote: > > On Mon, Apr 4, 2022 at 5:32 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 04-04-22 17:23:43, Zhaoyang Huang wrote: > > > > On Mon, Apr 4, 2022 at 5:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > > > > > > > > > > On Mon, Apr 4, 2022 at 4:51 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Mon 04-04-22 10:33:58, Zhaoyang Huang wrote: > > > > > > [...] > > > > > > > > One thing that I don't understand in this approach is: why memory.low > > > > > > > > should depend on the system's memory pressure. It seems you want to > > > > > > > > allow a process to allocate more when memory pressure is high. That is > > > > > > > > very counter-intuitive to me. Could you please explain the underlying > > > > > > > > logic of why this is the right thing to do, without going into > > > > > > > > technical details? > > > > > > > What I want to achieve is make memory.low be positive correlation with > > > > > > > timing and negative to memory pressure, which means the protected > > > > > > > memcg should lower its protection(via lower memcg.low) for helping > > > > > > > system's memory pressure when it's high. > > > > > > > > > > > > I have to say this is still very confusing to me. The low limit is a > > > > > > protection against external (e.g. global) memory pressure. Decreasing > > > > > > the protection based on the external pressure sounds like it goes right > > > > > > against the purpose of the knob. I can see reasons to update protection > > > > > > based on refaults or other metrics from the userspace but I still do not > > > > > > see how this is a good auto-magic tuning done by the kernel. > > > > > > > > > > > > > The concept behind is memcg's > > > > > > > fault back of dropped memory is less important than system's latency > > > > > > > on high memory pressure. > > > > > > > > > > > > Can you give some specific examples? > > > > > For both of the above two comments, please refer to the latest test > > > > > result in Patchv2 I have sent. I prefer to name my change as focus > > > > > transfer under pressure as protected memcg is the focus when system's > > > > > memory pressure is low which will reclaim from root, this is not > > > > > against current design. However, when global memory pressure is high, > > > > > then the focus has to be changed to the whole system, because it > > > > > doesn't make sense to let the protected memcg out of everybody, it > > > > > can't > > > > > do anything when the system is trapped in the kernel with reclaiming work. > > > > Does it make more sense if I describe the change as memcg will be > > > > protect long as system pressure is under the threshold(partially > > > > coherent with current design) and will sacrifice the memcg if pressure > > > > is over the threshold(added change) > > > > > > No, not really. For one it is still really unclear why there should be any > > > difference in the semantic between global and external memory pressure > > > in general. The low limit is always a protection from the external > > > pressure. And what should be the actual threshold? Amount of the reclaim > > > performed, effectivness of the reclaim or what? > > Please find bellowing for the test result, which shows current design > > has more effective protection when system memory pressure is high. It > > could be argued that the protected memcg lost the protection as its > > usage dropped too much. > > Yes, this is exactly how I do see it. The memory low/min is a > clear decision of the administrator to protect the memory consumption of > the said memcg (or hierarchy) from external memory pressure. Please be noticed that this patch DOES protect the memcg when external pressure is 1GB as fixed low does. Besides, how does the admin decide the exact number of low/min if it expand from small to even xGB in a quick changing scenario? > > > I would like to say that this is just the goal > > of the change. Is it reasonable to let the whole system be trapped in > > memory pressure while the memcg holds the memory? > > I would argue that this is expected and reasonable indeed. You cannot > provide a protection withtout pushing the pressure to others. The memory > is a finite resource. Imagin you are playing a game on a 3GB mobile phone while a visual call is connected. There could be either a dead view of the call because of the game possess a improper low value or the game be killed by OOM. IMHO, could we let the user have the choice to configure the memcg as dynamic protection. > > > With regard to > > threshold, it is a dynamic decayed watermark value which represents > > the historic(watermark) and present(update to new usage if it expands > > again) usage. Actually, I have update the code by adding opt-in code > > which means this is a opt type of the memcg. This patch is coherent to > > the original design if user want to set the fixed value by default and > > also provide a new way of dynamic protected memcg without external > > monitor and interactivation. > > The more I read here to more I am convinced that hooking into low/min > limits is simply wrong. If you want to achieve a more "clever" way to > balance memory reclaim among existing memcgs then fine but trying to > achieve that by dynamically interpreting low limits is just an abuse of > an existing interface IMO. What has led you to (ab)use low limit in the > first place? > > > We simply test above change by comparing it with current design on a v5.4 based > > system in 3GB RAM in bellowing steps, via which we can find that fixed > > memory.low have the system experience high memory pressure with holding too > > much memory. > > > > 1. setting up the topology seperatly as [1] > > 2. place a memory cost process into B and have it consume 1GB memory > > from userspace. > > 3. generating global memory pressure via mlock 1GB memory. > > 4. watching B's memory.current and PSI_MEM. > > 5. repeat 3,4 twice. > > This is effectively an OOM test, isn't it? Memory low protection will > be enforced as long as there is something reclaimable but your memory > pressure is unreclaimable due to mlock so it has a stronger guarantee > than low limit so the protected memcg is going to be reclaimed. Mlock is for the purpose of easy for test. High pressure could be introduced by file cache thrashing(mapped so file would have very short refault distance) or kernel driver allocation(multimedia or AI scenarios on android). I will retest it without Mlock then. > > Maybe I am just not following but this makes less and less sense as I am > reading through. So either I am missing something really significant or > we are just not on the same page. As I said, I would like to have this patch provide a opt-in choice for user to establish a kind of desired memcg while bring any change of current design. > -- > Michal Hocko > SUSE Labs
On Mon 04-04-22 21:14:40, Zhaoyang Huang wrote: [...] > Please be noticed that this patch DOES protect the memcg when external > pressure is 1GB as fixed low does. This is getting more and more confusing (at least to me). Could you describe the behavior of the reclaim for the following setups/situations? a) mostly reclaiming a clean page cache - via kswapd b) same as above but the direct reclaim is necessary but very lightweight c) direct reclaim makes fwd progress but not enough to satisfy the allocation request (so the reclaim has to be retried) d) direct reclaim not making progress and low limit protection is ignored. Say we have several memcgs and only some have low memory protection configured. What is the user observable state of the protected group and when and how much the protection can be updated? I think it would be also helpful to describe the high level semantic of this feature. > Besides, how does the admin decide > the exact number of low/min if it expand from small to even xGB in a > quick changing scenario? This is not really related, is it? There are different ways to tune for the protection. [...]
On Tue, Apr 5, 2022 at 8:08 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 04-04-22 21:14:40, Zhaoyang Huang wrote: > [...] > > Please be noticed that this patch DOES protect the memcg when external > > pressure is 1GB as fixed low does. > > This is getting more and more confusing (at least to me). Could you > describe the behavior of the reclaim for the following setups/situations? > > a) mostly reclaiming a clean page cache - via kswapd > b) same as above but the direct reclaim is necessary but very > lightweight > c) direct reclaim makes fwd progress but not enough to satisfy the > allocation request (so the reclaim has to be retried) > d) direct reclaim not making progress and low limit protection is > ignored. > > Say we have several memcgs and only some have low memory protection > configured. What is the user observable state of the protected group and > when and how much the protection can be updated? I am not sure if I understand you right. Do you have suspicions on the test result as you think protected memcg has no chance to update the protection or the global reclaim should have been satisfied with the reclaiming(step d is hard to reach?). Let me try to answer it under my understanding, please give me feedback if you need more info. The protection is updated while mem_cgroup_calculate_protection is called during either kswapd or direct reclaim for each round of the priority reclaiming and then the memcg's lruvec will be reached in step d. > > I think it would be also helpful to describe the high level semantic of > this feature. > > > Besides, how does the admin decide > > the exact number of low/min if it expand from small to even xGB in a > > quick changing scenario? > > This is not really related, is it? There are different ways to tune for > the protection. I don't think so. IMO, it is hard to protect when memcg has a wide and random range of its usage especially when depending on scenarios. Does the example of EAS on scheduler make more sense? When comparing with the legacy CFS, EAS does be against to some original design as load balance etc, while it will keep some small tasks into one CORE. > > [...] > -- > Michal Hocko > SUSE Labs
On Tue, Apr 5, 2022 at 8:08 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 04-04-22 21:14:40, Zhaoyang Huang wrote: > [...] > > Please be noticed that this patch DOES protect the memcg when external > > pressure is 1GB as fixed low does. > > This is getting more and more confusing (at least to me). Could you > describe the behavior of the reclaim for the following setups/situations? > > a) mostly reclaiming a clean page cache - via kswapd > b) same as above but the direct reclaim is necessary but very > lightweight > c) direct reclaim makes fwd progress but not enough to satisfy the > allocation request (so the reclaim has to be retried) > d) direct reclaim not making progress and low limit protection is > ignored. > > Say we have several memcgs and only some have low memory protection > configured. What is the user observable state of the protected group and > when and how much the protection can be updated? Ok. I guess you doubt why the external reclaiming on global LRU or other unprotected memcg does not satisfy the requirement and have the protected memcg have to face reclaim? According to my experience, this is common for a large number of malloc from userspace OR high order alloc_pages within the kernel. I have retested the previous case by removing mlock and get the trend of result is same, where the pages on global LRU could help to push some of the global memory pressure back to global LRU and finally reach the protected memcg. > > I think it would be also helpful to describe the high level semantic of > this feature. > > > Besides, how does the admin decide > > the exact number of low/min if it expand from small to even xGB in a > > quick changing scenario? > > This is not really related, is it? There are different ways to tune for > the protection. > > [...] > -- > Michal Hocko > SUSE Labs
On Wed 06-04-22 10:11:19, Zhaoyang Huang wrote: > On Tue, Apr 5, 2022 at 8:08 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 04-04-22 21:14:40, Zhaoyang Huang wrote: > > [...] > > > Please be noticed that this patch DOES protect the memcg when external > > > pressure is 1GB as fixed low does. > > > > This is getting more and more confusing (at least to me). Could you > > describe the behavior of the reclaim for the following setups/situations? > > > > a) mostly reclaiming a clean page cache - via kswapd > > b) same as above but the direct reclaim is necessary but very > > lightweight > > c) direct reclaim makes fwd progress but not enough to satisfy the > > allocation request (so the reclaim has to be retried) > > d) direct reclaim not making progress and low limit protection is > > ignored. > > > > Say we have several memcgs and only some have low memory protection > > configured. What is the user observable state of the protected group and > > when and how much the protection can be updated? > I am not sure if I understand you right. Do you have suspicions on the > test result as you think protected memcg has no chance to update the > protection or the global reclaim should have been satisfied with the > reclaiming(step d is hard to reach?). Let me try to answer it under my > understanding, please give me feedback if you need more info. The > protection is updated while mem_cgroup_calculate_protection is called > during either kswapd or direct reclaim for each round of the priority > reclaiming and then the memcg's lruvec will be reached in step d. This means that limits are altered even if there is memory to be reclaimed from other memcgs. Why? How does this line up with the basic property of the low limit to act as a protection from the reclaim? > > I think it would be also helpful to describe the high level semantic of > > this feature. Please focus on this part. Without a high level semantic explained we will not move forward.
On Thu, Apr 7, 2022 at 3:40 PM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 06-04-22 10:11:19, Zhaoyang Huang wrote: > > On Tue, Apr 5, 2022 at 8:08 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 04-04-22 21:14:40, Zhaoyang Huang wrote: > > > [...] > > > > Please be noticed that this patch DOES protect the memcg when external > > > > pressure is 1GB as fixed low does. > > > > > > This is getting more and more confusing (at least to me). Could you > > > describe the behavior of the reclaim for the following setups/situations? > > > > > > a) mostly reclaiming a clean page cache - via kswapd > > > b) same as above but the direct reclaim is necessary but very > > > lightweight > > > c) direct reclaim makes fwd progress but not enough to satisfy the > > > allocation request (so the reclaim has to be retried) > > > d) direct reclaim not making progress and low limit protection is > > > ignored. > > > > > > Say we have several memcgs and only some have low memory protection > > > configured. What is the user observable state of the protected group and > > > when and how much the protection can be updated? > > I am not sure if I understand you right. Do you have suspicions on the > > test result as you think protected memcg has no chance to update the > > protection or the global reclaim should have been satisfied with the > > reclaiming(step d is hard to reach?). Let me try to answer it under my > > understanding, please give me feedback if you need more info. The > > protection is updated while mem_cgroup_calculate_protection is called > > during either kswapd or direct reclaim for each round of the priority > > reclaiming and then the memcg's lruvec will be reached in step d. > > This means that limits are altered even if there is memory to be > reclaimed from other memcgs. Why? How does this line up with the > basic property of the low limit to act as a protection from the reclaim? ok, partially understand. I would like to say that low's original definition under this patch has changed, says the calculated low just provide protection when the psi value is lower than the setting and will introduce reclaiming if it exceed. It also can be seen from the bellowing latest test result(same as previous test but without mlock), which says that the memcg with fixed low will push back the reclaim to global LRU while keeping psi to be high. Please be noticed that the low will be updated when usage raise up over it which means resume the protection again when the memcg become active. psi(global=1GB) max stable psi(global=2GB) max stable Low=400MB some=18 full=11 700MB 600MB some=20 full=16 400MB 400MB Low=500MB some=18 full=13 680MB 540MB some=27 full=17 500MB 500MB patch setting1 some=19 full=13 863MB 740MB some=15 full=10 500MB 500MB patch setting1 some=14 full=11 640MB 470MB some=20 full=12 360MB 320MB > > > > I think it would be also helpful to describe the high level semantic of > > > this feature. > > Please focus on this part. Without a high level semantic explained we > will not move forward. > -- > Michal Hocko > SUSE Labs
[...] On Thu 07-04-22 16:59:50, Zhaoyang Huang wrote: > > This means that limits are altered even if there is memory to be > > reclaimed from other memcgs. Why? How does this line up with the > > basic property of the low limit to act as a protection from the reclaim? > ok, partially understand. I would like to say that low's original > definition under this patch has changed, says the calculated low just > provide protection when the psi value is lower than the setting and > will introduce reclaiming if it exceed. OK, I guess I finally get to understand what you are trying to say. So effectivelly your new semantic defines the low limit as an initial protection that is very soft and only preserved under a light global memory pressure[1]. If the reclaim pressure is higher the user provided protection is decreased. The new semantic is planned to be a global opt-in. Correct? Now, that I (believe) to have a slightly better understanding I have to say I really dislike the idea. First of all the new semantic would have to be memcg reclaim aware. That means that the scaling logic would need to be aware where the memory pressure comes from. More importantnly I really dislike the idea that the user provided input is updated by the kernel without userspace knowledge about that. How is the userspace supposed to know that the value has been changed? I can see how the effective values could be scaled down but this still sounds dubious as the userspace would have hard time to understand what is going on under the cover or even worse tune the value based on the specific observed behavior for a specific kernel which would make this a maintenance burden going forward. All that being said I have hard time to make sense of a protection which is unpredictably decreasing based on a global metrics without any userspace view into why and how this is done. So I am afraid I have to NACK this and I really do recommend you to start a discussion about your specific usecase and try to find a different solution. Best regards [1] this is based on the global PSI metric.
On Thu, Apr 7, 2022 at 5:44 PM Michal Hocko <mhocko@suse.com> wrote: > > [...] > On Thu 07-04-22 16:59:50, Zhaoyang Huang wrote: > > > This means that limits are altered even if there is memory to be > > > reclaimed from other memcgs. Why? How does this line up with the > > > basic property of the low limit to act as a protection from the reclaim? > > ok, partially understand. I would like to say that low's original > > definition under this patch has changed, says the calculated low just > > provide protection when the psi value is lower than the setting and > > will introduce reclaiming if it exceed. > > OK, I guess I finally get to understand what you are trying to say. So > effectivelly your new semantic defines the low limit as an initial > protection that is very soft and only preserved under a light global > memory pressure[1]. If the reclaim pressure is higher the user provided > protection is decreased. The new semantic is planned to be a global > opt-in. > > Correct? right. But I don't think the original protection is soft which could be proved by the test result that the memcg is protected in a certain range of pressure and could also help to release the system by breaking low limit. > > Now, that I (believe) to have a slightly better understanding I have to > say I really dislike the idea. > First of all the new semantic would have to be memcg reclaim aware. That > means that the scaling logic would need to be aware where the memory > pressure comes from. I don't follow. Does it mean that the protected should distinguish the pressure from global and other memcgs? I don't know why. > More importantnly I really dislike the idea that the user provided input > is updated by the kernel without userspace knowledge about that. How is > the userspace supposed to know that the value has been changed? Actually, the purpose of this patch is to free the userspace during runtime which require proper setup of parameter and then let the scheme decide rest things. > I can see how the effective values could be scaled down but this still > sounds dubious as the userspace would have hard time to understand what > is going on under the cover or even worse tune the value based on the > specific observed behavior for a specific kernel which would make this a > maintenance burden going forward. This kind of memcg is supposed to be used by the user who is aware of the scheme and would like the scheme to perform as it is. > > All that being said I have hard time to make sense of a protection which > is unpredictably decreasing based on a global metrics without any > userspace view into why and how this is done. So I am afraid I have to > NACK this and I really do recommend you to start a discussion about your > specific usecase and try to find a different solution. As I have mentioned before, EAS scheduler is also a self-motivating scheme which is based on load tracking and energy calculation. The user could also be hard to know when the schedule entity could be scheduled to big core. The admin could turn it off if dislike. I would like to push this patch forward and get more feedback from real scenarios. > > Best regards > > > [1] this is based on the global PSI metric. > -- > Michal Hocko > SUSE Labs
On Thu 07-04-22 20:36:51, Zhaoyang Huang wrote: > On Thu, Apr 7, 2022 at 5:44 PM Michal Hocko <mhocko@suse.com> wrote: > > > > [...] > > On Thu 07-04-22 16:59:50, Zhaoyang Huang wrote: > > > > This means that limits are altered even if there is memory to be > > > > reclaimed from other memcgs. Why? How does this line up with the > > > > basic property of the low limit to act as a protection from the reclaim? > > > ok, partially understand. I would like to say that low's original > > > definition under this patch has changed, says the calculated low just > > > provide protection when the psi value is lower than the setting and > > > will introduce reclaiming if it exceed. > > > > OK, I guess I finally get to understand what you are trying to say. So > > effectivelly your new semantic defines the low limit as an initial > > protection that is very soft and only preserved under a light global > > memory pressure[1]. If the reclaim pressure is higher the user provided > > protection is decreased. The new semantic is planned to be a global > > opt-in. > > > > Correct? > right. But I don't think the original protection is soft which could > be proved by the test result that the memcg is protected in a certain > range of pressure and could also help to release the system by > breaking low limit. Low limit protection is considered soft because it doesn't provide any guarantee. I will be ignored (and that will be reported to the userspace via LOW event) if there is nothing reclaimable in the scope of the reclaimed hierarchy. An alternative would be OOM actually. > > Now, that I (believe) to have a slightly better understanding I have to > > say I really dislike the idea. > > First of all the new semantic would have to be memcg reclaim aware. That > > means that the scaling logic would need to be aware where the memory > > pressure comes from. > I don't follow. Does it mean that the protected should distinguish the > pressure from global and other memcgs? I don't know why. No, it should behave consistently for any external memory pressure. A reclaimed memcg can apply different constraint depending on the root of the reclaim. Your solution is considering root to be root_memcg. [...]
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0c5c403..a510057 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -21,6 +21,9 @@ #include <linux/vmstat.h> #include <linux/writeback.h> #include <linux/page-flags.h> +#include <linux/sched/loadavg.h> +#include <linux/sched/clock.h> +#include <linux/psi.h> struct mem_cgroup; struct obj_cgroup; @@ -28,6 +31,8 @@ struct mm_struct; struct kmem_cache; +#define MEMCG_INTERVAL (2*HZ+1) /* 2 sec intervals */ + /* Cgroup-specific page state, on top of universal node page state */ enum memcg_stat_item { MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS, @@ -340,6 +345,10 @@ struct mem_cgroup { struct deferred_split deferred_split_queue; #endif + u64 wm_dec_fact; + u64 avg_next_update; + u64 avg_last_update; + struct mem_cgroup_per_node *nodeinfo[]; }; @@ -608,6 +617,47 @@ static inline bool mem_cgroup_disabled(void) return !cgroup_subsys_enabled(memory_cgrp_subsys); } +/* + * calculate memory.low based on the historic watermark and memory pressure + */ +static inline void calc_protected_low(struct mem_cgroup *group) +{ + u64 now, decay_factor; + u64 decayed_watermark; + u64 delta_time; + + now = sched_clock(); + + if (!group->avg_next_update) { + group->avg_next_update = now + jiffies_to_nsecs(5*HZ); + return; + } + + if (time_before((unsigned long)now, (unsigned long)group->avg_next_update)) + return; + + delta_time = group->avg_last_update ? now - group->avg_last_update : 0; + /* + * we take 2048 as "1" and 68s decay 1/2(36bit) by default + * decay_factor = 1024 * delta_time / 68s(0x1000000000) + * 0.5(1024)/68s = decay_factor/delta_time ==> decay_factor = delta_time >> 26 + */ + decay_factor = (2048 - min(2048ULL, delta_time >> (group->wm_dec_fact - 10))); + decayed_watermark = group->memory.decayed_watermark * decay_factor / 2048; + /* decay_factor: based on average memory pressure over elapsed time */ + decay_factor = psi_mem_get(delta_time); + group->memory.low = decayed_watermark * (100 - decay_factor) / 100; + + /* + * avg_next_update: expected expire time according to current status + */ + group->memory.decayed_watermark = decayed_watermark; + group->avg_last_update = now; + group->avg_next_update = now + jiffies_to_nsecs(2*HZ); + + return; +} + static inline void mem_cgroup_protection(struct mem_cgroup *root, struct mem_cgroup *memcg, unsigned long *min, diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index 6795913..2720eb9f 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -25,8 +25,12 @@ struct page_counter { /* legacy */ unsigned long watermark; + unsigned long decayed_watermark; unsigned long failcnt; + /* proportional protection */ + unsigned long min_prop; + unsigned long low_prop; /* * 'parent' is placed here to be far from 'usage' to reduce * cache false sharing, as 'usage' is written mostly while diff --git a/include/linux/psi.h b/include/linux/psi.h index 65eb147..6c76993 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -25,6 +25,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res); +unsigned long psi_mem_get(unsigned long time); + #ifdef CONFIG_CGROUPS int psi_cgroup_alloc(struct cgroup *cgrp); void psi_cgroup_free(struct cgroup *cgrp); diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index dd80bd2..8d315e0 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -291,6 +291,24 @@ static void get_recent_times(struct psi_group *group, int cpu, } } +unsigned long psi_mem_get(unsigned long time_ns) +{ + unsigned long time_sec = time_ns / (1000 * 1000 * 1000); + unsigned long some, full; + if (time_sec < 10) { + some = LOAD_INT(psi_system.avg[PSI_MEM * 2][0]); + full = LOAD_INT(psi_system.avg[PSI_MEM * 2 + 1][0]); + } else if (time_sec < 60) { + some = LOAD_INT(psi_system.avg[PSI_MEM * 2][1]); + full = LOAD_INT(psi_system.avg[PSI_MEM * 2 + 1][1]); + } else { + some = LOAD_INT(psi_system.avg[PSI_MEM * 2][2]); + full = LOAD_INT(psi_system.avg[PSI_MEM * 2 + 1][2]); + } + + return (some * 768 + full * 256) / 1024; +} + static void calc_avgs(unsigned long avg[3], int missed_periods, u64 time, u64 period) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 508bcea..6b579a4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5188,6 +5188,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX); memcg->soft_limit = PAGE_COUNTER_MAX; page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX); + memcg->wm_dec_fact = 36; if (parent) { memcg->swappiness = mem_cgroup_swappiness(parent); memcg->oom_kill_disable = parent->oom_kill_disable; @@ -6616,6 +6617,8 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root, { unsigned long usage, parent_usage; struct mem_cgroup *parent; + unsigned long memcg_emin, memcg_elow, parent_emin, parent_elow; + unsigned long watermark; if (mem_cgroup_disabled()) return; @@ -6642,6 +6645,7 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root, if (!parent) return; + calc_protected_low(memcg); if (parent == root) { memcg->memory.emin = READ_ONCE(memcg->memory.min); memcg->memory.elow = READ_ONCE(memcg->memory.low); diff --git a/mm/page_counter.c b/mm/page_counter.c index 7d83641..18abfdd 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -83,6 +83,8 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages) */ if (new > READ_ONCE(c->watermark)) WRITE_ONCE(c->watermark, new); + if (new > READ_ONCE(c->decayed_watermark)) + WRITE_ONCE(c->decayed_watermark, new); } } @@ -137,6 +139,8 @@ bool page_counter_try_charge(struct page_counter *counter, */ if (new > READ_ONCE(c->watermark)) WRITE_ONCE(c->watermark, new); + if (new > READ_ONCE(c->decayed_watermark)) + WRITE_ONCE(c->decayed_watermark, new); } return true;