Message ID | 20220226204144.1008339-3-bigeasy@linutronix.de (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/memcg: Address PREEMPT_RT problems instead of disabling it. | expand |
On 26/02/22 21:41, Sebastian Andrzej Siewior wrote: > During the integration of PREEMPT_RT support, the code flow around > memcg_check_events() resulted in `twisted code'. Moving the code around > and avoiding then would then lead to an additional local-irq-save > section within memcg_check_events(). While looking better, it adds a > local-irq-save section to code flow which is usually within an > local-irq-off block on non-PREEMPT_RT configurations. > Hey, sorry for necro'ing a year-old thread - would you happen to remember what the issues were with memcg_check_events()? I ran tests against cgroupv1 using an eventfd on OOM with the usual debug arsenal and didn't detect anything, I'm guessing it has to do with the IRQ-off region memcg_check_events() is called from? I want cgroupv1 to die as much as the next person, but in that specific situation I kinda need cgroupv1 to behave somewhat sanely on RT with threshold events :/ Cheers
On Wed 01-03-23 18:23:19, Valentin Schneider wrote: > On 26/02/22 21:41, Sebastian Andrzej Siewior wrote: > > During the integration of PREEMPT_RT support, the code flow around > > memcg_check_events() resulted in `twisted code'. Moving the code around > > and avoiding then would then lead to an additional local-irq-save > > section within memcg_check_events(). While looking better, it adds a > > local-irq-save section to code flow which is usually within an > > local-irq-off block on non-PREEMPT_RT configurations. > > > > Hey, sorry for necro'ing a year-old thread - would you happen to remember > what the issues were with memcg_check_events()? I ran tests against > cgroupv1 using an eventfd on OOM with the usual debug arsenal and didn't > detect anything, I'm guessing it has to do with the IRQ-off region > memcg_check_events() is called from? I would have to look into details but IIRC the resulting code to make the code RT safe was dreaded and hard to maintain as a result. As we didn't really have any real life usecase, disabling the code was an easier way to go forward. So it is not the code would be impossible to be enabled for RT it just doeasn't seam to be worth all the complexity. > I want cgroupv1 to die as much as the next person, but in that specific > situation I kinda need cgroupv1 to behave somewhat sanely on RT with > threshold events :/ Could you expand on the usecase?
On 02/03/23 08:45, Michal Hocko wrote: > On Wed 01-03-23 18:23:19, Valentin Schneider wrote: >> On 26/02/22 21:41, Sebastian Andrzej Siewior wrote: >> > During the integration of PREEMPT_RT support, the code flow around >> > memcg_check_events() resulted in `twisted code'. Moving the code around >> > and avoiding then would then lead to an additional local-irq-save >> > section within memcg_check_events(). While looking better, it adds a >> > local-irq-save section to code flow which is usually within an >> > local-irq-off block on non-PREEMPT_RT configurations. >> > >> >> Hey, sorry for necro'ing a year-old thread - would you happen to remember >> what the issues were with memcg_check_events()? I ran tests against >> cgroupv1 using an eventfd on OOM with the usual debug arsenal and didn't >> detect anything, I'm guessing it has to do with the IRQ-off region >> memcg_check_events() is called from? > > I would have to look into details but IIRC the resulting code to make > the code RT safe was dreaded and hard to maintain as a result. As we > didn't really have any real life usecase, disabling the code was an > easier way to go forward. So it is not the code would be impossible to > be enabled for RT it just doeasn't seam to be worth all the complexity. > Right, thanks for having a look. >> I want cgroupv1 to die as much as the next person, but in that specific >> situation I kinda need cgroupv1 to behave somewhat sanely on RT with >> threshold events :/ > > Could you expand on the usecase? > In this case it's just some middleware leveraging memcontrol cgroups and setting up callbacks for in-cgroup OOM events. This is a supported feature in cgroupv2, so this isn't a problem of cgroupv1 vs cgroupv2 feature parity, but rather one of being in a transitional phase where the middleware itself hasn't fully migrated to using cgroupv2. > -- > Michal Hocko > SUSE Labs
On Thu 02-03-23 10:18:31, Valentin Schneider wrote: > On 02/03/23 08:45, Michal Hocko wrote: > > On Wed 01-03-23 18:23:19, Valentin Schneider wrote: [...] > >> I want cgroupv1 to die as much as the next person, but in that specific > >> situation I kinda need cgroupv1 to behave somewhat sanely on RT with > >> threshold events :/ > > > > Could you expand on the usecase? > > > > In this case it's just some middleware leveraging memcontrol cgroups and > setting up callbacks for in-cgroup OOM events. This is a supported feature > in cgroupv2, so this isn't a problem of cgroupv1 vs cgroupv2 feature > parity, but rather one of being in a transitional phase where the > middleware itself hasn't fully migrated to using cgroupv2. How is this related to the RT kernel config? memcg OOM vs any RT assumptions do not really get along well AFAICT.
On 02/03/23 12:24, Michal Hocko wrote: > On Thu 02-03-23 10:18:31, Valentin Schneider wrote: >> On 02/03/23 08:45, Michal Hocko wrote: >> > On Wed 01-03-23 18:23:19, Valentin Schneider wrote: > [...] >> >> I want cgroupv1 to die as much as the next person, but in that specific >> >> situation I kinda need cgroupv1 to behave somewhat sanely on RT with >> >> threshold events :/ >> > >> > Could you expand on the usecase? >> > >> >> In this case it's just some middleware leveraging memcontrol cgroups and >> setting up callbacks for in-cgroup OOM events. This is a supported feature >> in cgroupv2, so this isn't a problem of cgroupv1 vs cgroupv2 feature >> parity, but rather one of being in a transitional phase where the >> middleware itself hasn't fully migrated to using cgroupv2. > > How is this related to the RT kernel config? memcg OOM vs any RT > assumptions do not really get along well AFAICT. > Yep. AIUI the tasks actually relying on RT guarantees DTRT (at least regarding memory allocations, or lack thereof), but other non-RT-reliant tasks on other CPUs come and go, hence the memcg involvement. > -- > Michal Hocko > SUSE Labs
On Thu 02-03-23 12:30:33, Valentin Schneider wrote: > On 02/03/23 12:24, Michal Hocko wrote: > > On Thu 02-03-23 10:18:31, Valentin Schneider wrote: > >> On 02/03/23 08:45, Michal Hocko wrote: > >> > On Wed 01-03-23 18:23:19, Valentin Schneider wrote: > > [...] > >> >> I want cgroupv1 to die as much as the next person, but in that specific > >> >> situation I kinda need cgroupv1 to behave somewhat sanely on RT with > >> >> threshold events :/ > >> > > >> > Could you expand on the usecase? > >> > > >> > >> In this case it's just some middleware leveraging memcontrol cgroups and > >> setting up callbacks for in-cgroup OOM events. This is a supported feature > >> in cgroupv2, so this isn't a problem of cgroupv1 vs cgroupv2 feature > >> parity, but rather one of being in a transitional phase where the > >> middleware itself hasn't fully migrated to using cgroupv2. > > > > How is this related to the RT kernel config? memcg OOM vs any RT > > assumptions do not really get along well AFAICT. > > > > Yep. AIUI the tasks actually relying on RT guarantees DTRT (at least > regarding memory allocations, or lack thereof), but other non-RT-reliant > tasks on other CPUs come and go, hence the memcg involvement. So are you suggesting that the RT kernel is used for mixed bag of workloads with RT and non RT assumptions? Is this really a reasonable and reliable setup? What I am trying to evaluate here is whether it makes sense to support and maintain a non-trivial code for something that might be working sub-optimally or even not working properly in some corner cases. The price for the maintenance is certainly not free.
On 02/03/23 13:56, Michal Hocko wrote: > On Thu 02-03-23 12:30:33, Valentin Schneider wrote: >> On 02/03/23 12:24, Michal Hocko wrote: >> > On Thu 02-03-23 10:18:31, Valentin Schneider wrote: >> >> On 02/03/23 08:45, Michal Hocko wrote: >> >> > On Wed 01-03-23 18:23:19, Valentin Schneider wrote: >> > [...] >> >> >> I want cgroupv1 to die as much as the next person, but in that specific >> >> >> situation I kinda need cgroupv1 to behave somewhat sanely on RT with >> >> >> threshold events :/ >> >> > >> >> > Could you expand on the usecase? >> >> > >> >> >> >> In this case it's just some middleware leveraging memcontrol cgroups and >> >> setting up callbacks for in-cgroup OOM events. This is a supported feature >> >> in cgroupv2, so this isn't a problem of cgroupv1 vs cgroupv2 feature >> >> parity, but rather one of being in a transitional phase where the >> >> middleware itself hasn't fully migrated to using cgroupv2. >> > >> > How is this related to the RT kernel config? memcg OOM vs any RT >> > assumptions do not really get along well AFAICT. >> > >> >> Yep. AIUI the tasks actually relying on RT guarantees DTRT (at least >> regarding memory allocations, or lack thereof), but other non-RT-reliant >> tasks on other CPUs come and go, hence the memcg involvement. > > So are you suggesting that the RT kernel is used for mixed bag of > workloads with RT and non RT assumptions? Is this really a reasonable > and reliable setup? > To some extent :-) > What I am trying to evaluate here is whether it makes sense to support > and maintain a non-trivial code for something that might be working > sub-optimally or even not working properly in some corner cases. The > price for the maintenance is certainly not free. That's also what I'm trying to make sense of. Either way this will be frankenkernel territory, the cgroupv1 ship has already sailed for upstream IMO.
On 02/03/23 14:34, Valentin Schneider wrote: > On 02/03/23 13:56, Michal Hocko wrote: >> What I am trying to evaluate here is whether it makes sense to support >> and maintain a non-trivial code for something that might be working >> sub-optimally or even not working properly in some corner cases. The >> price for the maintenance is certainly not free. > > That's also what I'm trying to make sense of. Either way this will be > frankenkernel territory, the cgroupv1 ship has already sailed for upstream > IMO. So, if someone ends up in a similar situation and considers kludging those notifications back in: Don't. Have a look at: memcg_check_events() `\ mem_cgroup_threshold() `\ __mem_cgroup_threshold() `\ eventfd_signal() Having IRQs off, percpu reads, and the the eventfd signal is a nice recipe for disaster. The OOM notification sits outside of that, but any other memcg notification will cause pain.
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index faac50149a222..2cc502a75ef64 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -64,6 +64,7 @@ Brief summary of control files. threads cgroup.procs show list of processes cgroup.event_control an interface for event_fd() + This knob is not available on CONFIG_PREEMPT_RT systems. memory.usage_in_bytes show current usage for memory (See 5.5 for details) memory.memsw.usage_in_bytes show current usage for memory+Swap @@ -75,6 +76,7 @@ Brief summary of control files. memory.max_usage_in_bytes show max memory usage recorded memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded memory.soft_limit_in_bytes set/show soft limit of memory usage + This knob is not available on CONFIG_PREEMPT_RT systems. memory.stat show various statistics memory.use_hierarchy set/show hierarchical account enabled This knob is deprecated and shouldn't be diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8ab2dc75e70ec..0b5117ed2ae08 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -859,6 +859,9 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, */ static void memcg_check_events(struct mem_cgroup *memcg, int nid) { + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + return; + /* threshold event is triggered in finer grain than soft limit */ if (unlikely(mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_THRESH))) { @@ -3731,8 +3734,12 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of, } break; case RES_SOFT_LIMIT: - memcg->soft_limit = nr_pages; - ret = 0; + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { + ret = -EOPNOTSUPP; + } else { + memcg->soft_limit = nr_pages; + ret = 0; + } break; } return ret ?: nbytes; @@ -4708,6 +4715,9 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of, char *endp; int ret; + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + return -EOPNOTSUPP; + buf = strstrip(buf); efd = simple_strtoul(buf, &endp, 10);