Message ID | 20900d89-b06d-2ec6-0ae0-beffc5874f26@I-love.SAKURA.ne.jp (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] mm,page_alloc: wait for oom_lock before retrying. | expand |
On Thu 14-02-19 01:30:28, Tetsuo Handa wrote: [...] > >From 63c5c8ee7910fa9ef1c4067f1cb35a779e9d582c Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Date: Tue, 12 Feb 2019 20:12:35 +0900 > Subject: [PATCH v3] mm,page_alloc: wait for oom_lock before retrying. > > When many hundreds of threads concurrently triggered a page fault, and > one of them invoked the global OOM killer, the owner of oom_lock is > preempted for minutes because they are rather depriving the owner of > oom_lock of CPU time rather than waiting for the owner of oom_lock to > make progress. We don't want to disable preemption while holding oom_lock > but we want the owner of oom_lock to complete as soon as possible. > > Thus, this patch kills the dangerous assumption that sleeping for one > jiffy is sufficient for allowing the owner of oom_lock to make progress. What does this prevent any _other_ kernel path or even high priority userspace to preempt the oom killer path? This was the essential question the last time around and I do not see it covered here. I strongly suspect that all these games with the locking is just a pointless tunning for an insane workload without fixing the underlying issue. > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > --- > mm/page_alloc.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 35fdde0..c867513 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3618,7 +3618,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) > */ > if (!mutex_trylock(&oom_lock)) { > *did_some_progress = 1; > - schedule_timeout_uninterruptible(1); > + if (mutex_lock_killable(&oom_lock) == 0) > + mutex_unlock(&oom_lock); > + else if (!tsk_is_oom_victim(current)) > + schedule_timeout_uninterruptible(1); > return NULL; > } > > -- > 1.8.3.1
On 2019/02/14 1:56, Michal Hocko wrote: > On Thu 14-02-19 01:30:28, Tetsuo Handa wrote: > [...] >> >From 63c5c8ee7910fa9ef1c4067f1cb35a779e9d582c Mon Sep 17 00:00:00 2001 >> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> >> Date: Tue, 12 Feb 2019 20:12:35 +0900 >> Subject: [PATCH v3] mm,page_alloc: wait for oom_lock before retrying. >> >> When many hundreds of threads concurrently triggered a page fault, and >> one of them invoked the global OOM killer, the owner of oom_lock is >> preempted for minutes because they are rather depriving the owner of >> oom_lock of CPU time rather than waiting for the owner of oom_lock to >> make progress. We don't want to disable preemption while holding oom_lock >> but we want the owner of oom_lock to complete as soon as possible. >> >> Thus, this patch kills the dangerous assumption that sleeping for one >> jiffy is sufficient for allowing the owner of oom_lock to make progress. > > What does this prevent any _other_ kernel path or even high priority > userspace to preempt the oom killer path? This was the essential > question the last time around and I do not see it covered here. I Since you already NACKed disabling preemption at https://marc.info/?i=20180322114002.GC23100@dhcp22.suse.cz , pointing out "even high priority userspace to preempt the oom killer path" is invalid. Since printk() is very slow, dump_header() can become slow, especially when dump_tasks() is called. And changing dump_tasks() to use rcu_lock_break() does not solve this problem, for this is a problem that once current thread released CPU, current thread might be kept preempted for minutes. Allowing OOM path to be preempted is what you prefer, isn't it? Then, spending CPU time for something (what you call "any _other_ kernel path") is accountable for delaying OOM path. But wasting CPU time when allocating threads can do nothing but wait for the owner of oom_lock to complete OOM path is not accountable for delaying OOM path. Thus, there is nothing to cover for your "I do not see it covered here" response, except how to avoid "wasting CPU time when allocating threads can do nothing but wait for the owner of oom_lock to complete OOM path". > strongly suspect that all these games with the locking is just a > pointless tunning for an insane workload without fixing the underlying > issue. We could even change oom_lock to a local lock inside oom_kill_process(), for all threads in a same allocating context will select the same OOM victim (unless oom_kill_allocating_task case), and many threads already inside oom_kill_process() will prevent themselves from selecting next OOM victim. Although this approach wastes some CPU resources for needlessly selecting same OOM victim for many times, this approach also can solve this problem. It seems that you don't want to admit that "wasting CPU time when allocating threads can do nothing but wait for the owner of oom_lock to complete OOM path" as the underlying issue. But we can't fix it without throttling direct reclaim paths. That's the evidence that this problem is not fixed for many years. Please stop NACKing proposals without alternative implementations. --- drivers/tty/sysrq.c | 2 - include/linux/oom.h | 2 - mm/memcontrol.c | 11 +----- mm/oom_kill.c | 106 ++++++++++++++++++++++++++++++++++++++-------------- mm/page_alloc.c | 18 +-------- 5 files changed, 81 insertions(+), 58 deletions(-) diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c index fa0ce7d..aa7f857 100644 --- a/drivers/tty/sysrq.c +++ b/drivers/tty/sysrq.c @@ -369,10 +369,8 @@ static void moom_callback(struct work_struct *ignored) .order = -1, }; - mutex_lock(&oom_lock); if (!out_of_memory(&oc)) pr_info("OOM request ignored. No task eligible\n"); - mutex_unlock(&oom_lock); } static DECLARE_WORK(moom_work, moom_callback); diff --git a/include/linux/oom.h b/include/linux/oom.h index d079920..865fddf 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -54,8 +54,6 @@ struct oom_control { enum oom_constraint constraint; }; -extern struct mutex oom_lock; - static inline void set_current_oom_origin(void) { current->signal->oom_flag_origin = true; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5fc2e1a..31d3314 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1399,17 +1399,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, .gfp_mask = gfp_mask, .order = order, }; - bool ret; - if (mutex_lock_killable(&oom_lock)) - return true; - /* - * A few threads which were not waiting at mutex_lock_killable() can - * fail to bail out. Therefore, check again after holding oom_lock. - */ - ret = should_force_charge() || out_of_memory(&oc); - mutex_unlock(&oom_lock); - return ret; + return should_force_charge() || out_of_memory(&oc); } #if MAX_NUMNODES > 1 diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 3a24848..db804fa 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -54,14 +54,10 @@ int sysctl_oom_dump_tasks = 1; /* - * Serializes oom killer invocations (out_of_memory()) from all contexts to - * prevent from over eager oom killing (e.g. when the oom killer is invoked - * from different domains). - * * oom_killer_disable() relies on this lock to stabilize oom_killer_disabled * and mark_oom_victim */ -DEFINE_MUTEX(oom_lock); +static DEFINE_SPINLOCK(oom_disable_lock); #ifdef CONFIG_NUMA /** @@ -430,7 +426,10 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim) { + static DEFINE_SPINLOCK(lock); + /* one line summary of the oom killer context. */ + spin_lock(&lock); pr_info("oom-kill:constraint=%s,nodemask=%*pbl", oom_constraint_text[oc->constraint], nodemask_pr_args(oc->nodemask)); @@ -438,6 +437,7 @@ static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim) mem_cgroup_print_oom_context(oc->memcg, victim); pr_cont(",task=%s,pid=%d,uid=%d\n", victim->comm, victim->pid, from_kuid(&init_user_ns, task_uid(victim))); + spin_unlock(&lock); } static void dump_header(struct oom_control *oc, struct task_struct *p) @@ -677,9 +677,6 @@ static inline void wake_oom_reaper(struct task_struct *tsk) * mark_oom_victim - mark the given task as OOM victim * @tsk: task to mark * - * Has to be called with oom_lock held and never after - * oom has been disabled already. - * * tsk->mm has to be non NULL and caller has to guarantee it is stable (either * under task_lock or operate on the current). */ @@ -687,11 +684,6 @@ static void mark_oom_victim(struct task_struct *tsk) { struct mm_struct *mm = tsk->mm; - WARN_ON(oom_killer_disabled); - /* OOM killer might race with memcg OOM */ - if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) - return; - /* oom_mm is bound to the signal struct life time. */ if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) { mmgrab(tsk->signal->oom_mm); @@ -704,8 +696,14 @@ static void mark_oom_victim(struct task_struct *tsk) * any memory and livelock. freezing_slow_path will tell the freezer * that TIF_MEMDIE tasks should be ignored. */ - __thaw_task(tsk); - atomic_inc(&oom_victims); + if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) + return; + spin_lock(&oom_disable_lock); + if (!oom_killer_disabled) { + __thaw_task(tsk); + atomic_inc(&oom_victims); + } + spin_unlock(&oom_disable_lock); trace_mark_victim(tsk->pid); } @@ -752,10 +750,9 @@ bool oom_killer_disable(signed long timeout) * Make sure to not race with an ongoing OOM killer. Check that the * current is not killed (possibly due to sharing the victim's memory). */ - if (mutex_lock_killable(&oom_lock)) - return false; + spin_lock(&oom_disable_lock); oom_killer_disabled = true; - mutex_unlock(&oom_lock); + spin_unlock(&oom_disable_lock); ret = wait_event_interruptible_timeout(oom_victims_wait, !atomic_read(&oom_victims), timeout); @@ -942,7 +939,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message) struct task_struct *victim = oc->chosen; struct mem_cgroup *oom_group; static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, - DEFAULT_RATELIMIT_BURST); + DEFAULT_RATELIMIT_BURST); + static DEFINE_MUTEX(oom_lock); + bool locked; /* * If the task is already exiting, don't alarm the sysadmin or kill @@ -959,6 +958,20 @@ static void oom_kill_process(struct oom_control *oc, const char *message) } task_unlock(victim); + /* + * Try to yield CPU time for dump_header(). But tolerate mixed printk() + * output if current thread was killed, for exiting quickly is better. + */ + locked = !mutex_lock_killable(&oom_lock); + /* + * Try to avoid reporting same OOM victim for multiple times, for + * concurrently allocating threads are waiting (after selecting the + * same OOM victim) at mutex_lock_killable() above. + */ + if (tsk_is_oom_victim(victim)) { + put_task_struct(victim); + goto done; + } if (__ratelimit(&oom_rs)) dump_header(oc, victim); @@ -980,6 +993,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message) (void*)message); mem_cgroup_put(oom_group); } + done: + if (locked) + mutex_unlock(&oom_lock); } /* @@ -1032,16 +1048,54 @@ int unregister_oom_notifier(struct notifier_block *nb) */ bool out_of_memory(struct oom_control *oc) { - unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; if (oom_killer_disabled) return false; if (!is_memcg_oom(oc)) { - blocking_notifier_call_chain(&oom_notify_list, 0, &freed); - if (freed > 0) - /* Got some memory back in the last second. */ + /* + * This mutex makes sure that concurrently allocating threads + * yield CPU time for the OOM notifier reclaim attempt. + * + * Since currently we are not doing the last second allocation + * attempt after the OOM notifier reclaim attempt, we need to + * use heuristically decreasing last_freed for concurrently + * allocating threads (which can become dozens to hundreds). + */ + static DEFINE_MUTEX(lock); + static unsigned long last_freed; + unsigned long freed; + + if (mutex_trylock(&lock)) { + last_freed /= 2; + freed = last_freed; +force_notify: + blocking_notifier_call_chain(&oom_notify_list, 0, + &freed); + last_freed = freed; + mutex_unlock(&lock); + } else if (mutex_lock_killable(&lock) == 0) { + /* + * If someone is waiting at mutex_lock_killable() here, + * mutex_trylock() above can't succeed. Therefore, + * although this path should be fast enough, let's be + * prepared for the worst case where nobody can update + * last_freed from mutex_trylock() path above. + */ + if (!last_freed) + goto force_notify; + freed = last_freed--; + mutex_unlock(&lock); + } else { + /* + * If killable wait failed, task_will_free_mem(current) + * below likely returns true. Therefore, just fall + * through here. + */ + freed = 0; + } + if (freed) return true; } @@ -1104,8 +1158,7 @@ bool out_of_memory(struct oom_control *oc) /* * The pagefault handler calls here because it is out of memory, so kill a - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom - * killing is already in progress so do nothing. + * memory-hogging task. */ void pagefault_out_of_memory(void) { @@ -1120,8 +1173,5 @@ void pagefault_out_of_memory(void) if (mem_cgroup_oom_synchronize(true)) return; - if (!mutex_trylock(&oom_lock)) - return; out_of_memory(&oc); - mutex_unlock(&oom_lock); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5ff93ad..dce737e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3775,24 +3775,11 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) *did_some_progress = 0; /* - * Acquire the oom lock. If that fails, somebody else is - * making progress for us. - */ - if (!mutex_trylock(&oom_lock)) { - *did_some_progress = 1; - schedule_timeout_uninterruptible(1); - return NULL; - } - - /* * Go through the zonelist yet one more time, keep very high watermark * here, this is only to catch a parallel oom killing, we must fail if - * we're still under heavy pressure. But make sure that this reclaim - * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY - * allocation which will never fail due to oom_lock already held. + * we're still under heavy pressure. */ - page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) & - ~__GFP_DIRECT_RECLAIM, order, + page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL), order, ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac); if (page) goto out; @@ -3843,7 +3830,6 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) ALLOC_NO_WATERMARKS, ac); } out: - mutex_unlock(&oom_lock); return page; }
On Fri 15-02-19 19:42:41, Tetsuo Handa wrote: > On 2019/02/14 1:56, Michal Hocko wrote: > > On Thu 14-02-19 01:30:28, Tetsuo Handa wrote: > > [...] > >> >From 63c5c8ee7910fa9ef1c4067f1cb35a779e9d582c Mon Sep 17 00:00:00 2001 > >> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > >> Date: Tue, 12 Feb 2019 20:12:35 +0900 > >> Subject: [PATCH v3] mm,page_alloc: wait for oom_lock before retrying. > >> > >> When many hundreds of threads concurrently triggered a page fault, and > >> one of them invoked the global OOM killer, the owner of oom_lock is > >> preempted for minutes because they are rather depriving the owner of > >> oom_lock of CPU time rather than waiting for the owner of oom_lock to > >> make progress. We don't want to disable preemption while holding oom_lock > >> but we want the owner of oom_lock to complete as soon as possible. > >> > >> Thus, this patch kills the dangerous assumption that sleeping for one > >> jiffy is sufficient for allowing the owner of oom_lock to make progress. > > > > What does this prevent any _other_ kernel path or even high priority > > userspace to preempt the oom killer path? This was the essential > > question the last time around and I do not see it covered here. I > > Since you already NACKed disabling preemption at > https://marc.info/?i=20180322114002.GC23100@dhcp22.suse.cz , pointing out > "even high priority userspace to preempt the oom killer path" is invalid. Why? > Since printk() is very slow, dump_header() can become slow, especially when > dump_tasks() is called. And changing dump_tasks() to use rcu_lock_break() > does not solve this problem, for this is a problem that once current thread > released CPU, current thread might be kept preempted for minutes. dump_tasks might be disabled for those who are concerned about the overhead but I do not see what your actual point here is. > Allowing OOM path to be preempted is what you prefer, isn't it? No, it is just practicality. If you disable preemption you are immediatelly going to fight with soft lockups. > Then, > spending CPU time for something (what you call "any _other_ kernel path") is > accountable for delaying OOM path. But wasting CPU time when allocating > threads can do nothing but wait for the owner of oom_lock to complete OOM > path is not accountable for delaying OOM path. > > Thus, there is nothing to cover for your "I do not see it covered here" > response, except how to avoid "wasting CPU time when allocating threads > can do nothing but wait for the owner of oom_lock to complete OOM path". > > > strongly suspect that all these games with the locking is just a > > pointless tunning for an insane workload without fixing the underlying > > issue. > > We could even change oom_lock to a local lock inside oom_kill_process(), for > all threads in a same allocating context will select the same OOM victim > (unless oom_kill_allocating_task case), and many threads already inside > oom_kill_process() will prevent themselves from selecting next OOM victim. > Although this approach wastes some CPU resources for needlessly selecting > same OOM victim for many times, this approach also can solve this problem. > > It seems that you don't want to admit that "wasting CPU time when allocating > threads can do nothing but wait for the owner of oom_lock to complete OOM path" > as the underlying issue. But we can't fix it without throttling direct reclaim > paths. That's the evidence that this problem is not fixed for many years. And yet I do not rememeber any _single_ bug report for a real life workload that would be suffering from this. I am all for a better throttling on the OOM conditions but what you have been proposing are hacks at best without any real world workload backing them. Please try to understand that the OOM path in the current form is quite complex already and adding more on top without addressing a problem which real workloads do care about is not really all that attractive. I have no objections to simple and obviously correct changes in this area but playing with locking with hard to evaluate side effects is not something I will ack.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 35fdde0..c867513 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3618,7 +3618,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) */ if (!mutex_trylock(&oom_lock)) { *did_some_progress = 1; - schedule_timeout_uninterruptible(1); + if (mutex_lock_killable(&oom_lock) == 0) + mutex_unlock(&oom_lock); + else if (!tsk_is_oom_victim(current)) + schedule_timeout_uninterruptible(1); return NULL; }