Message ID | 1529493638-6389-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed 20-06-18 20:20:38, Tetsuo Handa wrote: > Sleeping with oom_lock held can cause AB-BA lockup bug because > __alloc_pages_may_oom() does not wait for oom_lock. Since > blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping > with oom_lock held is currently an unavoidable problem. Could you be more specific about the potential deadlock? Sleeping while holding oom lock is certainly not nice but I do not see how that would result in a deadlock assuming that the sleeping context doesn't sleep on the memory allocation obviously. > As a preparation for not to sleep with oom_lock held, this patch brings > OOM notifier callbacks to outside of OOM killer, with two small behavior > changes explained below. Can we just eliminate this ugliness and remove it altogether? We do not have that many notifiers. Is there anything fundamental that would prevent us from moving them to shrinkers instead?
On 2018/06/20 20:55, Michal Hocko wrote: > On Wed 20-06-18 20:20:38, Tetsuo Handa wrote: >> Sleeping with oom_lock held can cause AB-BA lockup bug because >> __alloc_pages_may_oom() does not wait for oom_lock. Since >> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping >> with oom_lock held is currently an unavoidable problem. > > Could you be more specific about the potential deadlock? Sleeping while > holding oom lock is certainly not nice but I do not see how that would > result in a deadlock assuming that the sleeping context doesn't sleep on > the memory allocation obviously. "A" is "owns oom_lock" and "B" is "owns CPU resources". It was demonstrated at "mm,oom: Don't call schedule_timeout_killable() with oom_lock held." proposal. But since you don't accept preserving the short sleep which is a heuristic for reducing the possibility of AB-BA lockup, the only way we would accept will be wait for the owner of oom_lock (e.g. by s/mutex_trylock/mutex_lock/ or whatever) which is free of heuristic and free of AB-BA lockup. > >> As a preparation for not to sleep with oom_lock held, this patch brings >> OOM notifier callbacks to outside of OOM killer, with two small behavior >> changes explained below. > > Can we just eliminate this ugliness and remove it altogether? We do not > have that many notifiers. Is there anything fundamental that would > prevent us from moving them to shrinkers instead? > For long term, it would be possible. But not within this patch. For example, I think that virtio_balloon wants to release memory only when we have no choice but OOM kill. If virtio_balloon trivially releases memory, it will increase the risk of killing the entire guest by OOM-killer from the host side.
On Wed 20-06-18 21:21:21, Tetsuo Handa wrote: > On 2018/06/20 20:55, Michal Hocko wrote: > > On Wed 20-06-18 20:20:38, Tetsuo Handa wrote: > >> Sleeping with oom_lock held can cause AB-BA lockup bug because > >> __alloc_pages_may_oom() does not wait for oom_lock. Since > >> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping > >> with oom_lock held is currently an unavoidable problem. > > > > Could you be more specific about the potential deadlock? Sleeping while > > holding oom lock is certainly not nice but I do not see how that would > > result in a deadlock assuming that the sleeping context doesn't sleep on > > the memory allocation obviously. > > "A" is "owns oom_lock" and "B" is "owns CPU resources". It was demonstrated > at "mm,oom: Don't call schedule_timeout_killable() with oom_lock held." proposal. This is not a deadlock but merely a resource starvation AFAIU. > But since you don't accept preserving the short sleep which is a heuristic for > reducing the possibility of AB-BA lockup, the only way we would accept will be > wait for the owner of oom_lock (e.g. by s/mutex_trylock/mutex_lock/ or whatever) > which is free of heuristic and free of AB-BA lockup. > > > > >> As a preparation for not to sleep with oom_lock held, this patch brings > >> OOM notifier callbacks to outside of OOM killer, with two small behavior > >> changes explained below. > > > > Can we just eliminate this ugliness and remove it altogether? We do not > > have that many notifiers. Is there anything fundamental that would > > prevent us from moving them to shrinkers instead? > > > > For long term, it would be possible. But not within this patch. For example, > I think that virtio_balloon wants to release memory only when we have no > choice but OOM kill. If virtio_balloon trivially releases memory, it will > increase the risk of killing the entire guest by OOM-killer from the host > side. I would _prefer_ to think long term here. The sleep inside the oom lock is not something real workload are seeing out there AFAICS. Adding quite some code to address such a case doesn't justify the inclusion IMHO.
On Wed, 20 Jun 2018, Tetsuo Handa wrote: > Sleeping with oom_lock held can cause AB-BA lockup bug because > __alloc_pages_may_oom() does not wait for oom_lock. Since > blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping > with oom_lock held is currently an unavoidable problem. > > As a preparation for not to sleep with oom_lock held, this patch brings > OOM notifier callbacks to outside of OOM killer, with two small behavior > changes explained below. > > One is that this patch makes it impossible for SysRq-f and PF-OOM to > reclaim via OOM notifier. But such change should be tolerable because > "we unlikely try to use SysRq-f for reclaiming memory via OOM notifier > callbacks" and "pagefault_out_of_memory() will be called when OOM killer > selected current thread as an OOM victim after OOM notifier callbacks > already failed to reclaim memory". > > The other is that this patch makes it possible to reclaim memory via OOM > notifier after OOM killer is disabled (that is, suspend/hibernate is in > progress). But such change should be safe because of pm_suspended_storage() > check. > Makes sense in general and I don't think that getting around these two caveats is problematic. We should be handling everything that can free memory to preempt an oom kill in the page allocator and get rid of the notion that this is "oom notification" when in reality it is intended as the last form of reclaim prior to oom kill. > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > --- > include/linux/oom.h | 1 + > mm/oom_kill.c | 35 ++++++++++++++++++------ > mm/page_alloc.c | 76 +++++++++++++++++++++++++++++++---------------------- > 3 files changed, 73 insertions(+), 39 deletions(-) > > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 6adac11..085b033 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -101,6 +101,7 @@ extern unsigned long oom_badness(struct task_struct *p, > struct mem_cgroup *memcg, const nodemask_t *nodemask, > unsigned long totalpages); > > +extern unsigned long try_oom_notifier(void); > extern bool out_of_memory(struct oom_control *oc); > > extern void exit_oom_victim(void); > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 84081e7..2ff5db2 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -1010,6 +1010,33 @@ int unregister_oom_notifier(struct notifier_block *nb) > EXPORT_SYMBOL_GPL(unregister_oom_notifier); > > /** > + * try_oom_notifier - Try to reclaim memory from OOM notifier list. > + * > + * Returns non-zero if notifier callbacks released something, zero otherwise. > + */ > +unsigned long try_oom_notifier(void) It certainly is tried, but based on its usage it would probably be better to describe what is being returned (it's going to set *did_some_progress, which is a page count). That makes me think that "oom_notify_list" isn't very intuitive: it can free memory as a last step prior to oom kill. OOM notify, to me, sounds like its only notifying some callbacks about the condition. Maybe oom_reclaim_list and then rename this to oom_reclaim_pages()? > +{ > + static DEFINE_MUTEX(oom_notifier_lock); > + unsigned long freed = 0; > + > + /* > + * Since OOM notifier callbacks must not depend on __GFP_DIRECT_RECLAIM > + * && !__GFP_NORETRY memory allocation, waiting for mutex here is safe. > + * If lockdep reports possible deadlock dependency, it will be a bug in > + * OOM notifier callbacks. > + * > + * If SIGKILL is pending, it is likely that current thread was selected > + * as an OOM victim. In that case, current thread should return as soon > + * as possible using memory reserves. > + */ > + if (mutex_lock_killable(&oom_notifier_lock)) > + return 0; > + blocking_notifier_call_chain(&oom_notify_list, 0, &freed); > + mutex_unlock(&oom_notifier_lock); > + return freed; > +} If __blocking_notifier_call_chain() used down_read_killable(), could we eliminate oom_notifier_lock? > + > +/** > * out_of_memory - kill the "best" process when we run out of memory > * @oc: pointer to struct oom_control > * > @@ -1020,19 +1047,11 @@ int unregister_oom_notifier(struct notifier_block *nb) > */ > bool out_of_memory(struct oom_control *oc) > { > - unsigned long freed = 0; > enum oom_constraint constraint = CONSTRAINT_NONE; > > if (oom_killer_disabled) > return false; > > - if (!is_memcg_oom(oc)) { > - blocking_notifier_call_chain(&oom_notify_list, 0, &freed); > - if (freed > 0) > - /* Got some memory back in the last second. */ > - return true; > - } > - > /* > * If current has a pending SIGKILL or is exiting, then automatically > * select it. The goal is to allow it to allocate so that it may > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 1521100..c72ef1e 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3447,10 +3447,50 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) > return page; > } > > +static inline bool can_oomkill(gfp_t gfp_mask, unsigned int order, > + const struct alloc_context *ac) I'd suggest a more mm/page_alloc.c friendly name, something like oom_kill_allowed(). > +{ > + /* Coredumps can quickly deplete all memory reserves */ > + if (current->flags & PF_DUMPCORE) > + return false; > + /* The OOM killer will not help higher order allocs */ > + if (order > PAGE_ALLOC_COSTLY_ORDER) > + return false; > + /* > + * We have already exhausted all our reclaim opportunities without any > + * success so it is time to admit defeat. We will skip the OOM killer > + * because it is very likely that the caller has a more reasonable > + * fallback than shooting a random task. > + */ > + if (gfp_mask & __GFP_RETRY_MAYFAIL) > + return false; > + /* The OOM killer does not needlessly kill tasks for lowmem */ > + if (ac->high_zoneidx < ZONE_NORMAL) > + return false; > + if (pm_suspended_storage()) > + return false; > + /* > + * XXX: GFP_NOFS allocations should rather fail than rely on > + * other request to make a forward progress. > + * We are in an unfortunate situation where out_of_memory cannot > + * do much for this context but let's try it to at least get > + * access to memory reserved if the current task is killed (see > + * out_of_memory). Once filesystems are ready to handle allocation > + * failures more gracefully we should just bail out here. > + */ > + > + /* The OOM killer may not free memory on a specific node */ > + if (gfp_mask & __GFP_THISNODE) > + return false; > + > + return true; > +} > + > static inline struct page * > __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > const struct alloc_context *ac, unsigned long *did_some_progress) > { > + const bool oomkill = can_oomkill(gfp_mask, order, ac); > struct oom_control oc = { > .zonelist = ac->zonelist, > .nodemask = ac->nodemask, > @@ -3462,6 +3502,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) > > *did_some_progress = 0; > > + /* Try to reclaim via OOM notifier callback. */ > + if (oomkill) > + *did_some_progress = try_oom_notifier(); > + > /* > * Acquire the oom lock. If that fails, somebody else is > * making progress for us. *did_some_progress = oom_kill_allowed ? oom_reclaim_pages() : 0; This patch is certainly an improvement because it does the last get_page_from_freelist() call after invoking the oom notifiers that can free memory and we've otherwise pointlessly redirected it elsewhere. > @@ -3485,37 +3529,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) > if (page) > goto out; > > - /* Coredumps can quickly deplete all memory reserves */ > - if (current->flags & PF_DUMPCORE) > - goto out; > - /* The OOM killer will not help higher order allocs */ > - if (order > PAGE_ALLOC_COSTLY_ORDER) > - goto out; > - /* > - * We have already exhausted all our reclaim opportunities without any > - * success so it is time to admit defeat. We will skip the OOM killer > - * because it is very likely that the caller has a more reasonable > - * fallback than shooting a random task. > - */ > - if (gfp_mask & __GFP_RETRY_MAYFAIL) > - goto out; > - /* The OOM killer does not needlessly kill tasks for lowmem */ > - if (ac->high_zoneidx < ZONE_NORMAL) > - goto out; > - if (pm_suspended_storage()) > - goto out; > - /* > - * XXX: GFP_NOFS allocations should rather fail than rely on > - * other request to make a forward progress. > - * We are in an unfortunate situation where out_of_memory cannot > - * do much for this context but let's try it to at least get > - * access to memory reserved if the current task is killed (see > - * out_of_memory). Once filesystems are ready to handle allocation > - * failures more gracefully we should just bail out here. > - */ > - > - /* The OOM killer may not free memory on a specific node */ > - if (gfp_mask & __GFP_THISNODE) > + if (!oomkill) > goto out; > > /* Exhausted what can be done so it's blame time */
On Wed 20-06-18 15:36:45, David Rientjes wrote: [...] > That makes me think that "oom_notify_list" isn't very intuitive: it can > free memory as a last step prior to oom kill. OOM notify, to me, sounds > like its only notifying some callbacks about the condition. Maybe > oom_reclaim_list and then rename this to oom_reclaim_pages()? Yes agreed and that is the reason I keep saying we want to get rid of this yet-another-reclaim mechanism. We already have shrinkers which are the main source of non-lru pages reclaim. Why do we even need oom_reclaim_pages? What is fundamentally different here? Sure those pages should be reclaimed as the last resort but we already do have priority for slab shrinking so we know that the system is struggling when reaching the lowest priority. Isn't that enough to express the need for current oom notifier implementations?
On 2018/06/21 7:36, David Rientjes wrote: >> @@ -1010,6 +1010,33 @@ int unregister_oom_notifier(struct notifier_block *nb) >> EXPORT_SYMBOL_GPL(unregister_oom_notifier); >> >> /** >> + * try_oom_notifier - Try to reclaim memory from OOM notifier list. >> + * >> + * Returns non-zero if notifier callbacks released something, zero otherwise. >> + */ >> +unsigned long try_oom_notifier(void) > > It certainly is tried, but based on its usage it would probably be better > to describe what is being returned (it's going to set *did_some_progress, > which is a page count). Well, it depends on what the callbacks are doing. Currently, we have 5 users. arch/powerpc/platforms/pseries/cmm.c arch/s390/mm/cmm.c drivers/gpu/drm/i915/i915_gem_shrinker.c drivers/virtio/virtio_balloon.c kernel/rcu/tree_plugin.h Speak of rcu_oom_notify() in kernel/rcu/tree_plugin.h , we can't tell whether the callback helped releasing memory, for it does not update the "freed" argument. >> +{ >> + static DEFINE_MUTEX(oom_notifier_lock); >> + unsigned long freed = 0; >> + >> + /* >> + * Since OOM notifier callbacks must not depend on __GFP_DIRECT_RECLAIM >> + * && !__GFP_NORETRY memory allocation, waiting for mutex here is safe. >> + * If lockdep reports possible deadlock dependency, it will be a bug in >> + * OOM notifier callbacks. >> + * >> + * If SIGKILL is pending, it is likely that current thread was selected >> + * as an OOM victim. In that case, current thread should return as soon >> + * as possible using memory reserves. >> + */ >> + if (mutex_lock_killable(&oom_notifier_lock)) >> + return 0; >> + blocking_notifier_call_chain(&oom_notify_list, 0, &freed); >> + mutex_unlock(&oom_notifier_lock); >> + return freed; >> +} > > If __blocking_notifier_call_chain() used down_read_killable(), could we > eliminate oom_notifier_lock? I don't think we can eliminate it now, for it is a serialization lock (while trying to respond to SIGKILL as soon as possible) which is currently achieved by mutex_trylock(&oom_lock). (1) rcu_oom_notify() in kernel/rcu/tree_plugin.h is not prepared for being called concurrently. ---------- static int rcu_oom_notify(struct notifier_block *self, unsigned long notused, void *nfreed) { int cpu; /* Wait for callbacks from earlier instance to complete. */ wait_event(oom_callback_wq, atomic_read(&oom_callback_count) == 0); // <= Multiple threads can pass this line at the same time. smp_mb(); /* Ensure callback reuse happens after callback invocation. */ /* * Prevent premature wakeup: ensure that all increments happen * before there is a chance of the counter reaching zero. */ atomic_set(&oom_callback_count, 1); // <= Multiple threads can execute this line at the same time. for_each_online_cpu(cpu) { smp_call_function_single(cpu, rcu_oom_notify_cpu, NULL, 1); cond_resched_tasks_rcu_qs(); } /* Unconditionally decrement: no need to wake ourselves up. */ atomic_dec(&oom_callback_count); // <= Multiple threads can execute this line at the same time, making oom_callback_count < 0 ? return NOTIFY_OK; } ---------- The counter inconsistency problem could be fixed by - atomic_set(&oom_callback_count, 1); + atomic_inc(&oom_callback_count); but who becomes happy if rcu_oom_notify() became ready to be called concurrently? We want to wait for the callback to complete before proceeding to the OOM killer. I think that we should save CPU resource by serializing concurrent callers. (2) i915_gem_shrinker_oom() in drivers/gpu/drm/i915/i915_gem_shrinker.c depends on mutex_trylock() from shrinker_lock() from i915_gem_shrink() from i915_gem_shrink_all() to return 1 (i.e. succeed) before need_resched() becomes true in order to avoid returning without reclaiming memory. > This patch is certainly an improvement because it does the last > get_page_from_freelist() call after invoking the oom notifiers that can > free memory and we've otherwise pointlessly redirected it elsewhere. Thanks, but this patch might break subtle balance which is currently achieved by mutex_trylock(&oom_lock) serialization/exclusion. (3) virtballoon_oom_notify() in drivers/virtio/virtio_balloon.c by default tries to release 256 pages. Since this value is configurable, one might set 1048576 pages. If virtballoon_oom_notify() is concurrently called by many threads, it might needlessly deflate the memory balloon. We might want to remember and reuse the last result among serialized callers (feedback mechanism) like { static DEFINE_MUTEX(oom_notifier_lock); static unsigned long last_freed; unsigned long freed = 0; if (mutex_trylock(&oom_notifier_lock)) { blocking_notifier_call_chain(&oom_notify_list, 0, &freed); last_freed = freed; } else { mutex_lock(&oom_notifier_lock); freed = last_freed; } mutex_unlock(&oom_notifier_lock); return freed; } or { static DEFINE_MUTEX(oom_notifier_lock); static unsigned long last_freed; unsigned long freed = 0; if (mutex_lock_killable(&oom_notifier_lock)) { freed = last_freed; last_freed >>= 1; return freed; } else if (last_freed) { freed = last_freed; last_freed >>= 1; } else { blocking_notifier_call_chain(&oom_notify_list, 0, &freed); last_freed = freed; } mutex_unlock(&oom_notifier_lock); return freed; } . Without feedback mechanism, mutex_lock_killable(&oom_notifier_lock) serialization could still needlessly deflate the memory balloon compared to mutex_trylock(&oom_lock) serialization/exclusion. Maybe virtballoon_oom_notify() (and two CMM users) would implement feedback mechanism themselves, by examining watermark from OOM notifier hooks. On 2018/06/21 16:31, Michal Hocko wrote: > On Wed 20-06-18 15:36:45, David Rientjes wrote: > [...] >> That makes me think that "oom_notify_list" isn't very intuitive: it can >> free memory as a last step prior to oom kill. OOM notify, to me, sounds >> like its only notifying some callbacks about the condition. Maybe >> oom_reclaim_list and then rename this to oom_reclaim_pages()? > > Yes agreed and that is the reason I keep saying we want to get rid of > this yet-another-reclaim mechanism. We already have shrinkers which are > the main source of non-lru pages reclaim. Why do we even need > oom_reclaim_pages? What is fundamentally different here? Sure those > pages should be reclaimed as the last resort but we already do have > priority for slab shrinking so we know that the system is struggling > when reaching the lowest priority. Isn't that enough to express the need > for current oom notifier implementations? > Even if we update OOM notifier users to use shrinker hooks, they will need a subtle balance which is currently achieved by mutex_trylock(&oom_lock). Removing OOM notifier is not doable right now. It is not suitable as a regression fix for commit 27ae357fa82be5ab ("mm, oom: fix concurrent munlock and oom reaper unmap, v3"). What we could afford for this regression is https://patchwork.kernel.org/patch/9842889/ which is exactly what you suggested in a thread at https://www.spinics.net/lists/linux-mm/msg117896.html .
On Thu 21-06-18 20:27:41, Tetsuo Handa wrote: [....] > On 2018/06/21 16:31, Michal Hocko wrote: > > On Wed 20-06-18 15:36:45, David Rientjes wrote: > > [...] > >> That makes me think that "oom_notify_list" isn't very intuitive: it can > >> free memory as a last step prior to oom kill. OOM notify, to me, sounds > >> like its only notifying some callbacks about the condition. Maybe > >> oom_reclaim_list and then rename this to oom_reclaim_pages()? > > > > Yes agreed and that is the reason I keep saying we want to get rid of > > this yet-another-reclaim mechanism. We already have shrinkers which are > > the main source of non-lru pages reclaim. Why do we even need > > oom_reclaim_pages? What is fundamentally different here? Sure those > > pages should be reclaimed as the last resort but we already do have > > priority for slab shrinking so we know that the system is struggling > > when reaching the lowest priority. Isn't that enough to express the need > > for current oom notifier implementations? > > > > Even if we update OOM notifier users to use shrinker hooks, they will need a > subtle balance which is currently achieved by mutex_trylock(&oom_lock). No they do not. They do not want to rely on an unrelated locks to work properly. That is completely wrong design. We do want locks to protect specific data rather than code. > Removing OOM notifier is not doable right now. I haven't heard any technical argument why. > It is not suitable as a regression > fix for commit 27ae357fa82be5ab ("mm, oom: fix concurrent munlock and oom reaper > unmap, v3"). What is the regression? > What we could afford for this regression is > https://patchwork.kernel.org/patch/9842889/ which is exactly what you suggested > in a thread at https://www.spinics.net/lists/linux-mm/msg117896.html .
On 06/20/2018 01:55 PM, Michal Hocko wrote: > On Wed 20-06-18 20:20:38, Tetsuo Handa wrote: >> Sleeping with oom_lock held can cause AB-BA lockup bug because >> __alloc_pages_may_oom() does not wait for oom_lock. Since >> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping >> with oom_lock held is currently an unavoidable problem. > Could you be more specific about the potential deadlock? Sleeping while > holding oom lock is certainly not nice but I do not see how that would > result in a deadlock assuming that the sleeping context doesn't sleep on > the memory allocation obviously. It is a mutex you are supposed to be able to sleep. It's even exported. >> As a preparation for not to sleep with oom_lock held, this patch brings >> OOM notifier callbacks to outside of OOM killer, with two small behavior >> changes explained below. > Can we just eliminate this ugliness and remove it altogether? We do not > have that many notifiers. Is there anything fundamental that would > prevent us from moving them to shrinkers instead? @Hocko Do you remember the lowmemorykiller from android? Some things might not be the right thing for shrinkers.
On Mon 25-06-18 15:03:40, peter enderborg wrote: > On 06/20/2018 01:55 PM, Michal Hocko wrote: > > On Wed 20-06-18 20:20:38, Tetsuo Handa wrote: > >> Sleeping with oom_lock held can cause AB-BA lockup bug because > >> __alloc_pages_may_oom() does not wait for oom_lock. Since > >> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping > >> with oom_lock held is currently an unavoidable problem. > > Could you be more specific about the potential deadlock? Sleeping while > > holding oom lock is certainly not nice but I do not see how that would > > result in a deadlock assuming that the sleeping context doesn't sleep on > > the memory allocation obviously. > > It is a mutex you are supposed to be able to sleep. It's even exported. What do you mean? oom_lock is certainly not exported for general use. It is not local to oom_killer.c just because it is needed in other _mm_ code. > >> As a preparation for not to sleep with oom_lock held, this patch brings > >> OOM notifier callbacks to outside of OOM killer, with two small behavior > >> changes explained below. > > Can we just eliminate this ugliness and remove it altogether? We do not > > have that many notifiers. Is there anything fundamental that would > > prevent us from moving them to shrinkers instead? > > > @Hocko Do you remember the lowmemorykiller from android? Some things > might not be the right thing for shrinkers. Just that lmk did it wrong doesn't mean others have to follow.
On 06/25/2018 03:07 PM, Michal Hocko wrote: > On Mon 25-06-18 15:03:40, peter enderborg wrote: >> On 06/20/2018 01:55 PM, Michal Hocko wrote: >>> On Wed 20-06-18 20:20:38, Tetsuo Handa wrote: >>>> Sleeping with oom_lock held can cause AB-BA lockup bug because >>>> __alloc_pages_may_oom() does not wait for oom_lock. Since >>>> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping >>>> with oom_lock held is currently an unavoidable problem. >>> Could you be more specific about the potential deadlock? Sleeping while >>> holding oom lock is certainly not nice but I do not see how that would >>> result in a deadlock assuming that the sleeping context doesn't sleep on >>> the memory allocation obviously. >> It is a mutex you are supposed to be able to sleep. It's even exported. > What do you mean? oom_lock is certainly not exported for general use. It > is not local to oom_killer.c just because it is needed in other _mm_ > code. > It is in the oom.h file include/linux/oom.h, if it that sensitive it should be in mm/ and a documented note about the special rules. It is only used in drivers/tty/sysrq.c and that be replaced by a help function in mm that do the oom stuff. >>>> As a preparation for not to sleep with oom_lock held, this patch brings >>>> OOM notifier callbacks to outside of OOM killer, with two small behavior >>>> changes explained below. >>> Can we just eliminate this ugliness and remove it altogether? We do not >>> have that many notifiers. Is there anything fundamental that would >>> prevent us from moving them to shrinkers instead? >> >> @Hocko Do you remember the lowmemorykiller from android? Some things >> might not be the right thing for shrinkers. > Just that lmk did it wrong doesn't mean others have to follow. > If all you have is a hammer, everything looks like a nail. (I don’t argument that it was right) But if you don’t have a way to interact with the memory system we will get attempts like lmk. Oom notifiers and vmpressure is for this task better than shrinkers.
On 06/25/2018 03:07 PM, Michal Hocko wrote: > On Mon 25-06-18 15:03:40, peter enderborg wrote: >> On 06/20/2018 01:55 PM, Michal Hocko wrote: >>> On Wed 20-06-18 20:20:38, Tetsuo Handa wrote: >>>> Sleeping with oom_lock held can cause AB-BA lockup bug because >>>> __alloc_pages_may_oom() does not wait for oom_lock. Since >>>> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping >>>> with oom_lock held is currently an unavoidable problem. >>> Could you be more specific about the potential deadlock? Sleeping while >>> holding oom lock is certainly not nice but I do not see how that would >>> result in a deadlock assuming that the sleeping context doesn't sleep on >>> the memory allocation obviously. >> It is a mutex you are supposed to be able to sleep. It's even exported. > What do you mean? oom_lock is certainly not exported for general use. It > is not local to oom_killer.c just because it is needed in other _mm_ > code. > It is in the oom.h file include/linux/oom.h, if it that sensitive it should be in mm/ and a documented note about the special rules. It is only used in drivers/tty/sysrq.c and that be replaced by a help function in mm that do the oom stuff. >>>> As a preparation for not to sleep with oom_lock held, this patch brings >>>> OOM notifier callbacks to outside of OOM killer, with two small behavior >>>> changes explained below. >>> Can we just eliminate this ugliness and remove it altogether? We do not >>> have that many notifiers. Is there anything fundamental that would >>> prevent us from moving them to shrinkers instead? >> @Hocko Do you remember the lowmemorykiller from android? Some things >> might not be the right thing for shrinkers. > Just that lmk did it wrong doesn't mean others have to follow. > If all you have is a hammer, everything looks like a nail. (I don’t argument that it was right) But if you don’t have a way to interact with the memory system we will get attempts like lmk. Oom notifiers and vmpressure is for this task better than shrinkers.
On Mon 25-06-18 16:04:04, peter enderborg wrote: > On 06/25/2018 03:07 PM, Michal Hocko wrote: > > > On Mon 25-06-18 15:03:40, peter enderborg wrote: > >> On 06/20/2018 01:55 PM, Michal Hocko wrote: > >>> On Wed 20-06-18 20:20:38, Tetsuo Handa wrote: > >>>> Sleeping with oom_lock held can cause AB-BA lockup bug because > >>>> __alloc_pages_may_oom() does not wait for oom_lock. Since > >>>> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping > >>>> with oom_lock held is currently an unavoidable problem. > >>> Could you be more specific about the potential deadlock? Sleeping while > >>> holding oom lock is certainly not nice but I do not see how that would > >>> result in a deadlock assuming that the sleeping context doesn't sleep on > >>> the memory allocation obviously. > >> It is a mutex you are supposed to be able to sleep. It's even exported. > > What do you mean? oom_lock is certainly not exported for general use. It > > is not local to oom_killer.c just because it is needed in other _mm_ > > code. > > > > It is in the oom.h file include/linux/oom.h, if it that sensitive it should > be in mm/ and a documented note about the special rules. It is only used > in drivers/tty/sysrq.c and that be replaced by a help function in mm that > do the oom stuff. Well, there are many things defined in kernel header files and not meant for wider use. Using random locks is generally discouraged I would say unless you are sure you know what you are doing. We could do some more work to hide internals for sure, though. > >>>> As a preparation for not to sleep with oom_lock held, this patch brings > >>>> OOM notifier callbacks to outside of OOM killer, with two small behavior > >>>> changes explained below. > >>> Can we just eliminate this ugliness and remove it altogether? We do not > >>> have that many notifiers. Is there anything fundamental that would > >>> prevent us from moving them to shrinkers instead? > >> @Hocko Do you remember the lowmemorykiller from android? Some things > >> might not be the right thing for shrinkers. > > Just that lmk did it wrong doesn't mean others have to follow. > > > If all you have is a hammer, everything looks like a nail. (I don’t argument that it was right) > But if you don’t have a way to interact with the memory system we will get attempts like lmk. > Oom notifiers and vmpressure is for this task better than shrinkers. A lack of feature should be a trigger for a discussion rather than a quick hack that seems to work for a particular usecase and live out of tree, then get to staging and hope it will fix itself. Seriously, the kernel development is not a nail hammering.
On Thu, Jun 21, 2018 at 08:27:41PM +0900, Tetsuo Handa wrote: > On 2018/06/21 7:36, David Rientjes wrote: > >> @@ -1010,6 +1010,33 @@ int unregister_oom_notifier(struct notifier_block *nb) > >> EXPORT_SYMBOL_GPL(unregister_oom_notifier); > >> > >> /** > >> + * try_oom_notifier - Try to reclaim memory from OOM notifier list. > >> + * > >> + * Returns non-zero if notifier callbacks released something, zero otherwise. > >> + */ > >> +unsigned long try_oom_notifier(void) > > > > It certainly is tried, but based on its usage it would probably be better > > to describe what is being returned (it's going to set *did_some_progress, > > which is a page count). > > Well, it depends on what the callbacks are doing. Currently, we have 5 users. > > arch/powerpc/platforms/pseries/cmm.c > arch/s390/mm/cmm.c > drivers/gpu/drm/i915/i915_gem_shrinker.c > drivers/virtio/virtio_balloon.c > kernel/rcu/tree_plugin.h > > Speak of rcu_oom_notify() in kernel/rcu/tree_plugin.h , we can't tell whether > the callback helped releasing memory, for it does not update the "freed" argument. It does not immediately release memory, but instead takes steps to encourage memory to be released more quickly. > >> +{ > >> + static DEFINE_MUTEX(oom_notifier_lock); > >> + unsigned long freed = 0; > >> + > >> + /* > >> + * Since OOM notifier callbacks must not depend on __GFP_DIRECT_RECLAIM > >> + * && !__GFP_NORETRY memory allocation, waiting for mutex here is safe. > >> + * If lockdep reports possible deadlock dependency, it will be a bug in > >> + * OOM notifier callbacks. > >> + * > >> + * If SIGKILL is pending, it is likely that current thread was selected > >> + * as an OOM victim. In that case, current thread should return as soon > >> + * as possible using memory reserves. > >> + */ > >> + if (mutex_lock_killable(&oom_notifier_lock)) > >> + return 0; > >> + blocking_notifier_call_chain(&oom_notify_list, 0, &freed); > >> + mutex_unlock(&oom_notifier_lock); > >> + return freed; > >> +} > > > > If __blocking_notifier_call_chain() used down_read_killable(), could we > > eliminate oom_notifier_lock? > > I don't think we can eliminate it now, for it is a serialization lock > (while trying to respond to SIGKILL as soon as possible) which is currently > achieved by mutex_trylock(&oom_lock). > > (1) rcu_oom_notify() in kernel/rcu/tree_plugin.h is not prepared for being > called concurrently. > > ---------- > static int rcu_oom_notify(struct notifier_block *self, > unsigned long notused, void *nfreed) > { > int cpu; > > /* Wait for callbacks from earlier instance to complete. */ > wait_event(oom_callback_wq, atomic_read(&oom_callback_count) == 0); // <= Multiple threads can pass this line at the same time. > smp_mb(); /* Ensure callback reuse happens after callback invocation. */ > > /* > * Prevent premature wakeup: ensure that all increments happen > * before there is a chance of the counter reaching zero. > */ > atomic_set(&oom_callback_count, 1); // <= Multiple threads can execute this line at the same time. > > for_each_online_cpu(cpu) { > smp_call_function_single(cpu, rcu_oom_notify_cpu, NULL, 1); > cond_resched_tasks_rcu_qs(); > } > > /* Unconditionally decrement: no need to wake ourselves up. */ > atomic_dec(&oom_callback_count); // <= Multiple threads can execute this line at the same time, making oom_callback_count < 0 ? > > return NOTIFY_OK; > } > ---------- > > The counter inconsistency problem could be fixed by > > - atomic_set(&oom_callback_count, 1); > + atomic_inc(&oom_callback_count); > > but who becomes happy if rcu_oom_notify() became ready to be called > concurrently? We want to wait for the callback to complete before > proceeding to the OOM killer. I think that we should save CPU resource > by serializing concurrent callers. There are a lot of ways it could be made concurrency safe. If you need me to do this, please do let me know. That said, the way it is now written, if you invoke rcu_oom_notify() twice in a row, the second invocation will wait until the memory from the first invocation is freed. What do you want me to do if you invoke me concurrently? 1. One invocation "wins", waits for the earlier callbacks to complete, then encourages any subsequent callbacks to be processed more quickly. The other invocations return immediately without doing anything. 2. The invocations serialize, with each invocation waiting for the callbacks from previous invocation (in mutex_lock() order or some such), and then starting a new round. 3. Something else? Thanx, Paul > (2) i915_gem_shrinker_oom() in drivers/gpu/drm/i915/i915_gem_shrinker.c depends > on mutex_trylock() from shrinker_lock() from i915_gem_shrink() from > i915_gem_shrink_all() to return 1 (i.e. succeed) before need_resched() > becomes true in order to avoid returning without reclaiming memory. > > > This patch is certainly an improvement because it does the last > > get_page_from_freelist() call after invoking the oom notifiers that can > > free memory and we've otherwise pointlessly redirected it elsewhere. > > Thanks, but this patch might break subtle balance which is currently > achieved by mutex_trylock(&oom_lock) serialization/exclusion. > > (3) virtballoon_oom_notify() in drivers/virtio/virtio_balloon.c by default > tries to release 256 pages. Since this value is configurable, one might > set 1048576 pages. If virtballoon_oom_notify() is concurrently called by > many threads, it might needlessly deflate the memory balloon. > > We might want to remember and reuse the last result among serialized callers > (feedback mechanism) like > > { > static DEFINE_MUTEX(oom_notifier_lock); > static unsigned long last_freed; > unsigned long freed = 0; > if (mutex_trylock(&oom_notifier_lock)) { > blocking_notifier_call_chain(&oom_notify_list, 0, &freed); > last_freed = freed; > } else { > mutex_lock(&oom_notifier_lock); > freed = last_freed; > } > mutex_unlock(&oom_notifier_lock); > return freed; > > } > > or > > { > static DEFINE_MUTEX(oom_notifier_lock); > static unsigned long last_freed; > unsigned long freed = 0; > if (mutex_lock_killable(&oom_notifier_lock)) { > freed = last_freed; > last_freed >>= 1; > return freed; > } else if (last_freed) { > freed = last_freed; > last_freed >>= 1; > } else { > blocking_notifier_call_chain(&oom_notify_list, 0, &freed); > last_freed = freed; > } > mutex_unlock(&oom_notifier_lock); > return freed; > } > > . Without feedback mechanism, mutex_lock_killable(&oom_notifier_lock) serialization > could still needlessly deflate the memory balloon compared to mutex_trylock(&oom_lock) > serialization/exclusion. Maybe virtballoon_oom_notify() (and two CMM users) would > implement feedback mechanism themselves, by examining watermark from OOM notifier > hooks. > > > > On 2018/06/21 16:31, Michal Hocko wrote: > > On Wed 20-06-18 15:36:45, David Rientjes wrote: > > [...] > >> That makes me think that "oom_notify_list" isn't very intuitive: it can > >> free memory as a last step prior to oom kill. OOM notify, to me, sounds > >> like its only notifying some callbacks about the condition. Maybe > >> oom_reclaim_list and then rename this to oom_reclaim_pages()? > > > > Yes agreed and that is the reason I keep saying we want to get rid of > > this yet-another-reclaim mechanism. We already have shrinkers which are > > the main source of non-lru pages reclaim. Why do we even need > > oom_reclaim_pages? What is fundamentally different here? Sure those > > pages should be reclaimed as the last resort but we already do have > > priority for slab shrinking so we know that the system is struggling > > when reaching the lowest priority. Isn't that enough to express the need > > for current oom notifier implementations? > > > > Even if we update OOM notifier users to use shrinker hooks, they will need a > subtle balance which is currently achieved by mutex_trylock(&oom_lock). > > Removing OOM notifier is not doable right now. It is not suitable as a regression > fix for commit 27ae357fa82be5ab ("mm, oom: fix concurrent munlock and oom reaper > unmap, v3"). What we could afford for this regression is > https://patchwork.kernel.org/patch/9842889/ which is exactly what you suggested > in a thread at https://www.spinics.net/lists/linux-mm/msg117896.html . >
On 2018/06/27 2:03, Paul E. McKenney wrote: > There are a lot of ways it could be made concurrency safe. If you need > me to do this, please do let me know. > > That said, the way it is now written, if you invoke rcu_oom_notify() > twice in a row, the second invocation will wait until the memory from > the first invocation is freed. What do you want me to do if you invoke > me concurrently? > > 1. One invocation "wins", waits for the earlier callbacks to > complete, then encourages any subsequent callbacks to be > processed more quickly. The other invocations return > immediately without doing anything. > > 2. The invocations serialize, with each invocation waiting for > the callbacks from previous invocation (in mutex_lock() order > or some such), and then starting a new round. > > 3. Something else? > > Thanx, Paul As far as I can see, - atomic_set(&oom_callback_count, 1); + atomic_inc(&oom_callback_count); should be sufficient.
On Wed, Jun 27, 2018 at 05:10:48AM +0900, Tetsuo Handa wrote: > On 2018/06/27 2:03, Paul E. McKenney wrote: > > There are a lot of ways it could be made concurrency safe. If you need > > me to do this, please do let me know. > > > > That said, the way it is now written, if you invoke rcu_oom_notify() > > twice in a row, the second invocation will wait until the memory from > > the first invocation is freed. What do you want me to do if you invoke > > me concurrently? > > > > 1. One invocation "wins", waits for the earlier callbacks to > > complete, then encourages any subsequent callbacks to be > > processed more quickly. The other invocations return > > immediately without doing anything. > > > > 2. The invocations serialize, with each invocation waiting for > > the callbacks from previous invocation (in mutex_lock() order > > or some such), and then starting a new round. > > > > 3. Something else? > > > > Thanx, Paul > > As far as I can see, > > - atomic_set(&oom_callback_count, 1); > + atomic_inc(&oom_callback_count); > > should be sufficient. I don't see how that helps. For example, suppose that two tasks invoked rcu_oom_notify() at about the same time. Then they could both see oom_callback_count equal to zero, both atomically increment oom_callback_count, then both do the IPI invoking rcu_oom_notify_cpu() on each online CPU. So far, so good. But rcu_oom_notify_cpu() enqueues a per-CPU RCU callback, and enqueuing the same callback twice in quick succession would fatally tangle RCU's callback lists. What am I missing here? Thanx, Paul
On Tue 26-06-18 10:03:45, Paul E. McKenney wrote:
[...]
> 3. Something else?
How hard it would be to use a different API than oom notifiers? E.g. a
shrinker which just kicks all the pending callbacks if the reclaim
priority reaches low values (e.g. 0)?
On 2018/06/27 8:50, Paul E. McKenney wrote: > On Wed, Jun 27, 2018 at 05:10:48AM +0900, Tetsuo Handa wrote: >> As far as I can see, >> >> - atomic_set(&oom_callback_count, 1); >> + atomic_inc(&oom_callback_count); >> >> should be sufficient. > > I don't see how that helps. For example, suppose that two tasks > invoked rcu_oom_notify() at about the same time. Then they could > both see oom_callback_count equal to zero, both atomically increment > oom_callback_count, then both do the IPI invoking rcu_oom_notify_cpu() > on each online CPU. > > So far, so good. But rcu_oom_notify_cpu() enqueues a per-CPU RCU > callback, and enqueuing the same callback twice in quick succession > would fatally tangle RCU's callback lists. > > What am I missing here? > > Thanx, Paul You are pointing out that "number of rsp->call() is called" > "number of rcu_oom_callback() is called" can happen if concurrently called, aren't you? Then, you are not missing anything. You will need to use something equivalent to oom_lock even if you can convert rcu_oom_notify() to use shrinkers.
On Wed, Jun 27, 2018 at 07:52:23PM +0900, Tetsuo Handa wrote: > On 2018/06/27 8:50, Paul E. McKenney wrote: > > On Wed, Jun 27, 2018 at 05:10:48AM +0900, Tetsuo Handa wrote: > >> As far as I can see, > >> > >> - atomic_set(&oom_callback_count, 1); > >> + atomic_inc(&oom_callback_count); > >> > >> should be sufficient. > > > > I don't see how that helps. For example, suppose that two tasks > > invoked rcu_oom_notify() at about the same time. Then they could > > both see oom_callback_count equal to zero, both atomically increment > > oom_callback_count, then both do the IPI invoking rcu_oom_notify_cpu() > > on each online CPU. > > > > So far, so good. But rcu_oom_notify_cpu() enqueues a per-CPU RCU > > callback, and enqueuing the same callback twice in quick succession > > would fatally tangle RCU's callback lists. > > > > What am I missing here? > > You are pointing out that "number of rsp->call() is called" > "number of > rcu_oom_callback() is called" can happen if concurrently called, aren't you? Yes. Reusing an rcu_head before invocation of the earlier use is very bad indeed. ;-) > Then, you are not missing anything. You will need to use something equivalent > to oom_lock even if you can convert rcu_oom_notify() to use shrinkers. What should I look at to work out whether it makes sense to convert rcu_oom_notify() to shrinkers, and if so, how to go about it? Or are you simply asking me to serialize rcu_oom_notify()? (Which is of course not difficult, so please just let me know.) Thanx, Paul
On Wed, Jun 27, 2018 at 09:22:07AM +0200, Michal Hocko wrote: > On Tue 26-06-18 10:03:45, Paul E. McKenney wrote: > [...] > > 3. Something else? > > How hard it would be to use a different API than oom notifiers? E.g. a > shrinker which just kicks all the pending callbacks if the reclaim > priority reaches low values (e.g. 0)? Beats me. What is a shrinker? ;-) More seriously, could you please point me at an exemplary shrinker use case so I can see what is involved? Thanx, Paul
On Wed 27-06-18 07:31:25, Paul E. McKenney wrote: > On Wed, Jun 27, 2018 at 09:22:07AM +0200, Michal Hocko wrote: > > On Tue 26-06-18 10:03:45, Paul E. McKenney wrote: > > [...] > > > 3. Something else? > > > > How hard it would be to use a different API than oom notifiers? E.g. a > > shrinker which just kicks all the pending callbacks if the reclaim > > priority reaches low values (e.g. 0)? > > Beats me. What is a shrinker? ;-) This is a generich mechanism to reclaim memory that is not on standard LRU lists. Lwn.net surely has some nice coverage (e.g. https://lwn.net/Articles/548092/). > More seriously, could you please point me at an exemplary shrinker > use case so I can see what is involved? Well, I am not really sure what is the objective of the oom notifier to point you to the right direction. IIUC you just want to kick callbacks to be handled sooner under a heavy memory pressure, right? How is that achieved? Kick a worker?
On Thu, Jun 28, 2018 at 01:39:42PM +0200, Michal Hocko wrote: > On Wed 27-06-18 07:31:25, Paul E. McKenney wrote: > > On Wed, Jun 27, 2018 at 09:22:07AM +0200, Michal Hocko wrote: > > > On Tue 26-06-18 10:03:45, Paul E. McKenney wrote: > > > [...] > > > > 3. Something else? > > > > > > How hard it would be to use a different API than oom notifiers? E.g. a > > > shrinker which just kicks all the pending callbacks if the reclaim > > > priority reaches low values (e.g. 0)? > > > > Beats me. What is a shrinker? ;-) > > This is a generich mechanism to reclaim memory that is not on standard > LRU lists. Lwn.net surely has some nice coverage (e.g. > https://lwn.net/Articles/548092/). "In addition, there is little agreement over what a call to a shrinker really means or how the called subsystem should respond." ;-) Is this set up using register_shrinker() in mm/vmscan.c? I am guessing that the many mentions of shrinker in DRM are irrelevant. If my guess is correct, the API seems a poor fit for RCU. I can produce an approximate number of RCU callbacks for ->count_objects(), but a given callback might free a lot of memory or none at all. Plus, to actually have ->scan_objects() free them before returning, I would need to use something like rcu_barrier(), which might involve longer delays than desired. Or am I missing something here? > > More seriously, could you please point me at an exemplary shrinker > > use case so I can see what is involved? > > Well, I am not really sure what is the objective of the oom notifier to > point you to the right direction. IIUC you just want to kick callbacks > to be handled sooner under a heavy memory pressure, right? How is that > achieved? Kick a worker? That is achieved by enqueuing a non-lazy callback on each CPU's callback list, but only for those CPUs having non-empty lists. This causes CPUs with lists containing only lazy callbacks to be more aggressive, in particular, it prevents such CPUs from hanging out idle for seconds at a time while they have callbacks on their lists. The enqueuing happens via an IPI to the CPU in question. Thanx, Paul
On Thu 28-06-18 14:31:05, Paul E. McKenney wrote: > On Thu, Jun 28, 2018 at 01:39:42PM +0200, Michal Hocko wrote: > > On Wed 27-06-18 07:31:25, Paul E. McKenney wrote: > > > On Wed, Jun 27, 2018 at 09:22:07AM +0200, Michal Hocko wrote: > > > > On Tue 26-06-18 10:03:45, Paul E. McKenney wrote: > > > > [...] > > > > > 3. Something else? > > > > > > > > How hard it would be to use a different API than oom notifiers? E.g. a > > > > shrinker which just kicks all the pending callbacks if the reclaim > > > > priority reaches low values (e.g. 0)? > > > > > > Beats me. What is a shrinker? ;-) > > > > This is a generich mechanism to reclaim memory that is not on standard > > LRU lists. Lwn.net surely has some nice coverage (e.g. > > https://lwn.net/Articles/548092/). > > "In addition, there is little agreement over what a call to a shrinker > really means or how the called subsystem should respond." ;-) > > Is this set up using register_shrinker() in mm/vmscan.c? I am guessing Yes, exactly. You are supposed to implement the two methods in struct shrink_control > that the many mentions of shrinker in DRM are irrelevant. > > If my guess is correct, the API seems a poor fit for RCU. I can > produce an approximate number of RCU callbacks for ->count_objects(), > but a given callback might free a lot of memory or none at all. Plus, > to actually have ->scan_objects() free them before returning, I would > need to use something like rcu_barrier(), which might involve longer > delays than desired.` Well, I am not yet sure how good fit this is because I still do not understand the underlying problem your notifier is trying to solve. So I will get back to this once that is settled. > > Or am I missing something here? > > > > More seriously, could you please point me at an exemplary shrinker > > > use case so I can see what is involved? > > > > Well, I am not really sure what is the objective of the oom notifier to > > point you to the right direction. IIUC you just want to kick callbacks > > to be handled sooner under a heavy memory pressure, right? How is that > > achieved? Kick a worker? > > That is achieved by enqueuing a non-lazy callback on each CPU's callback > list, but only for those CPUs having non-empty lists. This causes > CPUs with lists containing only lazy callbacks to be more aggressive, > in particular, it prevents such CPUs from hanging out idle for seconds > at a time while they have callbacks on their lists. > > The enqueuing happens via an IPI to the CPU in question. I am afraid this is too low level for my to understand what is going on here. What are lazy callbacks and why do they need any specific action when we are getting close to OOM? I mean, I do understand that we might have many callers of call_rcu and free memory lazily. But there is quite a long way before we start the reclaim until we reach the OOM killer path. So why don't those callbacks get called during that time period? How are their triggered when we are not hitting the OOM path? They surely cannot sit there for ever, right? Can we trigger them sooner? Maybe the shrinker is not the best fit but we have a retry feedback loop in the page allocator, maybe we can kick this processing from there.
On Fri, Jun 29, 2018 at 11:04:19AM +0200, Michal Hocko wrote: > On Thu 28-06-18 14:31:05, Paul E. McKenney wrote: > > On Thu, Jun 28, 2018 at 01:39:42PM +0200, Michal Hocko wrote: > > > On Wed 27-06-18 07:31:25, Paul E. McKenney wrote: > > > > On Wed, Jun 27, 2018 at 09:22:07AM +0200, Michal Hocko wrote: > > > > > On Tue 26-06-18 10:03:45, Paul E. McKenney wrote: > > > > > [...] > > > > > > 3. Something else? > > > > > > > > > > How hard it would be to use a different API than oom notifiers? E.g. a > > > > > shrinker which just kicks all the pending callbacks if the reclaim > > > > > priority reaches low values (e.g. 0)? > > > > > > > > Beats me. What is a shrinker? ;-) > > > > > > This is a generich mechanism to reclaim memory that is not on standard > > > LRU lists. Lwn.net surely has some nice coverage (e.g. > > > https://lwn.net/Articles/548092/). > > > > "In addition, there is little agreement over what a call to a shrinker > > really means or how the called subsystem should respond." ;-) > > > > Is this set up using register_shrinker() in mm/vmscan.c? I am guessing > > Yes, exactly. You are supposed to implement the two methods in struct > shrink_control > > > that the many mentions of shrinker in DRM are irrelevant. > > > > If my guess is correct, the API seems a poor fit for RCU. I can > > produce an approximate number of RCU callbacks for ->count_objects(), > > but a given callback might free a lot of memory or none at all. Plus, > > to actually have ->scan_objects() free them before returning, I would > > need to use something like rcu_barrier(), which might involve longer > > delays than desired.` > > Well, I am not yet sure how good fit this is because I still do not > understand the underlying problem your notifier is trying to solve. So I > will get back to this once that is settled. > > > > Or am I missing something here? > > > > > > More seriously, could you please point me at an exemplary shrinker > > > > use case so I can see what is involved? > > > > > > Well, I am not really sure what is the objective of the oom notifier to > > > point you to the right direction. IIUC you just want to kick callbacks > > > to be handled sooner under a heavy memory pressure, right? How is that > > > achieved? Kick a worker? > > > > That is achieved by enqueuing a non-lazy callback on each CPU's callback > > list, but only for those CPUs having non-empty lists. This causes > > CPUs with lists containing only lazy callbacks to be more aggressive, > > in particular, it prevents such CPUs from hanging out idle for seconds > > at a time while they have callbacks on their lists. > > > > The enqueuing happens via an IPI to the CPU in question. > > I am afraid this is too low level for my to understand what is going on > here. What are lazy callbacks and why do they need any specific action > when we are getting close to OOM? I mean, I do understand that we might > have many callers of call_rcu and free memory lazily. But there is quite > a long way before we start the reclaim until we reach the OOM killer path. > So why don't those callbacks get called during that time period? How are > their triggered when we are not hitting the OOM path? They surely cannot > sit there for ever, right? Can we trigger them sooner? Maybe the > shrinker is not the best fit but we have a retry feedback loop in the page > allocator, maybe we can kick this processing from there. The effect of RCU's current OOM code is to speed up callback invocation by at most a few seconds (assuming no stalled CPUs, in which case it is not possible to speed up callback invocation). Given that, I should just remove RCU's OOM code entirely? Thanx, Paul
On Fri 29-06-18 05:52:18, Paul E. McKenney wrote: > On Fri, Jun 29, 2018 at 11:04:19AM +0200, Michal Hocko wrote: > > On Thu 28-06-18 14:31:05, Paul E. McKenney wrote: > > > On Thu, Jun 28, 2018 at 01:39:42PM +0200, Michal Hocko wrote: [...] > > > > Well, I am not really sure what is the objective of the oom notifier to > > > > point you to the right direction. IIUC you just want to kick callbacks > > > > to be handled sooner under a heavy memory pressure, right? How is that > > > > achieved? Kick a worker? > > > > > > That is achieved by enqueuing a non-lazy callback on each CPU's callback > > > list, but only for those CPUs having non-empty lists. This causes > > > CPUs with lists containing only lazy callbacks to be more aggressive, > > > in particular, it prevents such CPUs from hanging out idle for seconds > > > at a time while they have callbacks on their lists. > > > > > > The enqueuing happens via an IPI to the CPU in question. > > > > I am afraid this is too low level for my to understand what is going on > > here. What are lazy callbacks and why do they need any specific action > > when we are getting close to OOM? I mean, I do understand that we might > > have many callers of call_rcu and free memory lazily. But there is quite > > a long way before we start the reclaim until we reach the OOM killer path. > > So why don't those callbacks get called during that time period? How are > > their triggered when we are not hitting the OOM path? They surely cannot > > sit there for ever, right? Can we trigger them sooner? Maybe the > > shrinker is not the best fit but we have a retry feedback loop in the page > > allocator, maybe we can kick this processing from there. > > The effect of RCU's current OOM code is to speed up callback invocation > by at most a few seconds (assuming no stalled CPUs, in which case > it is not possible to speed up callback invocation). > > Given that, I should just remove RCU's OOM code entirely? Yeah, it seems so. I do not see how this would really help much. If we really need some way to kick callbacks then we should do so much earlier in the reclaim process - e.g. when we start struggling to reclaim any memory. I am curious. Has the notifier been motivated by a real world use case or it was "nice thing to do"?
On 2018/06/29 21:52, Paul E. McKenney wrote: > The effect of RCU's current OOM code is to speed up callback invocation > by at most a few seconds (assuming no stalled CPUs, in which case > it is not possible to speed up callback invocation). > > Given that, I should just remove RCU's OOM code entirely? out_of_memory() will start selecting an OOM victim without calling get_page_from_freelist() since rcu_oom_notify() does not set non-zero value to "freed" field. I think that rcu_oom_notify() needs to wait for completion of callback invocations (possibly with timeout in case there are stalling CPUs) and set non-zero value to "freed" field if pending callbacks did release memory. However, what will be difficult to tell is whether invocation of pending callbacks did release memory. Lack of last second get_page_from_freelist() call after blocking_notifier_call_chain(&oom_notify_list, 0, &freed) forces rcu_oom_notify() to set appropriate value (i.e. zero or non-zero) to "freed" field. We have tried to move really last second get_page_from_freelist() call to inside out_of_memory() after blocking_notifier_call_chain(&oom_notify_list, 0, &freed). But that proposal was not accepted... We could move blocking_notifier_call_chain(&oom_notify_list, 0, &freed) to before last second get_page_from_freelist() call (and this is what this patch is trying to do) which would allow rcu_oom_notify() to always return 0... or update rcu_oom_notify() to use shrinker API...
On Fri, Jun 29, 2018 at 03:26:38PM +0200, Michal Hocko wrote: > On Fri 29-06-18 05:52:18, Paul E. McKenney wrote: > > On Fri, Jun 29, 2018 at 11:04:19AM +0200, Michal Hocko wrote: > > > On Thu 28-06-18 14:31:05, Paul E. McKenney wrote: > > > > On Thu, Jun 28, 2018 at 01:39:42PM +0200, Michal Hocko wrote: > [...] > > > > > Well, I am not really sure what is the objective of the oom notifier to > > > > > point you to the right direction. IIUC you just want to kick callbacks > > > > > to be handled sooner under a heavy memory pressure, right? How is that > > > > > achieved? Kick a worker? > > > > > > > > That is achieved by enqueuing a non-lazy callback on each CPU's callback > > > > list, but only for those CPUs having non-empty lists. This causes > > > > CPUs with lists containing only lazy callbacks to be more aggressive, > > > > in particular, it prevents such CPUs from hanging out idle for seconds > > > > at a time while they have callbacks on their lists. > > > > > > > > The enqueuing happens via an IPI to the CPU in question. > > > > > > I am afraid this is too low level for my to understand what is going on > > > here. What are lazy callbacks and why do they need any specific action > > > when we are getting close to OOM? I mean, I do understand that we might > > > have many callers of call_rcu and free memory lazily. But there is quite > > > a long way before we start the reclaim until we reach the OOM killer path. > > > So why don't those callbacks get called during that time period? How are > > > their triggered when we are not hitting the OOM path? They surely cannot > > > sit there for ever, right? Can we trigger them sooner? Maybe the > > > shrinker is not the best fit but we have a retry feedback loop in the page > > > allocator, maybe we can kick this processing from there. > > > > The effect of RCU's current OOM code is to speed up callback invocation > > by at most a few seconds (assuming no stalled CPUs, in which case > > it is not possible to speed up callback invocation). > > > > Given that, I should just remove RCU's OOM code entirely? > > Yeah, it seems so. I do not see how this would really help much. If we > really need some way to kick callbacks then we should do so much earlier > in the reclaim process - e.g. when we start struggling to reclaim any > memory. One approach would be to tell RCU "It is time to trade CPU for memory" at the beginning of that struggle and then tell RCU "Go back to optimizing for CPU" at the end of that struggle. Is there already a way to do this? If so, RCU should probably just switch to it. But what is the typical duration of such a struggle? Does this duration change with workload? (I suspect that the answers are "who knows?" and "yes", but you tell me!) Are there other oom handlers that would prefer the approach of the previous paragraph? > I am curious. Has the notifier been motivated by a real world use case > or it was "nice thing to do"? It was introduced by b626c1b689364 ("rcu: Provide OOM handler to motivate lazy RCU callbacks"). The motivation for this commit was a set of changes that improved energy efficiency by making CPUs sleep for longer when all of their pending callbacks were known to only free memory (as opposed to doing a wakeup or some such). Prior to this set of changes, a CPU with callbacks would invoke those callbacks (thus freeing the memory) within a jiffy or so of the end of a grace period. After this set of changes, a CPU might wait several seconds. This was a concern to people with small-memory systems, hence commit b626c1b689364. Thanx, Paul
On Fri, Jun 29, 2018 at 11:35:48PM +0900, Tetsuo Handa wrote: > On 2018/06/29 21:52, Paul E. McKenney wrote: > > The effect of RCU's current OOM code is to speed up callback invocation > > by at most a few seconds (assuming no stalled CPUs, in which case > > it is not possible to speed up callback invocation). > > > > Given that, I should just remove RCU's OOM code entirely? > > out_of_memory() will start selecting an OOM victim without calling > get_page_from_freelist() since rcu_oom_notify() does not set non-zero > value to "freed" field. > > I think that rcu_oom_notify() needs to wait for completion of callback > invocations (possibly with timeout in case there are stalling CPUs) and > set non-zero value to "freed" field if pending callbacks did release memory. Waiting for the callbacks is easy. Timeouts would be a bit harder, but still doable. I have no idea how to tell which callbacks freed memory and how much -- all RCU does is invoke a function, and that function can do whatever its developer wants. > However, what will be difficult to tell is whether invocation of pending callbacks > did release memory. Lack of last second get_page_from_freelist() call after > blocking_notifier_call_chain(&oom_notify_list, 0, &freed) forces rcu_oom_notify() > to set appropriate value (i.e. zero or non-zero) to "freed" field. > > We have tried to move really last second get_page_from_freelist() call to inside > out_of_memory() after blocking_notifier_call_chain(&oom_notify_list, 0, &freed). > But that proposal was not accepted... > > We could move blocking_notifier_call_chain(&oom_notify_list, 0, &freed) to > before last second get_page_from_freelist() call (and this is what this patch > is trying to do) which would allow rcu_oom_notify() to always return 0... > or update rcu_oom_notify() to use shrinker API... Would it be possible to tell RCU that memory was starting to get tight with one call, and then tell it that things are OK with another call? That would make much more sense from an RCU perspective. Thanx, Paul
On Sat 30-06-18 10:05:22, Paul E. McKenney wrote: > On Fri, Jun 29, 2018 at 03:26:38PM +0200, Michal Hocko wrote: > > On Fri 29-06-18 05:52:18, Paul E. McKenney wrote: > > > On Fri, Jun 29, 2018 at 11:04:19AM +0200, Michal Hocko wrote: > > > > On Thu 28-06-18 14:31:05, Paul E. McKenney wrote: > > > > > On Thu, Jun 28, 2018 at 01:39:42PM +0200, Michal Hocko wrote: > > [...] > > > > > > Well, I am not really sure what is the objective of the oom notifier to > > > > > > point you to the right direction. IIUC you just want to kick callbacks > > > > > > to be handled sooner under a heavy memory pressure, right? How is that > > > > > > achieved? Kick a worker? > > > > > > > > > > That is achieved by enqueuing a non-lazy callback on each CPU's callback > > > > > list, but only for those CPUs having non-empty lists. This causes > > > > > CPUs with lists containing only lazy callbacks to be more aggressive, > > > > > in particular, it prevents such CPUs from hanging out idle for seconds > > > > > at a time while they have callbacks on their lists. > > > > > > > > > > The enqueuing happens via an IPI to the CPU in question. > > > > > > > > I am afraid this is too low level for my to understand what is going on > > > > here. What are lazy callbacks and why do they need any specific action > > > > when we are getting close to OOM? I mean, I do understand that we might > > > > have many callers of call_rcu and free memory lazily. But there is quite > > > > a long way before we start the reclaim until we reach the OOM killer path. > > > > So why don't those callbacks get called during that time period? How are > > > > their triggered when we are not hitting the OOM path? They surely cannot > > > > sit there for ever, right? Can we trigger them sooner? Maybe the > > > > shrinker is not the best fit but we have a retry feedback loop in the page > > > > allocator, maybe we can kick this processing from there. > > > > > > The effect of RCU's current OOM code is to speed up callback invocation > > > by at most a few seconds (assuming no stalled CPUs, in which case > > > it is not possible to speed up callback invocation). > > > > > > Given that, I should just remove RCU's OOM code entirely? > > > > Yeah, it seems so. I do not see how this would really help much. If we > > really need some way to kick callbacks then we should do so much earlier > > in the reclaim process - e.g. when we start struggling to reclaim any > > memory. > > One approach would be to tell RCU "It is time to trade CPU for memory" > at the beginning of that struggle and then tell RCU "Go back to optimizing > for CPU" at the end of that struggle. Is there already a way to do this? > If so, RCU should probably just switch to it. Well, I can think of the first "we are strugling part". This would be anytime we are strugling to reclaim any memory. If we knew how much can be sitting in callbacks then it would certainly help to make some educated decision but I am worried we do not have any number to look at. Or maybe we have the number of callbacks to know to kick the processing? The other part and go back to optimize for cpu is harder. We are ususally not running the code when we do not struggle ;) > But what is the typical duration of such a struggle? Does this duration > change with workload? (I suspect that the answers are "who knows?" and > "yes", but you tell me!) Are there other oom handlers that would prefer > the approach of the previous paragraph? > > > I am curious. Has the notifier been motivated by a real world use case > > or it was "nice thing to do"? > > It was introduced by b626c1b689364 ("rcu: Provide OOM handler to motivate > lazy RCU callbacks"). The motivation for this commit was a set of changes > that improved energy efficiency by making CPUs sleep for longer when all > of their pending callbacks were known to only free memory (as opposed > to doing a wakeup or some such). Prior to this set of changes, a CPU > with callbacks would invoke those callbacks (thus freeing the memory) > within a jiffy or so of the end of a grace period. After this set of > changes, a CPU might wait several seconds. This was a concern to people > with small-memory systems, hence commit b626c1b689364. Do we have any real life examples of OOMs caused by the lazy execution of rcu callbacks? If not then I would simply drop the notifier and get back to _some_ implementation if somebody sees a real problem.
diff --git a/include/linux/oom.h b/include/linux/oom.h index 6adac11..085b033 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -101,6 +101,7 @@ extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages); +extern unsigned long try_oom_notifier(void); extern bool out_of_memory(struct oom_control *oc); extern void exit_oom_victim(void); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 84081e7..2ff5db2 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1010,6 +1010,33 @@ int unregister_oom_notifier(struct notifier_block *nb) EXPORT_SYMBOL_GPL(unregister_oom_notifier); /** + * try_oom_notifier - Try to reclaim memory from OOM notifier list. + * + * Returns non-zero if notifier callbacks released something, zero otherwise. + */ +unsigned long try_oom_notifier(void) +{ + static DEFINE_MUTEX(oom_notifier_lock); + unsigned long freed = 0; + + /* + * Since OOM notifier callbacks must not depend on __GFP_DIRECT_RECLAIM + * && !__GFP_NORETRY memory allocation, waiting for mutex here is safe. + * If lockdep reports possible deadlock dependency, it will be a bug in + * OOM notifier callbacks. + * + * If SIGKILL is pending, it is likely that current thread was selected + * as an OOM victim. In that case, current thread should return as soon + * as possible using memory reserves. + */ + if (mutex_lock_killable(&oom_notifier_lock)) + return 0; + blocking_notifier_call_chain(&oom_notify_list, 0, &freed); + mutex_unlock(&oom_notifier_lock); + return freed; +} + +/** * out_of_memory - kill the "best" process when we run out of memory * @oc: pointer to struct oom_control * @@ -1020,19 +1047,11 @@ int unregister_oom_notifier(struct notifier_block *nb) */ bool out_of_memory(struct oom_control *oc) { - unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; if (oom_killer_disabled) return false; - if (!is_memcg_oom(oc)) { - blocking_notifier_call_chain(&oom_notify_list, 0, &freed); - if (freed > 0) - /* Got some memory back in the last second. */ - return true; - } - /* * If current has a pending SIGKILL or is exiting, then automatically * select it. The goal is to allow it to allocate so that it may diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1521100..c72ef1e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3447,10 +3447,50 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) return page; } +static inline bool can_oomkill(gfp_t gfp_mask, unsigned int order, + const struct alloc_context *ac) +{ + /* Coredumps can quickly deplete all memory reserves */ + if (current->flags & PF_DUMPCORE) + return false; + /* The OOM killer will not help higher order allocs */ + if (order > PAGE_ALLOC_COSTLY_ORDER) + return false; + /* + * We have already exhausted all our reclaim opportunities without any + * success so it is time to admit defeat. We will skip the OOM killer + * because it is very likely that the caller has a more reasonable + * fallback than shooting a random task. + */ + if (gfp_mask & __GFP_RETRY_MAYFAIL) + return false; + /* The OOM killer does not needlessly kill tasks for lowmem */ + if (ac->high_zoneidx < ZONE_NORMAL) + return false; + if (pm_suspended_storage()) + return false; + /* + * XXX: GFP_NOFS allocations should rather fail than rely on + * other request to make a forward progress. + * We are in an unfortunate situation where out_of_memory cannot + * do much for this context but let's try it to at least get + * access to memory reserved if the current task is killed (see + * out_of_memory). Once filesystems are ready to handle allocation + * failures more gracefully we should just bail out here. + */ + + /* The OOM killer may not free memory on a specific node */ + if (gfp_mask & __GFP_THISNODE) + return false; + + return true; +} + static inline struct page * __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, const struct alloc_context *ac, unsigned long *did_some_progress) { + const bool oomkill = can_oomkill(gfp_mask, order, ac); struct oom_control oc = { .zonelist = ac->zonelist, .nodemask = ac->nodemask, @@ -3462,6 +3502,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) *did_some_progress = 0; + /* Try to reclaim via OOM notifier callback. */ + if (oomkill) + *did_some_progress = try_oom_notifier(); + /* * Acquire the oom lock. If that fails, somebody else is * making progress for us. @@ -3485,37 +3529,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) if (page) goto out; - /* Coredumps can quickly deplete all memory reserves */ - if (current->flags & PF_DUMPCORE) - goto out; - /* The OOM killer will not help higher order allocs */ - if (order > PAGE_ALLOC_COSTLY_ORDER) - goto out; - /* - * We have already exhausted all our reclaim opportunities without any - * success so it is time to admit defeat. We will skip the OOM killer - * because it is very likely that the caller has a more reasonable - * fallback than shooting a random task. - */ - if (gfp_mask & __GFP_RETRY_MAYFAIL) - goto out; - /* The OOM killer does not needlessly kill tasks for lowmem */ - if (ac->high_zoneidx < ZONE_NORMAL) - goto out; - if (pm_suspended_storage()) - goto out; - /* - * XXX: GFP_NOFS allocations should rather fail than rely on - * other request to make a forward progress. - * We are in an unfortunate situation where out_of_memory cannot - * do much for this context but let's try it to at least get - * access to memory reserved if the current task is killed (see - * out_of_memory). Once filesystems are ready to handle allocation - * failures more gracefully we should just bail out here. - */ - - /* The OOM killer may not free memory on a specific node */ - if (gfp_mask & __GFP_THISNODE) + if (!oomkill) goto out; /* Exhausted what can be done so it's blame time */
Sleeping with oom_lock held can cause AB-BA lockup bug because __alloc_pages_may_oom() does not wait for oom_lock. Since blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping with oom_lock held is currently an unavoidable problem. As a preparation for not to sleep with oom_lock held, this patch brings OOM notifier callbacks to outside of OOM killer, with two small behavior changes explained below. One is that this patch makes it impossible for SysRq-f and PF-OOM to reclaim via OOM notifier. But such change should be tolerable because "we unlikely try to use SysRq-f for reclaiming memory via OOM notifier callbacks" and "pagefault_out_of_memory() will be called when OOM killer selected current thread as an OOM victim after OOM notifier callbacks already failed to reclaim memory". The other is that this patch makes it possible to reclaim memory via OOM notifier after OOM killer is disabled (that is, suspend/hibernate is in progress). But such change should be safe because of pm_suspended_storage() check. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- include/linux/oom.h | 1 + mm/oom_kill.c | 35 ++++++++++++++++++------ mm/page_alloc.c | 76 +++++++++++++++++++++++++++++++---------------------- 3 files changed, 73 insertions(+), 39 deletions(-)