Message ID | 20141105124620.GB4527@dhcp22.suse.cz (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Hello, Michal. On Wed, Nov 05, 2014 at 01:46:20PM +0100, Michal Hocko wrote: > As I've said I wasn't entirely happy with this half solution but it helped > the current situation at the time. The full solution would require to I don't think this helps the situation. It just makes the bug more obscure and the race window while reduced is still pretty big and there seems to be an actual not too low chance of the bug triggering out in the wild. How does this level of obscuring help anything? In addition to making the bug more difficult to reproduce, it also adds a bunch of code which *pretends* to address the issue but ultimately just lowers visibility into what's going on and hinders tracking down the issue when something actually goes wrong. This is *NOT* making the situation better. The patch is net negative. > I think the patch below should be safe. Would you prefer this solution > instead? It is race free but there is the risk that exposing a lock which Yes, this is an a lot saner approach in general. > completely blocks OOM killer from the allocation path will kick us > later. Can you please spell it out? How would it kick us? We already have oom_killer_disable/enable(), how is this any different in terms of correctness from them? Also, why isn't this part of oom_killer_disable/enable()? The way they're implemented is really silly now. It just sets a flag and returns whether there's a currently running instance or not. How were these even useful? Why can't you just make disable/enable to what they were supposed to do from the beginning? Thanks.
On Wed 05-11-14 08:02:47, Tejun Heo wrote: > Hello, Michal. > > On Wed, Nov 05, 2014 at 01:46:20PM +0100, Michal Hocko wrote: > > As I've said I wasn't entirely happy with this half solution but it helped > > the current situation at the time. The full solution would require to > > I don't think this helps the situation. It just makes the bug more > obscure and the race window while reduced is still pretty big and > there seems to be an actual not too low chance of the bug triggering > out in the wild. How does this level of obscuring help anything? In > addition to making the bug more difficult to reproduce, it also adds a > bunch of code which *pretends* to address the issue but ultimately > just lowers visibility into what's going on and hinders tracking down > the issue when something actually goes wrong. This is *NOT* making > the situation better. The patch is net negative. The patch was a compromise. It was needed to catch the most common OOM conditions while the tasks are getting frozen. The race window between the counter increment and the check in the PM path is negligible compared to the freezing process. And it is safe from OOM point of view because nothing can block it away. > > I think the patch below should be safe. Would you prefer this solution > > instead? It is race free but there is the risk that exposing a lock which > > Yes, this is an a lot saner approach in general. > > > completely blocks OOM killer from the allocation path will kick us > > later. > > Can you please spell it out? How would it kick us? We already have > oom_killer_disable/enable(), how is this any different in terms of > correctness from them? As already said in the part of email you haven't quoted. oom_killer_disable will cause allocations to _fail_. With the lock you are _blocking_ OOM killer completely. This is error prone because no part of system should be able to block the last resort memory shortage actions. > Also, why isn't this part of > oom_killer_disable/enable()? The way they're implemented is really > silly now. It just sets a flag and returns whether there's a > currently running instance or not. How were these even useful? > Why can't you just make disable/enable to what they were supposed to > do from the beginning? Because then we would block all the potential allocators coming from workqueues or kernel threads which are not frozen yet rather than fail the allocation. I am not familiar with the PM code and all the paths this might get called from enough to tell whether failing the allocation is better approach than failing the suspend operation on a timeout.
On Wed 05-11-14 14:31:00, Michal Hocko wrote: > On Wed 05-11-14 08:02:47, Tejun Heo wrote: [...] > > Also, why isn't this part of > > oom_killer_disable/enable()? The way they're implemented is really > > silly now. It just sets a flag and returns whether there's a > > currently running instance or not. How were these even useful? > > Why can't you just make disable/enable to what they were supposed to > > do from the beginning? > > Because then we would block all the potential allocators coming from > workqueues or kernel threads which are not frozen yet rather than fail > the allocation. After thinking about this more it would be doable by using trylock in the allocation oom path. I will respin the patch. The API will be cleaner this way.
On Wed 05-11-14 13:46:20, Michal Hocko wrote: [...] > From ef6227565fa65b52986c4626d49ba53b499e54d1 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Wed, 5 Nov 2014 11:49:14 +0100 > Subject: [PATCH] OOM, PM: make OOM detection in the freezer path raceless > > 5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend) > has left a race window when OOM killer manages to note_oom_kill after > freeze_processes checks the counter. The race window is quite small > and really unlikely and deemed sufficient at the time of submission. > > Tejun wasn't happy about this partial solution though and insisted on > a full solution. That requires the full OOM and freezer exclusion, > though. This is done by this patch which introduces oom_sem RW lock. > Page allocation OOM path takes the lock for reading because there might > be concurrent OOM happening on disjunct zonelists. oom_killer_disabled > check is moved right before out_of_memory is called because it was > checked too early before and we do not want to hold the lock while doing > the last attempt for allocation which might involve zone_reclaim. This is incorrect because it would cause an endless allocation loop because we really have to got to no_page if OOM is disabled. > freeze_processes then takes the lock for write throughout the whole > freezing process and OOM disabling. > > There is no need to recheck all the processes with the full > synchronization anymore. > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > --- > include/linux/oom.h | 5 +++++ > kernel/power/process.c | 50 +++++++++----------------------------------------- > mm/oom_kill.c | 17 ----------------- > mm/page_alloc.c | 24 ++++++++++++------------ > 4 files changed, 26 insertions(+), 70 deletions(-) > > diff --git a/include/linux/oom.h b/include/linux/oom.h > index e8d6e1058723..350b9b2ffeec 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -73,7 +73,12 @@ extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > extern int register_oom_notifier(struct notifier_block *nb); > extern int unregister_oom_notifier(struct notifier_block *nb); > > +/* > + * oom_killer_disabled can be modified only under oom_sem taken for write > + * and checked under read lock along with the full OOM handler. > + */ > extern bool oom_killer_disabled; > +extern struct rw_semaphore oom_sem; > > static inline void oom_killer_disable(void) > { > diff --git a/kernel/power/process.c b/kernel/power/process.c > index 5a6ec8678b9a..befce9785233 100644 > --- a/kernel/power/process.c > +++ b/kernel/power/process.c > @@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only) > return todo ? -EBUSY : 0; > } > > -static bool __check_frozen_processes(void) > -{ > - struct task_struct *g, *p; > - > - for_each_process_thread(g, p) > - if (p != current && !freezer_should_skip(p) && !frozen(p)) > - return false; > - > - return true; > -} > - > -/* > - * Returns true if all freezable tasks (except for current) are frozen already > - */ > -static bool check_frozen_processes(void) > -{ > - bool ret; > - > - read_lock(&tasklist_lock); > - ret = __check_frozen_processes(); > - read_unlock(&tasklist_lock); > - return ret; > -} > - > /** > * freeze_processes - Signal user space processes to enter the refrigerator. > * The current thread will not be frozen. The same process that calls > @@ -142,7 +118,6 @@ static bool check_frozen_processes(void) > int freeze_processes(void) > { > int error; > - int oom_kills_saved; > > error = __usermodehelper_disable(UMH_FREEZING); > if (error) > @@ -157,27 +132,20 @@ int freeze_processes(void) > pm_wakeup_clear(); > printk("Freezing user space processes ... "); > pm_freezing = true; > - oom_kills_saved = oom_kills_count(); > + > + /* > + * Need to exlude OOM killer from triggering while tasks are > + * getting frozen to make sure none of them gets killed after > + * try_to_freeze_tasks is done. > + */ > + down_write(&oom_sem); > error = try_to_freeze_tasks(true); > if (!error) { > __usermodehelper_set_disable_depth(UMH_DISABLED); > oom_killer_disable(); > - > - /* > - * There might have been an OOM kill while we were > - * freezing tasks and the killed task might be still > - * on the way out so we have to double check for race. > - */ > - if (oom_kills_count() != oom_kills_saved && > - !check_frozen_processes()) { > - __usermodehelper_set_disable_depth(UMH_ENABLED); > - printk("OOM in progress."); > - error = -EBUSY; > - } else { > - printk("done."); > - } > + printk("done.\n"); > } > - printk("\n"); > + up_write(&oom_sem); > BUG_ON(in_atomic()); > > if (error) > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 5340f6b91312..bbf405a3a18f 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -404,23 +404,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, > dump_tasks(memcg, nodemask); > } > > -/* > - * Number of OOM killer invocations (including memcg OOM killer). > - * Primarily used by PM freezer to check for potential races with > - * OOM killed frozen task. > - */ > -static atomic_t oom_kills = ATOMIC_INIT(0); > - > -int oom_kills_count(void) > -{ > - return atomic_read(&oom_kills); > -} > - > -void note_oom_kill(void) > -{ > - atomic_inc(&oom_kills); > -} > - > #define K(x) ((x) << (PAGE_SHIFT-10)) > /* > * Must be called while holding a reference to p, which will be released upon > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 9cd36b822444..76095266c4b5 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -243,6 +243,7 @@ void set_pageblock_migratetype(struct page *page, int migratetype) > } > > bool oom_killer_disabled __read_mostly; > +DECLARE_RWSEM(oom_sem); > > #ifdef CONFIG_DEBUG_VM > static int page_outside_zone_boundaries(struct zone *zone, struct page *page) > @@ -2252,14 +2253,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > } > > /* > - * PM-freezer should be notified that there might be an OOM killer on > - * its way to kill and wake somebody up. This is too early and we might > - * end up not killing anything but false positives are acceptable. > - * See freeze_processes. > - */ > - note_oom_kill(); > - > - /* > * Go through the zonelist yet one more time, keep very high watermark > * here, this is only to catch a parallel oom killing, we must fail if > * we're still under heavy pressure. > @@ -2288,8 +2281,17 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > if (gfp_mask & __GFP_THISNODE) > goto out; > } > - /* Exhausted what can be done so it's blamo time */ > - out_of_memory(zonelist, gfp_mask, order, nodemask, false); > + > + /* > + * Exhausted what can be done so it's blamo time. > + * Just make sure that we cannot race with oom_killer disabling > + * e.g. PM freezer needs to make sure that no OOM happens after > + * all tasks are frozen. > + */ > + down_read(&oom_sem); > + if (!oom_killer_disabled) > + out_of_memory(zonelist, gfp_mask, order, nodemask, false); > + up_read(&oom_sem); > > out: > oom_zonelist_unlock(zonelist, gfp_mask); > @@ -2716,8 +2718,6 @@ rebalance: > */ > if (!did_some_progress) { > if (oom_gfp_allowed(gfp_mask)) { > - if (oom_killer_disabled) > - goto nopage; > /* Coredumps can quickly deplete all memory reserves */ > if ((current->flags & PF_DUMPCORE) && > !(gfp_mask & __GFP_NOFAIL)) > -- > 2.1.1 > > > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Wed, Nov 05, 2014 at 02:42:19PM +0100, Michal Hocko wrote: > On Wed 05-11-14 14:31:00, Michal Hocko wrote: > > On Wed 05-11-14 08:02:47, Tejun Heo wrote: > [...] > > > Also, why isn't this part of > > > oom_killer_disable/enable()? The way they're implemented is really > > > silly now. It just sets a flag and returns whether there's a > > > currently running instance or not. How were these even useful? > > > Why can't you just make disable/enable to what they were supposed to > > > do from the beginning? > > > > Because then we would block all the potential allocators coming from > > workqueues or kernel threads which are not frozen yet rather than fail > > the allocation. > > After thinking about this more it would be doable by using trylock in > the allocation oom path. I will respin the patch. The API will be > cleaner this way. In disable, block new invocations of OOM killer and then drain the in-progress ones. This is a common pattern, isn't it?
On Wed 05-11-14 10:44:36, Tejun Heo wrote: > On Wed, Nov 05, 2014 at 02:42:19PM +0100, Michal Hocko wrote: > > On Wed 05-11-14 14:31:00, Michal Hocko wrote: > > > On Wed 05-11-14 08:02:47, Tejun Heo wrote: > > [...] > > > > Also, why isn't this part of > > > > oom_killer_disable/enable()? The way they're implemented is really > > > > silly now. It just sets a flag and returns whether there's a > > > > currently running instance or not. How were these even useful? > > > > Why can't you just make disable/enable to what they were supposed to > > > > do from the beginning? > > > > > > Because then we would block all the potential allocators coming from > > > workqueues or kernel threads which are not frozen yet rather than fail > > > the allocation. > > > > After thinking about this more it would be doable by using trylock in > > the allocation oom path. I will respin the patch. The API will be > > cleaner this way. > > In disable, block new invocations of OOM killer and then drain the > in-progress ones. This is a common pattern, isn't it? I am not sure I am following. With the latest patch OOM path is no longer blocked by the PM (aka oom_killer_disable()). Allocations simply fail if the read_trylock fails. oom_killer_disable is moved before tasks are frozen and it will wait for all on-going OOM killers on the write lock. OOM killer is enabled again on the resume path.
Hello, Michal. On Wed, Nov 05, 2014 at 05:01:15PM +0100, Michal Hocko wrote: > I am not sure I am following. With the latest patch OOM path is no > longer blocked by the PM (aka oom_killer_disable()). Allocations simply > fail if the read_trylock fails. > oom_killer_disable is moved before tasks are frozen and it will wait for > all on-going OOM killers on the write lock. OOM killer is enabled again > on the resume path. Sure, but why are we exposing new interfaces? Can't we just make oom_killer_disable() first set the disable flag and wait for the on-going ones to finish (and make the function fail if it gets chosen as an OOM victim)? It's weird to expose extra stuff on top. Why are we doing that? Thanks.
On Wed 05-11-14 11:29:29, Tejun Heo wrote: > Hello, Michal. > > On Wed, Nov 05, 2014 at 05:01:15PM +0100, Michal Hocko wrote: > > I am not sure I am following. With the latest patch OOM path is no > > longer blocked by the PM (aka oom_killer_disable()). Allocations simply > > fail if the read_trylock fails. > > oom_killer_disable is moved before tasks are frozen and it will wait for > > all on-going OOM killers on the write lock. OOM killer is enabled again > > on the resume path. > > Sure, but why are we exposing new interfaces? Can't we just make > oom_killer_disable() first set the disable flag and wait for the > on-going ones to finish (and make the function fail if it gets chosen > as an OOM victim)? Still not following. How do you want to detect an on-going OOM without any interface around out_of_memory?
On Wed, Nov 05, 2014 at 05:39:56PM +0100, Michal Hocko wrote: > On Wed 05-11-14 11:29:29, Tejun Heo wrote: > > Hello, Michal. > > > > On Wed, Nov 05, 2014 at 05:01:15PM +0100, Michal Hocko wrote: > > > I am not sure I am following. With the latest patch OOM path is no > > > longer blocked by the PM (aka oom_killer_disable()). Allocations simply > > > fail if the read_trylock fails. > > > oom_killer_disable is moved before tasks are frozen and it will wait for > > > all on-going OOM killers on the write lock. OOM killer is enabled again > > > on the resume path. > > > > Sure, but why are we exposing new interfaces? Can't we just make > > oom_killer_disable() first set the disable flag and wait for the > > on-going ones to finish (and make the function fail if it gets chosen > > as an OOM victim)? > > Still not following. How do you want to detect an on-going OOM without > any interface around out_of_memory? I thought you were using oom_killer_allowed_start() outside OOM path. Ugh.... why is everything weirdly structured? oom_killer_disabled implies that oom killer may fail, right? Why is __alloc_pages_slowpath() checking it directly? If whether oom killing failed or not is relevant to its users, make out_of_memory() return an error code. There's no reason for the exclusion detail to leak out of the oom killer proper. The only interface should be disable/enable and whether oom killing failed or not. Thanks.
On Wed, Nov 05, 2014 at 11:54:28AM -0500, Tejun Heo wrote: > > Still not following. How do you want to detect an on-going OOM without > > any interface around out_of_memory? > > I thought you were using oom_killer_allowed_start() outside OOM path. > Ugh.... why is everything weirdly structured? oom_killer_disabled > implies that oom killer may fail, right? Why is > __alloc_pages_slowpath() checking it directly? If whether oom killing > failed or not is relevant to its users, make out_of_memory() return an > error code. There's no reason for the exclusion detail to leak out of > the oom killer proper. The only interface should be disable/enable > and whether oom killing failed or not. And what's implemented is wrong. What happens if oom killing is already in progress and then a task blocks trying to write-lock the rwsem and then that task is selected as the OOM victim? disable() call must be able to fail.
diff --git a/include/linux/oom.h b/include/linux/oom.h index e8d6e1058723..350b9b2ffeec 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -73,7 +73,12 @@ extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, extern int register_oom_notifier(struct notifier_block *nb); extern int unregister_oom_notifier(struct notifier_block *nb); +/* + * oom_killer_disabled can be modified only under oom_sem taken for write + * and checked under read lock along with the full OOM handler. + */ extern bool oom_killer_disabled; +extern struct rw_semaphore oom_sem; static inline void oom_killer_disable(void) { diff --git a/kernel/power/process.c b/kernel/power/process.c index 5a6ec8678b9a..befce9785233 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only) return todo ? -EBUSY : 0; } -static bool __check_frozen_processes(void) -{ - struct task_struct *g, *p; - - for_each_process_thread(g, p) - if (p != current && !freezer_should_skip(p) && !frozen(p)) - return false; - - return true; -} - -/* - * Returns true if all freezable tasks (except for current) are frozen already - */ -static bool check_frozen_processes(void) -{ - bool ret; - - read_lock(&tasklist_lock); - ret = __check_frozen_processes(); - read_unlock(&tasklist_lock); - return ret; -} - /** * freeze_processes - Signal user space processes to enter the refrigerator. * The current thread will not be frozen. The same process that calls @@ -142,7 +118,6 @@ static bool check_frozen_processes(void) int freeze_processes(void) { int error; - int oom_kills_saved; error = __usermodehelper_disable(UMH_FREEZING); if (error) @@ -157,27 +132,20 @@ int freeze_processes(void) pm_wakeup_clear(); printk("Freezing user space processes ... "); pm_freezing = true; - oom_kills_saved = oom_kills_count(); + + /* + * Need to exlude OOM killer from triggering while tasks are + * getting frozen to make sure none of them gets killed after + * try_to_freeze_tasks is done. + */ + down_write(&oom_sem); error = try_to_freeze_tasks(true); if (!error) { __usermodehelper_set_disable_depth(UMH_DISABLED); oom_killer_disable(); - - /* - * There might have been an OOM kill while we were - * freezing tasks and the killed task might be still - * on the way out so we have to double check for race. - */ - if (oom_kills_count() != oom_kills_saved && - !check_frozen_processes()) { - __usermodehelper_set_disable_depth(UMH_ENABLED); - printk("OOM in progress."); - error = -EBUSY; - } else { - printk("done."); - } + printk("done.\n"); } - printk("\n"); + up_write(&oom_sem); BUG_ON(in_atomic()); if (error) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 5340f6b91312..bbf405a3a18f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -404,23 +404,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, dump_tasks(memcg, nodemask); } -/* - * Number of OOM killer invocations (including memcg OOM killer). - * Primarily used by PM freezer to check for potential races with - * OOM killed frozen task. - */ -static atomic_t oom_kills = ATOMIC_INIT(0); - -int oom_kills_count(void) -{ - return atomic_read(&oom_kills); -} - -void note_oom_kill(void) -{ - atomic_inc(&oom_kills); -} - #define K(x) ((x) << (PAGE_SHIFT-10)) /* * Must be called while holding a reference to p, which will be released upon diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9cd36b822444..76095266c4b5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -243,6 +243,7 @@ void set_pageblock_migratetype(struct page *page, int migratetype) } bool oom_killer_disabled __read_mostly; +DECLARE_RWSEM(oom_sem); #ifdef CONFIG_DEBUG_VM static int page_outside_zone_boundaries(struct zone *zone, struct page *page) @@ -2252,14 +2253,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, } /* - * PM-freezer should be notified that there might be an OOM killer on - * its way to kill and wake somebody up. This is too early and we might - * end up not killing anything but false positives are acceptable. - * See freeze_processes. - */ - note_oom_kill(); - - /* * Go through the zonelist yet one more time, keep very high watermark * here, this is only to catch a parallel oom killing, we must fail if * we're still under heavy pressure. @@ -2288,8 +2281,17 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, if (gfp_mask & __GFP_THISNODE) goto out; } - /* Exhausted what can be done so it's blamo time */ - out_of_memory(zonelist, gfp_mask, order, nodemask, false); + + /* + * Exhausted what can be done so it's blamo time. + * Just make sure that we cannot race with oom_killer disabling + * e.g. PM freezer needs to make sure that no OOM happens after + * all tasks are frozen. + */ + down_read(&oom_sem); + if (!oom_killer_disabled) + out_of_memory(zonelist, gfp_mask, order, nodemask, false); + up_read(&oom_sem); out: oom_zonelist_unlock(zonelist, gfp_mask); @@ -2716,8 +2718,6 @@ rebalance: */ if (!did_some_progress) { if (oom_gfp_allowed(gfp_mask)) { - if (oom_killer_disabled) - goto nopage; /* Coredumps can quickly deplete all memory reserves */ if ((current->flags & PF_DUMPCORE) && !(gfp_mask & __GFP_NOFAIL))