Message ID | 20141106160223.GJ7202@dhcp22.suse.cz (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On Thu, Nov 06, 2014 at 05:02:23PM +0100, Michal Hocko wrote: > > Why would PM freezing make OOM killing fail? That doesn't make much > > sense. Sure, it can block it for a finite duration for sync purposes > > but making OOM killing fail seems the wrong way around. > > We cannot block in the allocation path because the request might come > from the freezer path itself (e.g. when suspending devices etc.). > At least this is my understanding why the original oom disable approach > was implemented. I was saying that it could temporarily block either direction to implement proper synchronization while guaranteeing forward progress. > > We're doing one thing for non-PM freezing and the other way around for > > PM freezing, which indicates one of the two directions is wrong. > > Because those two paths are quite different in their requirements. The > cgroup freezer only cares about freezing tasks and it doesn't have to > care about tasks accessing a possibly half suspended device on their way > out. I don't think the fundamental relationship between freezing and oom killing are different between the two and the failure to recognize that is what's leading to these weird issues. > > Shouldn't it be that OOM killing happening while PM freezing is in > > progress cancels PM freezing rather than the other way around? Find a > > point in PM suspend/hibernation operation where everything must be > > stable, disable OOM killing there and check whether OOM killing > > happened inbetween and if so back out. > > This is freeze_processes AFAIU. I might be wrong of course but this is > the time since when nobody should be waking processes up because they > could access half suspended devices. No, you're doing it before freezing starts. The system is in no way in a quiescent state at that point. > > It seems rather obvious to me that OOM killing has to have precedence > > over PM freezing. > > > > Sure, once the system reaches a point where the whole system must be > > in a stable state for snapshotting or whatever, disabling OOM killing > > is fine but at that point the system is in a very limited execution > > mode and sure won't be processing page faults from userland for > > example and we can actually disable OOM killing knowing that anything > > afterwards is ready to handle memory allocation failures. > > I am really confused now. This is basically what the final patch does > actually. Here is the what I have currently just to make the further > discussion easier. Please see above.
On Thu 06-11-14 11:28:45, Tejun Heo wrote: > On Thu, Nov 06, 2014 at 05:02:23PM +0100, Michal Hocko wrote: [...] > > > We're doing one thing for non-PM freezing and the other way around for > > > PM freezing, which indicates one of the two directions is wrong. > > > > Because those two paths are quite different in their requirements. The > > cgroup freezer only cares about freezing tasks and it doesn't have to > > care about tasks accessing a possibly half suspended device on their way > > out. > > I don't think the fundamental relationship between freezing and oom > killing are different between the two and the failure to recognize > that is what's leading to these weird issues. I do not understand the above. Could you be more specific, please? AFAIU cgroup freezer requires that no task will leak into userspace while the cgroup is frozen. This is naturally true for the OOM path whether the two are synchronized or not. The PM freezer, on the other hand, requires that no task is _woken up_ after all tasks are frozen. This requires synchronization between the freezer and OOM path because allocations are allowed also after tasks are frozen. What am I missing? > > > Shouldn't it be that OOM killing happening while PM freezing is in > > > progress cancels PM freezing rather than the other way around? Find a > > > point in PM suspend/hibernation operation where everything must be > > > stable, disable OOM killing there and check whether OOM killing > > > happened inbetween and if so back out. > > > > This is freeze_processes AFAIU. I might be wrong of course but this is > > the time since when nobody should be waking processes up because they > > could access half suspended devices. > > No, you're doing it before freezing starts. The system is in no way > in a quiescent state at that point. You are right! Userspace shouldn't see any unexpected allocation failures just because PM freezing is in progress. This whole process should be transparent from userspace POV. I am getting back to oom_killer_lock(); error = try_to_freeze_tasks(); if (!error) oom_killer_disable(); oom_killer_unlock(); Thanks!
Hi, here is another take at OOM vs. PM freezer interaction fixes/cleanups. First three patches are fixes for an unlikely cases when OOM races with the PM freezer which should be closed completely finally. The last patch is a simple code enhancement which is not needed strictly speaking but it is nice to have IMO. Both OOM killer and PM freezer are quite subtle so I hope I haven't missing anything. Any feedback is highly appreciated. I am also interested about feedback for the used approach. To be honest I am not really happy about spreading TIF_MEMDIE checks into freezer (patch 1) but I didn't find any other way for detecting OOM killed tasks. Changes are based on top of Linus tree (3.18-rc3). Michal Hocko (4): OOM, PM: Do not miss OOM killed frozen tasks OOM, PM: make OOM detection in the freezer path raceless OOM, PM: handle pm freezer as an OOM victim correctly OOM: thaw the OOM victim if it is frozen Diffstat says: drivers/tty/sysrq.c | 6 ++-- include/linux/oom.h | 39 ++++++++++++++++------ kernel/freezer.c | 15 +++++++-- kernel/power/process.c | 60 +++++++++------------------------- mm/memcontrol.c | 4 ++- mm/oom_kill.c | 89 ++++++++++++++++++++++++++++++++++++++------------ mm/page_alloc.c | 32 +++++++++--------- 7 files changed, 147 insertions(+), 98 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 12, 2014 at 07:58:48PM +0100, Michal Hocko wrote: > Hi, > here is another take at OOM vs. PM freezer interaction fixes/cleanups. > First three patches are fixes for an unlikely cases when OOM races with > the PM freezer which should be closed completely finally. The last patch > is a simple code enhancement which is not needed strictly speaking but > it is nice to have IMO. > > Both OOM killer and PM freezer are quite subtle so I hope I haven't > missing anything. Any feedback is highly appreciated. I am also > interested about feedback for the used approach. To be honest I am not > really happy about spreading TIF_MEMDIE checks into freezer (patch 1) > but I didn't find any other way for detecting OOM killed tasks. I really don't get why this is structured this way. Can't you just do the following? 1. Freeze all freezables. Don't worry about PF_MEMDIE. 2. Disable OOM killer. This should be contained in the OOM killer proper. Lock out the OOM killer and disable it. 3. At this point, we know that no one will create more freezable threads and no new process will be OOM kliled. Wait till there's no process w/ PF_MEMDIE set. There's no reason to lock out or disable OOM killer while the system is not in the quiescent state, which is a big can of worms. Bring down the system to the quiescent state, disable the OOM killer and then drain PF_MEMDIEs. Thanks.
On Fri 14-11-14 15:14:19, Tejun Heo wrote: > On Wed, Nov 12, 2014 at 07:58:48PM +0100, Michal Hocko wrote: > > Hi, > > here is another take at OOM vs. PM freezer interaction fixes/cleanups. > > First three patches are fixes for an unlikely cases when OOM races with > > the PM freezer which should be closed completely finally. The last patch > > is a simple code enhancement which is not needed strictly speaking but > > it is nice to have IMO. > > > > Both OOM killer and PM freezer are quite subtle so I hope I haven't > > missing anything. Any feedback is highly appreciated. I am also > > interested about feedback for the used approach. To be honest I am not > > really happy about spreading TIF_MEMDIE checks into freezer (patch 1) > > but I didn't find any other way for detecting OOM killed tasks. > > I really don't get why this is structured this way. Can't you just do > the following? Well, I liked how simple this was and localized at the only place which matters. When I was thinking about a solution which you are describing below it was more complicated and more subtle (e.g. waiting for an OOM victim might be tricky if it stumbles over a lock which is held by a frozen thread which uses try_to_freeze_unsafe). Anyway I gave it another try and will post the two patches as a reply to this email. I hope the both interface and implementation is cleaner. > 1. Freeze all freezables. Don't worry about PF_MEMDIE. > > 2. Disable OOM killer. This should be contained in the OOM killer > proper. Lock out the OOM killer and disable it. > > 3. At this point, we know that no one will create more freezable > threads and no new process will be OOM kliled. Wait till there's > no process w/ PF_MEMDIE set. > > There's no reason to lock out or disable OOM killer while the system > is not in the quiescent state, which is a big can of worms. Bring > down the system to the quiescent state, disable the OOM killer and > then drain PF_MEMDIEs.
Hi, here is another take at OOM vs. PM freezer interaction fixes/cleanups. First three patches are fixes for an unlikely cases when OOM races with the PM freezer which should be closed completely finally. The last patch is a simple code enhancement which is not needed strictly speaking but it is nice to have IMO. Both OOM killer and PM freezer are quite subtle so I hope I haven't missing anything. Any feedback is highly appreciated. I am also interested about feedback for the used approach. To be honest I am not really happy about spreading TIF_MEMDIE checks into freezer (patch 1) but I didn't find any other way for detecting OOM killed tasks. Changes are based on top of Linus tree (3.18-rc3). Michal Hocko (4): OOM, PM: Do not miss OOM killed frozen tasks OOM, PM: make OOM detection in the freezer path raceless OOM, PM: handle pm freezer as an OOM victim correctly OOM: thaw the OOM victim if it is frozen Diffstat says: drivers/tty/sysrq.c | 6 ++-- include/linux/oom.h | 39 ++++++++++++++++------ kernel/freezer.c | 15 +++++++-- kernel/power/process.c | 60 +++++++++------------------------- mm/memcontrol.c | 4 ++- mm/oom_kill.c | 89 ++++++++++++++++++++++++++++++++++++++------------ mm/page_alloc.c | 32 +++++++++--------- 7 files changed, 147 insertions(+), 98 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
For some reason this is the previous version of the cover letter. I had some issues with git send-email which was failing for me. Anyway, this is the correct cover. Sorry about the cofusion. Hi, this is another attempt to address OOM vs. PM interaction. More about the issue is described in the last patch. The other 4 patches are just clean ups. This is based on top of 3.18-rc3 + Johannes' http://marc.info/?l=linux-kernel&m=141779091114777 which is not in the Andrew's tree yet but I wanted to prevent from later merge conflicts. The previous version of the main patch (5th one) was posted here: http://marc.info/?l=linux-mm&m=141634503316543&w=2. This version has hopefully addressed all the points raised by Tejun in the previous version. Namely - checkpatch fixes + printk -> pr_* changes in the respective areas - more comments added to clarify subtle interactions - oom_killer_disable(), unmark_tsk_oom_victim changed into wait_even API which is easier to use Both OOM killer and the PM freezer are really subtle so I would really appreciate a throughout review here. I still haven't changed lowmemory killer which is abusing TIF_MEMDIE yet and it would break this code (oom_victims counter balance) and I plan to look at it as soon as the rest of the of the series is OK and agreed as a way to go. So there will be at least one more patch for the final submission. Thanks! Michal Hocko (5): oom: add helpers for setting and clearing TIF_MEMDIE OOM: thaw the OOM victim if it is frozen PM: convert printk to pr_* equivalent sysrq: convert printk to pr_* equivalent OOM, PM: make OOM detection in the freezer path raceless And diffstat: drivers/tty/sysrq.c | 23 ++++---- include/linux/oom.h | 18 +++---- kernel/exit.c | 3 +- kernel/power/process.c | 81 +++++++++------------------- mm/memcontrol.c | 4 +- mm/oom_kill.c | 142 +++++++++++++++++++++++++++++++++++++++++++------ mm/page_alloc.c | 17 +----- 7 files changed, 178 insertions(+), 110 deletions(-)
On Sun, Dec 07, 2014 at 11:09:53AM +0100, Michal Hocko wrote: > this is another attempt to address OOM vs. PM interaction. More > about the issue is described in the last patch. The other 4 patches > are just clean ups. This is based on top of 3.18-rc3 + Johannes' > http://marc.info/?l=linux-kernel&m=141779091114777 which is not in the > Andrew's tree yet but I wanted to prevent from later merge conflicts. When the patches are based on a custom tree, it's often a good idea to create a git branch of the patches to help reviewing. Thanks.
On Sun 07-12-14 08:55:51, Tejun Heo wrote: > On Sun, Dec 07, 2014 at 11:09:53AM +0100, Michal Hocko wrote: > > this is another attempt to address OOM vs. PM interaction. More > > about the issue is described in the last patch. The other 4 patches > > are just clean ups. This is based on top of 3.18-rc3 + Johannes' > > http://marc.info/?l=linux-kernel&m=141779091114777 which is not in the > > Andrew's tree yet but I wanted to prevent from later merge conflicts. > > When the patches are based on a custom tree, it's often a good idea to > create a git branch of the patches to help reviewing. git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git to-review/make-oom-vs-pm-freezing-more-robust-2
On Sun 07-12-14 20:00:26, Michal Hocko wrote: > On Sun 07-12-14 08:55:51, Tejun Heo wrote: > > On Sun, Dec 07, 2014 at 11:09:53AM +0100, Michal Hocko wrote: > > > this is another attempt to address OOM vs. PM interaction. More > > > about the issue is described in the last patch. The other 4 patches > > > are just clean ups. This is based on top of 3.18-rc3 + Johannes' > > > http://marc.info/?l=linux-kernel&m=141779091114777 which is not in the > > > Andrew's tree yet but I wanted to prevent from later merge conflicts. > > > > When the patches are based on a custom tree, it's often a good idea to > > create a git branch of the patches to help reviewing. > > git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git to-review/make-oom-vs-pm-freezing-more-robust-2 Are there any other concerns? Should I just resubmit (after rc1)?
diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c index 42bad18c66c9..14f3d7fd961f 100644 --- a/drivers/tty/sysrq.c +++ b/drivers/tty/sysrq.c @@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = { static void moom_callback(struct work_struct *ignored) { - out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL, - 0, NULL, true); + if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), + GFP_KERNEL, 0, NULL, true)) { + printk(KERN_INFO "OOM killer disabled\n"); + } } static DECLARE_WORK(moom_work, moom_callback); diff --git a/include/linux/oom.h b/include/linux/oom.h index e8d6e1058723..04b892ddca7d 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -68,22 +68,25 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, unsigned long totalpages, const nodemask_t *nodemask, bool force_kill); -extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, +extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *mask, bool force_kill); extern int register_oom_notifier(struct notifier_block *nb); extern int unregister_oom_notifier(struct notifier_block *nb); -extern bool oom_killer_disabled; - -static inline void oom_killer_disable(void) -{ - oom_killer_disabled = true; -} +/** + * oom_killer_disable - disable OOM killer in page allocator + * + * Forces all page allocations to fail rather than trigger OOM killer. + * + * This function should be used with an extreme care and any new usage + * should be consulted with MM people. + */ +extern void oom_killer_disable(void); -static inline void oom_killer_enable(void) -{ - oom_killer_disabled = false; -} +/** + * oom_killer_enable - enable OOM killer + */ +extern void oom_killer_enable(void); static inline bool oom_gfp_allowed(gfp_t gfp_mask) { diff --git a/kernel/power/process.c b/kernel/power/process.c index 5a6ec8678b9a..7d08d56cbf3f 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only) return todo ? -EBUSY : 0; } -static bool __check_frozen_processes(void) -{ - struct task_struct *g, *p; - - for_each_process_thread(g, p) - if (p != current && !freezer_should_skip(p) && !frozen(p)) - return false; - - return true; -} - -/* - * Returns true if all freezable tasks (except for current) are frozen already - */ -static bool check_frozen_processes(void) -{ - bool ret; - - read_lock(&tasklist_lock); - ret = __check_frozen_processes(); - read_unlock(&tasklist_lock); - return ret; -} - /** * freeze_processes - Signal user space processes to enter the refrigerator. * The current thread will not be frozen. The same process that calls @@ -142,7 +118,6 @@ static bool check_frozen_processes(void) int freeze_processes(void) { int error; - int oom_kills_saved; error = __usermodehelper_disable(UMH_FREEZING); if (error) @@ -157,27 +132,18 @@ int freeze_processes(void) pm_wakeup_clear(); printk("Freezing user space processes ... "); pm_freezing = true; - oom_kills_saved = oom_kills_count(); + + /* + * Need to exlude OOM killer from triggering while tasks are + * getting frozen to make sure none of them gets killed after + * try_to_freeze_tasks is done. + */ + oom_killer_disable(); error = try_to_freeze_tasks(true); if (!error) { __usermodehelper_set_disable_depth(UMH_DISABLED); - oom_killer_disable(); - - /* - * There might have been an OOM kill while we were - * freezing tasks and the killed task might be still - * on the way out so we have to double check for race. - */ - if (oom_kills_count() != oom_kills_saved && - !check_frozen_processes()) { - __usermodehelper_set_disable_depth(UMH_ENABLED); - printk("OOM in progress."); - error = -EBUSY; - } else { - printk("done."); - } + printk("done.\n"); } - printk("\n"); BUG_ON(in_atomic()); if (error) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 5340f6b91312..7f88ddd55f80 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -404,23 +404,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, dump_tasks(memcg, nodemask); } -/* - * Number of OOM killer invocations (including memcg OOM killer). - * Primarily used by PM freezer to check for potential races with - * OOM killed frozen task. - */ -static atomic_t oom_kills = ATOMIC_INIT(0); - -int oom_kills_count(void) -{ - return atomic_read(&oom_kills); -} - -void note_oom_kill(void) -{ - atomic_inc(&oom_kills); -} - #define K(x) ((x) << (PAGE_SHIFT-10)) /* * Must be called while holding a reference to p, which will be released upon @@ -615,8 +598,20 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) spin_unlock(&zone_scan_lock); } +static DECLARE_RWSEM(oom_sem); + +void oom_killer_disable(void) +{ + down_write(&oom_sem); +} + +void oom_killer_enable(void) +{ + up_write(&oom_sem); +} + /** - * out_of_memory - kill the "best" process when we run out of memory + * __out_of_memory - kill the "best" process when we run out of memory * @zonelist: zonelist pointer * @gfp_mask: memory allocation flags * @order: amount of memory being requested as a power of 2 @@ -628,7 +623,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask) * OR try to be smart about which process to kill. Note that we * don't have to be perfect here, we just have to be good. */ -void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, +static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *nodemask, bool force_kill) { const nodemask_t *mpol_mask; @@ -693,6 +688,27 @@ out: schedule_timeout_killable(1); } +/** out_of_memory - tries to invoke OOM killer. + * @zonelist: zonelist pointer + * @gfp_mask: memory allocation flags + * @order: amount of memory being requested as a power of 2 + * @nodemask: nodemask passed to page allocator + * @force_kill: true if a task must be killed, even if others are exiting + * + * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable() + * when it returns false. Otherwise returns true. + */ +bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, + int order, nodemask_t *nodemask, bool force_kill) +{ + if (!down_read_trylock(&oom_sem)) + return false; + __out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill); + up_read(&oom_sem); + + return true; +} + /* * The pagefault handler calls here because it is out of memory, so kill a * memory-hogging task. If any populated zone has ZONE_OOM_LOCKED set, a diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9cd36b822444..d44d69aa7b70 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -242,8 +242,6 @@ void set_pageblock_migratetype(struct page *page, int migratetype) PB_migrate, PB_migrate_end); } -bool oom_killer_disabled __read_mostly; - #ifdef CONFIG_DEBUG_VM static int page_outside_zone_boundaries(struct zone *zone, struct page *page) { @@ -2241,10 +2239,11 @@ static inline struct page * __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, nodemask_t *nodemask, struct zone *preferred_zone, - int classzone_idx, int migratetype) + int classzone_idx, int migratetype, bool *oom_failed) { struct page *page; + *oom_failed = false; /* Acquire the per-zone oom lock for each zone */ if (!oom_zonelist_trylock(zonelist, gfp_mask)) { schedule_timeout_uninterruptible(1); @@ -2252,14 +2251,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, } /* - * PM-freezer should be notified that there might be an OOM killer on - * its way to kill and wake somebody up. This is too early and we might - * end up not killing anything but false positives are acceptable. - * See freeze_processes. - */ - note_oom_kill(); - - /* * Go through the zonelist yet one more time, keep very high watermark * here, this is only to catch a parallel oom killing, we must fail if * we're still under heavy pressure. @@ -2289,8 +2280,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, goto out; } /* Exhausted what can be done so it's blamo time */ - out_of_memory(zonelist, gfp_mask, order, nodemask, false); - + if (!out_of_memory(zonelist, gfp_mask, order, nodemask, false)) + *oom_failed = true; out: oom_zonelist_unlock(zonelist, gfp_mask); return page; @@ -2716,8 +2707,8 @@ rebalance: */ if (!did_some_progress) { if (oom_gfp_allowed(gfp_mask)) { - if (oom_killer_disabled) - goto nopage; + bool oom_failed; + /* Coredumps can quickly deplete all memory reserves */ if ((current->flags & PF_DUMPCORE) && !(gfp_mask & __GFP_NOFAIL)) @@ -2725,10 +2716,19 @@ rebalance: page = __alloc_pages_may_oom(gfp_mask, order, zonelist, high_zoneidx, nodemask, preferred_zone, - classzone_idx, migratetype); + classzone_idx, migratetype, + &oom_failed); + if (page) goto got_pg; + /* + * OOM killer might be disabled and then we have to + * fail the allocation + */ + if (oom_failed) + goto nopage; + if (!(gfp_mask & __GFP_NOFAIL)) { /* * The oom killer is not called for high-order