Message ID | 20161216155808.12809-3-mhocko@kernel.org (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Fri, Dec 16, 2016 at 04:58:08PM +0100, Michal Hocko wrote: > @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc) > * make sure exclude 0 mask - all other users should have at least > * ___GFP_DIRECT_RECLAIM to get here. > */ > - if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL))) > + if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS)) > return true; This makes sense, we should go back to what we had here. Because it's not that the reported OOMs are premature - there is genuinely no more memory reclaimable from the allocating context - but that this class of allocations should never invoke the OOM killer in the first place. > @@ -3737,6 +3752,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > */ > WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER); > > + /* > + * Help non-failing allocations by giving them access to memory > + * reserves but do not use ALLOC_NO_WATERMARKS because this > + * could deplete whole memory reserves which would just make > + * the situation worse > + */ > + page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac); > + if (page) > + goto got_pg; > + But this should be a separate patch, IMO. Do we observe GFP_NOFS lockups when we don't do this? Don't we risk premature exhaustion of the memory reserves, and it's better to wait for other reclaimers to make some progress instead? Should we give reserve access to all GFP_NOFS allocations, or just the ones from a reclaim/cleaning context? All that should go into the changelog of a separate allocation booster patch, I think. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri 16-12-16 12:31:51, Johannes Weiner wrote: > On Fri, Dec 16, 2016 at 04:58:08PM +0100, Michal Hocko wrote: > > @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc) > > * make sure exclude 0 mask - all other users should have at least > > * ___GFP_DIRECT_RECLAIM to get here. > > */ > > - if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL))) > > + if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS)) > > return true; > > This makes sense, we should go back to what we had here. Because it's > not that the reported OOMs are premature - there is genuinely no more > memory reclaimable from the allocating context - but that this class > of allocations should never invoke the OOM killer in the first place. agreed, at least not with the current implementtion. If we had a proper accounting where we know that the memory pinned by the fs is not really there then we could invoke the oom killer and be safe > > @@ -3737,6 +3752,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > */ > > WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER); > > > > + /* > > + * Help non-failing allocations by giving them access to memory > > + * reserves but do not use ALLOC_NO_WATERMARKS because this > > + * could deplete whole memory reserves which would just make > > + * the situation worse > > + */ > > + page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac); > > + if (page) > > + goto got_pg; > > + > > But this should be a separate patch, IMO. > > Do we observe GFP_NOFS lockups when we don't do this? this is hard to tell but considering users like grow_dev_page we can get stuck with a very slow progress I believe. Those allocations could see some help. > Don't we risk > premature exhaustion of the memory reserves, and it's better to wait > for other reclaimers to make some progress instead? waiting for other reclaimers would be preferable but we should at least give these some priority, which is what ALLOC_HARDER should help with. > Should we give > reserve access to all GFP_NOFS allocations, or just the ones from a > reclaim/cleaning context? I would focus only for those which are important enough. Which are those is a harder question. But certainly those with GFP_NOFAIL are important enough. > All that should go into the changelog of a separate allocation booster > patch, I think. The reason I did both in the same patch is to address the concern about potential lockups when NOFS|NOFAIL cannot make any progress. I've chosen ALLOC_HARDER to give the minimum portion of the reserves so that we do not risk other high priority users to be blocked out but still help a bit at least and prevent from starvation when other reclaimers are faster to consume the reclaimed memory. I can extend the changelog of course but I believe that having both changes together makes some sense. NOFS|NOFAIL allocations are not all that rare and sometimes we really depend on them making a further progress.
Michal Hocko wrote: > On Fri 16-12-16 12:31:51, Johannes Weiner wrote: >>> @@ -3737,6 +3752,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, >>> */ >>> WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER); >>> >>> + /* >>> + * Help non-failing allocations by giving them access to memory >>> + * reserves but do not use ALLOC_NO_WATERMARKS because this >>> + * could deplete whole memory reserves which would just make >>> + * the situation worse >>> + */ >>> + page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac); >>> + if (page) >>> + goto got_pg; >>> + >> >> But this should be a separate patch, IMO. >> >> Do we observe GFP_NOFS lockups when we don't do this? > > this is hard to tell but considering users like grow_dev_page we can get > stuck with a very slow progress I believe. Those allocations could see > some help. > >> Don't we risk >> premature exhaustion of the memory reserves, and it's better to wait >> for other reclaimers to make some progress instead? > > waiting for other reclaimers would be preferable but we should at least > give these some priority, which is what ALLOC_HARDER should help with. > >> Should we give >> reserve access to all GFP_NOFS allocations, or just the ones from a >> reclaim/cleaning context? > > I would focus only for those which are important enough. Which are those > is a harder question. But certainly those with GFP_NOFAIL are important > enough. > >> All that should go into the changelog of a separate allocation booster >> patch, I think. > > The reason I did both in the same patch is to address the concern about > potential lockups when NOFS|NOFAIL cannot make any progress. I've chosen > ALLOC_HARDER to give the minimum portion of the reserves so that we do > not risk other high priority users to be blocked out but still help a > bit at least and prevent from starvation when other reclaimers are > faster to consume the reclaimed memory. > > I can extend the changelog of course but I believe that having both > changes together makes some sense. NOFS|NOFAIL allocations are not all > that rare and sometimes we really depend on them making a further > progress. > I feel that allowing access to memory reserves based on __GFP_NOFAIL might not make sense. My understanding is that actual I/O operation triggered by I/O requests by filesystem code are processed by other threads. Even if we grant access to memory reserves to GFP_NOFS | __GFP_NOFAIL allocations by fs code, I think that it is possible that memory allocations by underlying bio code fails to make a further progress unless memory reserves are granted as well. Below is a typical trace which I observe under OOM lockuped situation (though this trace is from an OOM stress test using XFS). ---------------------------------------- [ 1845.187246] MemAlloc: kworker/2:1(14498) flags=0x4208060 switches=323636 seq=48 gfp=0x2400000(GFP_NOIO) order=0 delay=430400 uninterruptible [ 1845.187248] kworker/2:1 D12712 14498 2 0x00000080 [ 1845.187251] Workqueue: events_freezable_power_ disk_events_workfn [ 1845.187252] Call Trace: [ 1845.187253] ? __schedule+0x23f/0xba0 [ 1845.187254] schedule+0x38/0x90 [ 1845.187255] schedule_timeout+0x205/0x4a0 [ 1845.187256] ? del_timer_sync+0xd0/0xd0 [ 1845.187257] schedule_timeout_uninterruptible+0x25/0x30 [ 1845.187258] __alloc_pages_nodemask+0x1035/0x10e0 [ 1845.187259] ? alloc_request_struct+0x14/0x20 [ 1845.187261] alloc_pages_current+0x96/0x1b0 [ 1845.187262] ? bio_alloc_bioset+0x20f/0x2e0 [ 1845.187264] bio_copy_kern+0xc4/0x180 [ 1845.187265] blk_rq_map_kern+0x6f/0x120 [ 1845.187268] __scsi_execute.isra.23+0x12f/0x160 [ 1845.187270] scsi_execute_req_flags+0x8f/0x100 [ 1845.187271] sr_check_events+0xba/0x2b0 [sr_mod] [ 1845.187274] cdrom_check_events+0x13/0x30 [cdrom] [ 1845.187275] sr_block_check_events+0x25/0x30 [sr_mod] [ 1845.187276] disk_check_events+0x5b/0x150 [ 1845.187277] disk_events_workfn+0x17/0x20 [ 1845.187278] process_one_work+0x1fc/0x750 [ 1845.187279] ? process_one_work+0x167/0x750 [ 1845.187279] worker_thread+0x126/0x4a0 [ 1845.187280] kthread+0x10a/0x140 [ 1845.187281] ? process_one_work+0x750/0x750 [ 1845.187282] ? kthread_create_on_node+0x60/0x60 [ 1845.187283] ret_from_fork+0x2a/0x40 ---------------------------------------- I think that this GFP_NOIO allocation request needs to consume more memory reserves than GFP_NOFS allocation request to make progress. Do we want to add __GFP_NOFAIL to this GFP_NOIO allocation request in order to allow access to memory reserves as well as GFP_NOFS | __GFP_NOFAIL allocation request? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat 17-12-16 20:17:07, Tetsuo Handa wrote: [...] > I feel that allowing access to memory reserves based on __GFP_NOFAIL might not > make sense. My understanding is that actual I/O operation triggered by I/O > requests by filesystem code are processed by other threads. Even if we grant > access to memory reserves to GFP_NOFS | __GFP_NOFAIL allocations by fs code, > I think that it is possible that memory allocations by underlying bio code > fails to make a further progress unless memory reserves are granted as well. IO layer should rely on mempools to guarantee a forward progress. > Below is a typical trace which I observe under OOM lockuped situation (though > this trace is from an OOM stress test using XFS). > > ---------------------------------------- > [ 1845.187246] MemAlloc: kworker/2:1(14498) flags=0x4208060 switches=323636 seq=48 gfp=0x2400000(GFP_NOIO) order=0 delay=430400 uninterruptible > [ 1845.187248] kworker/2:1 D12712 14498 2 0x00000080 > [ 1845.187251] Workqueue: events_freezable_power_ disk_events_workfn > [ 1845.187252] Call Trace: > [ 1845.187253] ? __schedule+0x23f/0xba0 > [ 1845.187254] schedule+0x38/0x90 > [ 1845.187255] schedule_timeout+0x205/0x4a0 > [ 1845.187256] ? del_timer_sync+0xd0/0xd0 > [ 1845.187257] schedule_timeout_uninterruptible+0x25/0x30 > [ 1845.187258] __alloc_pages_nodemask+0x1035/0x10e0 > [ 1845.187259] ? alloc_request_struct+0x14/0x20 > [ 1845.187261] alloc_pages_current+0x96/0x1b0 > [ 1845.187262] ? bio_alloc_bioset+0x20f/0x2e0 > [ 1845.187264] bio_copy_kern+0xc4/0x180 > [ 1845.187265] blk_rq_map_kern+0x6f/0x120 > [ 1845.187268] __scsi_execute.isra.23+0x12f/0x160 > [ 1845.187270] scsi_execute_req_flags+0x8f/0x100 > [ 1845.187271] sr_check_events+0xba/0x2b0 [sr_mod] > [ 1845.187274] cdrom_check_events+0x13/0x30 [cdrom] > [ 1845.187275] sr_block_check_events+0x25/0x30 [sr_mod] > [ 1845.187276] disk_check_events+0x5b/0x150 > [ 1845.187277] disk_events_workfn+0x17/0x20 > [ 1845.187278] process_one_work+0x1fc/0x750 > [ 1845.187279] ? process_one_work+0x167/0x750 > [ 1845.187279] worker_thread+0x126/0x4a0 > [ 1845.187280] kthread+0x10a/0x140 > [ 1845.187281] ? process_one_work+0x750/0x750 > [ 1845.187282] ? kthread_create_on_node+0x60/0x60 > [ 1845.187283] ret_from_fork+0x2a/0x40 > ---------------------------------------- > > I think that this GFP_NOIO allocation request needs to consume more memory reserves > than GFP_NOFS allocation request to make progress. AFAIU, this is an allocation path which doesn't block a forward progress on a regular IO. It is merely a check whether there is a new medium in the CDROM (aka regular polling of the device). I really fail to see any reason why this one should get any access to memory reserves at all. I actually do not see any reason why it should be NOIO in the first place but I am not familiar with this code much so there might be some reasons for that. The fact that it might stall under a heavy memory pressure is sad but who actually cares? > Do we want to add __GFP_NOFAIL to this GFP_NOIO allocation request > in order to allow access to memory reserves as well as GFP_NOFS | > __GFP_NOFAIL allocation request? Why?
diff --git a/mm/oom_kill.c b/mm/oom_kill.c index ec9f11d4f094..12a6fce85f61 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc) * make sure exclude 0 mask - all other users should have at least * ___GFP_DIRECT_RECLAIM to get here. */ - if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL))) + if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS)) return true; /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 095e2fa286de..d6bc3e4f1a0c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3057,6 +3057,26 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...) } static inline struct page * +__alloc_pages_cpuset_fallback(gfp_t gfp_mask, unsigned int order, + unsigned int alloc_flags, + const struct alloc_context *ac) +{ + struct page *page; + + page = get_page_from_freelist(gfp_mask, order, + alloc_flags|ALLOC_CPUSET, ac); + /* + * fallback to ignore cpuset restriction if our nodes + * are depleted + */ + if (!page) + page = get_page_from_freelist(gfp_mask, order, + alloc_flags, ac); + + return page; +} + +static inline struct page * __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, const struct alloc_context *ac, unsigned long *did_some_progress) { @@ -3091,47 +3111,42 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, if (page) goto out; - if (!(gfp_mask & __GFP_NOFAIL)) { - /* Coredumps can quickly deplete all memory reserves */ - if (current->flags & PF_DUMPCORE) - goto out; - /* The OOM killer will not help higher order allocs */ - if (order > PAGE_ALLOC_COSTLY_ORDER) - goto out; - /* The OOM killer does not needlessly kill tasks for lowmem */ - if (ac->high_zoneidx < ZONE_NORMAL) - goto out; - if (pm_suspended_storage()) - goto out; - /* - * XXX: GFP_NOFS allocations should rather fail than rely on - * other request to make a forward progress. - * We are in an unfortunate situation where out_of_memory cannot - * do much for this context but let's try it to at least get - * access to memory reserved if the current task is killed (see - * out_of_memory). Once filesystems are ready to handle allocation - * failures more gracefully we should just bail out here. - */ + /* Coredumps can quickly deplete all memory reserves */ + if (current->flags & PF_DUMPCORE) + goto out; + /* The OOM killer will not help higher order allocs */ + if (order > PAGE_ALLOC_COSTLY_ORDER) + goto out; + /* The OOM killer does not needlessly kill tasks for lowmem */ + if (ac->high_zoneidx < ZONE_NORMAL) + goto out; + if (pm_suspended_storage()) + goto out; + /* + * XXX: GFP_NOFS allocations should rather fail than rely on + * other request to make a forward progress. + * We are in an unfortunate situation where out_of_memory cannot + * do much for this context but let's try it to at least get + * access to memory reserved if the current task is killed (see + * out_of_memory). Once filesystems are ready to handle allocation + * failures more gracefully we should just bail out here. + */ + + /* The OOM killer may not free memory on a specific node */ + if (gfp_mask & __GFP_THISNODE) + goto out; - /* The OOM killer may not free memory on a specific node */ - if (gfp_mask & __GFP_THISNODE) - goto out; - } /* Exhausted what can be done so it's blamo time */ - if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) { + if (out_of_memory(&oc)) { *did_some_progress = 1; - if (gfp_mask & __GFP_NOFAIL) { - page = get_page_from_freelist(gfp_mask, order, - ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac); - /* - * fallback to ignore cpuset restriction if our nodes - * are depleted - */ - if (!page) - page = get_page_from_freelist(gfp_mask, order, + /* + * Help non-failing allocations by giving them access to memory + * reserves + */ + if (gfp_mask & __GFP_NOFAIL) + page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_NO_WATERMARKS, ac); - } } out: mutex_unlock(&oom_lock); @@ -3737,6 +3752,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, */ WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER); + /* + * Help non-failing allocations by giving them access to memory + * reserves but do not use ALLOC_NO_WATERMARKS because this + * could deplete whole memory reserves which would just make + * the situation worse + */ + page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac); + if (page) + goto got_pg; + cond_resched(); goto retry; }