Message ID | 20180620103736.13880-1-mhocko@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Jun 20, 2018 at 12:37:36PM +0200, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") > has changed the ENOMEM semantic of memcg charges. Rather than invoking > the oom killer from the charging context it delays the oom killer to the > page fault path (pagefault_out_of_memory). This in turn means that many > users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg > hits the hard limit and the memcg is is OOM. This is behavior is > inconsistent with !memcg case where the oom killer is invoked from the > allocation context and the allocator keeps retrying until it succeeds. > > The difference in the behavior is user visible. mmap(MAP_POPULATE) might > result in not fully populated ranges while the mmap return code doesn't > tell that to the userspace. Random syscalls might fail with ENOMEM etc. > > The primary motivation of the different memcg oom semantic was the > deadlock avoidance. Things have changed since then, though. We have > an async oom teardown by the oom reaper now and so we do not have to > rely on the victim to tear down its memory anymore. Therefore we can > return to the original semantic as long as the memcg oom killer is not > handed over to the users space. > > There is still one thing to be careful about here though. If the oom > killer is not able to make any forward progress - e.g. because there is > no eligible task to kill - then we have to bail out of the charge path > to prevent from same class of deadlocks. We have basically two options > here. Either we fail the charge with ENOMEM or force the charge and > allow overcharge. The first option has been considered more harmful than > useful because rare inconsistencies in the ENOMEM behavior is hard to > test for and error prone. Basically the same reason why the page > allocator doesn't fail allocations under such conditions. The later > might allow runaways but those should be really unlikely unless somebody > misconfigures the system. E.g. allowing to migrate tasks away from the > memcg to a different unlimited memcg with move_charge_at_immigrate > disabled. This is more straight-forward than I thought it would be. I have no objections to this going forward, just a couple of minor notes. > @@ -1483,28 +1483,54 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) > __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); > } > > -static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > +enum oom_status { > + OOM_SUCCESS, > + OOM_FAILED, > + OOM_ASYNC, > + OOM_SKIPPED Either SUCCESS & FAILURE, or SUCCEEDED & FAILED ;) We're not distinguishing ASYNC and SKIPPED anywhere below, but I cannot think of a good name to communicate them both without this function making assumptions about the charge function's behavior. So it's a bit weird, but probably the best way to go. > +static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > { > - if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER) > - return; > + if (order > PAGE_ALLOC_COSTLY_ORDER) > + return OOM_SKIPPED; > /* > * We are in the middle of the charge context here, so we > * don't want to block when potentially sitting on a callstack > * that holds all kinds of filesystem and mm locks. > * > - * Also, the caller may handle a failed allocation gracefully > - * (like optional page cache readahead) and so an OOM killer > - * invocation might not even be necessary. > + * cgroup1 allows disabling the OOM killer and waiting for outside > + * handling until the charge can succeed; remember the context and put > + * the task to sleep at the end of the page fault when all locks are > + * released. > + * > + * On the other hand, in-kernel OOM killer allows for an async victim > + * memory reclaim (oom_reaper) and that means that we are not solely > + * relying on the oom victim to make a forward progress and we can > + * invoke the oom killer here. > * > - * That's why we don't do anything here except remember the > - * OOM context and then deal with it at the end of the page > - * fault when the stack is unwound, the locks are released, > - * and when we know whether the fault was overall successful. > + * Please note that mem_cgroup_oom_synchronize might fail to find a > + * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > + * we would fall back to the global oom killer in pagefault_out_of_memory > */ > + if (!memcg->oom_kill_disable) { > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > + return OOM_SUCCESS; > + > + WARN(!current->memcg_may_oom, > + "Memory cgroup charge failed because of no reclaimable memory! " > + "This looks like a misconfiguration or a kernel bug."); > + return OOM_FAILED; > + } > + > + if (!current->memcg_may_oom) > + return OOM_SKIPPED; memcg_may_oom was introduced to distinguish between userspace faults that can OOM and contexts that return -ENOMEM. Now we're using it slightly differently and it's confusing. 1) Why warn for kernel allocations, but not userspace ones? This should have a comment at least. 2) We invoke the OOM killer when !memcg_may_oom. We want to OOM kill in either case, but only set up the mem_cgroup_oom_synchronize() for userspace faults. So the code makes sense, but a better name would be in order -- current->in_user_fault? > css_get(&memcg->css); > current->memcg_in_oom = memcg; > current->memcg_oom_gfp_mask = mask; > current->memcg_oom_order = order; > + > + return OOM_ASYNC; In terms of code flow, it would be much clearer to handle the memcg->oom_kill_disable case first, as a special case with early return, and make the OOM invocation the main code of this function, given its name. > @@ -1994,8 +2024,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > > memcg_memory_event(mem_over_limit, MEMCG_OOM); > > - mem_cgroup_oom(mem_over_limit, gfp_mask, > + /* > + * keep retrying as long as the memcg oom killer is able to make > + * a forward progress or bypass the charge if the oom killer > + * couldn't make any progress. > + */ > + oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, > get_order(nr_pages * PAGE_SIZE)); > + switch (oom_status) { > + case OOM_SUCCESS: > + nr_retries = MEM_CGROUP_RECLAIM_RETRIES; > + oomed = true; > + goto retry; > + case OOM_FAILED: > + goto force; > + default: > + goto nomem; > + } > nomem: > if (!(gfp_mask & __GFP_NOFAIL)) > return -ENOMEM; > -- > 2.17.1 >
On Wed 20-06-18 11:18:12, Johannes Weiner wrote: > On Wed, Jun 20, 2018 at 12:37:36PM +0200, Michal Hocko wrote: [...] > > -static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > > +enum oom_status { > > + OOM_SUCCESS, > > + OOM_FAILED, > > + OOM_ASYNC, > > + OOM_SKIPPED > > Either SUCCESS & FAILURE, or SUCCEEDED & FAILED ;) sure, I will go with later. > We're not distinguishing ASYNC and SKIPPED anywhere below, but I > cannot think of a good name to communicate them both without this > function making assumptions about the charge function's behavior. > So it's a bit weird, but probably the best way to go. Yeah, that was what I was fighting with. My original proposal which simply ENOMEM in the failure case was a simple bool but once we have those different sates and failure behavior I think it is better to comunicate that and let the caller do whatever it finds reasonable. > > > +static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > > { > > - if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER) > > - return; > > + if (order > PAGE_ALLOC_COSTLY_ORDER) > > + return OOM_SKIPPED; > > /* > > * We are in the middle of the charge context here, so we > > * don't want to block when potentially sitting on a callstack > > * that holds all kinds of filesystem and mm locks. > > * > > - * Also, the caller may handle a failed allocation gracefully > > - * (like optional page cache readahead) and so an OOM killer > > - * invocation might not even be necessary. > > + * cgroup1 allows disabling the OOM killer and waiting for outside > > + * handling until the charge can succeed; remember the context and put > > + * the task to sleep at the end of the page fault when all locks are > > + * released. > > + * > > + * On the other hand, in-kernel OOM killer allows for an async victim > > + * memory reclaim (oom_reaper) and that means that we are not solely > > + * relying on the oom victim to make a forward progress and we can > > + * invoke the oom killer here. > > * > > - * That's why we don't do anything here except remember the > > - * OOM context and then deal with it at the end of the page > > - * fault when the stack is unwound, the locks are released, > > - * and when we know whether the fault was overall successful. > > + * Please note that mem_cgroup_oom_synchronize might fail to find a > > + * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > > + * we would fall back to the global oom killer in pagefault_out_of_memory > > */ > > + if (!memcg->oom_kill_disable) { > > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > > + return OOM_SUCCESS; > > + > > + WARN(!current->memcg_may_oom, > > + "Memory cgroup charge failed because of no reclaimable memory! " > > + "This looks like a misconfiguration or a kernel bug."); > > + return OOM_FAILED; > > + } > > + > > + if (!current->memcg_may_oom) > > + return OOM_SKIPPED; > > memcg_may_oom was introduced to distinguish between userspace faults > that can OOM and contexts that return -ENOMEM. Now we're using it > slightly differently and it's confusing. > > 1) Why warn for kernel allocations, but not userspace ones? This > should have a comment at least. I am not sure I understand. We do warn for all allocations types of mem_cgroup_out_of_memory fails as long as we are not in a legacy - oom_disabled case. > 2) We invoke the OOM killer when !memcg_may_oom. We want to OOM kill > in either case, but only set up the mem_cgroup_oom_synchronize() for > userspace faults. So the code makes sense, but a better name would be > in order -- current->in_user_fault? in_user_fault is definitely better than memcg_may_oom. > > css_get(&memcg->css); > > current->memcg_in_oom = memcg; > > current->memcg_oom_gfp_mask = mask; > > current->memcg_oom_order = order; > > + > > + return OOM_ASYNC; > > In terms of code flow, it would be much clearer to handle the > memcg->oom_kill_disable case first, as a special case with early > return, and make the OOM invocation the main code of this function, > given its name. This? if (order > PAGE_ALLOC_COSTLY_ORDER) return OOM_SKIPPED; /* * We are in the middle of the charge context here, so we * don't want to block when potentially sitting on a callstack * that holds all kinds of filesystem and mm locks. * * cgroup1 allows disabling the OOM killer and waiting for outside * handling until the charge can succeed; remember the context and put * the task to sleep at the end of the page fault when all locks are * released. * * On the other hand, in-kernel OOM killer allows for an async victim * memory reclaim (oom_reaper) and that means that we are not solely * relying on the oom victim to make a forward progress and we can * invoke the oom killer here. * * Please note that mem_cgroup_oom_synchronize might fail to find a * victim and then we have rely on mem_cgroup_oom_synchronize otherwise * we would fall back to the global oom killer in pagefault_out_of_memory */ if (memcg->oom_kill_disable) { if (!current->memcg_may_oom) return OOM_SKIPPED; css_get(&memcg->css); current->memcg_in_oom = memcg; current->memcg_oom_gfp_mask = mask; current->memcg_oom_order = order; return OOM_ASYNC; } if (mem_cgroup_out_of_memory(memcg, mask, order)) return OOM_SUCCESS; WARN(!current->memcg_may_oom, "Memory cgroup charge failed because of no reclaimable memory! " "This looks like a misconfiguration or a kernel bug."); return OOM_FAILED; Thanks!
On Wed 20-06-18 17:31:48, Michal Hocko wrote: > On Wed 20-06-18 11:18:12, Johannes Weiner wrote: [...] > > 1) Why warn for kernel allocations, but not userspace ones? This > > should have a comment at least. > > I am not sure I understand. We do warn for all allocations types of > mem_cgroup_out_of_memory fails as long as we are not in a legacy - > oom_disabled case. OK, I can see it now. It wasn't in the quoted context and I just forgot that WARN(!current->memcg_may_oom, ...). Well, I do not remember why I've made it conditional and you are right it doesn't make any sense. Probably a different code flow back then. Updated to warn regardless of memcg_may_oom.
On Wed, Jun 20, 2018 at 05:31:48PM +0200, Michal Hocko wrote: > This? > if (order > PAGE_ALLOC_COSTLY_ORDER) > return OOM_SKIPPED; > > /* > * We are in the middle of the charge context here, so we > * don't want to block when potentially sitting on a callstack > * that holds all kinds of filesystem and mm locks. > * > * cgroup1 allows disabling the OOM killer and waiting for outside > * handling until the charge can succeed; remember the context and put > * the task to sleep at the end of the page fault when all locks are > * released. > * > * On the other hand, in-kernel OOM killer allows for an async victim > * memory reclaim (oom_reaper) and that means that we are not solely > * relying on the oom victim to make a forward progress and we can > * invoke the oom killer here. > * > * Please note that mem_cgroup_oom_synchronize might fail to find a > * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > * we would fall back to the global oom killer in pagefault_out_of_memory > */ > if (memcg->oom_kill_disable) { > if (!current->memcg_may_oom) > return OOM_SKIPPED; > css_get(&memcg->css); > current->memcg_in_oom = memcg; > current->memcg_oom_gfp_mask = mask; > current->memcg_oom_order = order; > > return OOM_ASYNC; > } > > if (mem_cgroup_out_of_memory(memcg, mask, order)) > return OOM_SUCCESS; > > WARN(!current->memcg_may_oom, > "Memory cgroup charge failed because of no reclaimable memory! " > "This looks like a misconfiguration or a kernel bug."); > return OOM_FAILED; Yep, this looks good IMO.
On Wed, Jun 20, 2018 at 05:31:48PM +0200, Michal Hocko wrote: > * Please note that mem_cgroup_oom_synchronize might fail to find a > * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > * we would fall back to the global oom killer in pagefault_out_of_memory I can't quite figure out what this paragraph is trying to say. "oom_synchronize might fail [...] and we have to rely on oom_synchronize". Hm?
On Wed 20-06-18 15:38:36, Johannes Weiner wrote: > On Wed, Jun 20, 2018 at 05:31:48PM +0200, Michal Hocko wrote: > > * Please note that mem_cgroup_oom_synchronize might fail to find a > > * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > > * we would fall back to the global oom killer in pagefault_out_of_memory > > I can't quite figure out what this paragraph is trying to > say. "oom_synchronize might fail [...] and we have to rely on > oom_synchronize". Hm? heh, vim autocompletion + a stale comment from the previous implementation which ENOMEM on the fail path. I went with * Please note that mem_cgroup_out_of_memory might fail to find a * victim and then we have to bail out from the charge path.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e6f0d5ef320a..7fe3ce1fd625 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1483,28 +1483,54 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); } -static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) +enum oom_status { + OOM_SUCCESS, + OOM_FAILED, + OOM_ASYNC, + OOM_SKIPPED +}; + +static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) { - if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER) - return; + if (order > PAGE_ALLOC_COSTLY_ORDER) + return OOM_SKIPPED; /* * We are in the middle of the charge context here, so we * don't want to block when potentially sitting on a callstack * that holds all kinds of filesystem and mm locks. * - * Also, the caller may handle a failed allocation gracefully - * (like optional page cache readahead) and so an OOM killer - * invocation might not even be necessary. + * cgroup1 allows disabling the OOM killer and waiting for outside + * handling until the charge can succeed; remember the context and put + * the task to sleep at the end of the page fault when all locks are + * released. + * + * On the other hand, in-kernel OOM killer allows for an async victim + * memory reclaim (oom_reaper) and that means that we are not solely + * relying on the oom victim to make a forward progress and we can + * invoke the oom killer here. * - * That's why we don't do anything here except remember the - * OOM context and then deal with it at the end of the page - * fault when the stack is unwound, the locks are released, - * and when we know whether the fault was overall successful. + * Please note that mem_cgroup_oom_synchronize might fail to find a + * victim and then we have rely on mem_cgroup_oom_synchronize otherwise + * we would fall back to the global oom killer in pagefault_out_of_memory */ + if (!memcg->oom_kill_disable) { + if (mem_cgroup_out_of_memory(memcg, mask, order)) + return OOM_SUCCESS; + + WARN(!current->memcg_may_oom, + "Memory cgroup charge failed because of no reclaimable memory! " + "This looks like a misconfiguration or a kernel bug."); + return OOM_FAILED; + } + + if (!current->memcg_may_oom) + return OOM_SKIPPED; css_get(&memcg->css); current->memcg_in_oom = memcg; current->memcg_oom_gfp_mask = mask; current->memcg_oom_order = order; + + return OOM_ASYNC; } /** @@ -1899,6 +1925,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned long nr_reclaimed; bool may_swap = true; bool drained = false; + bool oomed = false; if (mem_cgroup_is_root(memcg)) return 0; @@ -1986,6 +2013,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (nr_retries--) goto retry; + if (gfp_mask & __GFP_RETRY_MAYFAIL && oomed) + goto nomem; + if (gfp_mask & __GFP_NOFAIL) goto force; @@ -1994,8 +2024,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, memcg_memory_event(mem_over_limit, MEMCG_OOM); - mem_cgroup_oom(mem_over_limit, gfp_mask, + /* + * keep retrying as long as the memcg oom killer is able to make + * a forward progress or bypass the charge if the oom killer + * couldn't make any progress. + */ + oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages * PAGE_SIZE)); + switch (oom_status) { + case OOM_SUCCESS: + nr_retries = MEM_CGROUP_RECLAIM_RETRIES; + oomed = true; + goto retry; + case OOM_FAILED: + goto force; + default: + goto nomem; + } nomem: if (!(gfp_mask & __GFP_NOFAIL)) return -ENOMEM;