From patchwork Fri Jun 1 01:21:13 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tetsuo Handa X-Patchwork-Id: 10442333 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 96ACC601C3 for ; Fri, 1 Jun 2018 01:21:33 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 87DFD28BD7 for ; Fri, 1 Jun 2018 01:21:33 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7AC9128F5A; Fri, 1 Jun 2018 01:21:33 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00, MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 62C5728BD7 for ; Fri, 1 Jun 2018 01:21:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 52E9B6B0266; Thu, 31 May 2018 21:21:31 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 4B52E6B026C; Thu, 31 May 2018 21:21:31 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 358546B026E; Thu, 31 May 2018 21:21:31 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-ot0-f200.google.com (mail-ot0-f200.google.com [74.125.82.200]) by kanga.kvack.org (Postfix) with ESMTP id 077C66B0266 for ; Thu, 31 May 2018 21:21:31 -0400 (EDT) Received: by mail-ot0-f200.google.com with SMTP id n40-v6so14044708ote.13 for ; Thu, 31 May 2018 18:21:31 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:message-id :subject:from:to:cc:mime-version:date:references:in-reply-to :content-transfer-encoding; bh=2Cp3fV3nf5cbmthd+FWA0JujT5uwvSgt6OWp3GJ/lhk=; b=dX6guDsmf2YBXNCJ9XmRSAGGB/WCmM2v+xsm429OoW6JW5/vs5ZpuZ+fsBl6qzaJx9 PSZHD5k0hMuNwLMwL5biqOZAjtEEns0REXAa+OGSm+JyT3hqg8qi6CY8JvYDKQ2Yf5yQ Tiyk3r47h1Ncc4XT4/N/s9+OT9K3qduAwoJORm7kvDVNikOD5/DRP5fooiUoHZOwErjc VJ56xli0P0oHBPTiM2DFhn3WLQz1C9OuwnQywzKc+o3JscsP0KOploA39aD0n1dE12Ab 21xzFIubKL04IOIXfeKGrEr8ZPV7JrE9+7GsoxAFQpVjpEtKKMII9eBuWydHWvMC9+P1 cu6A== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of penguin-kernel@i-love.sakura.ne.jp designates 202.181.97.72 as permitted sender) smtp.mailfrom=penguin-kernel@i-love.sakura.ne.jp X-Gm-Message-State: APt69E2HyrWIqen51mTJDL8OOVgi/3nUXlw6ierw8OXw5XSdHHALMd4Z rPVjIltkBSDlvGUswTA1Vbj9rFQW4gsmHoItmnjhDnyuByMrK1MPV9GF29iwwdwLnenH9oqmAWt CxhjvJjU7morkEd8mljkeSgCKieraT7oxyjx5ahPFF32ILnExnlB0xRaswRx8iroypA== X-Received: by 2002:a9d:70d3:: with SMTP id w19-v6mr2687309otj.147.1527816090763; Thu, 31 May 2018 18:21:30 -0700 (PDT) X-Google-Smtp-Source: ADUXVKJ1UzMybSE+PQ9UaY4yV+YsIYQX6FQopWs3iMDxK0fizIzDnjrZFHo4XFNxVJTWwy/ZoKtl X-Received: by 2002:a9d:70d3:: with SMTP id w19-v6mr2687283otj.147.1527816089621; Thu, 31 May 2018 18:21:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527816089; cv=none; d=google.com; s=arc-20160816; b=UMUiDrZ4WGfrhGKpVqmfyvmBPnrcyA7/heQSGikvy1yRVTzgAgw/KPrXyoh0xgrUgo rtGSmsBuI3WD6k8jexXS4Jf1jM97akzvLnMZd7X+vIipNwr6sc9F7TkT7ZG3/dSbjKZM hcV32Wz1Taz6B86fyk9T6w2FBPbev0ywmcHJoWbSuzfZnFkp30LFPqWLkQIg1LsrYpZp m7OBxdjrPRhcmVCDBsVsouU+xdFiTfNZB9OJSzN5ZHDC7xVzvHO8S3wdbAVOLTbJrTTA /7oRhjWqvbzQ1fOeA0dkf0Bzz1/VprZ7TlXtVKpb9BTi4Y9+I8XUSVTM1tQwxI/N4K/F sUEQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:references:date:mime-version :cc:to:from:subject:message-id:arc-authentication-results; bh=2Cp3fV3nf5cbmthd+FWA0JujT5uwvSgt6OWp3GJ/lhk=; b=cVp4EsOczKLJKnHomkxqN++KEzyfeVQTsKEwnp6mD9Uf2dLOY3kmXbxDXFrp80A0sa /mpBkKOFlPFM4ApS/GYcpZLzFDIcLsz4GL0Ot8B/vrflUltjRnhhhe9sUhYww9z7trk+ C1Yn2rAJyE4+ixSZYloHs20oL75SUQPqf6oADpe/Y8O4ffuN6Y4ADC/35OQzBvCTvxxy PXqWdJL/JexgC1p2Sc4kBDPDrmIW49FKSBzZMvb/KjKeOXZ2G22YRMreTBCdxiW8cOtr ABDVLrxI2GGb9lDZh9RLi16L1cSttzpdt0Y3rZVcD+urVZM97mWBkCIn29BWVzCIdNvF liTQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of penguin-kernel@i-love.sakura.ne.jp designates 202.181.97.72 as permitted sender) smtp.mailfrom=penguin-kernel@i-love.sakura.ne.jp Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [202.181.97.72]) by mx.google.com with ESMTPS id j131-v6si13003180oia.26.2018.05.31.18.21.28 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 31 May 2018 18:21:29 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of penguin-kernel@i-love.sakura.ne.jp designates 202.181.97.72 as permitted sender) client-ip=202.181.97.72; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of penguin-kernel@i-love.sakura.ne.jp designates 202.181.97.72 as permitted sender) smtp.mailfrom=penguin-kernel@i-love.sakura.ne.jp Received: from fsav401.sakura.ne.jp (fsav401.sakura.ne.jp [133.242.250.100]) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTP id w511LDkd077254; Fri, 1 Jun 2018 10:21:13 +0900 (JST) (envelope-from penguin-kernel@i-love.sakura.ne.jp) Received: from www262.sakura.ne.jp (202.181.97.72) by fsav401.sakura.ne.jp (F-Secure/fsigk_smtp/530/fsav401.sakura.ne.jp); Fri, 01 Jun 2018 10:21:13 +0900 (JST) X-Virus-Status: clean(F-Secure/fsigk_smtp/530/fsav401.sakura.ne.jp) Received: from www262.sakura.ne.jp (localhost [127.0.0.1]) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTP id w511LDrS077250; Fri, 1 Jun 2018 10:21:13 +0900 (JST) (envelope-from penguin-kernel@i-love.sakura.ne.jp) Received: (from i-love@localhost) by www262.sakura.ne.jp (8.15.2/8.15.2/Submit) id w511LDbC077249; Fri, 1 Jun 2018 10:21:13 +0900 (JST) (envelope-from penguin-kernel@i-love.sakura.ne.jp) Message-Id: <201806010121.w511LDbC077249@www262.sakura.ne.jp> X-Authentication-Warning: www262.sakura.ne.jp: i-love set sender to penguin-kernel@i-love.sakura.ne.jp using -f Subject: Re: [PATCH] mm, oom: Don't call =?ISO-2022-JP?B?c2NoZWR1bGVfdGltZW91dF9r?= =?ISO-2022-JP?B?aWxsYWJsZSgpIHdpdGggb29tX2xvY2sgaGVsZC4=?= From: Tetsuo Handa To: Michal Hocko Cc: Andrew Morton , torvalds@linux-foundation.org, guro@fb.com, rientjes@google.com, hannes@cmpxchg.org, vdavydov.dev@gmail.com, tj@kernel.org, linux-mm@kvack.org MIME-Version: 1.0 Date: Fri, 01 Jun 2018 10:21:13 +0900 References: <7276d450-5e66-be56-3a17-0fc77596a3b6@i-love.sakura.ne.jp> <20180531184721.GU15278@dhcp22.suse.cz> In-Reply-To: <20180531184721.GU15278@dhcp22.suse.cz> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Michal Hocko wrote: > On Fri 01-06-18 00:23:57, Tetsuo Handa wrote: > > On 2018/05/31 19:44, Michal Hocko wrote: > > > On Thu 31-05-18 19:10:48, Tetsuo Handa wrote: > > >> On 2018/05/30 8:07, Andrew Morton wrote: > > >>> On Tue, 29 May 2018 09:17:41 +0200 Michal Hocko wrote: > > >>> > > >>>>> I suggest applying > > >>>>> this patch first, and then fix "mm, oom: cgroup-aware OOM killer" patch. > > >>>> > > >>>> Well, I hope the whole pile gets merged in the upcoming merge window > > >>>> rather than stall even more. > > >>> > > >>> I'm more inclined to drop it all. David has identified significant > > >>> shortcomings and I'm not seeing a way of addressing those shortcomings > > >>> in a backward-compatible fashion. Therefore there is no way forward > > >>> at present. > > >>> > > >> > > >> Can we apply my patch as-is first? > > > > > > No. As already explained before. Sprinkling new sleeps without a strong > > > reason is not acceptable. The issue you are seeing is pretty artificial > > > and as such doesn're really warrant an immediate fix. We should rather > > > go with a well thought trhough fix. In other words we should simply drop > > > the sleep inside the oom_lock for starter unless it causes some really > > > unexpected behavior change. > > > > > > > The OOM killer did not require schedule_timeout_killable(1) to return > > as long as the OOM victim can call __mmput(). But now the OOM killer > > requires schedule_timeout_killable(1) to return in order to allow the > > OOM victim to call __oom_reap_task_mm(). Thus, this is a regression. > > > > Artificial cannot become the reason to postpone my patch. If we don't care > > artificialness/maliciousness, we won't need to care Spectre/Meltdown bugs. > > > > I'm not sprinkling new sleeps. I'm just merging existing sleeps (i.e. > > mutex_trylock() case and !mutex_trylock() case) and updating the outdated > > comments. > > Sigh. So what exactly is wrong with going simple and do > http://lkml.kernel.org/r/20180528124313.GC27180@dhcp22.suse.cz ? > Because (1) You are trying to apply this fix after Roman's patchset which Andrew Morton is more inclined to drop. (2) You are making this fix difficult to backport because this patch depends on Roman's patchset. (3) You are not fixing the bug in Roman's patchset. (4) You are not updating the outdated comment in my patch and Roman's patchset. (5) You are not evaluating the side effect of not sleeping outside of the OOM path, despite you said > If we _really_ need to touch should_reclaim_retry then > it should be done in a separate patch with some numbers/tracing > data backing that story. and I consider that "whether moving the short sleep to should_reclaim_retry() has negative impact" and "whether eliminating the short sleep has negative impact" should be evaluated in a separate patch. but I will tolerate below patch if you can accept below patch "as-is" (because it explicitly notes what actually happens and there might be unexpected side effect of not sleeping outside of the OOM path). From 4f8dbcd2a7edde2dddf4b1bd364bb2fa930b0819 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Fri, 1 Jun 2018 09:35:01 +0900 Subject: [PATCH v2] mm, oom: remove sleep from under oom_lock Tetsuo has pointed out that since commit 27ae357fa82be5ab ("mm, oom: fix concurrent munlock and oom reaper unmap, v3") we have a strong synchronization between the oom_killer and victim's exiting because both have to take the oom_lock. Therefore the original heuristic to sleep for a short time in out_of_memory() doesn't serve the original purpose. Moreover Tetsuo has noticed that the short sleep can be more harmful than actually useful. Hammering the system with a few dozens of processes can block the task holding the oom_lock `forever', and the system gets completely locked-up because neither the OOM victims nor the OOM reaper can make any progress. Drop the short sleep from out_of_memory() when we hold the lock (without ever examining the side effect of not sleeping without the lock). But keep the short sleep when the mutex_trylock() fails in order to throttle the concurrent OOM paths a bit. This should be solved in a more reasonable way (e.g. sleep proportional to the time spent in the active reclaiming etc.) but this is much more complex thing to achieve. This is a quick fixup to remove a stale code. Reported-by: Tetsuo Handa Signed-off-by: Michal Hocko --- mm/oom_kill.c | 58 ++++++++++++++++++++++++---------------------------------- 1 file changed, 24 insertions(+), 34 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 565e7da..7940c53 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -357,7 +357,8 @@ int oom_evaluate_task(struct task_struct *task, void *arg) /* * Simple selection loop. We choose the process with the highest number of - * 'points'. In case scan was aborted, oc->chosen_task is set to -1. + * 'points'. In case scan was aborted, oc->chosen_task is set to + * INFLIGHT_VICTIM. */ static void select_bad_process(struct oom_control *oc) { @@ -499,6 +500,21 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) static struct task_struct *oom_reaper_list; static DEFINE_SPINLOCK(oom_reaper_lock); +/* + * We have to make sure not to cause premature new oom victim selection. + * + * __alloc_pages_may_oom() oom_reap_task_mm()/exit_mmap() + * mutex_trylock(&oom_lock) + * get_page_from_freelist(ALLOC_WMARK_HIGH) # fails + * unmap_page_range() # frees some memory + * set_bit(MMF_OOM_SKIP) + * out_of_memory() + * select_bad_process() + * test_bit(MMF_OOM_SKIP) # selects new oom victim + * mutex_unlock(&oom_lock) + * + * Therefore, the callers hold oom_lock when calling this function. + */ void __oom_reap_task_mm(struct mm_struct *mm) { struct vm_area_struct *vma; @@ -543,20 +559,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) { bool ret = true; - /* - * We have to make sure to not race with the victim exit path - * and cause premature new oom victim selection: - * oom_reap_task_mm exit_mm - * mmget_not_zero - * mmput - * atomic_dec_and_test - * exit_oom_victim - * [...] - * out_of_memory - * select_bad_process - * # no TIF_MEMDIE task selects new victim - * unmap_page_range # frees some memory - */ mutex_lock(&oom_lock); if (!down_read_trylock(&mm->mmap_sem)) { @@ -1099,7 +1101,6 @@ bool out_of_memory(struct oom_control *oc) { unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; - bool delay = false; /* if set, delay next allocation attempt */ if (oom_killer_disabled) return false; @@ -1149,32 +1150,21 @@ bool out_of_memory(struct oom_control *oc) return true; } - if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) { - delay = true; - goto out; - } + if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) + return true; select_bad_process(oc); /* Found nothing?!?! Either we hang forever, or we panic. */ - if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { + if (!oc->chosen_task) { + if (is_sysrq_oom(oc) || is_memcg_oom(oc)) + return false; dump_header(oc, NULL); panic("Out of memory and no killable processes...\n"); } - if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) { + if (oc->chosen_task != INFLIGHT_VICTIM) oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : "Memory cgroup out of memory"); - delay = true; - } - -out: - /* - * Give the killed process a good chance to exit before trying - * to allocate memory again. - */ - if (delay) - schedule_timeout_killable(1); - - return !!oc->chosen_task; + return true; } /*