From patchwork Tue Sep 25 12:03:25 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Michal Hocko X-Patchwork-Id: 10613887 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AB3A514BD for ; Tue, 25 Sep 2018 12:03:51 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9ADB02973F for ; Tue, 25 Sep 2018 12:03:51 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 8EE32297BA; Tue, 25 Sep 2018 12:03:51 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3CEDF2973F for ; Tue, 25 Sep 2018 12:03:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EB4458E008E; Tue, 25 Sep 2018 08:03:48 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id E64DB8E0072; Tue, 25 Sep 2018 08:03:48 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D07968E008E; Tue, 25 Sep 2018 08:03:48 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pg1-f197.google.com (mail-pg1-f197.google.com [209.85.215.197]) by kanga.kvack.org (Postfix) with ESMTP id 8D2C38E0072 for ; Tue, 25 Sep 2018 08:03:48 -0400 (EDT) Received: by mail-pg1-f197.google.com with SMTP id s15-v6so3663872pgv.9 for ; Tue, 25 Sep 2018 05:03:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id:in-reply-to:references; bh=wRKSiUQDbw+8adYoE8EbYONjg8YQRck0ZU8fe396StY=; b=fIel9JTf1r6CnGZ8QIMm6uUanW9o1J4YcHZ5G4936NELOAd9VdnjCyhMrOPhzlFBjY ek+8yiZGz0tGX0BpjnTzti1NPU02KrqNIG3pXFDjy/X3H+hu1zc/MMd2nFk0t81eAnxr K7Ax1jkTD7iQdiJVJuDhIytPNc8RRj5ue2qzCENSbYc0hKe9xJExVkkUKhTOwUv7WFkD 5uRMTu4MycikUU2e3BOv21sW0eBEdpCw6iqqkqpTFePkpVeKjtazaGEtpIJ9kRbHq2bb o+nlkibE2fBTW9BOtnkQ1zS64Sib/duH6EjyWw+yc7ZMdp2iz+GXWQFXttWYsUpms9PS A6Cw== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of mstsxfx@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=mstsxfx@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org X-Gm-Message-State: ABuFfojAH6DV04CTm3dkH4mBZssgopaxMDyMYfa4h8AlY9+2P+/Z4Gkq uQciCbS0JGbJ5QTJm0h77IBeS1U0GL2IRtdvA0QWOw+qJK9F7F/fPwvDVy3Jv90yxNVhw/Imoi3 Bt/Yim8eiSykaTr+xFqwSJKFZnzg4xOf8fGrCz1C919dvJrcIYv9l2quPdaCAmUUhUL3Nz2sfL9 xnCbpWw97kPs0vU0XuuRXtQ5OUTkdMd7d7AX+53mqWD74gC7vcCENLOLlveCQpeFxxVKHYHHL4d 3mLYgFEh5IP+G7Shxum8sS2UUd2okRVGuz97q67hLYoS1X4Fh95izc9IbPWZOKW8+gKZzSWlClV VA8ntdgqFpTgXceAJnmGfuvor8i1PDwuHvjKSOJHdghCqyzNeY1+DYZXWzOW1SfuoCFz7G48vg= = X-Received: by 2002:aa7:824d:: with SMTP id e13-v6mr777941pfn.97.1537877028200; Tue, 25 Sep 2018 05:03:48 -0700 (PDT) X-Received: by 2002:aa7:824d:: with SMTP id e13-v6mr777861pfn.97.1537877027050; Tue, 25 Sep 2018 05:03:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1537877027; cv=none; d=google.com; s=arc-20160816; b=YDcHhR+xvcc0ZjasQscWOCwFn3ieTI9jL474N3McsJ4IbuMHlxHNvEe8M8ZoUzeNRz xuZ0UOivQL/QJiXM6woY3DYNEGJtD7ZFHEN2VeaTC5yfXksxOaVVdlr/NVgd5+bhxOoA RB5R4S9kxFoXo8e8APgQlLNO2peUXnMNkXwAHSrhfor1lfnqWrpwg5LwYISys6g1ZlwQ IYmQ7Y9mf+lhYcRr0dr2Gesds+ouZlivaPXAClI1EKQQdhQZ5Is3wGl6taCFpAaL3F/Q yjizr8kQ5C5YQozy9S8DxGg+5dk6ttVhD8bWnYMS7L0kVlGWfcYIG01cpvYgbO4S3KCo ztzQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from; bh=wRKSiUQDbw+8adYoE8EbYONjg8YQRck0ZU8fe396StY=; b=rEfNcEZK+ZMNbBdTA6TQqbAAoETCsqpycbfAn7eli58/5UHqfhnQTxR2NTM7iyEfLf a9C2oxJkmCvh1ADsV/pDN0UZTwPHMxguVLYngby/sSoUkz8EPm1fETvIqmJirU68n/R9 orx6IzJiNPt6subNm2zA/I8zvo9ucEFmNbchy8qdRfgPa2MVqnPvFbaBzjimLwsRxpwv uIPoER9+xCOjiwCqwv11CEuGPUdJ56WgsgwLDOXdhrruF7sqqi698o+FP2zHEEmS9fT+ 5nAXklvc5Dua+m72iqSSOzAsMuB4WFjzuAF/jc9q3oVAi7gRN2ZX3dmrTVWLfXteWQg6 Qkow== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of mstsxfx@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=mstsxfx@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id z14-v6sor250832pgs.71.2018.09.25.05.03.46 for (Google Transport Security); Tue, 25 Sep 2018 05:03:47 -0700 (PDT) Received-SPF: pass (google.com: domain of mstsxfx@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; spf=pass (google.com: domain of mstsxfx@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=mstsxfx@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org X-Google-Smtp-Source: ACcGV61rWA5SVdGerccndzfdIWlc+TP8Y6cuqwOYIV9KaQ6240/LSC3SQM/wR/7/tS8uCq/Os50Y7A== X-Received: by 2002:a63:2703:: with SMTP id n3-v6mr768211pgn.113.1537877026589; Tue, 25 Sep 2018 05:03:46 -0700 (PDT) Received: from tiehlicka.suse.cz (prg-ext-pat.suse.com. [213.151.95.130]) by smtp.gmail.com with ESMTPSA id h132-v6sm3657449pfc.100.2018.09.25.05.03.43 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 25 Sep 2018 05:03:45 -0700 (PDT) From: Michal Hocko To: Andrew Morton Cc: Mel Gorman , Vlastimil Babka , David Rientjes , Andrea Argangeli , Zi Yan , Stefan Priebe - Profihost AG , "Kirill A. Shutemov" , , LKML , Andrea Arcangeli , Stable tree , Michal Hocko Subject: [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings Date: Tue, 25 Sep 2018 14:03:25 +0200 Message-Id: <20180925120326.24392-2-mhocko@kernel.org> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20180925120326.24392-1-mhocko@kernel.org> References: <20180925120326.24392-1-mhocko@kernel.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP From: Andrea Arcangeli THP allocation might be really disruptive when allocated on NUMA system with the local node full or hard to reclaim. Stefan has posted an allocation stall report on 4.12 based SLES kernel which suggests the same issue: [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null) [245513.363983] kvm cpuset=/ mems_allowed=0-1 [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G W 4.12.0+98-ph 0000001 SLE15 (unreleased) [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017 [245513.365905] Call Trace: [245513.366535] dump_stack+0x5c/0x84 [245513.367148] warn_alloc+0xe0/0x180 [245513.367769] __alloc_pages_slowpath+0x820/0xc90 [245513.368406] ? __slab_free+0xa9/0x2f0 [245513.369048] ? __slab_free+0xa9/0x2f0 [245513.369671] __alloc_pages_nodemask+0x1cc/0x210 [245513.370300] alloc_pages_vma+0x1e5/0x280 [245513.370921] do_huge_pmd_wp_page+0x83f/0xf00 [245513.371554] ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0 [245513.372184] ? do_huge_pmd_anonymous_page+0x631/0x6d0 [245513.372812] __handle_mm_fault+0x93d/0x1060 [245513.373439] handle_mm_fault+0xc6/0x1b0 [245513.374042] __do_page_fault+0x230/0x430 [245513.374679] ? get_vtime_delta+0x13/0xb0 [245513.375411] do_page_fault+0x2a/0x70 [245513.376145] ? page_fault+0x65/0x80 [245513.376882] page_fault+0x7b/0x80 [...] [245513.382056] Mem-Info: [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5 active_file:60183 inactive_file:245285 isolated_file:0 unevictable:15657 dirty:286 writeback:1 unstable:0 slab_reclaimable:75543 slab_unreclaimable:2509111 mapped:81814 shmem:31764 pagetables:370616 bounce:0 free:32294031 free_pcp:6233 free_cma:0 [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no The defrag mode is "madvise" and from the above report it is clear that the THP has been allocated for MADV_HUGEPAGA vma. Andrea has identified that the main source of the problem is __GFP_THISNODE usage: : The problem is that direct compaction combined with the NUMA : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very : hard the local node, instead of failing the allocation if there's no : THP available in the local node. : : Such logic was ok until __GFP_THISNODE was added to the THP allocation : path even with MPOL_DEFAULT. : : The idea behind the __GFP_THISNODE addition, is that it is better to : provide local memory in PAGE_SIZE units than to use remote NUMA THP : backed memory. That largely depends on the remote latency though, on : threadrippers for example the overhead is relatively low in my : experience. : : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in : extremely slow qemu startup with vfio, if the VM is larger than the : size of one host NUMA node. This is because it will try very hard to : unsuccessfully swapout get_user_pages pinned pages as result of the : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE : allocations and instead of trying to allocate THP on other nodes (it : would be even worse without vfio type1 GUP pins of course, except it'd : be swapping heavily instead). Fix this by removing __GFP_THISNODE for THP requests which are requesting the direct reclaim. This effectivelly reverts 5265047ac301 on the grounds that the zone/node reclaim was known to be disruptive due to premature reclaim when there was memory free. While it made sense at the time for HPC workloads without NUMA awareness on rare machines, it was ultimately harmful in the majority of cases. The existing behaviour is similiar, if not as widespare as it applies to a corner case but crucially, it cannot be tuned around like zone_reclaim_mode can. The default behaviour should always be to cause the least harm for the common case. If there are specialised use cases out there that want zone_reclaim_mode in specific cases, then it can be built on top. Longterm we should consider a memory policy which allows for the node reclaim like behavior for the specific memory ranges which would allow a [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com [mhocko@suse.com: rewrote the changelog based on the one from Andrea] Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") Cc: Zi Yan Cc: stable # 4.1+ Reported-by: Stefan Priebe Debugged-by: Andrea Arcangeli Reported-by: Alex Williamson Signed-off-by: Andrea Arcangeli Signed-off-by: Michal Hocko Reviewed-by: Mel Gorman Nacked-by: David Rientjes --- mm/mempolicy.c | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index da858f794eb6..149b6f4cf023 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2046,8 +2046,36 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, nmask = policy_nodemask(gfp, pol); if (!nmask || node_isset(hpage_node, *nmask)) { mpol_cond_put(pol); - page = __alloc_pages_node(hpage_node, - gfp | __GFP_THISNODE, order); + /* + * We cannot invoke reclaim if __GFP_THISNODE + * is set. Invoking reclaim with + * __GFP_THISNODE set, would cause THP + * allocations to trigger heavy swapping + * despite there may be tons of free memory + * (including potentially plenty of THP + * already available in the buddy) on all the + * other NUMA nodes. + * + * At most we could invoke compaction when + * __GFP_THISNODE is set (but we would need to + * refrain from invoking reclaim even if + * compaction returned COMPACT_SKIPPED because + * there wasn't not enough memory to succeed + * compaction). For now just avoid + * __GFP_THISNODE instead of limiting the + * allocation path to a strict and single + * compaction invocation. + * + * Supposedly if direct reclaim was enabled by + * the caller, the app prefers THP regardless + * of the node it comes from so this would be + * more desiderable behavior than only + * providing THP originated from the local + * node in such case. + */ + if (!(gfp & __GFP_DIRECT_RECLAIM)) + gfp |= __GFP_THISNODE; + page = __alloc_pages_node(hpage_node, gfp, order); goto out; } }