From patchwork Wed Sep 4 19:54:18 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Rientjes X-Patchwork-Id: 11131337 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2806B1510 for ; Wed, 4 Sep 2019 19:54:24 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D9C1723400 for ; Wed, 4 Sep 2019 19:54:23 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="aYpwCJlp" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D9C1723400 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A21026B0006; Wed, 4 Sep 2019 15:54:22 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 9D12F6B000A; Wed, 4 Sep 2019 15:54:22 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8E5F16B000C; Wed, 4 Sep 2019 15:54:22 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0076.hostedemail.com [216.40.44.76]) by kanga.kvack.org (Postfix) with ESMTP id 6BDA96B0006 for ; Wed, 4 Sep 2019 15:54:22 -0400 (EDT) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 0B4D0A2B9 for ; Wed, 4 Sep 2019 19:54:22 +0000 (UTC) X-FDA: 75898289964.24.name97_7d5da57647841 X-Spam-Summary: 2,0,0,c003fce0d5b410bd,d41d8cd98f00b204,rientjes@google.com,:torvalds@linux-foundation.org:akpm@linux-foundation.org:aarcange@redhat.com:mhocko@suse.com:mgorman@suse.de:vbabka@suse.cz:kirill@shutemov.name:linux-kernel@vger.kernel.org:,RULES_HIT:2:41:69:355:379:800:960:966:973:988:989:1260:1277:1313:1314:1345:1437:1516:1518:1535:1593:1594:1605:1730:1747:1777:1792:2196:2199:2393:2559:2562:2689:2734:2901:3138:3139:3140:3141:3142:3152:3865:3866:3867:3868:3870:3871:3872:3874:4049:4120:4250:4321:4385:4470:4605:5007:6119:6261:6653:7875:7903:7974:9592:10004:11026:11233:11473:11657:11658:11914:12043:12296:12297:12438:12517:12519:12555:12679:12683:12986:13146:13161:13229:13230:13255:13439:14096:14097:14130:14659:21080:21444:21451:21627:21740:30054:30069:30091,0,RBL:209.85.210.196:@google.com:.lbl8.mailshell.net-62.18.0.100 66.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LU A_SUMMAR X-HE-Tag: name97_7d5da57647841 X-Filterd-Recvd-Size: 9142 Received: from mail-pf1-f196.google.com (mail-pf1-f196.google.com [209.85.210.196]) by imf35.hostedemail.com (Postfix) with ESMTP for ; Wed, 4 Sep 2019 19:54:21 +0000 (UTC) Received: by mail-pf1-f196.google.com with SMTP id w22so5525744pfi.9 for ; Wed, 04 Sep 2019 12:54:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:user-agent:mime-version; bh=/yqLrrEZ1jGCgctDjUkMs8Nl6CSrDL+V901mgKrrlLc=; b=aYpwCJlpggUi2FQ0+ApYB02dfO+53N9oMqyYZoiWyvmwYynDM/vkN4WH4vPpshrnjI /wEZ0BQozY9Vz3IYpgEpRlDoJKcfsl8jUIkBCP4EIkBsLmZROitGMvntgR+w/15fSe0l L1oCGWejuhYlxUmB7okIRI7/QG+P2H5ykBGYn0CrWi+dXqY9CxJ8dYWsQjCR+xnisFD+ QPz3uj6A/rSJUXVv8HdQ1sQr7A6ZvAf4X42VuIbDJ2h0+jBSRqX+buEA5dQukQcDgj2f Re2Y4AhkhVEP57zS1Lhzch1D07VtGHmFyyjVhsHqhKGiSjKSfYNr+q6tlKwjWsiQG73L WkJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:user-agent :mime-version; bh=/yqLrrEZ1jGCgctDjUkMs8Nl6CSrDL+V901mgKrrlLc=; b=bz1ZaeTR5ypVoJAdnnRpNoFaOh8kMGvHbUXC5YspZLMgalt52kNO31V1GmpHZNVFw4 PFpUsp/lejoKJ22dtgOKML+BME1cC8hEB0Qtkoc0Le3JjUWuaZFPKRq78D4v4lRzAhjx MfQjHwJ8JKBi9ANw2YHBYt4GERnmnhwjSzA+NZ6gProDjKHZU0nw4M/LtkWik5Ktd/J/ DovnQfzS13A0fmHVOFlaMi6LHkgJfv9Ke6TVC58z8u2DOsn3mJR9QLqUV6sjx5KWHGjQ DATLvSO/51VLwrGZXhDprmTsHzCkYBc58rwfuFUNhvOSPuxwLtwE0pM2q/0yxbSSP7Y+ VleQ== X-Gm-Message-State: APjAAAVvUEyMmzGw2MjVf+P4TaeRI3AbxdqZ1pP09tr4Slrh6G1k9xGP BaqPSVVBK/V4taEgrGhrmssB2g== X-Google-Smtp-Source: APXvYqxY8g9k8sFXGaBUVsXS3uhCUUF/KrFeCPAt3bvfH1vh500utSbB+LJNNZ1FAUItgeRsAEEEYQ== X-Received: by 2002:a17:90a:c715:: with SMTP id o21mr4798298pjt.55.1567626859752; Wed, 04 Sep 2019 12:54:19 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id g202sm32480208pfb.155.2019.09.04.12.54.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 04 Sep 2019 12:54:19 -0700 (PDT) Date: Wed, 4 Sep 2019 12:54:18 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Linus Torvalds , Andrew Morton cc: Andrea Arcangeli , Michal Hocko , Mel Gorman , Vlastimil Babka , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [patch for-5.3 1/4] Revert "Revert "mm, thp: restore node-local hugepage allocations"" Message-ID: User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This reverts commit a8282608c88e08b1782141026eab61204c1e533f. The commit references the original intended semantic for MADV_HUGEPAGE which has subsequently taken on three unique purposes: - enables or disables thp for a range of memory depending on the system's config (is thp "enabled" set to "always" or "madvise"), - determines the synchronous compaction behavior for thp allocations at fault (is thp "defrag" set to "always", "defer+madvise", or "madvise"), and - reverts a previous MADV_NOHUGEPAGE (there is no madvise mode to only clear previous hugepage advice). These are the three purposes that currently exist in 5.2 and over the past several years that userspace has been written around. Adding a NUMA locality preference adds a fourth dimension to an already conflated advice mode. Based on the semantic that MADV_HUGEPAGE has provided over the past several years, there exist workloads that use the tunable based on these principles: specifically that the allocation should attempt to defragment a local node before falling back. It is agreed that remote hugepages typically (but not always) have a better access latency than remote native pages, although on Naples this is at parity for intersocket. The revert commit that this patch reverts allows hugepage allocation to immediately allocate remotely when local memory is fragmented. This is contrary to the semantic of MADV_HUGEPAGE over the past several years: that is, memory compaction should be attempted locally before falling back. The performance degradation of remote hugepages over local hugepages on Rome, for example, is 53.5% increased access latency. For this reason, the goal is to revert back to the 5.2 and previous behavior that would attempt local defragmentation before falling back. With the patch that is reverted by this patch, we see performance degradations at the tail because the allocator happily allocates the remote hugepage rather than even attempting to make a local hugepage available. zone_reclaim_mode is not a solution to this problem since it does not only impact hugepage allocations but rather changes the memory allocation strategy for *all* page allocations. Signed-off-by: David Rientjes --- include/linux/mempolicy.h | 2 -- mm/huge_memory.c | 42 +++++++++++++++------------------------ mm/mempolicy.c | 2 +- 3 files changed, 17 insertions(+), 29 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -139,8 +139,6 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp, struct mempolicy *get_task_policy(struct task_struct *p); struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, unsigned long addr); -struct mempolicy *get_vma_policy(struct vm_area_struct *vma, - unsigned long addr); bool vma_policy_mof(struct vm_area_struct *vma); extern void numa_default_policy(void); diff --git a/mm/huge_memory.c b/mm/huge_memory.c --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -648,37 +648,27 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr) { const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); - gfp_t this_node = 0; - -#ifdef CONFIG_NUMA - struct mempolicy *pol; - /* - * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not - * specified, to express a general desire to stay on the current - * node for optimistic allocation attempts. If the defrag mode - * and/or madvise hint requires the direct reclaim then we prefer - * to fallback to other node rather than node reclaim because that - * can lead to excessive reclaim even though there is free memory - * on other nodes. We expect that NUMA preferences are specified - * by memory policies. - */ - pol = get_vma_policy(vma, addr); - if (pol->mode != MPOL_BIND) - this_node = __GFP_THISNODE; - mpol_cond_put(pol); -#endif + const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE; + /* Always do synchronous compaction */ if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags)) - return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY); + return GFP_TRANSHUGE | __GFP_THISNODE | + (vma_madvised ? 0 : __GFP_NORETRY); + + /* Kick kcompactd and fail quickly */ if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags)) - return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node; + return gfp_mask | __GFP_KSWAPD_RECLAIM; + + /* Synchronous compaction if madvised, otherwise kick kcompactd */ if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags)) - return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM : - __GFP_KSWAPD_RECLAIM | this_node); + return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : + __GFP_KSWAPD_RECLAIM); + + /* Only do synchronous compaction if madvised */ if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags)) - return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM : - this_node); - return GFP_TRANSHUGE_LIGHT | this_node; + return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0); + + return gfp_mask; } /* Caller must hold page table lock. */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1734,7 +1734,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, * freeing by another task. It is the caller's responsibility to free the * extra reference for shared policies. */ -struct mempolicy *get_vma_policy(struct vm_area_struct *vma, +static struct mempolicy *get_vma_policy(struct vm_area_struct *vma, unsigned long addr) { struct mempolicy *pol = __get_vma_policy(vma, addr); From patchwork Wed Sep 4 19:54:20 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Rientjes X-Patchwork-Id: 11131339 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E43981395 for ; Wed, 4 Sep 2019 19:54:26 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 9464E22CF7 for ; Wed, 4 Sep 2019 19:54:26 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="frdgxIlY" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9464E22CF7 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 120616B000A; Wed, 4 Sep 2019 15:54:25 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 0F9C76B000C; Wed, 4 Sep 2019 15:54:25 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EDD506B000D; Wed, 4 Sep 2019 15:54:24 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0064.hostedemail.com [216.40.44.64]) by kanga.kvack.org (Postfix) with ESMTP id CEEE96B000A for ; Wed, 4 Sep 2019 15:54:24 -0400 (EDT) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id 5EDBB440E for ; Wed, 4 Sep 2019 19:54:24 +0000 (UTC) X-FDA: 75898290048.20.list90_7da9250942249 X-Spam-Summary: 2,0,0,65694c53c6aaf779,d41d8cd98f00b204,rientjes@google.com,:torvalds@linux-foundation.org:akpm@linux-foundation.org:aarcange@redhat.com:mhocko@suse.com:mgorman@suse.de:vbabka@suse.cz:kirill@shutemov.name:linux-kernel@vger.kernel.org:,RULES_HIT:1:2:41:69:355:379:800:960:966:973:988:989:1260:1277:1313:1314:1345:1437:1516:1518:1593:1594:1605:1730:1747:1777:1792:2196:2198:2199:2200:2393:2559:2562:2731:2740:2910:3138:3139:3140:3141:3142:3152:3865:3866:3867:3868:3870:3871:3872:3874:4051:4250:4321:4385:4605:5007:6119:6120:6261:6653:7901:7903:8957:11026:11232:11473:11657:11658:11914:12043:12114:12291:12296:12297:12438:12517:12519:12555:12679:12683:13156:13228:13255:13439:14096:14097:14659:21080:21444:21451:21627:30003:30054:30069:30075,0,RBL:209.85.210.194:@google.com:.lbl8.mailshell.net-62.18.0.100 66.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:1:0,LFtime:23,LUA_SUMMARY:none X-HE-Tag: list90_7da9250942249 X-Filterd-Recvd-Size: 11272 Received: from mail-pf1-f194.google.com (mail-pf1-f194.google.com [209.85.210.194]) by imf12.hostedemail.com (Postfix) with ESMTP for ; Wed, 4 Sep 2019 19:54:23 +0000 (UTC) Received: by mail-pf1-f194.google.com with SMTP id 205so11519933pfw.2 for ; Wed, 04 Sep 2019 12:54:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:user-agent:mime-version; bh=5m9yEJkwMGkTsleZ21Dj3LjbDk1nJJBNL55KpYfxVrA=; b=frdgxIlY4Nc86Dt2nE92haaUYpAauut2Kk/7VOTuR0X0NOLbmkUttvop6zDwkQ1zpz rWR5qOdNkmSMArIGyuxg6zNUORvF5hXOV/iwwsQ77LKvVOpfYIbSSDDtEQZ5cTTNXDXr NO/AsH8t19vzIK4O+xi/xAQJylAg4BYIUndEFQfwitQSdkpxL99I5fzf4iylQXlSQnIg YN7CvfXkoTfiK0djxw8oo29ScoFuwpMX5EzWmm7+J1llbtX+2vXlHT6cgYfCLy32AJOK 5cB0afoAnvdT+D+FDUyF2bxwsT3p3oXy4FD/0sqhDk/lMsky7Sql+rwsMZFXoVqOM4Qv RpaQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:user-agent :mime-version; bh=5m9yEJkwMGkTsleZ21Dj3LjbDk1nJJBNL55KpYfxVrA=; b=jQWPKgDjPqMOECjzEVWgwv2dWW1byurAhTOhtUiTY6wFeQouCXA9LhoBRvRQC6PTEW cGbns3Ad6eCqE2r2AD6Zb0jn0fQoziG6PDYiVSzSpZbQRnXMLnDHU0yFOmAIM3c9u22I TNF1o04QqxVYsyFXKWq3XD3Mqu9gkLrcQmRkGlJgq4AWDK7f6yAkdW+Btn3PFD7l/Caf n8Elt2Nh1vvJ/X2j4vUhEX5gKP8xYPBJ4z3eVdw0vtcJnIW3yy9+obNeUwsNrDMYZL0I 0RsOFumVLLYYV/1UuKHKKhVbJO61KtWZrz26j6m7nDGZMWJGwB/Zu9LgVF5/rufL+ZB+ 3k+g== X-Gm-Message-State: APjAAAX7qxoAIijoeGKjvKmo4CBjDlWLMaMQjPl95UAVjzzFQnzrhn2K gczRfBQ/viwpVxRTbT/wuufIcg== X-Google-Smtp-Source: APXvYqwrCvnGc3SA4RjOmcxVI4Fez+CswtbUXEr6d/jsncxuWzmiXpLxRs+5DBwOx/kpLU6ny3/mYQ== X-Received: by 2002:a65:62cd:: with SMTP id m13mr36300315pgv.437.1567626861960; Wed, 04 Sep 2019 12:54:21 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id y194sm619808pfg.186.2019.09.04.12.54.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 04 Sep 2019 12:54:21 -0700 (PDT) Date: Wed, 4 Sep 2019 12:54:20 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Linus Torvalds , Andrew Morton cc: Andrea Arcangeli , Michal Hocko , Mel Gorman , Vlastimil Babka , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [patch for-5.3 2/4] Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" Message-ID: User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This reverts commit 92717d429b38e4f9f934eed7e605cc42858f1839. Since commit a8282608c88e ("Revert "mm, thp: restore node-local hugepage allocations"") is reverted in this series, it is better to restore the previous 5.2 behavior between the thp allocation and the page allocator rather than to attempt any consolidation or cleanup for a policy that is now reverted. It's less risky during an rc cycle and subsequent patches in this series further modify the same policy that the pre-5.3 behavior implements. Consolidation and cleanup can be done subsequent to a sane default page allocation strategy, so this patch reverts a cleanup done on a strategy that is now reverted and thus is the least risky option for 5.3. Signed-off-by: David Rientjes --- include/linux/gfp.h | 12 ++++++++---- mm/huge_memory.c | 27 +++++++++++++-------------- mm/mempolicy.c | 32 +++++++++++++++++++++++++++++--- mm/shmem.c | 2 +- 4 files changed, 51 insertions(+), 22 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -510,18 +510,22 @@ alloc_pages(gfp_t gfp_mask, unsigned int order) } extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order, struct vm_area_struct *vma, unsigned long addr, - int node); + int node, bool hugepage); +#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \ + alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true) #else #define alloc_pages(gfp_mask, order) \ alloc_pages_node(numa_node_id(), gfp_mask, order) -#define alloc_pages_vma(gfp_mask, order, vma, addr, node)\ +#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\ + alloc_pages(gfp_mask, order) +#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \ alloc_pages(gfp_mask, order) #endif #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0) #define alloc_page_vma(gfp_mask, vma, addr) \ - alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id()) + alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false) #define alloc_page_vma_node(gfp_mask, vma, addr, node) \ - alloc_pages_vma(gfp_mask, 0, vma, addr, node) + alloc_pages_vma(gfp_mask, 0, vma, addr, node, false) extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order); extern unsigned long get_zeroed_page(gfp_t gfp_mask); diff --git a/mm/huge_memory.c b/mm/huge_memory.c --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -645,30 +645,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, * available * never: never stall for any thp allocation */ -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr) +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) { const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); - const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE; /* Always do synchronous compaction */ if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags)) - return GFP_TRANSHUGE | __GFP_THISNODE | - (vma_madvised ? 0 : __GFP_NORETRY); + return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY); /* Kick kcompactd and fail quickly */ if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags)) - return gfp_mask | __GFP_KSWAPD_RECLAIM; + return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM; /* Synchronous compaction if madvised, otherwise kick kcompactd */ if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags)) - return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : - __GFP_KSWAPD_RECLAIM); + return GFP_TRANSHUGE_LIGHT | + (vma_madvised ? __GFP_DIRECT_RECLAIM : + __GFP_KSWAPD_RECLAIM); /* Only do synchronous compaction if madvised */ if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags)) - return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0); + return GFP_TRANSHUGE_LIGHT | + (vma_madvised ? __GFP_DIRECT_RECLAIM : 0); - return gfp_mask; + return GFP_TRANSHUGE_LIGHT; } /* Caller must hold page table lock. */ @@ -740,8 +740,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) pte_free(vma->vm_mm, pgtable); return ret; } - gfp = alloc_hugepage_direct_gfpmask(vma, haddr); - page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, vma, haddr, numa_node_id()); + gfp = alloc_hugepage_direct_gfpmask(vma); + page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER); if (unlikely(!page)) { count_vm_event(THP_FAULT_FALLBACK); return VM_FAULT_FALLBACK; @@ -1348,9 +1348,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd) alloc: if (__transparent_hugepage_enabled(vma) && !transparent_hugepage_debug_cow()) { - huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr); - new_page = alloc_pages_vma(huge_gfp, HPAGE_PMD_ORDER, vma, - haddr, numa_node_id()); + huge_gfp = alloc_hugepage_direct_gfpmask(vma); + new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER); } else new_page = NULL; diff --git a/mm/mempolicy.c b/mm/mempolicy.c --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1180,8 +1180,8 @@ static struct page *new_page(struct page *page, unsigned long start) } else if (PageTransHuge(page)) { struct page *thp; - thp = alloc_pages_vma(GFP_TRANSHUGE, HPAGE_PMD_ORDER, vma, - address, numa_node_id()); + thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address, + HPAGE_PMD_ORDER); if (!thp) return NULL; prep_transhuge_page(thp); @@ -2083,6 +2083,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order, * @vma: Pointer to VMA or NULL if not available. * @addr: Virtual Address of the allocation. Must be inside the VMA. * @node: Which node to prefer for allocation (modulo policy). + * @hugepage: for hugepages try only the preferred node if possible * * This function allocates a page from the kernel page pool and applies * a NUMA policy associated with the VMA or the current process. @@ -2093,7 +2094,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order, */ struct page * alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, - unsigned long addr, int node) + unsigned long addr, int node, bool hugepage) { struct mempolicy *pol; struct page *page; @@ -2111,6 +2112,31 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, goto out; } + if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) { + int hpage_node = node; + + /* + * For hugepage allocation and non-interleave policy which + * allows the current node (or other explicitly preferred + * node) we only try to allocate from the current/preferred + * node and don't fall back to other nodes, as the cost of + * remote accesses would likely offset THP benefits. + * + * If the policy is interleave, or does not allow the current + * node in its nodemask, we allocate the standard way. + */ + if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL)) + hpage_node = pol->v.preferred_node; + + nmask = policy_nodemask(gfp, pol); + if (!nmask || node_isset(hpage_node, *nmask)) { + mpol_cond_put(pol); + page = __alloc_pages_node(hpage_node, + gfp | __GFP_THISNODE, order); + goto out; + } + } + nmask = policy_nodemask(gfp, pol); preferred_nid = policy_node(gfp, pol, node); page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask); diff --git a/mm/shmem.c b/mm/shmem.c --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1466,7 +1466,7 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp, shmem_pseudo_vma_init(&pvma, info, hindex); page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN, - HPAGE_PMD_ORDER, &pvma, 0, numa_node_id()); + HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true); shmem_pseudo_vma_destroy(&pvma); if (page) prep_transhuge_page(page); From patchwork Wed Sep 4 19:54:22 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Rientjes X-Patchwork-Id: 11131341 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 50AC41395 for ; Wed, 4 Sep 2019 19:54:29 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0F48722CF7 for ; Wed, 4 Sep 2019 19:54:28 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="or+awGJW" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0F48722CF7 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9FA8D6B000C; Wed, 4 Sep 2019 15:54:26 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 985AB6B000D; Wed, 4 Sep 2019 15:54:26 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8723B6B000E; Wed, 4 Sep 2019 15:54:26 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0153.hostedemail.com [216.40.44.153]) by kanga.kvack.org (Postfix) with ESMTP id 6495F6B000C for ; Wed, 4 Sep 2019 15:54:26 -0400 (EDT) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id 09562824CA21 for ; Wed, 4 Sep 2019 19:54:26 +0000 (UTC) X-FDA: 75898290132.15.alley25_7df2caa788419 X-Spam-Summary: 2,0,0,44037022ecd62162,d41d8cd98f00b204,rientjes@google.com,:torvalds@linux-foundation.org:akpm@linux-foundation.org:aarcange@redhat.com:mhocko@suse.com:mgorman@suse.de:vbabka@suse.cz:kirill@shutemov.name:linux-kernel@vger.kernel.org:,RULES_HIT:41:355:379:800:877:960:966:973:988:989:1260:1277:1313:1314:1345:1437:1516:1518:1535:1543:1593:1594:1711:1730:1747:1777:1792:2196:2198:2199:2200:2393:2553:2559:2562:2731:2890:2901:3138:3139:3140:3141:3142:3152:3355:3865:3866:3867:3868:3870:3871:3872:3874:4042:4117:4250:4385:4641:5007:6119:6120:6261:6653:7875:7903:10004:10400:10450:10455:11026:11658:11914:12291:12295:12297:12438:12517:12519:13161:13215:13229:13255:13439:14096:14097:14181:14659:14721:14819:19904:19999:21080:21324:21444:21450:21451:21627:30001:30054:30090:30091,0,RBL:209.85.210.193:@google.com:.lbl8.mailshell.net-62.18.0.100 66.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:0 :0,LFtim X-HE-Tag: alley25_7df2caa788419 X-Filterd-Recvd-Size: 6113 Received: from mail-pf1-f193.google.com (mail-pf1-f193.google.com [209.85.210.193]) by imf06.hostedemail.com (Postfix) with ESMTP for ; Wed, 4 Sep 2019 19:54:25 +0000 (UTC) Received: by mail-pf1-f193.google.com with SMTP id x127so1284022pfb.7 for ; Wed, 04 Sep 2019 12:54:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:user-agent:mime-version; bh=B3LdhXMejYHV96KWaOzQTiZ+Z/f85k5LioNA2CuFGwA=; b=or+awGJWRmLily/ifY9QIoKC+CGZd1JKGA2tLCtcVC/VmH4wwudR9egRn8R/ycC2MU QkEVmujb1Mk/MjJsZ531+o2qQ7k16VC23z6324ylW7RJXfjvi0O5g2i8K9SXerUQbgNO Q6NSmTB/AoABUH0mdkvgh56M3vvjfBldKCTDhb6r8Ry9sSh/SoJW+jWVTaFD3s8cE7fe ohOxMT9Cth1jYnIPeErV+9BC+nAc3qlv9HzDPUcw3x/Wwi54C0UG/xZGaIFTFZBuEquM B3NRWQjdcMOUp7g/Fb0s6sxmlmLcFoSw/s92qJLDjewgM3DS2Ds0XUVoYF5VnX0+1+IS cONA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:user-agent :mime-version; bh=B3LdhXMejYHV96KWaOzQTiZ+Z/f85k5LioNA2CuFGwA=; b=Inq4nQFKOw9xPBJ2oQ63TuKp+6HyXwDl1V90768D2EFazPG54RrpSkMlwpag8XMOeT GFZZ30Vdc/pr/oI6FbaZsB1x5S/GRNPhBmSogtgB3k7wqWBdO3q8zgaBMaqrX5tHxr+y BByTl29CFGCJH+dDLoT+tgEPhk5afoI+zDJbR0StoHv3Vq8i/u6oIYfhtYgEctkcnDHX sH1EJFw1g/WMOpiiRvuZnPQXYzyAH1Z02ye5sFge8bUQlZSi7YNlEnX3PGbwhP/xoCgz 1mJ9AKq3gia53zX2AMsbzQN4uGnOJVoRY6dR7BSfSPn0Kufs+nQmgcUHqz+FvDOTU632 wQmA== X-Gm-Message-State: APjAAAWkcDiaFssZ6DEDlww0jMUaEOaeMCs21L1Qt6fURBQsInzTlmWB kjfYRtYXPIhRIF1fVvdjUDmcrg== X-Google-Smtp-Source: APXvYqx0L17qVqLK8UqgdtItnyAIhFPkS2AAz/mdKmDOlHp8/ioytZJL1xi4U489pRG9xu/gDl/18A== X-Received: by 2002:a62:cd45:: with SMTP id o66mr49017925pfg.112.1567626864234; Wed, 04 Sep 2019 12:54:24 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id c2sm3173938pjs.13.2019.09.04.12.54.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 04 Sep 2019 12:54:23 -0700 (PDT) Date: Wed, 4 Sep 2019 12:54:22 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Linus Torvalds , Andrew Morton cc: Andrea Arcangeli , Michal Hocko , Mel Gorman , Vlastimil Babka , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [rfc 3/4] mm, page_alloc: avoid expensive reclaim when compaction may not succeed Message-ID: User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Memory compaction has a couple significant drawbacks as the allocation order increases, specifically: - isolate_freepages() is responsible for finding free pages to use as migration targets and is implemented as a linear scan of memory starting at the end of a zone, - failing order-0 watermark checks in memory compaction does not account for how far below the watermarks the zone actually is: to enable migration, there must be *some* free memory available. Per the above, watermarks are not always suffficient if isolate_freepages() cannot find the free memory but it could require hundreds of MBs of reclaim to even reach this threshold (read: potentially very expensive reclaim with no indication compaction can be successful), and - if compaction at this order has failed recently so that it does not even run as a result of deferred compaction, looping through reclaim can often be pointless. For hugepage allocations, these are quite substantial drawbacks because these are very high order allocations (order-9 on x86) and falling back to doing reclaim can potentially be *very* expensive without any indication that compaction would even be successful. Reclaim itself is unlikely to free entire pageblocks and certainly no reliance should be put on it to do so in isolation (recall lumpy reclaim). This means we should avoid reclaim and simply fail hugepage allocation if compaction is deferred. It is also not helpful to thrash a zone by doing excessive reclaim if compaction may not be able to access that memory. If order-0 watermarks fail and the allocation order is sufficiently large, it is likely better to fail the allocation rather than thrashing the zone. Signed-off-by: David Rientjes --- mm/page_alloc.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4458,6 +4458,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (page) goto got_pg; + if (order >= pageblock_order && (gfp_mask & __GFP_IO)) { + /* + * If allocating entire pageblock(s) and compaction + * failed because all zones are below low watermarks + * or is prohibited because it recently failed at this + * order, fail immediately. + * + * Reclaim is + * - potentially very expensive because zones are far + * below their low watermarks or this is part of very + * bursty high order allocations, + * - not guaranteed to help because isolate_freepages() + * may not iterate over freed pages as part of its + * linear scan, and + * - unlikely to make entire pageblocks free on its + * own. + */ + if (compact_result == COMPACT_SKIPPED || + compact_result == COMPACT_DEFERRED) + goto nopage; + } + /* * Checks for costly allocations with __GFP_NORETRY, which * includes THP page fault allocations From patchwork Wed Sep 4 19:54:25 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Rientjes X-Patchwork-Id: 11131343 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 596BF1510 for ; Wed, 4 Sep 2019 19:54:31 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1BA9522CF7 for ; Wed, 4 Sep 2019 19:54:31 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="E3V36Igk" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1BA9522CF7 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id AC0626B000D; Wed, 4 Sep 2019 15:54:28 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id A20036B000E; Wed, 4 Sep 2019 15:54:28 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 935706B0010; Wed, 4 Sep 2019 15:54:28 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0241.hostedemail.com [216.40.44.241]) by kanga.kvack.org (Postfix) with ESMTP id 7433B6B000D for ; Wed, 4 Sep 2019 15:54:28 -0400 (EDT) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id F1093A2D9 for ; Wed, 4 Sep 2019 19:54:27 +0000 (UTC) X-FDA: 75898290174.22.glass22_7e3e18b6a1031 X-Spam-Summary: 2,0,0,82784b88b86ef792,d41d8cd98f00b204,rientjes@google.com,:torvalds@linux-foundation.org:akpm@linux-foundation.org:aarcange@redhat.com:mhocko@suse.com:mgorman@suse.de:vbabka@suse.cz:kirill@shutemov.name:linux-kernel@vger.kernel.org:,RULES_HIT:41:355:379:800:960:973:988:989:1260:1277:1313:1314:1345:1437:1516:1518:1535:1542:1593:1594:1711:1730:1747:1777:1792:2198:2199:2393:2559:2562:2693:2731:2901:3138:3139:3140:3141:3142:3152:3354:3865:3866:3867:3868:3870:3871:3872:3874:5007:6119:6120:6261:6653:7903:8660:10004:10128:10400:10450:10455:11026:11473:11658:11914:12043:12297:12438:12517:12519:13148:13161:13215:13229:13230:13439:14096:14097:14181:14659:14721:14819:19904:19999:21080:21444:21451:21627:21740:30054:30091,0,RBL:209.85.214.194:@google.com:.lbl8.mailshell.net-62.18.0.100 66.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: glass22_7e3e18b6a1031 X-Filterd-Recvd-Size: 5260 Received: from mail-pl1-f194.google.com (mail-pl1-f194.google.com [209.85.214.194]) by imf03.hostedemail.com (Postfix) with ESMTP for ; Wed, 4 Sep 2019 19:54:27 +0000 (UTC) Received: by mail-pl1-f194.google.com with SMTP id t1so8306plq.13 for ; Wed, 04 Sep 2019 12:54:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:user-agent:mime-version; bh=/fcTM79WZf6aDvSyEB7v9KgQnwaVScj2r1F/kgx7mCM=; b=E3V36Igk37DlL+J+dNYTpk/5GulrcMjXTraH6Ngs+1wvAnXOx31Cox17KsuPBS7LwS /EwobeVXzDpp5HZRc6SYmYbcJ8PFJ6kTVzu86VTyAp2FJl2PsaajqWu2hY4sWkvGts+a SgmZuJHXD58+tgFD+VMwSmk0P7tTVZwYiBNn3CfSxtIRzkKNJhSkYUVNiPZiJmS/bclx lY9UzFHAaQDRht4CDe8lp9VOmP8kK/Vd/Md1apXnk23QxeYAYkiOH88Isx85bfdS0va9 qZqc3Pab7GpN9kcL1nLhYDiS0u71EEyedAHOTAWzT807RgsRB/1r9nH30uSry4hIh6lp ndjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:user-agent :mime-version; bh=/fcTM79WZf6aDvSyEB7v9KgQnwaVScj2r1F/kgx7mCM=; b=IuMJJe6KEV1s+dfzFpd/A+kt2uB9/5eWcLFCZUnuYX9g4yeuRtH5Ss71lIbqoR0a5Q 51XzKc3rgjbntiJx3KPx6T0FRpLWgLWWbAlIYOpQi5CsfzOg7mDk0yp65imKsbvkrFeg xZ4rw2i6GtWLLG7ViOnBavrtHUV8yLUnNHgH3CXY5X4eIkL42u8ed3FHWxZ9hb4YgjQR yRnAym6u2iMhfvvQ49hEO4ezOS7Xh1rYFjnAs2HSXYFM1/wSeaaL0gkKBLJbNRSUCLxg AzEtOK9kImlgh1ZGBiM+R1sRFGIKPeUEvVJHNOZCfsNLzhV2GOywrPRy0AUpAJT8Id+H dwaw== X-Gm-Message-State: APjAAAW8eUHoBqm9bxNvQgWeGevgnYVMYieTw+BaBz+DEaJNQuVjWgyL yYFPifREIYo3tVuP4yP1g7pZHQ== X-Google-Smtp-Source: APXvYqyev6JUD7Cgpq+QD6BsUkX6hZCGFvGo0usW/rNEhRfV/oQEPsa1k3RR3apalU7SCOTsPntQdQ== X-Received: by 2002:a17:902:b7cb:: with SMTP id v11mr23076612plz.153.1567626866334; Wed, 04 Sep 2019 12:54:26 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id j2sm6631739pfe.130.2019.09.04.12.54.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 04 Sep 2019 12:54:25 -0700 (PDT) Date: Wed, 4 Sep 2019 12:54:25 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Linus Torvalds , Andrew Morton cc: Andrea Arcangeli , Michal Hocko , Mel Gorman , Vlastimil Babka , "Kirill A. Shutemov" , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [rfc 4/4] mm, page_alloc: allow hugepage fallback to remote nodes when madvised Message-ID: User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: For systems configured to always try hard to allocate transparent hugepages (thp defrag setting of "always") or for memory that has been explicitly madvised to MADV_HUGEPAGE, it is often better to fallback to remote memory to allocate the hugepage if the local allocation fails first. The point is to allow the initial call to __alloc_pages_node() to attempt to defragment local memory to make a hugepage available, if possible, rather than immediately fallback to remote memory. Local hugepages will always have a better access latency than remote (huge)pages, so an attempt to make a hugepage available locally is always preferred. If memory compaction cannot be successful locally, however, it is likely better to fallback to remote memory. This could take on two forms: either allow immediate fallback to remote memory or do per-zone watermark checks. It would be possible to fallback only when per-zone watermarks fail for order-0 memory, since that would require local reclaim for all subsequent faults so remote huge allocation is likely better than thrashing the local zone for large workloads. In this case, it is assumed that because the system is configured to try hard to allocate hugepages or the vma is advised to explicitly want to try hard for hugepages that remote allocation is better when local allocation and memory compaction have both failed. Signed-off-by: David Rientjes --- mm/mempolicy.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/mm/mempolicy.c b/mm/mempolicy.c --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2133,6 +2133,17 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, mpol_cond_put(pol); page = __alloc_pages_node(hpage_node, gfp | __GFP_THISNODE, order); + + /* + * If hugepage allocations are configured to always + * synchronous compact or the vma has been madvised + * to prefer hugepage backing, retry allowing remote + * memory as well. + */ + if (!page && (gfp & __GFP_DIRECT_RECLAIM)) + page = __alloc_pages_node(hpage_node, + gfp | __GFP_NORETRY, order); + goto out; } }