From patchwork Mon Oct 7 07:55:48 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Michal Hocko X-Patchwork-Id: 11176919 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5AFBB1747 for ; Mon, 7 Oct 2019 07:56:23 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 245922133F for ; Mon, 7 Oct 2019 07:56:23 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 245922133F Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 3C87E8E0006; Mon, 7 Oct 2019 03:56:22 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 37A238E0003; Mon, 7 Oct 2019 03:56:22 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 268618E0006; Mon, 7 Oct 2019 03:56:22 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0224.hostedemail.com [216.40.44.224]) by kanga.kvack.org (Postfix) with ESMTP id F2DDE8E0003 for ; Mon, 7 Oct 2019 03:56:21 -0400 (EDT) Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id 89A686107 for ; Mon, 7 Oct 2019 07:56:21 +0000 (UTC) X-FDA: 76016230962.21.yard10_8ec19af1de62c X-Spam-Summary: 57,3.5,0,6dbf26ce2ad313f8,d41d8cd98f00b204,mstsxfx@gmail.com,:torvalds@linux-foundation.org:rientjes@google.com:vbabka@suse.cz:mike.kravetz@oracle.com:mgorman@suse.de:akpm@linux-foundation.org:linux-kernel@vger.kernel.org::mhocko@suse.com,RULES_HIT:41:355:379:541:800:960:967:973:988:989:1260:1263:1311:1314:1345:1437:1461:1515:1535:1544:1605:1711:1730:1747:1777:1792:2195:2198:2199:2200:2393:2525:2553:2559:2564:2682:2685:2689:2731:2859:2907:2919:2933:2937:2939:2942:2945:2947:2951:2954:3022:3138:3139:3140:3141:3142:3770:3865:3866:3867:3868:3870:3871:3872:3874:3934:3936:3938:3941:3944:3947:3950:3953:3956:3959:4118:4184:4250:4321:4470:4837:5007:6119:6238:6261:6630:7576:7875:7903:8599:9025:10013:10128:10913:11026:11232:11473:11657:11658:11914:12043:12295:12296:12297:12438:12517:12519:12555:12663:12679:12696:12700:12737:12783:12895:12986:13161:13229:13894:14096:14180:14181:14394:14721:14849:21060:21080:21444:21451:21627:21939:30054:30064:30090:30091,0,RBL:error,Cache IP:none, X-HE-Tag: yard10_8ec19af1de62c X-Filterd-Recvd-Size: 7316 Received: from mail-ed1-f68.google.com (mail-ed1-f68.google.com [209.85.208.68]) by imf47.hostedemail.com (Postfix) with ESMTP for ; Mon, 7 Oct 2019 07:56:20 +0000 (UTC) Received: by mail-ed1-f68.google.com with SMTP id v38so11479051edm.7 for ; Mon, 07 Oct 2019 00:56:20 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=0tjelzRG35F/5Gai2v8Luw9ZVlOg6fVlANdh8tu9bww=; b=jaXa460sncpYfSl0G99KC0SvgiQwe5QnLxpyHDO0Afz5CZjAQRXqFOJtw4oTDJd+ro F87OB94iOop+F8Qk8V3BAVDfWdaDQJmtDmB3Y4ej5KIMGyj496ddmI+BlNLxhgUkzNtb WXG7KuQrvO/I1oQDVKReQc4vH15o5PrNt+j6wInSwbDr/8xo44rr3QXLDgj4oDtlDdpD OHvPvgjhwUPcgwqA47QdF6ZJApL75Y8gq1HLXKrgl2+mzrCtjXs+hZiI3kGN4hl+TiWl fWbtRV1QurIQu1CMkDCDqPF+0dqumUEcmdcT95O46n/0mSJfVvAuSy/BvqQU6jNP4woJ Vnow== X-Gm-Message-State: APjAAAV7GAfMzaEQO3tKcXyulJ8jvIXmMGP4+2Dv/dKXgw3KC40dd2Cm nDBX/egV1ebh/a9Yz8eNikM= X-Google-Smtp-Source: APXvYqwmcNNs5iKMHquWb4x1twV+ehVRh1QfIiji4ASsPTW3uMJ6qimeC3PZd8PVD6QiagO42OVRdg== X-Received: by 2002:a50:b6c8:: with SMTP id f8mr27997836ede.33.1570434979680; Mon, 07 Oct 2019 00:56:19 -0700 (PDT) Received: from tiehlicka.microfocus.com (prg-ext-pat.suse.com. [213.151.95.130]) by smtp.gmail.com with ESMTPSA id g19sm1733059eje.0.2019.10.07.00.56.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 07 Oct 2019 00:56:18 -0700 (PDT) From: Michal Hocko To: Linus Torvalds Cc: David Rientjes , Vlastimil Babka , Mike Kravetz , Mel Gorman , Andrew Morton , LKML , , Michal Hocko Subject: [PATCH] mm, hugetlb: allow hugepage allocations to excessively reclaim Date: Mon, 7 Oct 2019 09:55:48 +0200 Message-Id: <20191007075548.12456-1-mhocko@kernel.org> X-Mailer: git-send-email 2.20.1 MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: David Rientjes b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed") has chnaged the allocator to bail out from the allocator early to prevent from a potentially excessive memory reclaim. __GFP_RETRY_MAYFAIL is designed to retry the allocation, reclaim and compaction loop as long as there is a reasonable chance to make a forward progress. Neither COMPACT_SKIPPED nor COMPACT_DEFERRED at the INIT_COMPACT_PRIORITY compaction attempt gives this feedback. The most obvious affected subsystem is hugetlbfs which allocates huge pages based on an admin request (or via admin configured overcommit). I have done a simple test which tries to allocate half of the memory for hugetlb pages while the memory is full of a clean page cache. This is not an unusual situation because we try to cache as much of the memory as possible and sysctl/sysfs interface to allocate huge pages is there for flexibility to allocate hugetlb pages at any time. System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages after the memory is prefilled by a clean page cache: root@test1:~# cat hugetlb_test.sh set -x echo 0 > /proc/sys/vm/nr_hugepages echo 3 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/compact_memory dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10)) TS=$(date +%s) echo 256 > /proc/sys/vm/nr_hugepages cat /proc/sys/vm/nr_hugepages The results for 2 consecutive runs on clean 5.3 root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s + date +%s + TS=1569905284 + echo 256 + cat /proc/sys/vm/nr_hugepages 256 root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s + date +%s + TS=1569905311 + echo 256 + cat /proc/sys/vm/nr_hugepages 256 Now with b39d0ee2632d applied root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s + date +%s + TS=1569905516 + echo 256 + cat /proc/sys/vm/nr_hugepages 11 root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s + date +%s + TS=1569905541 + echo 256 + cat /proc/sys/vm/nr_hugepages 12 The success rate went down by factor of 20! Although hugetlb allocation requests might fail and it is reasonable to expect them to under extremely fragmented memory or when the memory is under a heavy pressure but the above situation is not that case. Fix the regression by reverting back to the previous behavior for __GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for those requests. [mhocko@suse.com: reworded changelog] Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed") Cc: Mike Kravetz Signed-off-by: David Rientjes Signed-off-by: Michal Hocko Reviewed-by: Mike Kravetz Acked-by: Vlastimil Babka --- Hi, this has been posted by David as an RFC [1]. David doesn't seem to appreciate the level of regression so I have largely rewritten the changelog to be more explicit. I haven't changed the patch itself so I have preserved his s-o-b. I would also like to emphasise that I am not overly happy about the patch. Vlastimil has posted [2] an alternative solution which looks better but it is also slightly more complex. We can do that in a follow up though so let's go with the simplest hack^Wsolution for now. [1] http://lkml.kernel.org/r/alpine.DEB.2.21.1910021556270.187014@chino.kir.corp.google.com [2] http://lkml.kernel.org/r/20191001054343.GA15624@dhcp22.suse.cz mm/page_alloc.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 15c2050c629b..01aa46acee76 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4467,12 +4467,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (page) goto got_pg; - if (order >= pageblock_order && (gfp_mask & __GFP_IO)) { + if (order >= pageblock_order && (gfp_mask & __GFP_IO) && + !(gfp_mask & __GFP_RETRY_MAYFAIL)) { /* * If allocating entire pageblock(s) and compaction * failed because all zones are below low watermarks * or is prohibited because it recently failed at this - * order, fail immediately. + * order, fail immediately unless the allocator has + * requested compaction and reclaim retry. * * Reclaim is * - potentially very expensive because zones are far