From patchwork Tue Jan 7 20:39:56 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Xu X-Patchwork-Id: 13929591 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 113BCE77197 for ; Tue, 7 Jan 2025 20:40:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 80EC96B00A7; Tue, 7 Jan 2025 15:40:14 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 79C546B00A4; Tue, 7 Jan 2025 15:40:14 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5F2D46B00A6; Tue, 7 Jan 2025 15:40:14 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 35DAE6B00A3 for ; Tue, 7 Jan 2025 15:40:14 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 9E5EB1C7E08 for ; Tue, 7 Jan 2025 20:40:13 +0000 (UTC) X-FDA: 82981823106.23.5FC279F Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf16.hostedemail.com (Postfix) with ESMTP id 684DD180002 for ; Tue, 7 Jan 2025 20:40:11 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=it5EevuF; spf=pass (imf16.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736282411; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7lk3UnwL4Id+x/YZ2BCdCNZ6wF+YE+g3ghMfVfHdxsk=; b=QtBWBUeKWjBQzOuZ+SjjupcAeJBsPjgdEV6H3KnzjNUsN4T0zm6M2oYWY2d/BB57nIZSB6 qnfPnYPSwORTmWqC4jtqZOHiXpgm6XGYIyg/cFbQG+JV1ttaE+EHmlGPkby/GucVuYbSS7 ZyCzQSNV3iAtgQk4SREaagIWxvP5vzU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736282411; a=rsa-sha256; cv=none; b=zULfY4Rhh92dVRPY2iurcYEDUh5OF8MvjVq8eXO4bW2CaVJkVvujD2Ku3uCxEx0P2RPi2z Xdn0ouv1FvOjafLDRYpRjh/hrxWupdev4+NKtkFZ3yYLHn+TzrEDlYp9WHovalu5kIl1an hiUc9XUpMvLWhYFV9wdkbFeiScmmTYI= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=it5EevuF; spf=pass (imf16.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1736282410; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=7lk3UnwL4Id+x/YZ2BCdCNZ6wF+YE+g3ghMfVfHdxsk=; b=it5EevuFYzu9wa2ByvJ/uxAQ9D/wgZ7wJoOKWVGZD/hx+bvNFx0oN5zXmvvJrixzucN9jg FZRuuuac6MD1hg5BmRi6gjrWnGd8ljAhM4eDugQxMsH8bHWMm6tGAANx8aifBtSjweH/Qr VKkIvGA8pM/ZYaa101/AbsdwZxPp5Po= Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com [209.85.219.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-280-Pbqg5RqEO0CTFHlqQGLhPw-1; Tue, 07 Jan 2025 15:40:09 -0500 X-MC-Unique: Pbqg5RqEO0CTFHlqQGLhPw-1 X-Mimecast-MFC-AGG-ID: Pbqg5RqEO0CTFHlqQGLhPw Received: by mail-qv1-f72.google.com with SMTP id 6a1803df08f44-6dcccc8b035so3154126d6.1 for ; Tue, 07 Jan 2025 12:40:09 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736282408; x=1736887208; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=7lk3UnwL4Id+x/YZ2BCdCNZ6wF+YE+g3ghMfVfHdxsk=; b=dg37lyAVM9L0R+nHqECYY68Hp5t4mrwA/qp9MijNTTJLGXPJyLl1mLA3ifi6inE6iK sfcBdhcgULjT7qH3/tY1R5z19+7ElyQzdVLXBWnUPHMPKDYnQVtSb/EJxClnseCVC4D7 dP1IVfjTMpPtRwfD48DO18kXCqhe/HzzQ8dtGdVWssaJOmQyn1rcetRzYPXThSFYF/rN rImB6yeghzqC1ZGEYxFCIROj7ry6iTtTq4SOK+U85vRcEH0sd4Isgrm829xXcmbaShTr Dv8bC3iwmV2vwF/yOWPSXWY8uMCISlMwM9R/2gar2LyLRBFlxF+JiJdpz4U88tQtj9o+ 431g== X-Gm-Message-State: AOJu0YyX5KlGlyrV00u2YAW12K75wawY31knpxKVVgbI4y3+3BJ3yYib 47ngWJs2nWGJltrajquMf9ntNC6snydNtxo772ZwjEnfsyjIMzY5JyIM56coLYtV8L3wRWQE7dx c1kO0vFY0j6ygehhSHfqpq54ZSAG59KS1J6zn+sOs2FiDkLEy5Kdgpdb5KdAJOaDbqkc2pZJ0c4 BAlIO+MID/xA66IkxjMkMpAFufOzWzwQ== X-Gm-Gg: ASbGncuXfcVnseQGm/3brD9ewSW17mQrt1nIWxBxlY2L2rUGxoKEXWiGbKz2kiNKIn4 8qh6VhpalKwSPSTYwYCZJ7r/DKeAzgcxAXOrxAezOUeasIRrOr/CWro2TmN7kuAVkvqTpXArjqc SALxTuN8FCJMMFrImqYfb/hfdRqcjZGT5WTRdyoouGAEyTFDJV0OGGY2W4oNWNJC4c9r0sm1ydC 66KS+plP9y9aR5/C8vCAHId2OzbOU9a1Drvkh/y9LyJ8KIAck4tu0IaWkrfEDWhfJbeCAisAhfG v80A2wFRv68ecQq3Ot+4EcpJ5dvRjTSA X-Received: by 2002:ad4:5dc5:0:b0:6d8:b660:f6aa with SMTP id 6a1803df08f44-6df9adfb518mr7494796d6.14.1736282407121; Tue, 07 Jan 2025 12:40:07 -0800 (PST) X-Google-Smtp-Source: AGHT+IE4opU0e+dfjClXuwDmR9cg3uGQm3YY2brmnjieEI+XeJr35YfbdedExB7KgsOyuQE8PaeuGA== X-Received: by 2002:ad4:5dc5:0:b0:6d8:b660:f6aa with SMTP id 6a1803df08f44-6df9adfb518mr7494476d6.14.1736282406765; Tue, 07 Jan 2025 12:40:06 -0800 (PST) Received: from x1n.redhat.com (pool-99-254-114-190.cpe.net.cable.rogers.com. [99.254.114.190]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6dd181373f6sm184478306d6.62.2025.01.07.12.40.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Jan 2025 12:40:06 -0800 (PST) From: Peter Xu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Breno Leitao , Rik van Riel , Muchun Song , Naoya Horiguchi , Roman Gushchin , Ackerley Tng , Andrew Morton , peterx@redhat.com, Oscar Salvador , linux-stable Subject: [PATCH v2 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool Date: Tue, 7 Jan 2025 15:39:56 -0500 Message-ID: <20250107204002.2683356-2-peterx@redhat.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: <20250107204002.2683356-1-peterx@redhat.com> References: <20250107204002.2683356-1-peterx@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: U31TBF16MWb_LwvyqvZoJ9m3qEpEKGPIVvBCSiuJhlE_1736282409 X-Mimecast-Originator: redhat.com content-type: text/plain; charset="US-ASCII"; x-default=true X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 684DD180002 X-Stat-Signature: qgwtz41oybwhbymo7pk78jdg8xgdbbab X-Rspam-User: X-HE-Tag: 1736282411-137945 X-HE-Meta: U2FsdGVkX1+lowTYhpyJ20ymD+8Gz8mkCAhqGANUnicaqI20Xz7UFFdhfkFA8+RomchVE3uqSBubsSSEu40hVML5dIaHY3Dw2ojXzUMemMXoxG4CEyB87AWPX0ha4ehqYGCs1pIEDjtl5tYAPj4uORGjLvdlGjePZEYGHW+sJeU3yqX45ZufipLPuWWxjWSZstx/Qt/BYtS3fKH4zZNN9XseR3BoG3a+RoifWjQHxnU4KtlrZQFRRT0wtei4PJkBp9PN469Ts+WMqzDlp6WJyXrR00555t/2g7GXAnMK+lHWpnNKbo27V16dYROpPX6THVdrQa6Iw7sKx8/3w/oUh1BJb76O9/W538fGYIlm35CwW7gOd/yabEVH0mac6912wYmbr1CV2WU+rUgNhg1X2b4amYfhHeP9BktqckzQAcIbPn9eeNm3o/Fql3rohU7q1yq1GFafYJ2+vMv2sbF17A5YZIGwrZjG/dNTi0zQKfhPh9wsoSIWHOqU3VpT1GAUmQh+h/j6tAJS1wnL4hkStga/C718EYjcOduGOH1gM3aDQHSoAX0FEEOWgi/J157WC+UxOVOLjIRBxLiYacWOw6JoSqvspRNoHideTGPD5sRQlpuzRF1Au2gCT8LuSwSfiBKN1B9OfZen2bKep6lcNAcvlqBu3VME+PXTrZY48glcPEMyPQngYhkI0miftVG9Kr/+E99XUMsgdt/aox2jVWfKRGf8tSghAwfj4/sDaAW/u1Fcwf2i+M5oycmOjIJIlQ2D6gGaofdnJqmx0rx0B+RT/AMh/wemmD8Cbmn2WqGeJXfSc9VJDGiHyb8qO92yqNpxYbGWTd7HE+nr7o9axYFcN1hZ7mubEQaRilfturL3Ii6064+ri/Po+nz2C29mHlYL1guKka7OTl4SHV9PQLyFSVidDfMgSbm8ZMWcUwgZ3jqLWO2McBcNgNtzpgGR3lCi/Pk8xvjuHcPq0A/ pZ9p1oU6 iz9iF+kXgzmb454T1rmLhJRPONqR+8BD83MpM+xHIPo1J153PlRtkSUTp94yMdFMSrPcB9BTzA/Qq0F3jWopMf/K9WI3EAk87GVJqM63O+7Y1LBFhKt9rUEMg5JAYMInvFHXUuApKoBTKSBHetdZb4CHwLpzhffkD6ONTGoElqoDScHFlf3zA2uadqOz4DPg28kdlN0W3WuzWE70Hh2Mshn/dHdbia1BjN00ukuiQjrFOPpNEbXRzb2dcusWQe7mwyu6xzzw6f4cJptogIvNnZryh8I4OiRfr4imy6/D808lc1ZwMyKmpmW+ptAU9iDueNjWX4uf1kUfZ7CfAfCFF+E6DutkS55CDRgEDqdXVEvWPvlCOs17Jy+PxRsUbt8XPS39Y0p7/Bghmzc+xz1y3fqx2S6QctvUszZ1NHfw5gaT+fFSBBEqkl4pbH4aG5gY38JAlIeK8JfOVwIUfkvmNEH3ShEg8iLR/9uTYkp0EC3TDyo0bpPvQg7iXS/Bu/9aA26/YRZee1lWlugTFE9BE6xxQTU7jj6YxiOG8ap/Tb0sCFVFZ2OubNGd/j5orVwI78cU174DRaD/IAxtucHlHeLLNFd9GoU8A2kIYO7fvYbkJ6OUQ+YvurYdGirIVzbH01D+lIrsGvMFIUv8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Since commit 04f2cbe35699 ("hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed"), avoid_reserve was introduced for a special case of CoW on hugetlb private mappings, and only if the owner VMA is trying to allocate yet another hugetlb folio that is not reserved within the private vma reserved map. Later on, in commit d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate"), alloc_huge_page() enforced to not consume any global reservation as long as avoid_reserve=true. This operation doesn't look correct, because even if it will enforce the allocation to not use global reservation at all, it will still try to take one reservation from the spool (if the subpool existed). Then since the spool reserved pages take from global reservation, it'll also take one reservation globally. Logically it can cause global reservation to go wrong. I wrote a reproducer below, trigger this special path, and every run of such program will cause global reservation count to increment by one, until it hits the number of free pages: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include #include #include #include #include #include #define MSIZE (2UL << 20) int main(int argc, char *argv[]) { const char *path; int *buf; int fd, ret; pid_t child; if (argc < 2) { printf("usage: %s \n", argv[0]); return -1; } path = argv[1]; fd = open(path, O_RDWR | O_CREAT, 0666); if (fd < 0) { perror("open failed"); return -1; } ret = fallocate(fd, 0, 0, MSIZE); if (ret != 0) { perror("fallocate"); return -1; } buf = mmap(NULL, MSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0); if (buf == MAP_FAILED) { perror("mmap() failed"); return -1; } /* Allocate a page */ *buf = 1; child = fork(); if (child == 0) { /* child doesn't need to do anything */ exit(0); } /* Trigger CoW from owner */ *buf = 2; munmap(buf, MSIZE); close(fd); unlink(path); return 0; } It can only reproduce with a sub-mount when there're reserved pages on the spool, like: # sysctl vm.nr_hugepages=128 # mkdir ./hugetlb-pool # mount -t hugetlbfs -o min_size=8M,pagesize=2M none ./hugetlb-pool Then run the reproducer on the mountpoint: # ./reproducer ./hugetlb-pool/test Fix it by taking the reservation from spool if available. In general, avoid_reserve is IMHO more about "avoid vma resv map", not spool's. I copied stable, however I have no intention for backporting if it's not a clean cherry-pick, because private hugetlb mapping, and then fork() on top is too rare to hit. Cc: linux-stable Fixes: d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate") Reviewed-by: Ackerley Tng Tested-by: Ackerley Tng Signed-off-by: Peter Xu --- mm/hugetlb.c | 22 +++------------------- 1 file changed, 3 insertions(+), 19 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 354eec6f7e84..2bf971f77553 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1394,8 +1394,7 @@ static unsigned long available_huge_pages(struct hstate *h) static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h, struct vm_area_struct *vma, - unsigned long address, int avoid_reserve, - long chg) + unsigned long address, long chg) { struct folio *folio = NULL; struct mempolicy *mpol; @@ -1411,10 +1410,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h, if (!vma_has_reserves(vma, chg) && !available_huge_pages(h)) goto err; - /* If reserves cannot be used, ensure enough pages are in the pool */ - if (avoid_reserve && !available_huge_pages(h)) - goto err; - gfp_mask = htlb_alloc_mask(h); nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask); @@ -1430,7 +1425,7 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h, folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, nid, nodemask); - if (folio && !avoid_reserve && vma_has_reserves(vma, chg)) { + if (folio && vma_has_reserves(vma, chg)) { folio_set_hugetlb_restore_reserve(folio); h->resv_huge_pages--; } @@ -3047,17 +3042,6 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, gbl_chg = hugepage_subpool_get_pages(spool, 1); if (gbl_chg < 0) goto out_end_reservation; - - /* - * Even though there was no reservation in the region/reserve - * map, there could be reservations associated with the - * subpool that can be used. This would be indicated if the - * return value of hugepage_subpool_get_pages() is zero. - * However, if avoid_reserve is specified we still avoid even - * the subpool reservations. - */ - if (avoid_reserve) - gbl_chg = 1; } /* If this allocation is not consuming a reservation, charge it now. @@ -3080,7 +3064,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, * from the global free pool (global change). gbl_chg == 0 indicates * a reservation exists for the allocation. */ - folio = dequeue_hugetlb_folio_vma(h, vma, addr, avoid_reserve, gbl_chg); + folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg); if (!folio) { spin_unlock_irq(&hugetlb_lock); folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);