From patchwork Sat Feb 18 00:27:55 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: James Houghton X-Patchwork-Id: 13145388 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 93BEBC64ED6 for ; Sat, 18 Feb 2023 00:29:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 23785280012; Fri, 17 Feb 2023 19:29:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1E6E0280002; Fri, 17 Feb 2023 19:29:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 08802280012; Fri, 17 Feb 2023 19:29:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id EF340280002 for ; Fri, 17 Feb 2023 19:29:07 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id CA837808D1 for ; Sat, 18 Feb 2023 00:29:07 +0000 (UTC) X-FDA: 80478527934.27.E583B62 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) by imf29.hostedemail.com (Postfix) with ESMTP id 13A3412000A for ; Sat, 18 Feb 2023 00:29:05 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=GhwC2T28; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf29.hostedemail.com: domain of 30RvwYwoKCO4ZjXekWXjedWeeWbU.SecbYdkn-ccalQSa.ehW@flex--jthoughton.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=30RvwYwoKCO4ZjXekWXjedWeeWbU.SecbYdkn-ccalQSa.ehW@flex--jthoughton.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1676680146; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=JfAQD2GnLgelJPsgCkKvGQkDAwhBg880UG9Zo5xkj4o=; b=xitQ+Nvfz/7wr+hRO9blF2y+5M0gZ3CgOPdTw5fEG/b4wnPjB7D9yriCaQ87J8XmaMdgWJ e9iTkvMJpL/1GnrpgBGcQK8h0VmcPnnp6z3fyGJB/j+7WX53zVMYYK7NIQ+ZyEpbbwhTcU NB6NiU04ROhEX5sla7FXRDacfcghrDI= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=GhwC2T28; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf29.hostedemail.com: domain of 30RvwYwoKCO4ZjXekWXjedWeeWbU.SecbYdkn-ccalQSa.ehW@flex--jthoughton.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=30RvwYwoKCO4ZjXekWXjedWeeWbU.SecbYdkn-ccalQSa.ehW@flex--jthoughton.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1676680146; a=rsa-sha256; cv=none; b=hBzMX6UywmtyI7Mo6EZA4QamRavu1IO4gsvwTEBeE/Cu+ejwZHnSBz9qQcJhfaKwDy7q98 tHeH++qrOcMBRakKK0BVuVwgWTYvU6kuSzH2BLELokXYOgLeBuYP2jU4/6EkbSHH5tB576 FlwXYgwEQZIonVGqPEnk52luQYuG+cQ= Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-536629fa4ceso17376237b3.16 for ; Fri, 17 Feb 2023 16:29:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=JfAQD2GnLgelJPsgCkKvGQkDAwhBg880UG9Zo5xkj4o=; b=GhwC2T28C0lmYxfHkr1F39H6tq+748WK5+znOZC5ojr/ThSSPvOkW3JfFS+N39ixGB ghbNpsvcjrL12G/rlTpuIaTgDVDrM0g8p1X3gpqXE1+PzjgwrE4/bveclrEHpITrsH6/ vCh2zPSW0omBWie4q3qfP1M6ZsaWlNDa4GFKgD1OeOlMW8F6eFJFEyAiAd7ZbqdicqY8 v4DpXcrSMZw5s++QcfPtVYVTTZBANPOzW0cTGWCstj+659DoDrTPU4/tbwMfCr3764rU 0IShYiy8Ml8Z3PlRkk3ux+gdZUWzixjHuGIN16t1b0SrnlJSfYwfS4QIWpPPtcB+ore8 jTlw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=JfAQD2GnLgelJPsgCkKvGQkDAwhBg880UG9Zo5xkj4o=; b=kzDXJoI3doT/1LyBruhQO3CDARUAd7rfEgLXxlWhmUCIb2QQzEiSYITPcoRJJ5C0Cc nw/u60C9ZeZMP84IDp+Mh28RVBCX6Kzxg2myNyxCYNZwYQ0fe4nR1RVAdav2AA6hURyn RihKyuyBGT5Ae0uPoXVJrtIKERgyhjeEL6OQvrDxs4k1cnDmMzXOaDUQGhZBZi32IXv9 H7ehqdLozCwiVBMgd0v+A4rjfu9o5Miigz0rTCOoq3mss3u7LveL4FqEHiE83TMV5CCd aIcT89iTDK5olYnKA+7ionOG4R0aaVKtKUsePzt2Q+s1riyoiQ0CSx5NFeFMyX6BppdK 1Q3A== X-Gm-Message-State: AO0yUKUr/DLEeMDD8tXnV9fpjtLhAuNJ5cAfgI9SQqNO6BTpX6xrP1KU HQ8S0X546nQC6e966OBeHg2XfvSZpFHbZCbX X-Google-Smtp-Source: AK7set+yokzqkL4u720B7KYDqcTj9sYjOsXNmiF1GrEsGqhb2zUZSxmLSlU9X3KMDaalRNX0T2p99Q9vNaJNFASA X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a05:6902:28c:b0:997:bdfe:78c5 with SMTP id v12-20020a056902028c00b00997bdfe78c5mr59430ybh.6.1676680145067; Fri, 17 Feb 2023 16:29:05 -0800 (PST) Date: Sat, 18 Feb 2023 00:27:55 +0000 In-Reply-To: <20230218002819.1486479-1-jthoughton@google.com> Mime-Version: 1.0 References: <20230218002819.1486479-1-jthoughton@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Message-ID: <20230218002819.1486479-23-jthoughton@google.com> Subject: [PATCH v2 22/46] hugetlb: add HGM support to copy_hugetlb_page_range From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu , Andrew Morton Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Frank van der Linden , Jiaqi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 13A3412000A X-Stat-Signature: fx73fqmhjhn7sy17znapcs3fn4k49hhc X-HE-Tag: 1676680145-884451 X-HE-Meta: U2FsdGVkX18T2TegfG2cjAaIrb8sGyMeo+M4rFHstrUyfg/4UqF4Bt45odbsdK+/lNVqlwdZiF9uNaaT1N/vDoCT+kT6VTg4nLm1D1H1jTQQg7xXdznH8Y0/NZw8w/XGUTp4ZEM1fkLVhoqeEmAhAW/3IN+JSjPG4TAyy9MBakyCf16uvRGEifyJt/iTNclHVdo4cXqgMRfR8kUASKw5W0onnSDeO9VOCmjsOI6sDpad4ImMFf7MdFEEbXKohY0cOYt5P8vxEGDk+lmaudVxcOo9FyB9Sctnsm4TOaGp4Vlk0gQ/RQafyzymUywjRgQOItiOMiDSdUd1pCUMDayzIcF5eD5laB9I9S7QuDLSJyK1xh/bgu7Yv6PRf1kvgQdY+mSsXWnTfSrwoLb+RPDwfZJ7Ss2B8RZUKtTIWGfUqNGJ2e011xVwM++VjZBsYGY7SWUaPWWj7nPSFDHfLKtWexhzZugWXbSdJBD1UHqCb+qil7a0X6PuOqrDQeeG/F3kwKkMmpN3Kxr3hkhRaIO/nlNmkudse9miFK2O2wlMezfh9RSqemDQwQCz6iNLtBtIv6ZL+2J2+OaXS8dSl+qHeyqkZ2vsYL+zM3eez0unzjKSxvw2olOFW2iDzeod+7OUjEv80DSJVRFYWlzxxyHKd1t5r5rWRppUCrk/yL/L06Lgo4QCcx9Chyh/LKoPhWmqpmAYhKfiqg4kzOLqFap1hUhndzc97/o9eMExSQ78hstySHr/892xYSJiVd61Mk4pgqjBTmylUjHH1k6kTVvfhkUTQ3657QiNv7ucpIQvEb35MEAr3Qo3loUS/7IVfNEeHHAI4CVp0U7lfnpTn/lZ6Llo+2YVnBGK2F7C7EKAf8CB0671Z4jRoi9wSoPMF1DYhenRDtQlwqlzYH3Gf/A/2LgPxxAkOvxbep73OEzSe22zS8/cy4TU10UcO7qV3Oylt8tvs75FEfBkEnG6kg6 f4CMbUrc Fztg0ZzwiEI3pjsIhzWyIfI3VHILZZnt2cFEzXjE/it3Jn3Fv+V8XfdrzzhwA62kF2BS61T8w0Wl5cm4o9a0ZoHCW/thyqibWEvR4jUb42yygMrabf6DYjo7hzaGoytjC8oc2cB/hQgjxNU4igTJmWn6dO2VSNQ8XBhWS49PPgKDKRxypr0M7AQsHs9Ev9wCtVL4Lvp4mx7jz6I17aCVS7x2ylXgZDAMLE/31n9cUb4SmYGVkJ/qka540w1SkczHW2ivLHSKqXO3jyGIphtvwoSSXh7fDKFOXyynbQtVa9m6yKjOnIGqbMxL+NPeJlmCMwujHgFai/etLQot+PYwiF/YpA2D/aDQq2f4viS/kklwGwLrzPHI8f2h8/lk3/hzmlhUC3LUoKxtqJgA9imsUR7zFDkRCSLnoJPhtFZjLC9Wedkygtz98i2JaZqTJI3OuxoWJxG/8Z1hwJGr+eR6JzNFIGUbygbpKyaOntaChAUAdb1dJfNDJxw6gGWxirD6rovCVtUJ3ECOgOlR8qd3HP4+kaO6OLjNflNJzwUbTu5DeAdsw+LGCbtfZq/XpZqbcSryh2sJ2osfGd+zGheEVsDX8YtD+UQU2G4bQuoZ+rIy6L4kiY6KrG5NL49C33d1WfqWPHxp5FhX4CZ5i+ju8JDZ6si2TO49iakwMOqSUiROhi+9l0gluPp1vKceAenGiWUsCZaCup1HS/mU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This allows fork() to work with high-granularity mappings. The page table structure is copied such that partially mapped regions will remain partially mapped in the same way for the new process. A page's reference count is incremented for *each* portion of it that is mapped in the page table. For example, if you have a PMD-mapped 1G page, the reference count will be incremented by 512. mapcount is handled similar to THPs: if you're completely mapping a hugepage, then the compound_mapcount is incremented. If you're mapping a part of it, the subpages that are getting mapped will have their mapcounts incremented. Signed-off-by: James Houghton diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 1a1a71868dfd..2fe1eb6897d4 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -162,6 +162,8 @@ void hugepage_put_subpool(struct hugepage_subpool *spool); void hugetlb_remove_rmap(struct page *subpage, unsigned long shift, struct hstate *h, struct vm_area_struct *vma); +void hugetlb_add_file_rmap(struct page *subpage, unsigned long shift, + struct hstate *h, struct vm_area_struct *vma); void hugetlb_dup_vma_private(struct vm_area_struct *vma); void clear_vma_resv_huge_pages(struct vm_area_struct *vma); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 693332b7e186..210c6f2b16a5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -141,6 +141,37 @@ void hugetlb_remove_rmap(struct page *subpage, unsigned long shift, page_remove_rmap(subpage, vma, false); } } +/* + * hugetlb_add_file_rmap() - increment the mapcounts for file-backed hugetlb + * pages appropriately. + * + * For pages that are being mapped with their hstate-level PTE (e.g., a 1G page + * being mapped with a 1G PUD), then we increment the compound_mapcount for the + * head page. + * + * For pages that are being mapped with high-granularity, we increment the + * mapcounts for the individual subpages that are getting mapped. + */ +void hugetlb_add_file_rmap(struct page *subpage, unsigned long shift, + struct hstate *h, struct vm_area_struct *vma) +{ + struct page *hpage = compound_head(subpage); + + if (shift == huge_page_shift(h)) { + VM_BUG_ON_PAGE(subpage != hpage, subpage); + page_add_file_rmap(hpage, vma, true); + } else { + unsigned long nr_subpages = 1UL << (shift - PAGE_SHIFT); + struct page *final_page = &subpage[nr_subpages]; + + VM_BUG_ON_PAGE(HPageVmemmapOptimized(hpage), hpage); + /* + * Increment the mapcount on each page that is getting mapped. + */ + for (; subpage < final_page; ++subpage) + page_add_file_rmap(subpage, vma, false); + } +} static inline bool subpool_is_free(struct hugepage_subpool *spool) { @@ -5210,7 +5241,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *src_vma) { pte_t *src_pte, *dst_pte, entry; - struct page *ptepage; + struct hugetlb_pte src_hpte, dst_hpte; + struct page *ptepage, *hpage; unsigned long addr; bool cow = is_cow_mapping(src_vma->vm_flags); struct hstate *h = hstate_vma(src_vma); @@ -5238,18 +5270,24 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, } last_addr_mask = hugetlb_mask_last_page(h); - for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) { + addr = src_vma->vm_start; + while (addr < src_vma->vm_end) { spinlock_t *src_ptl, *dst_ptl; - src_pte = hugetlb_walk(src_vma, addr, sz); - if (!src_pte) { - addr |= last_addr_mask; + unsigned long hpte_sz; + + if (hugetlb_full_walk(&src_hpte, src_vma, addr)) { + addr = (addr | last_addr_mask) + sz; continue; } - dst_pte = huge_pte_alloc(dst, dst_vma, addr, sz); - if (!dst_pte) { - ret = -ENOMEM; + ret = hugetlb_full_walk_alloc(&dst_hpte, dst_vma, addr, + hugetlb_pte_size(&src_hpte)); + if (ret) break; - } + + src_pte = src_hpte.ptep; + dst_pte = dst_hpte.ptep; + + hpte_sz = hugetlb_pte_size(&src_hpte); /* * If the pagetables are shared don't copy or take references. @@ -5259,13 +5297,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, * another vma. So page_count of ptep page is checked instead * to reliably determine whether pte is shared. */ - if (page_count(virt_to_page(dst_pte)) > 1) { - addr |= last_addr_mask; + if (hugetlb_pte_size(&dst_hpte) == sz && + page_count(virt_to_page(dst_pte)) > 1) { + addr = (addr | last_addr_mask) + sz; continue; } - dst_ptl = huge_pte_lock(h, dst, dst_pte); - src_ptl = huge_pte_lockptr(huge_page_shift(h), src, src_pte); + dst_ptl = hugetlb_pte_lock(&dst_hpte); + src_ptl = hugetlb_pte_lockptr(&src_hpte); spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); entry = huge_ptep_get(src_pte); again: @@ -5309,10 +5348,15 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, */ if (userfaultfd_wp(dst_vma)) set_huge_pte_at(dst, addr, dst_pte, entry); + } else if (!hugetlb_pte_present_leaf(&src_hpte, entry)) { + /* Retry the walk. */ + spin_unlock(src_ptl); + spin_unlock(dst_ptl); + continue; } else { - entry = huge_ptep_get(src_pte); ptepage = pte_page(entry); - get_page(ptepage); + hpage = compound_head(ptepage); + get_page(hpage); /* * Failing to duplicate the anon rmap is a rare case @@ -5324,13 +5368,34 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, * need to be without the pgtable locks since we could * sleep during the process. */ - if (!PageAnon(ptepage)) { - page_add_file_rmap(ptepage, src_vma, true); - } else if (page_try_dup_anon_rmap(ptepage, true, + if (!PageAnon(hpage)) { + hugetlb_add_file_rmap(ptepage, + src_hpte.shift, h, src_vma); + } + /* + * It is currently impossible to get anonymous HugeTLB + * high-granularity mappings, so we use 'hpage' here. + * + * This will need to be changed when HGM support for + * anon mappings is added. + */ + else if (page_try_dup_anon_rmap(hpage, true, src_vma)) { pte_t src_pte_old = entry; struct folio *new_folio; + /* + * If we are mapped at high granularity, we + * may end up allocating lots and lots of + * hugepages when we only need one. Bail out + * now. + */ + if (hugetlb_pte_size(&src_hpte) != sz) { + put_page(hpage); + ret = -EINVAL; + break; + } + spin_unlock(src_ptl); spin_unlock(dst_ptl); /* Do not use reserve as it's private owned */ @@ -5342,7 +5407,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, } copy_user_huge_page(&new_folio->page, ptepage, addr, dst_vma, npages); - put_page(ptepage); + put_page(hpage); /* Install the new hugetlb folio if src pte stable */ dst_ptl = huge_pte_lock(h, dst, dst_pte); @@ -5360,6 +5425,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, hugetlb_install_folio(dst_vma, dst_pte, addr, new_folio); spin_unlock(src_ptl); spin_unlock(dst_ptl); + addr += hugetlb_pte_size(&src_hpte); continue; } @@ -5376,10 +5442,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, } set_huge_pte_at(dst, addr, dst_pte, entry); - hugetlb_count_add(npages, dst); + hugetlb_count_add( + hugetlb_pte_size(&dst_hpte) / PAGE_SIZE, + dst); } spin_unlock(src_ptl); spin_unlock(dst_ptl); + addr += hugetlb_pte_size(&src_hpte); } if (cow) {