From patchwork Mon Dec 18 07:06:32 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Xu Yu X-Patchwork-Id: 13496286 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AFC32C35274 for ; Mon, 18 Dec 2023 07:06:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 722CD6B0092; Mon, 18 Dec 2023 02:06:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6D4976B0093; Mon, 18 Dec 2023 02:06:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 500746B0096; Mon, 18 Dec 2023 02:06:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 241A16B0093 for ; Mon, 18 Dec 2023 02:06:45 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id EEBA64060E for ; Mon, 18 Dec 2023 07:06:44 +0000 (UTC) X-FDA: 81579056328.30.4E80819 Received: from out30-113.freemail.mail.aliyun.com (out30-113.freemail.mail.aliyun.com [115.124.30.113]) by imf12.hostedemail.com (Postfix) with ESMTP id CD8DB40012 for ; Mon, 18 Dec 2023 07:06:40 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=none; spf=pass (imf12.hostedemail.com: domain of xuyu@linux.alibaba.com designates 115.124.30.113 as permitted sender) smtp.mailfrom=xuyu@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1702883203; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=hNelgYU2MfCQmFGb/IgXpTx2PCe6ToGlDo7Dz1GFMGU=; b=HC0+QIUHP5iFSmOUk43Y18+7OcwfJ8xBbdToQkHm1nFVMIrHn6KeJEzApK1v7VeV+Z+pH1 TVRBzJfmJYOqmCuBrFCt+vwRgES39L9cMqnm8wtBg+J5RhGgVMzPK+k8FHOop9EKGOHsjj 2vcpwfJnBIRn3zL2X7gRX9Guffm1u5k= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1702883203; a=rsa-sha256; cv=none; b=ChnHCfLvHCfKYbXBxEaPij2Qzn/j89Ol24x30bmUFvOTANHnRBcvfxgbWR51Xkc2bcGSwb acRIFzSUaN9IDfKLuaDKLm/EGK9FNx6UTszNmyTpatlMkW7YYgVsW8wUjRiMzYChqQjG9O C8g9Fi1b1mxb6mBwX1ncCQYhoEMMGX8= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=none; spf=pass (imf12.hostedemail.com: domain of xuyu@linux.alibaba.com designates 115.124.30.113 as permitted sender) smtp.mailfrom=xuyu@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046051;MF=xuyu@linux.alibaba.com;NM=1;PH=DS;RN=2;SR=0;TI=SMTPD_---0VygkCIt_1702883195; Received: from localhost(mailfrom:xuyu@linux.alibaba.com fp:SMTPD_---0VygkCIt_1702883195) by smtp.aliyun-inc.com; Mon, 18 Dec 2023 15:06:36 +0800 From: Xu Yu To: linux-mm@kvack.org Cc: david@redhat.com Subject: [PATCH v3 1/2] mm/khugepaged: map RO non-exclusive pte-mapped anon THPs by pmds Date: Mon, 18 Dec 2023 15:06:32 +0800 Message-Id: <1fecc331345653b8a3ab1dc2cfb24b5f946f5569.1702882426.git.xuyu@linux.alibaba.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Queue-Id: CD8DB40012 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: meq1izb5f7r6xtmh87kq8jyop9tf1pp6 X-HE-Tag: 1702883200-321984 X-HE-Meta: U2FsdGVkX1/nw0K1r4XJZ9T9mHGQ7GH/YMV9Vy8zZwLtTnjcfUhZB99xREsfM7gmdm/8bvoy9hTHFzQZD1U701SJnX6264y5Zm2wbhM7RwUxOnj7vk0SlzIFPqe6qIygCCVlYKNayaZnA/KPHy7dblFeU2xRZGSZwgygHctWlv4cGR0Rt50zzhVachL5oiCWnoSxSvHJ0YkItX3cs4G/fv1U4a91PGdOuHcWpqLYZ9H/RwXKbRESJt/FilzWNjW/Jvu1CzVa5PjNImZUyiCOJ4xUm5LYOqerw0kDrdJgubpc3M5oNY9kVHFlmLrM2yk7JVyoq5W7+CS2Kbv5wBP1jTfMTAtQmTnDU5+PpMnPoj1PPfGnx6mxAqOMircEAgCLnqVnMZOwdCCfIia0iryO05oMIHd0FlfdbN0fGkaasFVKBRofP20rChg76SOj9xVBP9qSDiIp7saRprXCkHQxJ9u7vzsAqZEJ1yZYdYh96ZweCfiKCczmm2jCL/Bk8z+1y5l3iSqyfEaSvw4nenrk0jhKbOWIsSfJ9Lh2dmIMfNP3OXAQuXC5fHtQvd/zuE8qtteyx6bp8zr2SRo1YS40UhNE+ryWiKcCxNhA8tmOFrl/VYaAb3IJGqThEa5zTDWRLHGjRtJHWwmDCrZa2Kl+HFYLUeRiCzIwiUzZeOKoCbOVAlAJeRJ+yicE0SHLDTKIwGMXs8528BFdSTsEmeDe79AO/hBzUfUPVy4rfdfucV/DyJ/y0qEw385YD0e+yZuA1rUg8G/5tjhmfTfU1rjEWfp5sCvchNQDV4wJQzIK0fwHc3rzLjlxovRHkwulUhCODum+ziYNIDEH3sGZXIUkfkfKFUvjb+28G3eQDvZZEpM1NdweBoq6WknVx+cuba2Q9ToFxvIT+ng6o2+eM/y/GNNSdE5rMKSndOy3pYn4iEqnUlpN5U+06fjKs5mgQ/tRxmzrmoQmJnt1rwdMhOc ZyaX+Gtf IP6M5svkTkaHaIBq2pn4Qjt0wm7pbZLCEmwDhB9B0KdtoHtyQIMau0CV5WbIY5l/45L8JO0faVmSUFCDAcFHjHI4lxsSaKByaYT0CnygHh1vPbZ0wdlsoOYOk4A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: In the anonymous collapse path, khugepaged always collapses pte-mapped hugepage by allocating and copying to a new hugepage. In some scenarios, we can only update the mapping page tables for anonymous pte-mapped THPs, in the same way as file/shmem-backed pte-mapped THPs, as shown in commit 58ac9a8993a1 ("mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds") The simplest scenario that satisfies the conditions, as David points out, is when no subpages are PageAnonExclusive (PTEs must be R/O), we can collapse into a R/O PMD without further action. Let's start from this simplest scenario. Signed-off-by: Xu Yu --- mm/khugepaged.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 212 insertions(+) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 88433cc25d8a..57e261387124 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1237,6 +1237,196 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, return result; } +static struct folio *find_lock_pte_mapped_folio(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmd) +{ + pte_t *pte, pteval; + struct folio *folio = NULL; + + pte = pte_offset_map(pmd, addr); + if (!pte) + return NULL; + + pteval = ptep_get_lockless(pte); + if (pte_none(pteval) || !pte_present(pteval)) + goto out; + + folio = vm_normal_folio(vma, addr, pteval); + if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) + goto out; + + if (!folio_trylock(folio)) { + folio = NULL; + goto out; + } + + if (!folio_try_get(folio)) { + folio_unlock(folio); + folio = NULL; + goto out; + } + +out: + pte_unmap(pte); + return folio; +} + +static int collapse_pte_mapped_anon_thp(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long haddr, bool *mmap_locked, + struct collapse_control *cc) +{ + struct mmu_notifier_range range; + struct folio *folio; + pte_t *start_pte, *pte; + pmd_t *pmd, pmdval; + spinlock_t *pml, *ptl; + pgtable_t pgtable; + unsigned long addr; + int exclusive = 0; + bool writable = false; + int result, i; + + /* Fast check before locking folio if already PMD-mapped */ + result = find_pmd_or_thp_or_none(mm, haddr, &pmd); + if (result == SCAN_PMD_MAPPED) + return result; + + folio = find_lock_pte_mapped_folio(vma, haddr, pmd); + if (!folio) + return SCAN_PAGE_NULL; + if (!folio_test_large(folio)) { + result = SCAN_FAIL; + goto drop_folio; + } + if (folio_order(folio) != HPAGE_PMD_ORDER) { + result = SCAN_PAGE_COMPOUND; + goto drop_folio; + } + + mmap_read_unlock(mm); + *mmap_locked = false; + + /* Prevent all access to pagetables */ + mmap_write_lock(mm); + vma_start_write(vma); + + result = hugepage_vma_revalidate(mm, haddr, true, &vma, cc); + if (result != SCAN_SUCCEED) + goto up_write; + + result = check_pmd_still_valid(mm, haddr, pmd); + if (result != SCAN_SUCCEED) + goto up_write; + + /* Recheck with mmap write lock */ + result = SCAN_SUCCEED; + start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); + if (!start_pte) + goto up_write; + for (i = 0, addr = haddr, pte = start_pte; + i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) { + struct page *subpage; + pte_t pteval = ptep_get(pte); + + if (pte_none(pteval) || !pte_present(pteval)) { + result = SCAN_PTE_NON_PRESENT; + break; + } + + if (pte_uffd_wp(pteval)) { + result = SCAN_PTE_UFFD_WP; + break; + } + + if (pte_write(pteval)) + writable = true; + + subpage = vm_normal_page(vma, addr, pteval); + + if (unlikely(!subpage) || + unlikely(is_zone_device_page(subpage))) { + result = SCAN_PAGE_NULL; + break; + } + + if (folio_page(folio, i) != subpage) { + result = SCAN_FAIL; + break; + } + + if (PageAnonExclusive(subpage)) + exclusive++; + } + pte_unmap_unlock(start_pte, ptl); + if (result != SCAN_SUCCEED) + goto up_write; + + /* + * Case 1: + * No subpages are PageAnonExclusive (PTEs must be R/O), we can + * collapse into a R/O PMD without further action. + */ + if (!(exclusive == 0 && !writable)) + goto up_write; + + /* Collapse pmd entry */ + anon_vma_lock_write(vma->anon_vma); + + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, + haddr, haddr + HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); + + pml = pmd_lock(mm, pmd); /* probably unnecessary */ + pmdval = pmdp_collapse_flush(vma, haddr, pmd); + spin_unlock(pml); + mmu_notifier_invalidate_range_end(&range); + tlb_remove_table_sync_one(); + + anon_vma_unlock_write(vma->anon_vma); + + /* + * Obtain a new pmd rmap before dropping pte rmaps to avoid + * false-negative page_mapped(). + */ + folio_get(folio); + page_add_anon_rmap(&folio->page, vma, haddr, RMAP_COMPOUND); + + start_pte = pte_offset_map_lock(mm, &pmdval, haddr, &ptl); + for (i = 0, addr = haddr, pte = start_pte; + i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) { + struct page *subpage; + pte_t pteval = ptep_get(pte); + + ptep_clear(mm, addr, pte); + subpage = vm_normal_page(vma, addr, pteval); + page_remove_rmap(subpage, vma, false); + } + pte_unmap_unlock(start_pte, ptl); + folio_ref_sub(folio, HPAGE_PMD_NR); + + /* Install pmd entry */ + pgtable = pmd_pgtable(pmdval); + pmdval = mk_huge_pmd(&folio->page, vma->vm_page_prot); + spin_lock(pml); + pgtable_trans_huge_deposit(mm, pmd, pgtable); + set_pmd_at(mm, haddr, pmd, pmdval); + update_mmu_cache_pmd(vma, haddr, pmd); + spin_unlock(pml); + + result = SCAN_SUCCEED; + +up_write: + mmap_write_unlock(mm); + +drop_folio: + folio_unlock(folio); + folio_put(folio); + + /* TODO: tracepoints */ + return result; +} + static int hpage_collapse_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, bool *mmap_locked, @@ -1251,6 +1441,8 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, spinlock_t *ptl; int node = NUMA_NO_NODE, unmapped = 0; bool writable = false; + int exclusive = 0; + bool is_hpage = false; VM_BUG_ON(address & ~HPAGE_PMD_MASK); @@ -1333,8 +1525,14 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, } } + if (PageAnonExclusive(page)) + exclusive++; + page = compound_head(page); + if (compound_order(page) == HPAGE_PMD_ORDER) + is_hpage = true; + /* * Record which node the original page is from and save this * information to cc->node_load[]. @@ -1396,7 +1594,21 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, } out_unmap: pte_unmap_unlock(pte, ptl); + + if (is_hpage && (exclusive == 0 && !writable)) { + int res; + + res = collapse_pte_mapped_anon_thp(mm, vma, address, + mmap_locked, cc); + if (res == SCAN_PMD_MAPPED || res == SCAN_SUCCEED) { + result = res; + goto out; + } + } + if (result == SCAN_SUCCEED) { + if (!*mmap_locked) + mmap_read_lock(mm); result = collapse_huge_page(mm, address, referenced, unmapped, cc); /* collapse_huge_page will return with the mmap_lock released */