From patchwork Mon Dec 16 07:11:47 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Liu Shixin X-Patchwork-Id: 13909289 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 72957E77180 for ; Mon, 16 Dec 2024 07:15:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DB5316B007B; Mon, 16 Dec 2024 02:15:50 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D48D06B0082; Mon, 16 Dec 2024 02:15:50 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C051F6B0085; Mon, 16 Dec 2024 02:15:50 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id A04446B007B for ; Mon, 16 Dec 2024 02:15:50 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 213E0807CD for ; Mon, 16 Dec 2024 07:15:50 +0000 (UTC) X-FDA: 82899961872.04.2EAD079 Received: from szxga05-in.huawei.com (szxga05-in.huawei.com [45.249.212.191]) by imf21.hostedemail.com (Postfix) with ESMTP id B994B1C0005 for ; Mon, 16 Dec 2024 07:14:51 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=none; spf=pass (imf21.hostedemail.com: domain of liushixin2@huawei.com designates 45.249.212.191 as permitted sender) smtp.mailfrom=liushixin2@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734333321; a=rsa-sha256; cv=none; b=NMa5BbkeIu5d+XLKAUD7uC+HKfFqk03Deo8wISQlU4WM/fqsZb5zJ9gll4Z6svAcD8iTHa 4LvekCm5IZC9FzIZvIHb6xqcKyfY7bcjLjSm07/H1PGGolUj8BlA5mHo1JaZl9f2s2IUGN LX3vo5nFEnIm6Mu58r8ZGVfcuxwPOKc= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=none; spf=pass (imf21.hostedemail.com: domain of liushixin2@huawei.com designates 45.249.212.191 as permitted sender) smtp.mailfrom=liushixin2@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734333321; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references; bh=86L1gSL6rMEhoUyzp1k8//DlIC3w82aabDZVqPYlfqg=; b=QVNrUvHRhtgY9wsUgvggEy2LjPEbk6p09nW+9beqwH+JbTh57pd9S2pLaHanKPIe/jXGAH fMezLEt1K0MWRt7FTqHf35MSNw54q5K6Smn3qlOEbWRhD+xUL/3XKt8dU+69DqvztQNrif 6k4VvgNFAX3xOV7/vjy4iDS/O3bDuLo= Received: from mail.maildlp.com (unknown [172.19.162.112]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4YBWPM5n2Yz1kw5J; Mon, 16 Dec 2024 15:13:11 +0800 (CST) Received: from kwepemg200013.china.huawei.com (unknown [7.202.181.64]) by mail.maildlp.com (Postfix) with ESMTPS id 93B871400F4; Mon, 16 Dec 2024 15:15:43 +0800 (CST) Received: from huawei.com (10.175.113.32) by kwepemg200013.china.huawei.com (7.202.181.64) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 16 Dec 2024 15:15:39 +0800 From: Liu Shixin To: Andrew Morton , Muchun Song , Kenneth W Chen , Kefeng Wang , Nanyong Sun CC: , , Liu Shixin Subject: [PATCH v2] mm: hugetlb: independent PMD page table shared count Date: Mon, 16 Dec 2024 15:11:47 +0800 Message-ID: <20241216071147.3984217-1-liushixin2@huawei.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 X-Originating-IP: [10.175.113.32] X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To kwepemg200013.china.huawei.com (7.202.181.64) X-Rspamd-Queue-Id: B994B1C0005 X-Stat-Signature: djyjd74iqgnxcmf6n5gyfap536ighgis X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1734333291-668258 X-HE-Meta: U2FsdGVkX1+wJCu9fLdzI6cNi+0w50kddu2Qv1cQo6qPSchUcBvoncdIVMBckAycGmd/kIOP521RL+yJ6E9MqKx3TXubA/hp6Y1LsytBL3AZatt1cmGseTJ3wNthpze4w9PyP3Q+ob06lLCVHzuOlLYPvv+nRGg+ALTS3P9gxIC46R6x19mF85MjO0pphJrnNOS4H+y0dT3D3KD4z/5bKaPV5igM71cz/b6hl1JbxvV8KU8/XSxsWEFLh1hBXSDl6jdUpKXcPD5mBpUfP4y3eEyscvQ4ybg6DGs+hbMWim+tTcS6hjc/l/IlWkDYPQSxbMtHR3A1Bs2ZiapRKF5mwvoPw3k1Ut9uS3ZcWYkBFcm9nN+HUoyJbywms2YtOLGKBzE+XeJrcigjzYIxGYg0rELaVjSUXQ++Qc/Oo7LsleHhwVUcnnYpgp5EtkdXL+yzbuXD+njY1CYaGjRYTKYUdOcRg1UjaWldvM65k8Yo3DGEM0d713BfC1wyg818Y68T1U3saG92H14Ccp31BF6zv9YRHZ9TRt6wjeWYIn1bhTxJqIugDjfMR0fA0NktiT3xegFWPgUFSjboKlp22j7iIGdeJZYfzBt+E4wyPovPkJekVJmm8cTfgyPxcXaAHB8vTTTZL4hZPSOKQF0LiZhO+AZkphXxmE5TQueSLwS6pNjjhp9J6dzKFKY75bYWC4/LIcqVMKb5OsHGO9kdkEfan5Kr4iDKUA+24HHuMwoUzxQwRuk+5a+0JRkOCd11GuPY9Cv1M9iq/aOhA4JmHuUobEKrSh8kOEnocNj7h7p+s3DYhByEkIX0mz8ha5pqaQpELSm7eM6tGxhuZ/vXIDH1y4ltDewyeoFWkj4oDORfzhFMALU+yuvy5ga4ZpU7cXGehCdHekPPBld8f0OvMLm3QVSRj5ElAgnT4GxYCgdBGxRnYSUNVkoKRzCAoc5b0mkCK9y8YQSFKNJXEvvtP/V 6A8fVFuE okPsoZNdPoLE8v3BlvR0GziM6aBqY8VBS6uoK2pDk9+tTyfEwAsm8R44Jz2ej4wNmpuK76hBXtfaCgl9Af8MFSUncYWGIcBlwigqWo/bLYENIkG9s/O5TPTS/f7BeT6c5olohmKdSlZuTffH4IwGLeisDakMrof0Ghj3czTn1e8dzJ6USLBGmVO/k+w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: The folio refcount may be increased unexpectly through try_get_folio() by caller such as split_huge_pages. In huge_pmd_unshare(), we use refcount to check whether a pmd page table is shared. The check is incorrect if the refcount is increased by the above caller, and this can cause the page table leaked: BUG: Bad page state in process sh pfn:109324 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x66 pfn:0x109324 flags: 0x17ffff800000000(node=0|zone=2|lastcpupid=0xfffff) page_type: f2(table) raw: 017ffff800000000 0000000000000000 0000000000000000 0000000000000000 raw: 0000000000000066 0000000000000000 00000000f2000000 0000000000000000 page dumped because: nonzero mapcount ... CPU: 31 UID: 0 PID: 7515 Comm: sh Kdump: loaded Tainted: G B 6.13.0-rc2master+ #7 Tainted: [B]=BAD_PAGE Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 Call trace: show_stack+0x20/0x38 (C) dump_stack_lvl+0x80/0xf8 dump_stack+0x18/0x28 bad_page+0x8c/0x130 free_page_is_bad_report+0xa4/0xb0 free_unref_page+0x3cc/0x620 __folio_put+0xf4/0x158 split_huge_pages_all+0x1e0/0x3e8 split_huge_pages_write+0x25c/0x2d8 full_proxy_write+0x64/0xd8 vfs_write+0xcc/0x280 ksys_write+0x70/0x110 __arm64_sys_write+0x24/0x38 invoke_syscall+0x50/0x120 el0_svc_common.constprop.0+0xc8/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x34/0x128 el0t_64_sync_handler+0xc8/0xd0 el0t_64_sync+0x190/0x198 The issue may be triggered by damon, offline_page, page_idle, etc, which will increase the refcount of page table. Fix it by introducing independent PMD page table shared count. As described by comment, pt_index/pt_mm/pt_frag_refcount are used for s390 gmap, x86 pgds and powerpc, pt_share_count is used for x86/arm64/riscv pmds, so we can reuse the field as pt_share_count. Fixes: 39dde65c9940 ("[PATCH] shared page table for hugetlb page") Signed-off-by: Liu Shixin --- v1->v2: Fix build error when !CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING and modify the changelog. include/linux/mm.h | 1 + include/linux/mm_types.h | 30 ++++++++++++++++++++++++++++++ mm/hugetlb.c | 16 +++++++--------- 3 files changed, 38 insertions(+), 9 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index c39c4945946c..50fbf2a1b0ad 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3115,6 +3115,7 @@ static inline bool pagetable_pmd_ctor(struct ptdesc *ptdesc) if (!pmd_ptlock_init(ptdesc)) return false; __folio_set_pgtable(folio); + ptdesc_pmd_pts_init(ptdesc); lruvec_stat_add_folio(folio, NR_PAGETABLE); return true; } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 7361a8f3ab68..332cee285662 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -445,6 +445,7 @@ FOLIO_MATCH(compound_head, _head_2a); * @pt_index: Used for s390 gmap. * @pt_mm: Used for x86 pgds. * @pt_frag_refcount: For fragmented page table tracking. Powerpc only. + * @pt_share_count: Used for HugeTLB PMD page table share count. * @_pt_pad_2: Padding to ensure proper alignment. * @ptl: Lock for the page table. * @__page_type: Same as page->page_type. Unused for page tables. @@ -471,6 +472,9 @@ struct ptdesc { pgoff_t pt_index; struct mm_struct *pt_mm; atomic_t pt_frag_refcount; +#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING + atomic_t pt_share_count; +#endif }; union { @@ -516,6 +520,32 @@ static_assert(sizeof(struct ptdesc) <= sizeof(struct page)); const struct page *: (const struct ptdesc *)(p), \ struct page *: (struct ptdesc *)(p))) +#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING +static inline void ptdesc_pmd_pts_init(struct ptdesc *ptdesc) +{ + atomic_set(&ptdesc->pt_share_count, 0); +} + +static inline void ptdesc_pmd_pts_inc(struct ptdesc *ptdesc) +{ + atomic_inc(&ptdesc->pt_share_count); +} + +static inline void ptdesc_pmd_pts_dec(struct ptdesc *ptdesc) +{ + atomic_dec(&ptdesc->pt_share_count); +} + +static inline int ptdesc_pmd_pts_count(struct ptdesc *ptdesc) +{ + return atomic_read(&ptdesc->pt_share_count); +} +#else +static inline void ptdesc_pmd_pts_init(struct ptdesc *ptdesc) +{ +} +#endif + /* * Used for sizing the vmemmap region on some architectures */ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ea2ed8e301ef..60846b060b87 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -7212,7 +7212,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, spte = hugetlb_walk(svma, saddr, vma_mmu_pagesize(svma)); if (spte) { - get_page(virt_to_page(spte)); + ptdesc_pmd_pts_inc(virt_to_ptdesc(spte)); break; } } @@ -7227,7 +7227,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, (pmd_t *)((unsigned long)spte & PAGE_MASK)); mm_inc_nr_pmds(mm); } else { - put_page(virt_to_page(spte)); + ptdesc_pmd_pts_dec(virt_to_ptdesc(spte)); } spin_unlock(&mm->page_table_lock); out: @@ -7239,10 +7239,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, /* * unmap huge page backed by shared pte. * - * Hugetlb pte page is ref counted at the time of mapping. If pte is shared - * indicated by page_count > 1, unmap is achieved by clearing pud and - * decrementing the ref count. If count == 1, the pte page is not shared. - * * Called with page table lock held. * * returns: 1 successfully unmapped a shared pte page @@ -7251,18 +7247,20 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { + unsigned long sz = huge_page_size(hstate_vma(vma)); pgd_t *pgd = pgd_offset(mm, addr); p4d_t *p4d = p4d_offset(pgd, addr); pud_t *pud = pud_offset(p4d, addr); i_mmap_assert_write_locked(vma->vm_file->f_mapping); hugetlb_vma_assert_locked(vma); - BUG_ON(page_count(virt_to_page(ptep)) == 0); - if (page_count(virt_to_page(ptep)) == 1) + if (sz != PMD_SIZE) + return 0; + if (!ptdesc_pmd_pts_count(virt_to_ptdesc(ptep))) return 0; pud_clear(pud); - put_page(virt_to_page(ptep)); + ptdesc_pmd_pts_dec(virt_to_ptdesc(ptep)); mm_dec_nr_pmds(mm); return 1; }