From patchwork Tue Nov 12 14:16:01 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Puranjay Mohan X-Patchwork-Id: 13872294 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 12C75D42B86 for ; Tue, 12 Nov 2024 14:16:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8FD1E6B0096; Tue, 12 Nov 2024 09:16:17 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8AAE16B0098; Tue, 12 Nov 2024 09:16:17 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 74BAB6B009D; Tue, 12 Nov 2024 09:16:17 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 526C76B0096 for ; Tue, 12 Nov 2024 09:16:17 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 9E13F1C4976 for ; Tue, 12 Nov 2024 14:16:16 +0000 (UTC) X-FDA: 82777641408.21.6792C0A Received: from smtp-fw-9102.amazon.com (smtp-fw-9102.amazon.com [207.171.184.29]) by imf05.hostedemail.com (Postfix) with ESMTP id 9062910002A for ; Tue, 12 Nov 2024 14:14:55 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=amazon.com header.s=amazon201209 header.b=OTCssGSR; spf=pass (imf05.hostedemail.com: domain of "prvs=039e7cc26=pjy@amazon.com" designates 207.171.184.29 as permitted sender) smtp.mailfrom="prvs=039e7cc26=pjy@amazon.com"; dmarc=pass (policy=quarantine) header.from=amazon.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731420741; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=acm1kyvqupguFjJWytnDbuQ5A0QhLVvL3XceyiCM+CM=; b=vLwIp12V0IJhn+UQMxKj20NjTvcWVWYS8G/sgD/reXHrvP8+DH3sSNiLDjhdrTAjRQO0C5 IYm7+ypT95hexLDpQRrY30wFrG4yxRNL8okYEiPL7aFKiqwZGAw8axlPEn/VcAQMFqQTrn 4JSGUgEtfdsegsyqzPScoveHw1JK8Cw= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=amazon.com header.s=amazon201209 header.b=OTCssGSR; spf=pass (imf05.hostedemail.com: domain of "prvs=039e7cc26=pjy@amazon.com" designates 207.171.184.29 as permitted sender) smtp.mailfrom="prvs=039e7cc26=pjy@amazon.com"; dmarc=pass (policy=quarantine) header.from=amazon.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731420741; a=rsa-sha256; cv=none; b=bqPEeLT8rRxhRSAGqyYr9FIvSKJSlrfdgciPvWyKXUwgphyBpunkcZjRQAwy2cCd0X/+h9 xew+BMSHh1ZIaNA9K3Kd8Zb/aQ1ol2GOmZTr9Es2P22cCsdFIbu3KcA6+3m1v1QSBsrlpf 0E+hgGwnkzJMpya1HGHio1i5AeLbwYA= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1731420974; x=1762956974; h=from:to:subject:date:message-id:mime-version: content-transfer-encoding; bh=acm1kyvqupguFjJWytnDbuQ5A0QhLVvL3XceyiCM+CM=; b=OTCssGSRshDxtLPeX6+Mn0sjswR/HvyMJSJsngF9yeuur/zXjmkIFJlc MuOsj0Ztzp0pwI8oGb8RNWqGmYu/AeAnYxBMEk5NEckjDWwE6bDiWzuO3 t7wKhO2t84dgHwIbNFrwqo+NvWQhRMxCRh9Q39GvPr3DDGW5zsiGo83my k=; X-IronPort-AV: E=Sophos;i="6.12,148,1728950400"; d="scan'208";a="469329413" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.25.36.214]) by smtp-border-fw-9102.sea19.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Nov 2024 14:16:05 +0000 Received: from EX19MTAUWA002.ant.amazon.com [10.0.7.35:56879] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.18.85:2525] with esmtp (Farcaster) id 9c695df4-969b-4e86-bbb0-610ddcc3be65; Tue, 12 Nov 2024 14:16:05 +0000 (UTC) X-Farcaster-Flow-ID: 9c695df4-969b-4e86-bbb0-610ddcc3be65 Received: from EX19EXOUWC001.ant.amazon.com (10.250.64.135) by EX19MTAUWA002.ant.amazon.com (10.250.64.202) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 12 Nov 2024 14:16:05 +0000 Received: from EX19MTAUWC001.ant.amazon.com (10.250.64.174) by EX19EXOUWC001.ant.amazon.com (10.250.64.135) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Tue, 12 Nov 2024 14:16:04 +0000 Received: from email-imr-corp-prod-pdx-all-2c-475d797d.us-west-2.amazon.com (10.25.36.210) by mail-relay.amazon.com (10.250.64.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34 via Frontend Transport; Tue, 12 Nov 2024 14:16:04 +0000 Received: from dev-dsk-pjy-1a-76bc80b3.eu-west-1.amazon.com (dev-dsk-pjy-1a-76bc80b3.eu-west-1.amazon.com [10.15.97.110]) by email-imr-corp-prod-pdx-all-2c-475d797d.us-west-2.amazon.com (Postfix) with ESMTP id 7D8D5A0126; Tue, 12 Nov 2024 14:16:04 +0000 (UTC) Received: by dev-dsk-pjy-1a-76bc80b3.eu-west-1.amazon.com (Postfix, from userid 22993570) id 1332D2084A; Tue, 12 Nov 2024 14:16:04 +0000 (UTC) From: Puranjay Mohan To: Mike Kravetz , Andrew Morton , , , Lorenzo Stoakes , Greg Kroah-Hartman Subject: [RFC PATCH 5.10.y] mm: hugetlb: call huge_pte_offset with i_mmap_rwsem held Date: Tue, 12 Nov 2024 14:16:01 +0000 Message-ID: <20241112141601.34540-1-pjy@amazon.com> X-Mailer: git-send-email 2.40.1 MIME-Version: 1.0 X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 9062910002A X-Stat-Signature: sz8dbqm1pwaqnrjezb9uwqu3t857sp68 X-Rspam-User: X-HE-Tag: 1731420895-596707 X-HE-Meta: U2FsdGVkX18MuJgSW41Vlp9Bff4TwqPRT4x9DwmUEjJi5MF1UCt7LMX/Jz70j07Z9c9Za54Ihh2jKx+VElNEbWiP4VEOoSJc6Zsi240arDOT31jzZ6raDKYRnQpPr1wgT+7nsPjpWn9SMIjQr6mMD5QGh0yua8CFQ6TfidwMxXTb1XfYRgv2+lqpJvoK4YNz3bNHp0zYwHFBH2TYnWPqUTlBq6BYDUWIkap+6dSGb83Yz2VSciu3gOTG7iP1xrnuQaJOwU4PEfp3duRz/5WJ4eZxNcZXXHEznL8jf4zQlifqqu7brFrfjCmZPA2ysjpAwkQcJIJK+eJC0pzh7vuo6fzp4UpEuJZa8zv6cNenp5eDpWz1HJtIY94FaRGOWM+tdRC66ozxZSIh4LDN9ogXct0z64727QU2swExHXItLM9nhxVlU+JM/badAXNsxwuNYlDo5MeeRvtl3sKKpzH/WvkrxY7jfpYmMyD6+WXFcHfRTB7s3H1UPSFEA7AlkodjekbuN5wuWgDVtQZhRT8Uv6cpsz9X9+xUVgfGqUcWQiIcdLGXgoaqK0BFVbkdJY4DiSWc1f6NIxyftT7O4ZpPi0lT6oMo06uKo3/YsW4Ub9dIAFvibZN3clPB+mHA8bBexhtlmJmPYCe3u2dBc/zpOZbyhsjozEl9O8IBv05N/xFSd0oxSXWAn1SGBc4tRxLC/vgg4IQkwkwANfY7tErvkk1b1t/Z4gKE8ZhJLshJQ5S2us+igU90eTCDW8IgY5tQom9iozHA4hqr4/TPWOYOVsCw8tC5GYIVVXgh3qD9yq6H5PgPWiI4LJy2rFV1J8l/V4LnF5X0Mj5gVnGDueJSfTlKGpUGOLRFBCtZTsKCHume+f9lBnr3b3vnK3UV/XJ0WWA6T5JXrhdBZQiaiGx7rMkF+hkDmyQI+fsmckFzfEqjrXocoIPx/8A8vWjyyziKV90iyFS0Y8DDwJrbu+i mWA99cpa Gx+XwiGUwSMY4b3gvoQaoceYbvfMdaAgQWIGWJEStPWJ74DxiLSTxorM8oKTYFD4Rl31SB2jm7QPN10ilF7hs31NLjUP95TNyflh5cKj8oKzN/5ga0dGJLDzE7wjuXwKIAB/JIMi24diBtD02PwsV+7OOp6WHFx0x9RqUEiK4CK7MeYzhnw4PmLO305amPhYKlNzs5mP3LDFH1R3Zu1BVsb/U6dYOfpqraDA7zQUFCQpxkNWLMvTtUTGfF/Ic6MRHY3iVMhO58l+RuJ9g7dmvPWVwYS8hJHq6xdC6JuweHKXqPUFBxsyPFdw8j8vQYLgvXFEWdvdfU9BMhP/pyxLrFx9knVQCTjy1/anAUAwlB3Lprewqy47ZRqOdd84Mru6jWxjed44DT/tovmTiTXGoki3LnL4So3tQKeYJSJfT2BS8+wJMy9jLmFZJFmy/zG7/JYUS0tBw83I7akOsjAlpK7m5srnEwIQU6vbv+oAdAaQa9z5UnUv7ZJZANhtXvFdFtRsXb4skzJezVTD3rGveFPocscrnHHvBSaqhUowW85fEMFajQYlx2CahEKLxtyAUYiXNU3CWm4buF4mt90hKRjNa33f+1kbQZDwQunYrS9nGhv/6YliDr+74/I/2DWjxRRc6hJHl2AxX0xc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: huge_pte_offset() walks the page table to the find the pte associated with the huge page. the PMD page can be shared with another process and pmd unsharing is possible (so the PUD-ranged pgtable page can go away while the page table walk is under way. It can be done by a pmd unshare with a follow up munmap() on the other process). Protect against this race by taking i_mmap_rwsem while the page table walk is going on and till the pointer to the PMD is being used. The upstream kernel has a new lock [1] for fixing this issue and backporting the whole series is not trivial. This patch is my attempt at fixing this issue and I am sending this as an RFC to receive feedback if we can fix it using another method. Once I receive the feedback and we have a path forward, I will send that patch to all stable branches that have this issue. [1] https://lwn.net/Articles/908092/ Here is an example kernel panic due to the issue being fixed in this patch: Unable to handle kernel paging request at virtual address ffffffffc0000698 Mem abort info: ESR = 0x96000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000729ec000 [ffffffffc0000698] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 96000004 [#1] SMP Modules linked in: xt_tcpmss ip6table_filter tcp_diag [.....] CPU: 3 PID: 62456 Comm: postgres Not tainted 5.10.184-175.731.amzn2.aarch64 #1 Hardware name: Amazon EC2 caspianr1g.16xlarge/, BIOS 1.0 11/1/2018 pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--) pc : huge_pte_offset+0x88/0x118 lr : hugetlb_fault+0x60/0x5f0 sp : ffff80001c6d3d00 x29: ffff80001c6d3d00 x28: ffff0003cdfa0000 x27: 0000000000000000 x26: ffff0003ce90d6a8 x25: 0000000000000007 x24: 00000a60da660000 x23: ffff0003ce90d640 x22: ffff800012256ed8 x21: 00000a60da600000 x20: ffff00040388d130 x19: ffff00040388d130 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800010345880 x8 : 0000000000000000 x7 : 0000000040000000 x6 : 0000000040000000 x5 : 0000000000000183 x4 : 00000000000000d3 x3 : ffffffffc0000000 x2 : 0000000000200000 x1 : 00000a60da600000 x0 : ffffffffc0000698 Call trace: huge_pte_offset+0x88/0x118 handle_mm_fault+0x1b0/0x240 do_page_fault+0x150/0x420 do_translation_fault+0xb8/0xf4 do_mem_abort+0x48/0xa8 el0_da+0x44/0x80 el0_sync_handler+0xe0/0x120 Code: eb00005f 54000380 d3557424 8b040c60 (f8647863) ---[ end trace 6cffaf3375de3ad9 ]--- Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x0804800e,7a00a238 Memory Limit: 2048 MB ---[ end Kernel panic - not syncing: Oops: Fatal exception ]--- Signed-off-by: Puranjay Mohan --- mm/hugetlb.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 02b7c8f9b0e87..a991b62afac4e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4545,7 +4545,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, struct address_space *mapping; int need_wait_lock = 0; unsigned long haddr = address & huge_page_mask(h); + mapping = vma->vm_file->f_mapping; + i_mmap_lock_read(mapping); ptep = huge_pte_offset(mm, haddr, huge_page_size(h)); if (ptep) { /* @@ -4556,10 +4558,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, entry = huge_ptep_get(ptep); if (unlikely(is_hugetlb_entry_migration(entry))) { migration_entry_wait_huge(vma, mm, ptep); + i_mmap_unlock_read(mapping); return 0; - } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) + } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) { + i_mmap_unlock_read(mapping); return VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h)); + } } /* @@ -4573,8 +4578,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, * is OK, as huge_pte_alloc will return the same value unless * something has changed. */ - mapping = vma->vm_file->f_mapping; - i_mmap_lock_read(mapping); ptep = huge_pte_alloc(mm, haddr, huge_page_size(h)); if (!ptep) { i_mmap_unlock_read(mapping);