From patchwork Thu Jan 19 21:22:57 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Edgecombe, Rick P" X-Patchwork-Id: 13108797 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96250C6379F for ; Thu, 19 Jan 2023 21:24:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2B7DD900004; Thu, 19 Jan 2023 16:23:59 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 243EE900002; Thu, 19 Jan 2023 16:23:59 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EEB04900004; Thu, 19 Jan 2023 16:23:58 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id D1F7D900002 for ; Thu, 19 Jan 2023 16:23:58 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id B0949806E1 for ; Thu, 19 Jan 2023 21:23:58 +0000 (UTC) X-FDA: 80372826156.02.4420006 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by imf27.hostedemail.com (Postfix) with ESMTP id 9105F40003 for ; Thu, 19 Jan 2023 21:23:56 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=XSnZIMk+; spf=pass (imf27.hostedemail.com: domain of rick.p.edgecombe@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=rick.p.edgecombe@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1674163436; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:in-reply-to: references:references:dkim-signature; bh=1+7FX9/q+7niILOmZEQ+FO8zDLlVkBM+CikBvkIlWYA=; b=dFqR0U1pxGPDAmbIxYb9bj5lz+Hcxslj/1D56nagRHgBitm9GxJJjSKh9FGNKDo+aVreRN RrDKu/KUwxKrGjHpepFUrUID3Mz85/KsuD+MpJ5sdfFGaPSlYq9KN2/ByYcRDIVjToeq+l KB0Gy4iOHqSU2/0pc6O8NBRH5tzT2fs= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=XSnZIMk+; spf=pass (imf27.hostedemail.com: domain of rick.p.edgecombe@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=rick.p.edgecombe@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1674163436; a=rsa-sha256; cv=none; b=BHCS9OmSejlpJaozjjoLujLvYVZcefWf7u8J/PnuthrLBsWDyqsA5r7KEBEjDFQq4SvtoG JrQnrnJyEVPasqhxfkpfOw/aBSE2Se73khyQuw/E9XcfuzdAdeFLZ78RekulbaFtg4h1pZ zhF4ybvP5i4/2BJo1VF80piMi67H9I8= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674163436; x=1705699436; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=mm+/ZDL+/WIl4y++bICVeaKn86uLCgMEEZawTB+3rKc=; b=XSnZIMk+XJtnpyXwzM63EH1HnBvLMLN2TNWAM7dhjX7+k2dS198Dk2Tl Fp8cIoRuDCAa4HZ6teh0jbIXbhBC4pLUBkIN1J+04QUVum4T9iZ1wnp2M Vbo2FZSnUKbULiRAnHYWW+sZALs13umuHl8PL/ldtuRkofIWBxLO/IrlP 4TSpWD5vDNyiVFJb4ZSU13+zkfA3TPNLDkWKk4WkL++rssCoQI9tO3m1/ 1GgeJ3hvKSoZEb/Ji8jy1yAbLmPSbrx+DtY5NMK5rPpngP6fgQ3GNLHNu 2OhJqIzBPEupV17knihGatO9+1NwSoaTm+FvYOxZGK7Raf7I38dOjmLVL A==; X-IronPort-AV: E=McAfee;i="6500,9779,10595"; a="323119613" X-IronPort-AV: E=Sophos;i="5.97,230,1669104000"; d="scan'208";a="323119613" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jan 2023 13:23:55 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10595"; a="989139087" X-IronPort-AV: E=Sophos;i="5.97,230,1669104000"; d="scan'208";a="989139087" Received: from hossain3-mobl.amr.corp.intel.com (HELO rpedgeco-desk.amr.corp.intel.com) ([10.252.128.187]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jan 2023 13:23:53 -0800 From: Rick Edgecombe To: x86@kernel.org, "H . Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H . J . Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , Weijiang Yang , "Kirill A . Shutemov" , John Allen , kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu Subject: [PATCH v5 19/39] mm: Fixup places that call pte_mkwrite() directly Date: Thu, 19 Jan 2023 13:22:57 -0800 Message-Id: <20230119212317.8324-20-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20230119212317.8324-1-rick.p.edgecombe@intel.com> References: <20230119212317.8324-1-rick.p.edgecombe@intel.com> X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 9105F40003 X-Stat-Signature: 1i5xqss1igmjgsu4u9f6e8xa9s74i5op X-Rspam-User: X-HE-Tag: 1674163436-324045 X-HE-Meta: U2FsdGVkX1+510ZZdMHum+IjaE6ZhpFAkYtrOtJPombrZYalEWf64Ekrg+8+iVzdosTSJTHdsKvBnlhipOH+F6yzQo4AuELZO021tlYcIyNCPMmiD0PymkDU5Cbm9M0bblaEsoXug+qR+qQlAqEqB0lTsp8T4lfSsAHuKnZXyUoEsGFDQ7wonaIEjWGIduvHQJcvi5X27+bg9CljfHsGBzN2AjM63FQn+eoWa095yMP2syg0fU/tvbEWMeeylOgKT4sLx4LwOCZuo5lJRdsjwBrwNXTjCw7/yts29sr+atr/X0jIu0JX9rvfr3UuZMn7bRd7/UvwEaJ3e1+nFiaK3lfYpls9QOehHKAtO7v1JBRZKLzNWJpcp9GAC+6SXmYyhH4DTUw/K8ZmDnITio8/MEPkOGXeZwlwGp4shLxeVryK74RrjnyB1xAkIQqNMPzQx2o3YydzIS9wu75fA47sMk6QAVtmvsp0Tb7vlMzvD81nnWJxODeCx1vaJvijwKupKPkwCBmDLzRFi5PZ8TTPGYpnBhXJ6ZVrIXdQKRDLdL6b70BFc9rZY20+0C+bFlYPuSXdJzY1XrncbsKisYPXkx45/wQEHyHA1hyeS2StOEFKjWcrefl/6UgBRsmE4JWfjfex5XT7erEgb5l5t4imoCGU7ACzMgzZDku1Z+rCAVM2lCFPkqNF9JRKXBaQ1CPk4ddSMT8x8QhJQUQj0SRNtIAviAHn/Tysj9Vmbe4rCLXr/uVKz9eN/9xUVvb/sgFic4bDIHbN98EjRt9e9l5WqEsBvp01Q3K0UexLBe67maLQl50BFRHT0AjS5NOgYSiOZrrD5ZkUv7Tud8Cf8evFhtDkKKRoogmndI8zBYSzxRaoRhrKGK0oylULVnWjD1/znzzIoEOwDQk9eaFZOZjvlCz/PsvgyU4p30Qp9H7/1qStw81gdQrEvl2s4cuSitjiaoQjunvAfKpxk+R5nOc 892bmXoU C16a0Ay6TU6/CwEzdRUGdrsoSNYpmWSZBnjKvK4nXuA+0/hMSi0QpDW5opR0rWAJC2SXIp92zdY3nZs186ixhC7YPDDufmnEMah/VvPkm7X+T9gZin5rAhOD+r+TWAjt7EAacQr6cBx2EyI5Wy3qd7SAMFQCWWCRIkt/O6/nS7J4UAhG5nyAdWjZGgZqqYyAihoAp0GqfNVszz0hyovw1VFo7Uf63p2wQgAUAMiw7yKiaRGAM+hueMXnVoXlbQDCGuQADwFMLeVI1TJQ4WN9eMsj+d0Upwcde/nFbMigc/R03oDXaIkj2QfYOR0X1lFW8qEJwlIW+LLinwynD6vfIBa1NUA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Yu-cheng Yu The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. With the introduction of shadow stack memory there are two ways a pte can be writable: regular writable memory and shadow stack memory. In past patches, maybe_mkwrite() has been updated to apply pte_mkwrite() or pte_mkwrite_shstk() depending on the VMA flag. This covers most cases where a PTE is made writable. However, there are places where pte_mkwrite() is called directly and the logic should now also create a shadow stack PTE in the case of a shadow stack VMA. - do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE directly and call pte_mkwrite(). Teach it about pte_mkwrite_shstk() - When userfaultfd is creating a PTE after userspace handles the fault it calls pte_mkwrite() directly. Teach it about pte_mkwrite_shstk() To make the code cleaner, introduce is_shstk_write() which simplifies checking for VM_WRITE | VM_SHADOW_STACK together. In other cases where pte_mkwrite() is called directly, the VMA will not be VM_SHADOW_STACK, and so shadow stack memory should not be created. - In the case of pte_savedwrite(), shadow stack VMA's are excluded. - In the case of the "dirty_accountable" optimization in mprotect(), shadow stack VMA's won't be VM_SHARED, so it is not necessary. Tested-by: Pengfei Xu Tested-by: John Allen Signed-off-by: Yu-cheng Yu Co-developed-by: Rick Edgecombe Signed-off-by: Rick Edgecombe Cc: Kees Cook Reviewed-by: Kees Cook --- v5: - Fix typo in commit log v3: - Restore do_anonymous_page() that accidetally moved commits (Kirill) - Open code maybe_mkwrite() cases from v2, so the behavior doesn't change to mark that non-writable PTEs dirty. (Nadav) v2: - Updated commit log with comment's from Dave Hansen - Dave also suggested (I understood) to maybe tweak vm_get_page_prot() to avoid having to call maybe_mkwrite(). After playing around with this I opted to *not* do this. Shadow stack memory memory is effectively writable, so having the default permissions be writable ended up mapping the zero page as writable and other surprises. So creating shadow stack memory needs to be done with manual logic like pte_mkwrite(). - Drop change in change_pte_range() because it couldn't actually trigger for shadow stack VMAs. - Clarify reasoning for skipped cases of pte_mkwrite(). Yu-cheng v25: - Apply same changes to do_huge_pmd_numa_page() as to do_numa_page(). arch/x86/include/asm/pgtable.h | 3 +++ arch/x86/mm/pgtable.c | 6 ++++++ include/linux/pgtable.h | 7 +++++++ mm/memory.c | 5 ++++- mm/migrate_device.c | 4 +++- mm/userfaultfd.c | 10 +++++++--- 6 files changed, 30 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 45b1a8f058fe..87d3068734ec 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -951,6 +951,9 @@ static inline pgd_t pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd) } #endif /* CONFIG_PAGE_TABLE_ISOLATION */ +#define is_shstk_write is_shstk_write +extern bool is_shstk_write(unsigned long vm_flags); + #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index e4f499eb0f29..d103945ba502 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -880,3 +880,9 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr) #endif /* CONFIG_X86_64 */ #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ + +bool is_shstk_write(unsigned long vm_flags) +{ + return (vm_flags & (VM_SHADOW_STACK | VM_WRITE)) == + (VM_SHADOW_STACK | VM_WRITE); +} diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 14a820a45a37..49ce1f055242 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1578,6 +1578,13 @@ static inline bool arch_has_pfn_modify_check(void) } #endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */ +#ifndef is_shstk_write +static inline bool is_shstk_write(unsigned long vm_flags) +{ + return false; +} +#endif + /* * Architecture PAGE_KERNEL_* fallbacks * diff --git a/mm/memory.c b/mm/memory.c index aad226daf41b..5e5107232a26 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4088,7 +4088,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) entry = mk_pte(page, vma->vm_page_prot); entry = pte_sw_mkyoung(entry); - if (vma->vm_flags & VM_WRITE) + + if (is_shstk_write(vma->vm_flags)) + entry = pte_mkwrite_shstk(pte_mkdirty(entry)); + else if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry)); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 721b2365dbca..53d417683e01 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -645,7 +645,9 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, goto abort; } entry = mk_pte(page, vma->vm_page_prot); - if (vma->vm_flags & VM_WRITE) + if (is_shstk_write(vma->vm_flags)) + entry = pte_mkwrite_shstk(pte_mkdirty(entry)); + else if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry)); } diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 0499907b6f1a..832f0250ca61 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -63,6 +63,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, int ret; pte_t _dst_pte, *dst_pte; bool writable = dst_vma->vm_flags & VM_WRITE; + bool shstk = dst_vma->vm_flags & VM_SHADOW_STACK; bool vm_shared = dst_vma->vm_flags & VM_SHARED; bool page_in_cache = page_mapping(page); spinlock_t *ptl; @@ -84,9 +85,12 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, writable = false; } - if (writable) - _dst_pte = pte_mkwrite(_dst_pte); - else + if (writable) { + if (shstk) + _dst_pte = pte_mkwrite_shstk(_dst_pte); + else + _dst_pte = pte_mkwrite(_dst_pte); + } else /* * We need this to make sure write bit removed; as mk_pte() * could return a pte with write bit set.