From patchwork Fri Jan 10 18:40:26 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Brendan Jackman X-Patchwork-Id: 13935561 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C3BF8E77188 for ; Fri, 10 Jan 2025 23:19:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:Message-ID: Mime-Version:Date:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To: References:List-Owner; bh=8OhQ6+0YDUGyBUx7j+xUWWqABofFUnIKZY38FX2rGcc=; b=fDY GqfqP/LdMZxF1OvCBDtk3hPPVsl/mNWR1fuDaS+vuEdAIDvkJMkiDTUVf+u6MD6nzKH4jZCOcqsyP Pqxxos5f9cDw70PRNoC89scLVLo0pKoM4hBfnCzMWD5xZCo1XKZMIUABYMZohPqMz45gq+qEvnuNF MZSJFMeSe1XJNE5Lneg7iQmYngqGmY3ukhUQMqJ9XgCMtbgWkgR4L9kpoW/4Vp98HQgnWam/J6Q9p zfgfGrCZMiGQrSfGi6R3QXIc5kuPA1Gywg2VNBSHf4+4sPGAz4Qa7li522p1+0m47Uqaq+6kjfy74 KzC9eNy37lYz6uQNrKzIHjmxgBqX2XQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux)) id 1tWOHr-0000000HEON-3koi; Fri, 10 Jan 2025 23:19:32 +0000 Received: from desiato.infradead.org ([2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by bombadil.infradead.org with esmtps (Exim 4.98 #2 (Red Hat Linux)) id 1tWJwF-0000000GbB7-22sc for linux-riscv@bombadil.infradead.org; Fri, 10 Jan 2025 18:40:55 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Transfer-Encoding:Content-Type :Cc:To:From:Subject:Message-ID:Mime-Version:Date:Sender:Reply-To:Content-ID: Content-Description:In-Reply-To:References; bh=jjtB5lRt7h5DA7GIgQanvF90WYFKTK1R3LPSOzDiibI=; b=DZiZXIUaO7z4pU8KDwqRQBpnjc 16kaWB8FJEwOB9bAmehqS1R0rV0WWwgQJrqu3PpA7+fOcEOwzjmISRN9n63iv0BQzWBMDHEikwsvJ zgE4VyL7MSLqSH3Foa1Fg2bIAif0C2pMjElI9o2+uTN+ffg56yfg0/AJZ/f4yucOjWIcymtt1ld9O U44IeqCniojyLoswwQtYko4okBlV+aNmjxYJcWfCx7UMnQ80GaWfNlOf5FO6KWNDNm8odAeyQxlJz p6nQWiOWhxZyMKUMjFzUPOZTl74pqjzvMV6A8VKshTXddCEBOSULjHpvgqybwHs0Rbt5/a4vR/JyY B7t+EM5g==; Received: from mail-wm1-x34a.google.com ([2a00:1450:4864:20::34a]) by desiato.infradead.org with esmtps (Exim 4.98 #2 (Red Hat Linux)) id 1tWJw9-00000009sDE-3bxC for linux-riscv@lists.infradead.org; Fri, 10 Jan 2025 18:40:54 +0000 Received: by mail-wm1-x34a.google.com with SMTP id 5b1f17b1804b1-4361eb83f46so20335425e9.3 for ; Fri, 10 Jan 2025 10:40:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1736534446; x=1737139246; darn=lists.infradead.org; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:from:to:cc:subject:date:message-id:reply-to; bh=jjtB5lRt7h5DA7GIgQanvF90WYFKTK1R3LPSOzDiibI=; b=ICj2h6stw07YTJWNUDAwFkZQFU9YW8G/ZQRRIeIdZ6dqjayXe+CpCkMcD+DMhS9XhF YHVDn+Q3pLy8LuS0Ywuybdx3eWBNX+KB6okZsq8sbZdx1G/DBGzdpoYxq23cXNakr6L7 UW5gGI4oX/FwH9Ybj6UVg553znV4lLwnJHH8VzPDbwoNLLJlMigsQkNdztH9Hm0O3Dxo 20gFi+Bx8IMueD3u0xDXQiFVQe6aOPw8ogZFvq1GA+yHQr4nVj+gN2/66CiN42SURthp DKctH/lVqcQ58Uaxf0vjmopElmNRgIU5FB7U+CVirTkH6PDkXwP2Cm6JK0TJJ/mc9MXO RAqw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736534446; x=1737139246; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=jjtB5lRt7h5DA7GIgQanvF90WYFKTK1R3LPSOzDiibI=; b=bjEIVudPrRSjCWUgtAQ8+hLae+hEDl9bPCvmHbHnYAW/LRG+fZWVn+yvtEcFdrSrpZ NRXgQvxqskqpmv3ZdJogzOpWU4iGNrQon2TS7LPQY9tCSsdwfIyKIrTq2mlAOifIdah1 fgBlCAjO6gKR56vtxmBqHSm8vBn+URy6H9OpXpEy6q/4N5T8+6/imp3wSv2hkj9hNVbv 1q9VMQu+oVW9KVl3Orv3nZaKIYVTDCxdT86EU5Czs9bQ9zwuvjjGuGu04Z1aQH9Os46A ks3wc5s4og3tfn1uQ1mCJusw6Jrjq+HpP7PQXEC5GBoAtwxDZCDZNXdFwvWetZMDceST QH9Q== X-Forwarded-Encrypted: i=1; AJvYcCX9vj7nnAfiiY5IR6GeRixNdmwlE1PJx+fhsexMnWOP8i2O5/3nvolKfw9Qba3A6bjQTdcLnUqIRdGfVA==@lists.infradead.org X-Gm-Message-State: AOJu0Yx4PM6JJAQRjo9eCnCWIdaIulmmnbcw4MWu9p4O0oLzSnsmhao1 Nc9DTPMfO5ZBrmKpORpC+VDOcxbaYnoyRQ1TG/IpPV4CZgx8LoaCexKt4puKFBurpS5HFLOE+YU P2m5yPyacDQ== X-Google-Smtp-Source: AGHT+IGn62s/zpmNlw6cS/dnEppCTrQbKdqDl7tm1TqotCpul8Aao0CteIYxi5s8bW0nJPjX7qZkFnpNt2sQAQ== X-Received: from wmbjw19.prod.google.com ([2002:a05:600c:5753:b0:434:a4bc:534f]) (user=jackmanb job=prod-delivery.src-stubby-dispatcher) by 2002:a7b:c315:0:b0:434:ffb2:f9df with SMTP id 5b1f17b1804b1-436e26adf94mr117996365e9.17.1736534445509; Fri, 10 Jan 2025 10:40:45 -0800 (PST) Date: Fri, 10 Jan 2025 18:40:26 +0000 Mime-Version: 1.0 X-B4-Tracking: v=1; b=H4sIAJtpgWcC/z2NwQrCMBBEf6Xs2ZVsTLX1JAh+gFfpIWnTdkEbS SQoJf9uzMHjvGHerBCsZxvgWK3gbeTAbslBbiroZ71MFnnIGaSQiohq1IHRjz1GifXQGmNU5s0 e8uDp7cjvIrvB9XKGLsOZw8v5TzmIVKqfSxxI/l1SYSQUSEqZHbVaiaE5Tc5Nd7vt3QO6lNIXN mCbbqoAAAA= X-Change-Id: 20241115-asi-rfc-v2-5d9bbb441186 X-Mailer: b4 0.15-dev Message-ID: <20250110-asi-rfc-v2-v2-0-8419288bc805@google.com> Subject: [PATCH RFC v2 00/29] Address Space Isolation (ASI) From: Brendan Jackman To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Andy Lutomirski , Peter Zijlstra , Richard Henderson , Matt Turner , Vineet Gupta , Russell King , Catalin Marinas , Will Deacon , Guo Ren , Brian Cain , Huacai Chen , WANG Xuerui , Geert Uytterhoeven , Michal Simek , Thomas Bogendoerfer , Dinh Nguyen , Jonas Bonn , Stefan Kristiansson , Stafford Horne , "James E.J. Bottomley" , Helge Deller , Michael Ellerman , Nicholas Piggin , Christophe Leroy , Naveen N Rao , Madhavan Srinivasan , Paul Walmsley , Palmer Dabbelt , Albert Ou , Heiko Carstens , Vasily Gorbik , Alexander Gordeev , Christian Borntraeger , Sven Schnelle , Yoshinori Sato , Rich Felker , John Paul Adrian Glaubitz , "David S. Miller" , Andreas Larsson , Richard Weinberger , Anton Ivanov , Johannes Berg , Chris Zankel , Max Filippov , Arnd Bergmann , Andrew Morton , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Uladzislau Rezki , Christoph Hellwig , Masami Hiramatsu , Mathieu Desnoyers , Mike Rapoport , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Dennis Zhou , Tejun Heo , Christoph Lameter , Sean Christopherson , Paolo Bonzini , Ard Biesheuvel , Josh Poimboeuf , Pawan Gupta Cc: x86@kernel.org, linux-kernel@vger.kernel.org, linux-alpha@vger.kernel.org, linux-snps-arc@lists.infradead.org, linux-arm-kernel@lists.infradead.org, linux-csky@vger.kernel.org, linux-hexagon@vger.kernel.org, loongarch@lists.linux.dev, linux-m68k@lists.linux-m68k.org, linux-mips@vger.kernel.org, linux-openrisc@vger.kernel.org, linux-parisc@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org, linux-sh@vger.kernel.org, sparclinux@vger.kernel.org, linux-um@lists.infradead.org, linux-arch@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, kvm@vger.kernel.org, linux-efi@vger.kernel.org, Brendan Jackman , Junaid Shahid , Ofir Weisse , Yosry Ahmed , Kevin Cheng , Reiji Watanabe X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20250110_184050_383976_612BC33B X-CRM114-Status: GOOD ( 29.24 ) X-Mailman-Approved-At: Fri, 10 Jan 2025 15:19:30 -0800 X-BeenThere: linux-riscv@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-riscv" Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org ASI is a technique to mitigate a broad class of CPU vulnerabilities by unmapping sensitive data from the kernel address space. If no data is mapped that needs protecting, this class of exploits cannot leak that data and so the kernel can skip expensive mitigation actions. For a more detailed overview, see the v1 RFC (which was wrongly labeled as a PATCH) [0]. This new iteration adds support for protecting against bare-metal processes as well as KVM guests. The basic principle is unchanged. .:: Multi-class ASI So far ASI has been a KVM-only solution, although I've been claiming that in principle it can be extended to also sandbox userspace. Dave Hansen's most important feedback at LPC [1] was that he wanted some evidence to support this claim. If it can be shown that ASI is just as powerful for bare-metal as for KVM, it's much more likely to actually offer an escape path from maintaining and reactively developing per-exploit mitigations. v1 already supported a notion of "ASI classes", with the only class being KVM. This RFC introduces a second class for userspace. Each process has a separate restricted address space ("domain") for each class. In v1, the only possible ASI transitions were between the KVM restricted address space, and the unrestricted address space. Now that there are multiple classes, it's possible to transition directly between two restricted address spaces. (Could we dodge this complexity by just transitioning via the unrestricted address space? Yes, but experience from Google's internal deployment suggests there's a significant benefit in avoiding an asi_exit() when switching between userspace and KVM, despite all the optimizations that exist to avoid that switching). Compared to v1, this version has a new mechanism to determine what mitigation actions are required when switching between address spaces. ASI classes provide a "taint policy" which describes what uarch state their sandboxee might leave behind, and what uarch state needs to be purged before their sandboxee can safely be run. The ASI core takes care of doing the actual flushes. This enables a reasonably advanced model of what flushes are needed when; for example the kernel is now able to model "when transitioning from a VMM to its KVM guest there is no point in flushing speculative control flow state, but if we _later_ exit to the unrestricted address space we do need to flush it". It's quite possible this is actually more advanced than what is needed so suggestions are welcome. .:: Performance issues: bogus mitigation costs Although this implementation of ASI is pretty generous in what it considers "nonsensitive", there remain unnecessary performance costs that need to be addressed. For example: - The entire page cache is removed from the direct map. Traditional file operations will hit an asi_exit(), paying a pointless cost to protect data from a process that obviously has the right to read that data. - Anything that accesses guest or user memory via the direct map instead of the user address space will hit an asi_exit(). - Pages being zeroed in the page allocator Most of these issues existed in v1 too, but now that ASI sandboxes userspace processes, the page-cache issue becomes very significant. For FIO 4k read (I suppose this workload is maximally sensitive to this issue) I saw a 70% degradation in throughput, with a Sapphire Rapids machine hard-coded to perform IBPB and RSB-stuffing on asi_exit(). Given a result like that I haven't gone into more detailed analysis. Note also that I ran with an unrealistic mitigation policy, results would be much different if ran with platform-appropriate flushes, but it would presumably lead to the same conclusion. There are some interesting discussions to be had about tackling that problem (e.g. reintroducing "local-nonsensitivity" from Junaid's 2022 ASI implementation [2], or creating ephemeral CPU-local mappings), but for this RFC I prefer to focus on deciding if the overall framework makes sense. .:: Next steps Aside from lack of userspace support, all the other issues listed in RFCv1 remain. I'll also need a proof-of-concept solution for the page-cache issue before we can credibly claim to be reaching a [PATCH], but before that I want to develop a more complete page_alloc integration. I plan to propose a topic about that at LSF/MM/BPF. Anyway, despite the further research needed on my side I think there's still useful stuff to discuss here. For example: - Does the "tainting" model make intuitive sense? Is there a simpler way to achieve something similar? - The taints offer a model for different parts of the kernel to communicate with each other about what mitigations they've taken care of. For example, KVM could clear ASI taints if it existing conditional-L1D-flush logic fires. Does it make sense to take advantage of this? (I think yes). How does this influence the design of the bugs.c kernel arguments? - Suggestions on how to map file pages into processes that can read them, while minimizing TLB management pain. Finally, a more extensive branch can be found at [3]. It has some tests and some of the lower-hanging fruit for optimising performance of KVM guests. [0] RFC v1: https://lore.kernel.org/linux-mm/20240712-asi-rfc-24-v1-0-144b319a40d8@google.com/ [1] LPC session: https://lpc.events/event/18/contributions/1761/ [2] Junaid’s RFC: https://lore.kernel.org/all/20220223052223.1202152-1-junaids@google.com/ [3] GitHub branch: https://github.com/googleprodkernel/linux-kvm/tree/asi-rfcv2-preview Signed-off-by: Brendan Jackman Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Andy Lutomirski , Peter Zijlstra , Sean Christopherson , Paolo Bonzini , Alexandre Chartre , Liran Alon , Jan Setje-Eilers , Catalin Marinas , Will Deacon , Mark Rutland , Andrew Morton , Mel Gorman , Lorenzo Stoakes , David Hildenbrand , Vlastimil Babka , Michal Hocko , Khalid Aziz , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Valentin Schneider , Paul Turner , Reiji Watanabe , Junaid Shahid , Ofir Weisse , Yosry Ahmed , Patrick Bellasi , KP Singh , Alexandra Sandulescu , Matteo Rizzo , Jann Horn kvm@vger.kernel.org, Brendan Jackman , Dennis Zhou --- Changes in v2: - Added support for sandboxing userspace processes. - Link to v1: https://lore.kernel.org/r/20240712-asi-rfc-24-v1-0-144b319a40d8@google.com --- Brendan Jackman (21): mm: asi: Make some utility functions noinstr compatible x86: Create CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION mm: asi: Introduce ASI core API mm: asi: Add infrastructure for boot-time enablement mm: asi: ASI support in interrupts/exceptions mm: asi: Avoid warning from NMI userspace accesses in ASI context mm: Add __PAGEFLAG_FALSE mm: asi: Map non-user buddy allocations as nonsensitive [TEMP WORKAROUND] mm: asi: Workaround missing partial-unmap support mm: asi: Map kernel text and static data as nonsensitive mm: asi: Map vmalloc/vmap data as nonsensitive mm: asi: Stabilize CR3 in switch_mm_irqs_off() mm: asi: Make TLB flushing correct under ASI KVM: x86: asi: Restricted address space for VM execution mm: asi: exit ASI before accessing CR3 from C code where appropriate mm: asi: Add infrastructure for mapping userspace addresses mm: asi: Restricted execution fore bare-metal processes x86: Create library for flushing L1D for L1TF mm: asi: Add some mitigations on address space transitions x86/pti: Disable PTI when ASI is on mm: asi: Stop ignoring asi=on cmdline flag Junaid Shahid (4): mm: asi: Make __get_current_cr3_fast() ASI-aware mm: asi: ASI page table allocation functions mm: asi: Functions to map/unmap a memory range into ASI page tables mm: asi: Add basic infrastructure for global non-sensitive mappings Ofir Weisse (1): mm: asi: asi_exit() on PF, skip handling if address is accessible Reiji Watanabe (1): mm: asi: Map dynamic percpu memory as nonsensitive Yosry Ahmed (2): mm: asi: Use separate PCIDs for restricted address spaces mm: asi: exit ASI before suspend-like operations arch/alpha/include/asm/Kbuild | 1 + arch/arc/include/asm/Kbuild | 1 + arch/arm/include/asm/Kbuild | 1 + arch/arm64/include/asm/Kbuild | 1 + arch/csky/include/asm/Kbuild | 1 + arch/hexagon/include/asm/Kbuild | 1 + arch/loongarch/include/asm/Kbuild | 3 + arch/m68k/include/asm/Kbuild | 1 + arch/microblaze/include/asm/Kbuild | 1 + arch/mips/include/asm/Kbuild | 1 + arch/nios2/include/asm/Kbuild | 1 + arch/openrisc/include/asm/Kbuild | 1 + arch/parisc/include/asm/Kbuild | 1 + arch/powerpc/include/asm/Kbuild | 1 + arch/riscv/include/asm/Kbuild | 1 + arch/s390/include/asm/Kbuild | 1 + arch/sh/include/asm/Kbuild | 1 + arch/sparc/include/asm/Kbuild | 1 + arch/um/include/asm/Kbuild | 2 +- arch/x86/Kconfig | 27 + arch/x86/boot/compressed/ident_map_64.c | 10 + arch/x86/boot/compressed/pgtable_64.c | 11 + arch/x86/include/asm/asi.h | 306 +++++++++ arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/disabled-features.h | 8 +- arch/x86/include/asm/idtentry.h | 50 +- arch/x86/include/asm/kvm_host.h | 3 + arch/x86/include/asm/l1tf.h | 11 + arch/x86/include/asm/nospec-branch.h | 2 + arch/x86/include/asm/pgalloc.h | 6 + arch/x86/include/asm/pgtable_64.h | 4 + arch/x86/include/asm/processor-flags.h | 24 + arch/x86/include/asm/processor.h | 20 +- arch/x86/include/asm/pti.h | 6 +- arch/x86/include/asm/special_insns.h | 45 +- arch/x86/include/asm/tlbflush.h | 6 + arch/x86/kernel/process.c | 2 + arch/x86/kernel/process_32.c | 2 +- arch/x86/kernel/process_64.c | 2 +- arch/x86/kernel/traps.c | 22 + arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/svm/svm.c | 2 + arch/x86/kvm/vmx/nested.c | 6 + arch/x86/kvm/vmx/vmx.c | 113 ++-- arch/x86/kvm/x86.c | 81 ++- arch/x86/lib/Makefile | 1 + arch/x86/lib/l1tf.c | 96 +++ arch/x86/lib/retpoline.S | 10 + arch/x86/mm/Makefile | 1 + arch/x86/mm/asi.c | 1039 ++++++++++++++++++++++++++++++ arch/x86/mm/fault.c | 124 +++- arch/x86/mm/init.c | 7 +- arch/x86/mm/init_64.c | 25 +- arch/x86/mm/mm_internal.h | 3 + arch/x86/mm/pti.c | 14 +- arch/x86/mm/tlb.c | 167 ++++- arch/x86/virt/svm/sev.c | 2 +- arch/xtensa/include/asm/Kbuild | 1 + drivers/firmware/efi/libstub/x86-5lvl.c | 2 +- include/asm-generic/asi.h | 113 ++++ include/asm-generic/vmlinux.lds.h | 11 + include/linux/entry-common.h | 11 + include/linux/gfp.h | 5 + include/linux/gfp_types.h | 15 +- include/linux/mm_types.h | 7 + include/linux/page-flags.h | 18 + include/linux/pgtable.h | 3 + include/trace/events/mmflags.h | 12 +- init/main.c | 2 + kernel/entry/common.c | 1 + kernel/fork.c | 5 + kernel/sched/core.c | 9 + mm/init-mm.c | 4 + mm/internal.h | 2 + mm/mm_init.c | 1 + mm/page_alloc.c | 160 ++++- mm/percpu-vm.c | 50 +- mm/percpu.c | 4 +- mm/vmalloc.c | 53 +- tools/perf/builtin-kmem.c | 1 + 80 files changed, 2582 insertions(+), 190 deletions(-) --- base-commit: ebd6ea9c6976c64ed5af3e6dce672616447e8e62 change-id: 20241115-asi-rfc-v2-5d9bbb441186 Best regards,