From patchwork Wed Dec 21 14:23:22 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 13078827 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 20854C4332F for ; Wed, 21 Dec 2022 14:23:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 835AA8E0002; Wed, 21 Dec 2022 09:23:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7E3BA8E0001; Wed, 21 Dec 2022 09:23:47 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6842F8E0002; Wed, 21 Dec 2022 09:23:47 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 54A738E0001 for ; Wed, 21 Dec 2022 09:23:47 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 0C18516017A for ; Wed, 21 Dec 2022 14:23:47 +0000 (UTC) X-FDA: 80266532094.20.0A13FAE Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf08.hostedemail.com (Postfix) with ESMTP id 4D14316000E for ; Wed, 21 Dec 2022 14:23:45 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=zx2c4.com header.s=20210105 header.b=JVpKVwq+; spf=pass (imf08.hostedemail.com: domain of "SRS0=kBoL=4T=zx2c4.com=Jason@kernel.org" designates 139.178.84.217 as permitted sender) smtp.mailfrom="SRS0=kBoL=4T=zx2c4.com=Jason@kernel.org"; dmarc=pass (policy=quarantine) header.from=zx2c4.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1671632625; a=rsa-sha256; cv=none; b=CNXgw7CbaLTrU//SFSafHgdZLaBYrHTYA8DmDmR0m3a6b3CMT4N3Y2nwHQtZYYtLlcm42e Q7oc30sTKpSp2hq8bCV82Dar7+CFnPJVbQb1u3hTb52WOL/xx0Wvw4eltJZoxr5pojH084 D3ZFNiiYBBV87JQPGxbS3HNWF1qwEGU= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=zx2c4.com header.s=20210105 header.b=JVpKVwq+; spf=pass (imf08.hostedemail.com: domain of "SRS0=kBoL=4T=zx2c4.com=Jason@kernel.org" designates 139.178.84.217 as permitted sender) smtp.mailfrom="SRS0=kBoL=4T=zx2c4.com=Jason@kernel.org"; dmarc=pass (policy=quarantine) header.from=zx2c4.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1671632625; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=t6mEMlSUPk+cDhskKS3zWxZg7w+Zno4GCzwlfty+P5I=; b=oIgXnltyJRx3dX4CQVzMWDvD65VRPJj8SpWJy0svBc5PTOy6B/V/NEYRG0cKpCUfETc1Cu Z4t80ziZuYEw+S5rk7vrFatnDQMnT/WaNPjCi302QHWUy6314AMHr7h539a3kf0AFCmGn0 VTrUlo4dcEOWieO7V4EuBZJJclv064g= Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 3228E617BF; Wed, 21 Dec 2022 14:23:44 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 249BDC433F2; Wed, 21 Dec 2022 14:23:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=zx2c4.com; s=20210105; t=1671632621; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=t6mEMlSUPk+cDhskKS3zWxZg7w+Zno4GCzwlfty+P5I=; b=JVpKVwq++YTiLzAqGC+YEldFExfQkKuiVszXkXyrpDWKn6lFnIctm4stYXXXV6Zzmh1q28 BUEuUl2bjzHSJVRpOK7ZAymOv13wLwhA7YFGwwig3sxZ6TL4mHYPqFlX0AVgn/P+5MmuCv 0RZamVbgGNAM8F/iVx0b1w0voTYXD+M= Received: by mail.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id d1900a58 (TLSv1.3:TLS_AES_256_GCM_SHA384:256:NO); Wed, 21 Dec 2022 14:23:41 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, patches@lists.linux.dev, tglx@linutronix.de Cc: "Jason A. Donenfeld" , linux-crypto@vger.kernel.org, linux-api@vger.kernel.org, x86@kernel.org, Greg Kroah-Hartman , Adhemerval Zanella Netto , Carlos O'Donell , Florian Weimer , Arnd Bergmann , Jann Horn , Christian Brauner , linux-mm@kvack.org Subject: [PATCH v13 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings Date: Wed, 21 Dec 2022 15:23:22 +0100 Message-Id: <20221221142327.126451-3-Jason@zx2c4.com> In-Reply-To: <20221221142327.126451-1-Jason@zx2c4.com> References: <20221221142327.126451-1-Jason@zx2c4.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 4D14316000E X-Stat-Signature: hmeqob9p1no4h8866xcjwsjfhixchqkk X-HE-Tag: 1671632625-897618 X-HE-Meta: U2FsdGVkX1919qVvUTFlXpdBSYFzyAs8EGGDP7c8bmRCyg6V14g2f1ESu3bI8Oh4AAXxrYGJ3RF+LMYErH2hCiURfbTH5P5h8oa57TLsWbfsV7TKugiQbslH4/pkNCLyImv7clDqfxzIR5H/5av5srOc3vEpNWhWRW9mZpTcPfhtEduF6FT3Q5O5eFzL4mo/UvjSJ/J69aRvL9X/UB3VIGw0dGBy4GvjaRGmvckTf0SkIajsiDz4BUCjokhiucRhVmUATvGgsAucX6ZX3JDMkileXEMJSKYmmymDwAQOWXEggV8b+IGHeVb4q2cMzULe8kPTHhEBPnUKg/NQZ2HKIQEn1NG+AdwOaoXW8NQhMo8Ncfcc6jNSVkdd7qBktuVsLGx3n48VWOZ9qDGgdUPtc6iEl3qnLVqghv9EDDVTm+Yfe93ImnI6nH5wTjkt+IWaeHFcEi6summ3WrelRRUTfZl2iDAGDV6LAF2K3538gbbRxjiKOsiz93Ssun9bJwYCKDG64fjpN6pEn8Ea1mt2mgeuR5YkHq1z8uQQdJlv65FEm0v37dgjDNWvZCW7kQ5pxCD7SjCYYjIFdU+yvUTshroXajvE7z3eqMujVi9nKjZQwT6BJcX3PK90fmdYOS4BOO+jjSW/Tqperh9snbq3aSGRT2z5CBUBNl+iTrkV4kQmf0xsC//yhBC8IO/qQY1OGkVlQcL2F3TEPCkumUGIgzTLPdC4uWQJvq6E2qm6HScGD+9pWx/WR8GA9KyPYJa/Wp0NfWAKLOjxxbiPUPlDggQGmrZ/iaF6r5p8Hz2MFsNtbSPqlTzz0n39IGB9c40fZnaVfX+dSVRDbl0FABWbSGShREmu3EO1XtcVUZ0tc1ucOevFM4rsakq0ikBEovPw5KL0gURtlfl1uRiqmhi2dY5XO9aTBlprnRbNbiMt8PnLJGeksIZYQLZ3kUiGN+BBxASp4LxuJq5Ht5VlGfI HuI12awh nEasIvaRWEzLI7M1ZGMcAwp1IRKRot6dPmiFs+/V7qJ8q+AaxN42yz2c33Dp9hw3khEXuEharH/msyfyeziz+auwJYGJK3iESuQih8qoPmcdHGwz9dMgEESzVWxQrxLFT9xdUDBt6BScpUeM4efWljYY+eDOyreUcpwL042wldyDoeHgy6oyNCiY+Uhl9jhHVRjxNgQyD8p4pkRMuZ7VwIyRJP0Tc2uu3M+UAHdsptaXlb04jHpcRl9j2/L8OpthX5FxgWAWr7CGtRDctq7kr50i/Z0o2ImkuPCPurzDVdBdkUf1Cf6+jszFNsg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The vDSO getrandom() implementation works with a buffer allocated with a new system call that has certain requirements: - It shouldn't be written to core dumps. * Easy: VM_DONTDUMP. - It should be zeroed on fork. * Easy: VM_WIPEONFORK. - It shouldn't be written to swap. * Uh-oh: mlock is rlimited. * Uh-oh: mlock isn't inherited by forks. - It shouldn't reserve actual memory, but it also shouldn't crash when page faulting in memory if none is available * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2. * Uh-oh: VM_NORESERVE means segfaults. It turns out that the vDSO getrandom() function has three really nice characteristics that we can exploit to solve this problem: 1) Due to being wiped during fork(), the vDSO code is already robust to having the contents of the pages it reads zeroed out midway through the function's execution. 2) In the absolute worst case of whatever contingency we're coding for, we have the option to fallback to the getrandom() syscall, and everything is fine. 3) The buffers the function uses are only ever useful for a maximum of 60 seconds -- a sort of cache, rather than a long term allocation. These characteristics mean that we can introduce VM_DROPPABLE, which has the following semantics: a) It never is written out to swap. b) Under memory pressure, mm can just drop the pages (so that they're zero when read back again). c) If there's not enough memory to service a page fault, it's not fatal, and no signal is sent. Instead, writes are simply lost. d) It is inherited by fork. e) It doesn't count against the mlock budget, since nothing is locked. This is fairly simple to implement, with the one snag that we have to use 64-bit VM_* flags, but this shouldn't be a problem, since the only consumers will probably be 64-bit anyway. This way, allocations used by vDSO getrandom() can use: VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE And there will be no problem with OOMing, crashing on overcommitment, using memory when not in use, not wiping on fork(), coredumps, or writing out to swap. At the moment, rather than skipping writes on OOM, the fault handler just returns to userspace, and the instruction is retried. This isn't terrible, but it's not quite what is intended. The actual instruction skipping has to be implemented arch-by-arch, but so does this whole vDSO series, so that's fine. The following commit addresses it for x86. Cc: linux-mm@kvack.org Signed-off-by: Jason A. Donenfeld --- fs/proc/task_mmu.c | 3 +++ include/linux/mm.h | 8 ++++++++ include/trace/events/mmflags.h | 7 +++++++ mm/Kconfig | 3 +++ mm/memory.c | 4 ++++ mm/mempolicy.c | 3 +++ mm/mprotect.c | 2 +- mm/rmap.c | 5 +++-- 8 files changed, 32 insertions(+), 3 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e35a0398db63..47c7c046f2be 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -711,6 +711,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR [ilog2(VM_UFFD_MINOR)] = "ui", #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */ +#ifdef CONFIG_NEED_VM_DROPPABLE + [ilog2(VM_DROPPABLE)] = "dp", +#endif }; size_t i; diff --git a/include/linux/mm.h b/include/linux/mm.h index f3f196e4d66d..fba3f1e8616b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -315,11 +315,13 @@ extern unsigned int kobjsize(const void *objp); #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ #ifdef CONFIG_ARCH_HAS_PKEYS @@ -335,6 +337,12 @@ extern unsigned int kobjsize(const void *objp); #endif #endif /* CONFIG_ARCH_HAS_PKEYS */ +#ifdef CONFIG_NEED_VM_DROPPABLE +# define VM_DROPPABLE VM_HIGH_ARCH_5 +#else +# define VM_DROPPABLE 0 +#endif + #if defined(CONFIG_X86) # define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */ #elif defined(CONFIG_PPC) diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 412b5a46374c..82b2fb811d06 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -163,6 +163,12 @@ IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison") # define IF_HAVE_UFFD_MINOR(flag, name) #endif +#ifdef CONFIG_NEED_VM_DROPPABLE +# define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name}, +#else +# define IF_HAVE_VM_DROPPABLE(flag, name) +#endif + #define __def_vmaflag_names \ {VM_READ, "read" }, \ {VM_WRITE, "write" }, \ @@ -195,6 +201,7 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \ {VM_MIXEDMAP, "mixedmap" }, \ {VM_HUGEPAGE, "hugepage" }, \ {VM_NOHUGEPAGE, "nohugepage" }, \ +IF_HAVE_VM_DROPPABLE(VM_DROPPABLE, "droppable" ) \ {VM_MERGEABLE, "mergeable" } \ #define show_vma_flags(flags) \ diff --git a/mm/Kconfig b/mm/Kconfig index ff7b209dec05..91fd0be96ca4 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1030,6 +1030,9 @@ config ARCH_USES_HIGH_VMA_FLAGS bool config ARCH_HAS_PKEYS bool +config NEED_VM_DROPPABLE + select ARCH_USES_HIGH_VMA_FLAGS + bool config ARCH_USES_PG_ARCH_X bool diff --git a/mm/memory.c b/mm/memory.c index aad226daf41b..1ade407ccbf9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5220,6 +5220,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, lru_gen_exit_fault(); + /* If the mapping is droppable, then errors due to OOM aren't fatal. */ + if (vma->vm_flags & VM_DROPPABLE) + ret &= ~VM_FAULT_OOM; + if (flags & FAULT_FLAG_USER) { mem_cgroup_exit_user_fault(); /* diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 61aa9aedb728..5aeb85bc9627 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2172,6 +2172,9 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma, int preferred_nid; nodemask_t *nmask; + if (vma->vm_flags & VM_DROPPABLE) + gfp |= __GFP_NOWARN | __GFP_NORETRY; + pol = get_vma_policy(vma, addr); if (pol->mode == MPOL_INTERLEAVE) { diff --git a/mm/mprotect.c b/mm/mprotect.c index 908df12caa26..a679cc5d1c75 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -593,7 +593,7 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma, may_expand_vm(mm, oldflags, nrpages)) return -ENOMEM; if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB| - VM_SHARED|VM_NORESERVE))) { + VM_SHARED|VM_NORESERVE|VM_DROPPABLE))) { charged = nrpages; if (security_vm_enough_memory_mm(mm, charged)) return -ENOMEM; diff --git a/mm/rmap.c b/mm/rmap.c index b616870a09be..5ed46e59dfcd 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1294,7 +1294,8 @@ void page_add_new_anon_rmap(struct page *page, int nr; VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); - __SetPageSwapBacked(page); + if (!(vma->vm_flags & VM_DROPPABLE)) + __SetPageSwapBacked(page); if (likely(!PageCompound(page))) { /* increment count (starts at -1) */ @@ -1683,7 +1684,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, * plus the rmap(s) (dropped by discard:). */ if (ref_count == 1 + map_count && - !folio_test_dirty(folio)) { + (!folio_test_dirty(folio) || (vma->vm_flags & VM_DROPPABLE))) { /* Invalidate as we cleared the pte */ mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); From patchwork Wed Dec 21 14:23:23 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Jason A. Donenfeld" X-Patchwork-Id: 13078828 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A53CBC4167B for ; Wed, 21 Dec 2022 14:23:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DD8CC8E0003; Wed, 21 Dec 2022 09:23:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D62EF8E0001; Wed, 21 Dec 2022 09:23:49 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BDB598E0003; Wed, 21 Dec 2022 09:23:49 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id AFA5F8E0001 for ; Wed, 21 Dec 2022 09:23:49 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 842A9401F0 for ; Wed, 21 Dec 2022 14:23:49 +0000 (UTC) X-FDA: 80266532178.16.1761AA1 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf27.hostedemail.com (Postfix) with ESMTP id F094140012 for ; Wed, 21 Dec 2022 14:23:47 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=zx2c4.com header.s=20210105 header.b=Ao0tnENC; spf=pass (imf27.hostedemail.com: domain of "SRS0=kBoL=4T=zx2c4.com=Jason@kernel.org" designates 139.178.84.217 as permitted sender) smtp.mailfrom="SRS0=kBoL=4T=zx2c4.com=Jason@kernel.org"; dmarc=pass (policy=quarantine) header.from=zx2c4.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1671632628; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UiIAUwifV2YyWZQAy2FXEyRLZlyxjnNXA+vsOwxLvB8=; b=1l4itNPrkqvl54w1pTzbRHMDlkwSSWVJUjtxhHF4GBme72FWJoHhjBQwAfoN7itSccTwV6 0M8AIYgb3QPsR2OTES2Oeegl5xTxvsdsixzDoaxMqP7PXcc7Erw/PoYkZ6dYFCBcdM3c9j +MesepjQTpKovD+lMiEBaEjTVuWGYtU= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=zx2c4.com header.s=20210105 header.b=Ao0tnENC; spf=pass (imf27.hostedemail.com: domain of "SRS0=kBoL=4T=zx2c4.com=Jason@kernel.org" designates 139.178.84.217 as permitted sender) smtp.mailfrom="SRS0=kBoL=4T=zx2c4.com=Jason@kernel.org"; dmarc=pass (policy=quarantine) header.from=zx2c4.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1671632628; a=rsa-sha256; cv=none; b=3XNAk1vdWaZxAikSq4hsZCj6TEERiZ+M8pn4KTYrCW7SpwyI3rbEM+SQ03uxNgLn4J6ox/ +75SUimr8MH24V/iyOUbIfeWiOOi3dVp6YF8J8tH4vApr5dJc5zGMmtIZXrO7PFfMVeIec usLcILUrvoBudncwjKkpWhozJFCnhV8= Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 09257617CE; Wed, 21 Dec 2022 14:23:47 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id D176DC433F0; Wed, 21 Dec 2022 14:23:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=zx2c4.com; s=20210105; t=1671632624; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UiIAUwifV2YyWZQAy2FXEyRLZlyxjnNXA+vsOwxLvB8=; b=Ao0tnENCjdlEmgrrLimO/1H4vX88RsqTMlKIUi2l1rMCPgr9NYG5elruck2BP1UO9bdMzV BGA7V1OsByS0G3q0cvPyX8oRJLctZdHRiP6emiwWXNtpYOZimRM20eEmRRNLu3W81i9RHD Lnf394xVEEdaSlqZvaVlfJZiEzIE2f8= Received: by mail.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id 95b72b9d (TLSv1.3:TLS_AES_256_GCM_SHA384:256:NO); Wed, 21 Dec 2022 14:23:44 +0000 (UTC) From: "Jason A. Donenfeld" To: linux-kernel@vger.kernel.org, patches@lists.linux.dev, tglx@linutronix.de Cc: "Jason A. Donenfeld" , linux-crypto@vger.kernel.org, linux-api@vger.kernel.org, x86@kernel.org, Greg Kroah-Hartman , Adhemerval Zanella Netto , Carlos O'Donell , Florian Weimer , Arnd Bergmann , Jann Horn , Christian Brauner , linux-mm@kvack.org Subject: [PATCH v13 3/7] x86: mm: Skip faulting instruction for VM_DROPPABLE faults Date: Wed, 21 Dec 2022 15:23:23 +0100 Message-Id: <20221221142327.126451-4-Jason@zx2c4.com> In-Reply-To: <20221221142327.126451-1-Jason@zx2c4.com> References: <20221221142327.126451-1-Jason@zx2c4.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: F094140012 X-Stat-Signature: bpaei97ju5wsgbouj37jahsp5efmer1w X-HE-Tag: 1671632627-221856 X-HE-Meta: U2FsdGVkX1+DvuS1cbtnOI/lFG2e6ap1TEjLt9fxZmMrnCXRBsalL8Ayj0B163GvvYRXHylMP5RuSN1N6SQhDelseW2LObbETfSCVJFkxlkhGXn1Qmu6kRJQXnmOQJ9/W1KcX5rBwUOICfFfzd/KwVC/v3Fa6STVI9dDrElxRbWtD/GglhM862KRCp1hF6CLVBVUm/D++Eu3CbFC5oVS+G9w0MKUvUmL9TENPbpgT5mV3mKUX5f57RET58T3h3X/L6ibVOjs7vCqu4NUOiZgaxJ+O+CE5IVmfj51LwcHmubcTqPgWTg6z1QwqSGJDE1iHw/URjiRNN5GzYemakSSJKZXl7t3npcpTIlIRvceOoqXVda6x+A4hgEOXt/IJ4QQ6LVQ8LA9StR6oD8N4xJajy76WgbvtD8u0bhHgQpssrrKANKCApb5iqLUE8aOVr55iXWfKh9g+wYMZ4AodmyNiqBdO6rj/HHQCTxesTLVIN5QUmxIfxMv0Mi3dahdS7vyxkTPya1Biu2jBZSUni1aVkpZ05BYgEQL5+26gvDh+ml1tOYKztv1s6VNjbK43oMvfYv/vseV15djEa3LbBDscuRK9uCt5VWzMuM5ORfQL6dNBSWXnCZLnVl5A5jyOUAq5QCzbtdKkSZhLYAI7bGtGyrVJzoeK60v8Rv6ToJW/V+X/3mtyVXm3gtzWDYLYFvjBZ0oGeqisGQuPDD4auwCPcD/quo2BvZJw+DfMvFLsFtlWmL/T8KwQ5Wt3csA5dLBWyUd75+GJVLY2eZLW43wmE8GR0SOWxC54nU2abtU+Sb6TmgrPaeSXPa+DKVxpihxgqH9WCEJf/4dE2bZOf9CgIBNMpUyU0o8kJ5OB7pSERcuJSYGt65lxww6N8mvHmNbGEST/ztCPgBg4Im2gqRv1pbFqnwsIxruUr1t0A08VpxeOEvYotHqzck40VQvMZTgSlDnnSH+YJ7yc8KuoGY ZNOFo9jQ h6L0+Sq3qf93L/z6zSsTzMatiL2PrkitYvtrONBOS7bMQgrFUSsgK1rlZtWyXd+JCEsTGd2Q1YxNUzAFveExpSDeDT8Ftvvf9e7NKauVyBdDTKi3NPAx7z2Ws/UAKKlxmh6wnPWjwjHYMBq8rbG6ZOSYpoXxCJhyvcQHwbxY8ttLztBOnJGtGdU4aNMV/+MDA9NBzalLHI0jYPthj3KPZnjN7LpxIx4+36tD0aC0qG47hicoOB/vpUuSJC4wFTvZyyMIrTYzKca8kLvoxJrN/S/QRxQWO4c68wbBNOku8jHPRWFo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The prior commit introduced VM_DROPPABLE, but in a limited form where the faulting instruction was retried instead of skipped. Finish that up with the platform-specific aspect of skipping the actual instruction. This works by copying userspace's %rip to a stack buffer of size MAX_INSN_SIZE, decoding it, and then adding the length of the decoded instruction to userspace's %rip. In the event any of these fail, just fallback to not advancing %rip and trying again. Cc: linux-mm@kvack.org Signed-off-by: Jason A. Donenfeld --- arch/x86/mm/fault.c | 19 +++++++++++++++++++ include/linux/mm_types.h | 5 ++++- mm/memory.c | 4 +++- 3 files changed, 26 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 7b0d4ab894c8..76ca99ab6eb7 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -33,6 +33,8 @@ #include /* kvm_handle_async_pf */ #include /* fixup_vdso_exception() */ #include +#include /* struct insn */ +#include /* insn_fetch_from_user(), ... */ #define CREATE_TRACE_POINTS #include @@ -1454,6 +1456,23 @@ void do_user_addr_fault(struct pt_regs *regs, } mmap_read_unlock(mm); + + if (fault & VM_FAULT_SKIP_INSN) { + u8 buf[MAX_INSN_SIZE]; + struct insn insn; + int nr_copied; + + nr_copied = insn_fetch_from_user(regs, buf); + if (nr_copied <= 0) + return; + + if (!insn_decode_from_regs(&insn, regs, buf, nr_copied)) + return; + + regs->ip += insn.length; + return; + } + if (likely(!(fault & VM_FAULT_ERROR))) return; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 3b8475007734..e76ab9ad555c 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -945,6 +945,7 @@ typedef __bitwise unsigned int vm_fault_t; * fsync() to complete (for synchronous page faults * in DAX) * @VM_FAULT_COMPLETED: ->fault completed, meanwhile mmap lock released + * @VM_FAULT_SKIP_INSN: ->handle the fault by skipping faulting instruction * @VM_FAULT_HINDEX_MASK: mask HINDEX value * */ @@ -962,6 +963,7 @@ enum vm_fault_reason { VM_FAULT_DONE_COW = (__force vm_fault_t)0x001000, VM_FAULT_NEEDDSYNC = (__force vm_fault_t)0x002000, VM_FAULT_COMPLETED = (__force vm_fault_t)0x004000, + VM_FAULT_SKIP_INSN = (__force vm_fault_t)0x008000, VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000, }; @@ -985,7 +987,8 @@ enum vm_fault_reason { { VM_FAULT_RETRY, "RETRY" }, \ { VM_FAULT_FALLBACK, "FALLBACK" }, \ { VM_FAULT_DONE_COW, "DONE_COW" }, \ - { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" } + { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }, \ + { VM_FAULT_SKIP_INSN, "SKIP_INSN" } struct vm_special_mapping { const char *name; /* The name, e.g. "[vdso]". */ diff --git a/mm/memory.c b/mm/memory.c index 1ade407ccbf9..62ba9b7b713e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5221,8 +5221,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, lru_gen_exit_fault(); /* If the mapping is droppable, then errors due to OOM aren't fatal. */ - if (vma->vm_flags & VM_DROPPABLE) + if ((ret & VM_FAULT_OOM) && (vma->vm_flags & VM_DROPPABLE)) { ret &= ~VM_FAULT_OOM; + ret |= VM_FAULT_SKIP_INSN; + } if (flags & FAULT_FLAG_USER) { mem_cgroup_exit_user_fault();