From patchwork Thu Jun 29 20:50:35 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Axel Rasmussen X-Patchwork-Id: 13297311 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C6B37EB64DD for ; Thu, 29 Jun 2023 20:50:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D4DC98D0002; Thu, 29 Jun 2023 16:50:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CD26E8D0001; Thu, 29 Jun 2023 16:50:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B4C1B8D0002; Thu, 29 Jun 2023 16:50:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 9DA898D0001 for ; Thu, 29 Jun 2023 16:50:52 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 6141EB017B for ; Thu, 29 Jun 2023 20:50:52 +0000 (UTC) X-FDA: 80956979544.10.1157F37 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf03.hostedemail.com (Postfix) with ESMTP id 94D0820009 for ; Thu, 29 Jun 2023 20:50:49 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=Ut9VLeaC; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf03.hostedemail.com: domain of 3qO6dZA0KCCkFcJQWFXRZXXJSLTTLQJ.HTRQNSZc-RRPaFHP.TWL@flex--axelrasmussen.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3qO6dZA0KCCkFcJQWFXRZXXJSLTTLQJ.HTRQNSZc-RRPaFHP.TWL@flex--axelrasmussen.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688071849; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=n31snoBoqTjmW3LOSDrR7Vzlxun4GSqbUT1PJH9sHYA=; b=KSHPWKInP6z3Wf/6SJ9I5D4sorTzmVJtfiD16mJ71Hk7FMPw2I7px02hcCm1IS4dba+zmV c1kvILlR5h+bZJCYj/W0Uk9XPwatIGCy0TsjtFFBx/zG1WSep8uU6qOKsnezM6AISbM69B pzeLoMx5wBsavSn0BECqYd7+CXuizUg= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=Ut9VLeaC; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf03.hostedemail.com: domain of 3qO6dZA0KCCkFcJQWFXRZXXJSLTTLQJ.HTRQNSZc-RRPaFHP.TWL@flex--axelrasmussen.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3qO6dZA0KCCkFcJQWFXRZXXJSLTTLQJ.HTRQNSZc-RRPaFHP.TWL@flex--axelrasmussen.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688071849; a=rsa-sha256; cv=none; b=WkhgWHXhe5+rCO0hY3gkCoMzC80+DIm2xoPe38/aVe1ae67FiLd8PQvRRGoovII1NC5nr2 cR22YwOkTwYEJOsIg1XD8YA2j0CoSAIaaEuT4Lm9sjlR4VYXIPQ7W6kekJQGDEsLSBpbbY FxW6YdkrkQFtGZ9PYgnhDjWPXTB1AaY= Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-bacd408046cso940034276.3 for ; Thu, 29 Jun 2023 13:50:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1688071848; x=1690663848; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=n31snoBoqTjmW3LOSDrR7Vzlxun4GSqbUT1PJH9sHYA=; b=Ut9VLeaCRmmMAVyM2bCZtqXgqEOBvF2UAZ9ooQLERry2XywKsFOp6Vs+Mt3/y1Q9md swkNVxtOZc2GH7cv2qiPiSGkrSmTkhI4kqORgRlzoVBbgg7cQIVc/rcmIA/x8cYpvSCl uTbxQzPvE5rHK4Nd113c2aEWWMRS9LJbucjC3HVmnRDp2IQmlYsdGAI8PPNILNjdHyU8 3F00ri/hxhXpJj8E6lioDYe867VXsmh4zUq2dF/V1vNd7PwNp8LMgoLdNF3V05iO5BBb WB3sHkODIx05aSDHExdUWM1Q9OAEkiP0/E5mcJm640ZTcWo4vgmV50jP04IPPVy80wY6 v5Zg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688071848; x=1690663848; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=n31snoBoqTjmW3LOSDrR7Vzlxun4GSqbUT1PJH9sHYA=; b=O7wsZqcgzx299h7vSVERndX4vuU3bQ3GQ47z+vT5DsLt1LxjRSM8noo18JlsGoNAFl 3MK+MuBjLMaKuOQfS19Eg+DxP7xYuVjjv0N+WcCAvKHNuj3Az/d38TIwawG8BlR2kC3f yxil9B/unZVfRy8xHFcHfOjvA5EfQEYcVxMMDRFGsT2O6SVRVq8G5KW0Vgiro6aJyTcL WeiV6xvJzJUZ3BFLJKRopOsSq4t5WH2WfKOg7B9Q/OQFJb1MKJ7Y05M824DbIhcCRfRA v0qfUEwQwCAlxWVVaJNa2m/Q8bT23aubsLbD8wQi3ofZd8QGy55u009QcJyuw6j/eg4m R6Vw== X-Gm-Message-State: ABy/qLbizxheui9fV3BsU2Uq+njHIlynmNtfRDrMf22A9HAPpQr6IbLD Z+iqbLmYND+B14pupnUbA+10yQbJs1FLWDIqalCy X-Google-Smtp-Source: APBJJlH4u/TteIZMM5o6+B+C6/pVPx63puKun/LsxSpecFO4G35qd0IIggv8TwfKmYBT/GWrBU4MPZ6uuL+/00FR93UM X-Received: from axel.svl.corp.google.com ([2620:15c:2a3:200:e20f:5917:3efa:d4bb]) (user=axelrasmussen job=sendgmr) by 2002:a25:aa61:0:b0:c1d:4fce:452 with SMTP id s88-20020a25aa61000000b00c1d4fce0452mr5988ybi.1.1688071848320; Thu, 29 Jun 2023 13:50:48 -0700 (PDT) Date: Thu, 29 Jun 2023 13:50:35 -0700 Mime-Version: 1.0 X-Mailer: git-send-email 2.41.0.255.g8b1d071c50-goog Message-ID: <20230629205040.665834-1-axelrasmussen@google.com> Subject: [PATCH v2 1/6] mm: userfaultfd: add new UFFDIO_POISON ioctl From: Axel Rasmussen To: Alexander Viro , Andrew Morton , Christian Brauner , David Hildenbrand , Huang Ying , Hugh Dickins , James Houghton , Jiaqi Yan , Jonathan Corbet , "Liam R. Howlett" , Miaohe Lin , Mike Kravetz , "Mike Rapoport (IBM)" , Muchun Song , Nadav Amit , Naoya Horiguchi , Peter Xu , Shuah Khan , ZhangPeng Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Axel Rasmussen X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 94D0820009 X-Stat-Signature: ditcq6tx6esudhzhdsqic67dnj6h1dc1 X-Rspam-User: X-HE-Tag: 1688071849-922908 X-HE-Meta: U2FsdGVkX19tW8MWJEgVUmEvfEV4i5cGYTQ/dwOITpXd2V0p7u1f9kpFHf2s9ufxF3k1lZlDfWKnoQUW3+YapSt+ucfGVojsB0t2hHESd5AKg2XpTkm6bjgGaJMqZZFYtQJtLHM0YcXbqOjuzBPXDLO9leui6UR5+9qvMrtPRtGpTHzuUb+Py1sO6CsHuJAMYv2txrattQc2TURuDZolZ1sGoLkt1fz5xI+fRS8gIsm+i2WN1alG+zc9eE89dN5uDK/Rul8NZfU11/etRTCuPz2c7phi3ZcTFXFLHJfQr0r3A860RkVDoWSkz6ao2vkdq8koJFtU2bfwa9ndMdKrwgFkNVw62uDmxOsRSLraOOL1XM5hrDDsehiJV2qNSOg8p/dowvtSxmnsj4mtZwDPs1rnRMqVNoLpv8zFkjFVu93cDJEWTZQA8tJgTr0FcUNiV8dwt138AdBnY/0eRAmyOYp2f+FszPAZdRB3aJox2j8hdC07imztM/n7034E8PiMJQ6KTodCeYXZ9vibfFNaFunZOTik8tEQNr7MzUci4VVVYvpdBUI1xXc1caZzGuMivxIGycrRQ3JwD//6rMOnXXmHU/8Y+DeJshICBRi3VmuV1DInjsJBnCuTD7kp3fl5o7UXFL7k9zgAgBDZInEWESmS8cuwlI2m+GM5enA5d/T5NkEbbwlBxhn56Jy63HzbT4K2K2xHJARgTiXC9ssSNZqQkTli71xp1pWnKlDpD4VTHSZ7a2cbxcVq+ov9JS8ZJpkAAizXBWSnw0bnV/U8V+t7RiP7/qu7bA1XhpySaTmpkDgT1RvacRQlQee7J1xG7DuWZ3zuwQrSaloYrVUN43aKm1n7lfMqPrZuskHwosgG4en+Xo8E3FqAStwTLD/5dwBHTXIIrtl3joLABXri0xF6yJtCb8TpWNm/6IRdkSrLkiwur9m/s4M/mTYtA5H3a2E3JT4R3sTzgMi9BTS REQn50dh A2QLTg1rLxdcfSo7F+W0V9yfmBGyc5ZWrp0Ts6SBiZWCgESjAOjtcWwtEgor579xNxBs2J0xzplWLrFBwozCNI7hXopviAakaAktBm8GK/kdal0klc2zpl+WaxbtKTtfqpzCVpNiCr1YNrKX+hKepAusKnh00c4UiHE51r9mdLkOisZaePdHYwdpaCbQPtVcGFhgMEm49rVjHPiyjX86Ps2tdTFHGSoD0ms/Cqp7pH71+IwnjNoeGsvEt81XjtjPrDqrDTJ+PG7v+donvJ+jmIboQJwR1kfN/zcy2dtHWbqa25aSsgWCO1w+JZ7WJJ//nL4yJH0eYOYhZWPCHJHPErvJnlWAjpl9AR3vesC4erg1eKBZZBeRtN7L+cEunc/RfqX3vUsdW/wGkKau14hPHlzAkXcjhJuZa/hOfym86dnS4UPl3eg749wwZ//cEU26uKlNoU2lD/snCXsnita867/jslqY8arDd4oiC2Q2tIN2Us3S7Uie9fmfisf7otRJEyyqBmfTilZEf5DPVioUwF7u1typJGq6iCETkdiu4siae9jk7J98TA0132MS/5d83xXCY5hnWl6itmw63WbIpJ8FNdgETKeflvUnwXm4xDMvTytvNr6j/zNm7w0ASBUris6wxt2MlTBVKXO5cTMJOWXI4M8qWWHosdX0UnMjdjUUHZ/wHJGpFiH68Fn313VBVS6+5XIr/Ay6GuFwvsAGQS2k+Og== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The basic idea here is to "simulate" memory poisoning for VMs. A VM running on some host might encounter a memory error, after which some page(s) are poisoned (i.e., future accesses SIGBUS). They expect that once poisoned, pages can never become "un-poisoned". So, when we live migrate the VM, we need to preserve the poisoned status of these pages. When live migrating, we try to get the guest running on its new host as quickly as possible. So, we start it running before all memory has been copied, and before we're certain which pages should be poisoned or not. So the basic way to use this new feature is: - On the new host, the guest's memory is registered with userfaultfd, in either MISSING or MINOR mode (doesn't really matter for this purpose). - On any first access, we get a userfaultfd event. At this point we can communicate with the old host to find out if the page was poisoned. - If so, we can respond with a UFFDIO_POISON - this places a swap marker so any future accesses will SIGBUS. Because the pte is now "present", future accesses won't generate more userfaultfd events, they'll just SIGBUS directly. UFFDIO_POISON does not handle unmapping previously-present PTEs. This isn't needed, because during live migration we want to intercept all accesses with userfaultfd (not just writes, so WP mode isn't useful for this). So whether minor or missing mode is being used (or both), the PTE won't be present in any case, so handling that case isn't needed. Why return VM_FAULT_HWPOISON instead of VM_FAULT_SIGBUS when one of these markers is encountered? For "normal" userspace programs there isn't a big difference, both yield a SIGBUS. The difference for KVM is key though: VM_FAULT_HWPOISON will result in an MCE being injected into the guest (which is the behavior we want). With VM_FAULT_SIGBUS, the hypervisor would need to catch the SIGBUS and deal with the MCE injection itself. Signed-off-by: Axel Rasmussen --- fs/userfaultfd.c | 63 ++++++++++++++++++++++++++++++++ include/linux/swapops.h | 3 +- include/linux/userfaultfd_k.h | 4 ++ include/uapi/linux/userfaultfd.h | 25 +++++++++++-- mm/memory.c | 4 ++ mm/userfaultfd.c | 62 ++++++++++++++++++++++++++++++- 6 files changed, 156 insertions(+), 5 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 7cecd49e078b..c26a883399c9 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1965,6 +1965,66 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) return ret; } +static inline int userfaultfd_poison(struct userfaultfd_ctx *ctx, unsigned long arg) +{ + __s64 ret; + struct uffdio_poison uffdio_poison; + struct uffdio_poison __user *user_uffdio_poison; + struct userfaultfd_wake_range range; + + user_uffdio_poison = (struct uffdio_poison __user *)arg; + + ret = -EAGAIN; + if (atomic_read(&ctx->mmap_changing)) + goto out; + + ret = -EFAULT; + if (copy_from_user(&uffdio_poison, user_uffdio_poison, + /* don't copy the output fields */ + sizeof(uffdio_poison) - (sizeof(__s64)))) + goto out; + + ret = validate_range(ctx->mm, uffdio_poison.range.start, + uffdio_poison.range.len); + if (ret) + goto out; + + ret = -EINVAL; + /* double check for wraparound just in case. */ + if (uffdio_poison.range.start + uffdio_poison.range.len <= + uffdio_poison.range.start) { + goto out; + } + if (uffdio_poison.mode & ~UFFDIO_POISON_MODE_DONTWAKE) + goto out; + + if (mmget_not_zero(ctx->mm)) { + ret = mfill_atomic_poison(ctx->mm, uffdio_poison.range.start, + uffdio_poison.range.len, + &ctx->mmap_changing, 0); + mmput(ctx->mm); + } else { + return -ESRCH; + } + + if (unlikely(put_user(ret, &user_uffdio_poison->updated))) + return -EFAULT; + if (ret < 0) + goto out; + + /* len == 0 would wake all */ + BUG_ON(!ret); + range.len = ret; + if (!(uffdio_poison.mode & UFFDIO_POISON_MODE_DONTWAKE)) { + range.start = uffdio_poison.range.start; + wake_userfault(ctx, &range); + } + ret = range.len == uffdio_poison.range.len ? 0 : -EAGAIN; + +out: + return ret; +} + static inline unsigned int uffd_ctx_features(__u64 user_features) { /* @@ -2066,6 +2126,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd, case UFFDIO_CONTINUE: ret = userfaultfd_continue(ctx, arg); break; + case UFFDIO_POISON: + ret = userfaultfd_poison(ctx, arg); + break; } return ret; } diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 4c932cb45e0b..8259fee32421 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -394,7 +394,8 @@ typedef unsigned long pte_marker; #define PTE_MARKER_UFFD_WP BIT(0) #define PTE_MARKER_SWAPIN_ERROR BIT(1) -#define PTE_MARKER_MASK (BIT(2) - 1) +#define PTE_MARKER_UFFD_POISON BIT(2) +#define PTE_MARKER_MASK (BIT(3) - 1) static inline swp_entry_t make_pte_marker_entry(pte_marker marker) { diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index ac7b0c96d351..ac8c6854097c 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -46,6 +46,7 @@ enum mfill_atomic_mode { MFILL_ATOMIC_COPY, MFILL_ATOMIC_ZEROPAGE, MFILL_ATOMIC_CONTINUE, + MFILL_ATOMIC_POISON, NR_MFILL_ATOMIC_MODES, }; @@ -83,6 +84,9 @@ extern ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm, extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst_start, unsigned long len, atomic_t *mmap_changing, uffd_flags_t flags); +extern ssize_t mfill_atomic_poison(struct mm_struct *dst_mm, unsigned long start, + unsigned long len, atomic_t *mmap_changing, + uffd_flags_t flags); extern int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, unsigned long len, bool enable_wp, atomic_t *mmap_changing); diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 66dd4cd277bd..62151706c5a3 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -39,7 +39,8 @@ UFFD_FEATURE_MINOR_SHMEM | \ UFFD_FEATURE_EXACT_ADDRESS | \ UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \ - UFFD_FEATURE_WP_UNPOPULATED) + UFFD_FEATURE_WP_UNPOPULATED | \ + UFFD_FEATURE_POISON) #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -49,12 +50,14 @@ (__u64)1 << _UFFDIO_COPY | \ (__u64)1 << _UFFDIO_ZEROPAGE | \ (__u64)1 << _UFFDIO_WRITEPROTECT | \ - (__u64)1 << _UFFDIO_CONTINUE) + (__u64)1 << _UFFDIO_CONTINUE | \ + (__u64)1 << _UFFDIO_POISON) #define UFFD_API_RANGE_IOCTLS_BASIC \ ((__u64)1 << _UFFDIO_WAKE | \ (__u64)1 << _UFFDIO_COPY | \ + (__u64)1 << _UFFDIO_WRITEPROTECT | \ (__u64)1 << _UFFDIO_CONTINUE | \ - (__u64)1 << _UFFDIO_WRITEPROTECT) + (__u64)1 << _UFFDIO_POISON) /* * Valid ioctl command number range with this API is from 0x00 to @@ -71,6 +74,7 @@ #define _UFFDIO_ZEROPAGE (0x04) #define _UFFDIO_WRITEPROTECT (0x06) #define _UFFDIO_CONTINUE (0x07) +#define _UFFDIO_POISON (0x08) #define _UFFDIO_API (0x3F) /* userfaultfd ioctl ids */ @@ -91,6 +95,8 @@ struct uffdio_writeprotect) #define UFFDIO_CONTINUE _IOWR(UFFDIO, _UFFDIO_CONTINUE, \ struct uffdio_continue) +#define UFFDIO_POISON _IOWR(UFFDIO, _UFFDIO_POISON, \ + struct uffdio_poison) /* read() structure */ struct uffd_msg { @@ -225,6 +231,7 @@ struct uffdio_api { #define UFFD_FEATURE_EXACT_ADDRESS (1<<11) #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12) #define UFFD_FEATURE_WP_UNPOPULATED (1<<13) +#define UFFD_FEATURE_POISON (1<<14) __u64 features; __u64 ioctls; @@ -321,6 +328,18 @@ struct uffdio_continue { __s64 mapped; }; +struct uffdio_poison { + struct uffdio_range range; +#define UFFDIO_POISON_MODE_DONTWAKE ((__u64)1<<0) + __u64 mode; + + /* + * Fields below here are written by the ioctl and must be at the end: + * the copy_from_user will not read past here. + */ + __s64 updated; +}; + /* * Flags for the userfaultfd(2) system call itself. */ diff --git a/mm/memory.c b/mm/memory.c index d8a9a770b1f1..7fbda39e060d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3692,6 +3692,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) if (WARN_ON_ONCE(!marker)) return VM_FAULT_SIGBUS; + /* Poison emulation explicitly requested for this PTE. */ + if (marker & PTE_MARKER_UFFD_POISON) + return VM_FAULT_HWPOISON; + /* Higher priority than uffd-wp when data corrupted */ if (marker & PTE_MARKER_SWAPIN_ERROR) return VM_FAULT_SIGBUS; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index a2bf37ee276d..87b62ca1e09e 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -286,6 +286,51 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, goto out; } +/* Handles UFFDIO_POISON for all non-hugetlb VMAs. */ +static int mfill_atomic_pte_poison(pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, + uffd_flags_t flags) +{ + int ret; + struct mm_struct *dst_mm = dst_vma->vm_mm; + pte_t _dst_pte, *dst_pte; + spinlock_t *ptl; + + _dst_pte = make_pte_marker(PTE_MARKER_UFFD_POISON); + dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); + + if (vma_is_shmem(dst_vma)) { + struct inode *inode; + pgoff_t offset, max_off; + + /* serialize against truncate with the page table lock */ + inode = dst_vma->vm_file->f_inode; + offset = linear_page_index(dst_vma, dst_addr); + max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); + ret = -EFAULT; + if (unlikely(offset >= max_off)) + goto out_unlock; + } + + ret = -EEXIST; + /* + * For now, we don't handle unmapping pages, so only support filling in + * none PTEs, or replacing PTE markers. + */ + if (!pte_none_mostly(*dst_pte)) + goto out_unlock; + + set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); + + /* No need to invalidate - it was non-present before */ + update_mmu_cache(dst_vma, dst_addr, dst_pte); + ret = 0; +out_unlock: + pte_unmap_unlock(dst_pte, ptl); + return ret; +} + static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) { pgd_t *pgd; @@ -336,8 +381,12 @@ static __always_inline ssize_t mfill_atomic_hugetlb( * supported by hugetlb. A PMD_SIZE huge pages may exist as used * by THP. Since we can not reliably insert a zero page, this * feature is not supported. + * + * PTE marker handling for hugetlb is a bit special, so for now + * UFFDIO_POISON is not supported. */ - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) { + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE) || + uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) { mmap_read_unlock(dst_mm); return -EINVAL; } @@ -481,6 +530,9 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd, if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) { return mfill_atomic_pte_continue(dst_pmd, dst_vma, dst_addr, flags); + } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) { + return mfill_atomic_pte_poison(dst_pmd, dst_vma, + dst_addr, flags); } /* @@ -702,6 +754,14 @@ ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long start, uffd_flags_set_mode(flags, MFILL_ATOMIC_CONTINUE)); } +ssize_t mfill_atomic_poison(struct mm_struct *dst_mm, unsigned long start, + unsigned long len, atomic_t *mmap_changing, + uffd_flags_t flags) +{ + return mfill_atomic(dst_mm, start, 0, len, mmap_changing, + uffd_flags_set_mode(flags, MFILL_ATOMIC_POISON)); +} + long uffd_wp_range(struct vm_area_struct *dst_vma, unsigned long start, unsigned long len, bool enable_wp) {