From patchwork Sun Jun 19 23:34:47 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nadav Amit X-Patchwork-Id: 12887051 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F060BCCA479 for ; Mon, 20 Jun 2022 07:09:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 46B928E0003; Mon, 20 Jun 2022 03:09:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3CE716B0074; Mon, 20 Jun 2022 03:09:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 135068E0003; Mon, 20 Jun 2022 03:09:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 0098A6B0073 for ; Mon, 20 Jun 2022 03:09:06 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id C45D233B03 for ; Mon, 20 Jun 2022 07:09:06 +0000 (UTC) X-FDA: 79597737492.21.8EED5BD Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by imf28.hostedemail.com (Postfix) with ESMTP id 427E6C0012 for ; Mon, 20 Jun 2022 07:09:06 +0000 (UTC) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id F1100347E0 for ; Mon, 20 Jun 2022 07:09:05 +0000 (UTC) X-FDA: 79597737450.23.DED5103 Received: from mail-pf1-f179.google.com (mail-pf1-f179.google.com [209.85.210.179]) by imf09.hostedemail.com (Postfix) with ESMTP id 91AC41400B0 for ; Mon, 20 Jun 2022 07:09:05 +0000 (UTC) Received: by mail-pf1-f179.google.com with SMTP id t21so3157406pfq.1 for ; Mon, 20 Jun 2022 00:09:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=XpZJJ/gku1wPF74neCu1Xmg/zJHYwUODRVOkqyPkGMY=; b=mb/REVZSR51ZMHAPDPUJBXmKGwCAs7yZ5FaMGlYIM2Av9Usqbf8SJu8Wn5JytIkKaX 5qiBXrruv/8EBfgVJ8SgvN5dbexGk1PM3by1pvI6SG5m/cKyUZ6k69zRz0vqxvb5InHg hE7o5IVYgdh33iIOQl/cvpICqy712RIHdTTE0TjmWvmkoIMX3tlHg9t6gLT8rxAYwDeK SnsQTzF4KemKe09h7RnSYO0eQSNP/j245GbRrIqNrHOftD6kiYV9RbWzpoO/SBj6KkYM zhJshFD9mqrZe+N639uPEa9MvTVkjAlM6Okl7WXao6/8wEv3lRNrgMskFvSaBpNrmVzp Rv6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=XpZJJ/gku1wPF74neCu1Xmg/zJHYwUODRVOkqyPkGMY=; b=HViRKb5fN8yjQQ17iGxovCBGlr3yZX3rbfwxTeqt3eGp+3+9YQnqmQ7sF96NFE5JiS mImEQF5M39ot+irDjpRN+oJQQZ/Z8uPWmtnm6sx3q9DicAaGbSS/7iePHYW3Em+2fKrp 0TGxsmF3D0DoTkLfcX9tKu1sgCQlnHV66x3N71GrgaNLCsG15NIU2jDr2xpZXQ0JVSTx 8pGzuC9bBPxgygC2fkFJ/B5uyrFvrAsEqy0aQo1cDY93PXXNYwNcVhjRnHmtCd95S8nu Bd0pt2f4pBN2NEuYImuh8Hy/Lvv5Dz07S8w/lDoA3sxb7HX9NRWWpLKAJEpdevdO2c38 DSqQ== X-Gm-Message-State: AJIora+tlH2txmQ3KqVRXrB1IZ3xZkdO6TI0UTbpjTibMvNAzY5Bs6Z5 9LE9KXhvXRE/KgCJfuXo25MUa3iH9WgcxA== X-Google-Smtp-Source: AGRyM1sAdJ/rSQW92bdBX9MUXVVPMxM4RBOOD6Ecv3uO5qwqyb2XmA+cXutlXWWN5yITVm7GZI2GBA== X-Received: by 2002:a63:7258:0:b0:40c:7483:969b with SMTP id c24-20020a637258000000b0040c7483969bmr10383147pgn.612.1655708944298; Mon, 20 Jun 2022 00:09:04 -0700 (PDT) Received: from sc2-haas01-esx0118.eng.vmware.com ([66.170.99.1]) by smtp.gmail.com with ESMTPSA id a8-20020a17090a6d8800b001e2ee0a9ac6sm9639773pjk.44.2022.06.20.00.09.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 20 Jun 2022 00:09:03 -0700 (PDT) From: Nadav Amit X-Google-Original-From: Nadav Amit To: linux-mm@kvack.org Cc: Nadav Amit , Mike Kravetz , Hugh Dickins , Andrew Morton , Axel Rasmussen , Peter Xu , David Hildenbrand , Mike Rapoport Subject: [RFC PATCH v2 3/5] userfaultfd: introduce write-likely mode for copy/wp operations Date: Sun, 19 Jun 2022 16:34:47 -0700 Message-Id: <20220619233449.181323-4-namit@vmware.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220619233449.181323-1-namit@vmware.com> References: <20220619233449.181323-1-namit@vmware.com> MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1655708946; a=rsa-sha256; cv=none; b=LLKyPKzFFXRY20F6OFew1imR6EYb3HvZjMcNzGf8K5IO/3MBCYQD/4aCYSE+pey4PCR53U eTc8ISjAHi9cmvOadykxOWBN3PLHH9IpQQLGr+K6BtlVjNodtIowIBJG92D4D9JaVSIN5R 9tQiEOVba/sKiHbtUTMEwk9IaHEF2FI= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1655708946; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XpZJJ/gku1wPF74neCu1Xmg/zJHYwUODRVOkqyPkGMY=; b=zop3elP6/5MD2TTIj0awx7v88hSaDVK+VVZA/EIKAJVMLVUFnbBlz1EC53BJDEJGQYiqw2 t7svHQgTXUBoho7ezHGccowBcTJf6hjZ7GFI7os1t1DXx3SLXq88bsk8L2cnnb1aDpkPja esEM595UWgkitlGC+d+MOLFAACR/Plo= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b="mb/REVZS"; dmarc=pass (policy=none) header.from=gmail.com; spf=none (imf28.hostedemail.com: domain of MAILER-DAEMON@hostedemail.com has no SPF policy when checking 216.40.44.10) smtp.mailfrom=MAILER-DAEMON@hostedemail.com X-HE-Tag-Orig: 1655708945-253139 Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b="mb/REVZS"; dmarc=pass (policy=none) header.from=gmail.com; spf=none (imf28.hostedemail.com: domain of MAILER-DAEMON@hostedemail.com has no SPF policy when checking 216.40.44.10) smtp.mailfrom=MAILER-DAEMON@hostedemail.com X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 427E6C0012 X-Stat-Signature: 8hskuphtcsx65t8u61bf4wodnrd3ayys X-HE-Tag: 1655708946-333986 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Nadav Amit Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte") has set PTEs as dirty as its title indicates. However, setting read-only PTEs as dirty can have several undesired implications. First, setting read-only PTEs as dirty, can cause these PTEs to become writable during mprotect() syscall. See in change_pte_range(): /* Avoid taking write faults for known dirty pages */ if (dirty_accountable && pte_dirty(ptent) && (pte_soft_dirty(ptent) || !(vma->vm_flags & VM_SOFTDIRTY))) { ptent = pte_mkwrite(ptent); } Second, unmapping read-only dirty PTEs often prevents TLB flush batching. See try_to_unmap_one(): /* * Page is dirty. Flush the TLB if a writable entry * potentially exists to avoid CPU writes after IO * starts and then write it out here. */ try_to_unmap_flush_dirty(); Similarly batching TLB flushed might be prevented in zap_pte_range(): if (!PageAnon(page)) { if (pte_dirty(ptent)) { force_flush = 1; set_page_dirty(page); } ... In general, setting a PTE as dirty seems for read-only entries might be dangerous. It should be reminded the dirty-COW vulnerability mitigation also relies on the dirty bit being set only after COW (although it does not appear to apply to userfaultfd). To summarize, setting the dirty bit for read-only PTEs is dangerous. But even if we only consider writable pages, always setting the dirty bit or always leaving it clear, does not seem as the best policy. Leaving the bit clear introduces overhead on the first write-access to set the bit. Setting the bit for pages the are eventually not written to can require more TLB flushes. Let the userfaultfd users control whether PTEs are marked as dirty or clean. Introduce UFFDIO_COPY_MODE_WRITE and UFFDIO_COPY_MODE_WRITE_LIKELY and UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY to enable userspace to indicate whether pages are likely to be written and set the dirty-bit if they are likely to be written. Cc: Mike Kravetz Cc: Hugh Dickins Cc: Andrew Morton Cc: Axel Rasmussen Cc: Peter Xu Cc: David Hildenbrand Cc: Mike Rapoport Signed-off-by: Nadav Amit --- fs/userfaultfd.c | 22 ++++++++++++++-------- include/linux/userfaultfd_k.h | 1 + include/uapi/linux/userfaultfd.h | 27 +++++++++++++++++++-------- mm/hugetlb.c | 3 +++ mm/shmem.c | 3 +++ mm/userfaultfd.c | 11 +++++++++-- 6 files changed, 49 insertions(+), 18 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 35a8c4347c54..a56983b594d5 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1700,7 +1700,7 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx, struct uffdio_copy uffdio_copy; struct uffdio_copy __user *user_uffdio_copy; struct userfaultfd_wake_range range; - bool mode_wp, mode_access_likely; + bool mode_wp, mode_access_likely, mode_write_likely; uffd_flags_t uffd_flags; user_uffdio_copy = (struct uffdio_copy __user *) arg; @@ -1727,14 +1727,17 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx, if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src) goto out; if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP| - UFFDIO_COPY_MODE_ACCESS_LIKELY)) + UFFDIO_COPY_MODE_ACCESS_LIKELY| + UFFDIO_COPY_MODE_WRITE_LIKELY)) goto out; mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP; mode_access_likely = uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY; + mode_write_likely = uffdio_copy.mode & UFFDIO_COPY_MODE_WRITE_LIKELY; uffd_flags = (mode_wp ? UFFD_FLAGS_WP : 0) | - (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0); + (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0) | + (mode_write_likely ? UFFD_FLAGS_WRITE_LIKELY : 0); if (mmget_not_zero(ctx->mm)) { ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src, @@ -1819,7 +1822,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx, struct uffdio_writeprotect uffdio_wp; struct uffdio_writeprotect __user *user_uffdio_wp; struct userfaultfd_wake_range range; - bool mode_wp, mode_dontwake, mode_access_likely; + bool mode_wp, mode_dontwake, mode_access_likely, mode_write_likely; uffd_flags_t uffd_flags; if (atomic_read(&ctx->mmap_changing)) @@ -1838,18 +1841,21 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx, if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE | UFFDIO_WRITEPROTECT_MODE_WP | - UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY)) + UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY | + UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY)) return -EINVAL; mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP; mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE; mode_access_likely = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY; + mode_write_likely = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY; if (mode_wp && mode_dontwake) return -EINVAL; uffd_flags = (mode_wp ? UFFD_FLAGS_WP : 0) | - (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0); + (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0) | + (mode_write_likely ? UFFD_FLAGS_WRITE_LIKELY : 0); if (mmget_not_zero(ctx->mm)) { ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start, @@ -1902,10 +1908,10 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) uffdio_continue.range.start) { goto out; } - if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE) + if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE)) goto out; - uffd_flags = UFFD_FLAGS_ACCESS_LIKELY; + uffd_flags = UFFD_FLAGS_ACCESS_LIKELY | UFFD_FLAGS_WRITE_LIKELY; if (mmget_not_zero(ctx->mm)) { ret = mcopy_continue(ctx->mm, uffdio_continue.range.start, diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index e6ac165ec044..261a3fa750d0 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -59,6 +59,7 @@ typedef unsigned int __bitwise uffd_flags_t; #define UFFD_FLAGS_WP ((__force uffd_flags_t)BIT(0)) #define UFFD_FLAGS_ACCESS_LIKELY ((__force uffd_flags_t)BIT(1)) +#define UFFD_FLAGS_WRITE_LIKELY ((__force uffd_flags_t)BIT(2)) extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, struct vm_area_struct *dst_vma, diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index d9c8ce9ba777..6ad93a13282e 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -267,12 +267,20 @@ struct uffdio_copy { */ __s64 copy; /* - * UFFDIO_COPY_MODE_ACCESS_LIKELY will set the mapped page as young. - * This can reduce the time that the first access to the page takes. - * Yet, if set opportunistically to memory that is not used, it might - * extend the time before the unused memory pages are reclaimed. + * UFFDIO_COPY_MODE_ACCESS_LIKELY indicates that the memory is likely to + * be accessed in the near future, in contrast to memory that is + * opportunistically copied and might not be accessed. The kernel will + * act accordingly, for instance by setting the access-bit in the PTE to + * reduce the access time to the page. + * + * UFFDIO_COPY_MODE_WRITE_LIKELY indicates that the memory is likely to + * be written to. The kernel will act accordingly, for instance by + * setting the dirty-bit in the PTE to reduce the write time to the + * page. This flag will be silently ignored if UFFDIO_COPY_MODE_WP is + * set. */ -#define UFFDIO_COPY_MODE_ACCESS_LIKELY ((__u64)1<<3) +#define UFFDIO_COPY_MODE_ACCESS_LIKELY ((__u64)1<<2) +#define UFFDIO_COPY_MODE_WRITE_LIKELY ((__u64)1<<3) }; struct uffdio_zeropage { @@ -297,9 +305,11 @@ struct uffdio_writeprotect { * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up * any wait thread after the operation succeeds. * - * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY: set the flag to mark the modified - * memory as young, which can reduce the time that the first access - * to the page takes. + * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY: set the flag to indicate the memory + * is likely to be accessed in the near future. + * + * UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY: set the flag to indicate that the + * memory is likely to be written to in the near future. * * NOTE: Write protecting a region (WP=1) is unrelated to page faults, * therefore DONTWAKE flag is meaningless with WP=1. Removing write @@ -309,6 +319,7 @@ struct uffdio_writeprotect { #define UFFDIO_WRITEPROTECT_MODE_WP ((__u64)1<<0) #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE ((__u64)1<<1) #define UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY ((__u64)1<<2) +#define UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY ((__u64)1<<3) __u64 mode; }; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 2beff8a4bf7c..46814fc7762f 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5962,6 +5962,9 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, *pagep = NULL; } + /* The PTE is not marked as dirty unconditionally */ + SetPageDirty(page); + /* * The memory barrier inside __SetPageUptodate makes sure that * preceding stores to the page contents become visible before diff --git a/mm/shmem.c b/mm/shmem.c index 89c775275bae..7488cd186c32 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2404,6 +2404,9 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, VM_BUG_ON(PageSwapBacked(page)); __SetPageLocked(page); __SetPageSwapBacked(page); + + /* The PTE is not marked as dirty unconditionally */ + SetPageDirty(page); __SetPageUptodate(page); ret = -EFAULT; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 140c8d3e946e..3172158d8faa 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -70,7 +70,6 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, pgoff_t offset, max_off; _dst_pte = mk_pte(page, dst_vma->vm_page_prot); - _dst_pte = pte_mkdirty(_dst_pte); if (page_in_cache && !vm_shared) writable = false; @@ -85,13 +84,18 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, if (writable) _dst_pte = pte_mkwrite(_dst_pte); - else + else { /* * We need this to make sure write bit removed; as mk_pte() * could return a pte with write bit set. */ _dst_pte = pte_wrprotect(_dst_pte); + /* Marking RO entries as dirty can mess with other code */ + if (uffd_flags & UFFD_FLAGS_WRITE_LIKELY) + _dst_pte = pte_mkdirty(_dst_pte); + } + if (uffd_flags & UFFD_FLAGS_ACCESS_LIKELY) _dst_pte = pte_mkyoung(_dst_pte); @@ -180,6 +184,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, *pagep = NULL; } + /* The PTE is not marked as dirty unconditionally */ + SetPageDirty(page); + /* * The memory barrier inside __SetPageUptodate makes sure that * preceding stores to the page contents become visible before