From patchwork Wed Jun 22 18:50:36 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nadav Amit X-Patchwork-Id: 12891717 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B319BC433EF for ; Thu, 23 Jun 2022 02:25:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E76FE8E010C; Wed, 22 Jun 2022 22:25:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DD8758E00FA; Wed, 22 Jun 2022 22:25:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B8DCD8E010C; Wed, 22 Jun 2022 22:25:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A47868E00FA for ; Wed, 22 Jun 2022 22:25:56 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 79ACC12F1 for ; Thu, 23 Jun 2022 02:25:56 +0000 (UTC) X-FDA: 79607910312.28.166BB32 Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by imf01.hostedemail.com (Postfix) with ESMTP id 102DF4001F for ; Thu, 23 Jun 2022 02:25:55 +0000 (UTC) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id C293A34803 for ; Thu, 23 Jun 2022 02:25:55 +0000 (UTC) X-FDA: 79607910270.15.F9A909D Received: from mail-pg1-f169.google.com (mail-pg1-f169.google.com [209.85.215.169]) by imf29.hostedemail.com (Postfix) with ESMTP id 53D8212001C for ; Thu, 23 Jun 2022 02:25:55 +0000 (UTC) Received: by mail-pg1-f169.google.com with SMTP id d129so17787276pgc.9 for ; Wed, 22 Jun 2022 19:25:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=egRAgvK3KjGXMpuJXiIv98W6PoKwjdA3C2XwehLqWZQ=; b=gyAc0XiK/CcE7RDtHwYsimROLN05k+ESFwJw8lNvfqJf8Irt4LvE80l7+GUDu8T9YR Yq1X8WDd9EXM+GeX0d9GyJqfvl2FcE9nLzJQ5iZoIp2VkgvhrUY8qv7afc5A301Nne+z 3vq5IRW4Z+BUtLi678yi9PtZcw3g3d6MYfzaMWKFP/aFBAk/u8NFdaz7JUrnl+oydX0i xeBzcM7+LB3xfR3TdDx2/++/MHog6Ys7zn1CBWK3Qn9fjSw4IvDDzMGLLANni3D5I16c 2ASI4CypmSiLS/dnM8V8wN43PiSbkuI21fZKotcqXjYLgX3B9GJBQVjNSffadn6Sngtt sRNg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=egRAgvK3KjGXMpuJXiIv98W6PoKwjdA3C2XwehLqWZQ=; b=g9LmA4C+cpwexHKc0q+DygpAsNogjzWFnOyWttO+ucroWfgPkSjMnjM0Lz/uu7Q5l4 m6Mf/1y29QDWP3uZrykT3CEyi3U2Ef2ZKhGi2GfCMGfpgF6cFhwl4ApXqOho02RQL1do vjlCmc46Y57bL7d2UEJb7kwm8WXgautxF11xfO0Xk1QygAnShTGEUebynEFD3RaA1mUj k5B1F8BjYAZ6AaxdKQi7ffGagJA5UszqudgAETFwOmfbEALY8WTqN9NyFr6ZS4VGDFbR ZYS30U6hgMSAAwHJBdLDu5/xrv9+2tpRz+/MUDfLBoarsJQrd+Zf4KTWnIuUtLGcEYp+ DMwA== X-Gm-Message-State: AJIora+WexogTeQhKDA7i8yXhX2953OImT4JfJq3q8X32JkQEEJ93v2M LyTChO5rr0JUqCkYSz3tROIt27jiVJeSUQ== X-Google-Smtp-Source: AGRyM1uoY6GauWfcaO7XrdOYiZGfaSJIaz11NdLlTvgxMuCH6QXkIQrcZA5oYSWxdsc+AjvRyQAfxA== X-Received: by 2002:a63:3dc1:0:b0:40c:9f5a:e35 with SMTP id k184-20020a633dc1000000b0040c9f5a0e35mr5449248pga.608.1655951154114; Wed, 22 Jun 2022 19:25:54 -0700 (PDT) Received: from sc2-haas01-esx0118.eng.vmware.com ([66.170.99.1]) by smtp.gmail.com with ESMTPSA id ik10-20020a170902ab0a00b001617541c94fsm13423998plb.60.2022.06.22.19.25.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 22 Jun 2022 19:25:53 -0700 (PDT) From: Nadav Amit X-Google-Original-From: Nadav Amit To: linux-mm@kvack.org Cc: Nadav Amit , Mike Kravetz , Hugh Dickins , Andrew Morton , Axel Rasmussen , Peter Xu , David Hildenbrand , Mike Rapoport Subject: [PATCH v1 3/5] userfaultfd: introduce write-likely mode for uffd operations Date: Wed, 22 Jun 2022 11:50:36 -0700 Message-Id: <20220622185038.71740-4-namit@vmware.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220622185038.71740-1-namit@vmware.com> References: <20220622185038.71740-1-namit@vmware.com> MIME-Version: 1.0 ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=gyAc0XiK; dmarc=pass (policy=none) header.from=gmail.com; spf=none (imf01.hostedemail.com: domain of MAILER-DAEMON@hostedemail.com has no SPF policy when checking 216.40.44.11) smtp.mailfrom=MAILER-DAEMON@hostedemail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1655951156; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=egRAgvK3KjGXMpuJXiIv98W6PoKwjdA3C2XwehLqWZQ=; b=mAZ0gyuKVPm2iCIrw/8FSFyyEkSBjM8pLFpLnWPP6ibXCG4/p3+lSblr48r2PAopXWPtHF l09yyCtS+9DRqL2em98Nbm6JRBP1fL4KHJZw6oJTukfYwNYf61h1q9f3AX/W3KoUEwQbsu Nhp9yYKYqoy9nv0sdq1c4/UCrtqDQrc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1655951156; a=rsa-sha256; cv=none; b=YPUPGkJa/hKeA8+/3iC6sOPdT56u7jFgl1vMmGt7ZC0EmiHbdimNN3W5mlSEUAD5qZ4QVm 88tXKhLuzWro//hfixq7uIZVqrxyhyd+c0eQwjm1O/tnCo0nCkyaAW1gdRKZ1iEoffshU7 UNCh80l46/bsYsOkgEfoxe5Bq4KCe34= X-Rspam-User: Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=gyAc0XiK; dmarc=pass (policy=none) header.from=gmail.com; spf=none (imf01.hostedemail.com: domain of MAILER-DAEMON@hostedemail.com has no SPF policy when checking 216.40.44.11) smtp.mailfrom=MAILER-DAEMON@hostedemail.com X-Stat-Signature: bwp3c4o6a949krth9sup58uxm6e66tp9 X-Rspamd-Queue-Id: 102DF4001F X-HE-Tag-Orig: 1655951155-261112 X-Rspamd-Server: rspam12 X-HE-Tag: 1655951155-25322 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Nadav Amit Either always setting the dirty bit or always leaving it clear does not seem as the best policy. Leaving the bit clear introduces overhead on the first write-access, which is required to set the bit. Setting the bit for pages the are eventually not written can require more TLB flushes. Let the userfaultfd users control whether PTEs are marked as dirty or clean. Introduce UFFDIO_[op]_MODE_WRITE to enable userspace to indicate whether pages are likely to be written and set the dirty-bit if they are likely to be written. Cc: Mike Kravetz Cc: Hugh Dickins Cc: Andrew Morton Cc: Axel Rasmussen Cc: Peter Xu Cc: David Hildenbrand Cc: Mike Rapoport Signed-off-by: Nadav Amit --- fs/userfaultfd.c | 20 ++++++++++++++++---- include/linux/userfaultfd_k.h | 1 + include/uapi/linux/userfaultfd.h | 13 ++++++++++++- mm/hugetlb.c | 3 +++ mm/shmem.c | 3 +++ mm/userfaultfd.c | 13 ++++++++++--- 6 files changed, 45 insertions(+), 8 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index abf176bd0349..13d73e37e230 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1727,7 +1727,8 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx, if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src) goto out; if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP| - UFFDIO_COPY_MODE_ACCESS_LIKELY)) + UFFDIO_COPY_MODE_ACCESS_LIKELY| + UFFDIO_COPY_MODE_WRITE_LIKELY)) goto out; mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP; @@ -1735,6 +1736,8 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx, uffd_flags = mode_wp ? UFFD_FLAGS_WP : UFFD_FLAGS_NONE; if (uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY) uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY; + if (uffdio_copy.mode & UFFDIO_COPY_MODE_WRITE_LIKELY) + uffd_flags |= UFFD_FLAGS_WRITE_LIKELY; if (mmget_not_zero(ctx->mm)) { ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src, @@ -1787,11 +1790,14 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx, goto out; ret = -EINVAL; if (uffdio_zeropage.mode & ~(UFFDIO_ZEROPAGE_MODE_DONTWAKE| - UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY)) + UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY| + UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY)) goto out; if (uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY) uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY; + if (uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY) + uffd_flags |= UFFD_FLAGS_WRITE_LIKELY; if (mmget_not_zero(ctx->mm)) { ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start, @@ -1843,7 +1849,8 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx, if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE | UFFDIO_WRITEPROTECT_MODE_WP | - UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY)) + UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY | + UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY)) return -EINVAL; mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP; @@ -1855,6 +1862,8 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx, uffd_flags = mode_wp ? UFFD_FLAGS_WP : UFFD_FLAGS_NONE; if (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY) uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY; + if (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY) + uffd_flags |= UFFD_FLAGS_WRITE_LIKELY; if (mmget_not_zero(ctx->mm)) { ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start, @@ -1908,11 +1917,14 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) goto out; } if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE| - UFFDIO_CONTINUE_MODE_ACCESS_LIKELY)) + UFFDIO_CONTINUE_MODE_ACCESS_LIKELY| + UFFDIO_CONTINUE_MODE_WRITE_LIKELY)) goto out; if (uffdio_continue.mode & UFFDIO_CONTINUE_MODE_ACCESS_LIKELY) uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY; + if (uffdio_continue.mode & UFFDIO_CONTINUE_MODE_WRITE_LIKELY) + uffd_flags |= UFFD_FLAGS_WRITE_LIKELY; if (mmget_not_zero(ctx->mm)) { ret = mcopy_continue(ctx->mm, uffdio_continue.range.start, diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index af268b2c2b27..59c43ea502e7 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -60,6 +60,7 @@ typedef unsigned int __bitwise uffd_flags_t; #define UFFD_FLAGS_NONE ((__force uffd_flags_t)0) #define UFFD_FLAGS_WP ((__force uffd_flags_t)BIT(0)) #define UFFD_FLAGS_ACCESS_LIKELY ((__force uffd_flags_t)BIT(1)) +#define UFFD_FLAGS_WRITE_LIKELY ((__force uffd_flags_t)BIT(2)) extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, struct vm_area_struct *dst_vma, diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index ff7150c878bb..7b6ab0b43475 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -206,7 +206,7 @@ struct uffdio_api { * write-protection mode is supported on both shmem and hugetlbfs. * * UFFD_FEATURE_ACCESS_HINTS indicates that the ioctl operations - * support the UFFDIO_*_MODE_ACCESS_LIKELY hints. + * support the UFFDIO_*_MODE_[ACCESS|WRITE]_LIKELY hints. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK (1<<1) @@ -261,9 +261,13 @@ struct uffdio_copy { * page is likely to be access in the near future. Providing the hint * properly can improve performance. * + * UFFDIO_COPY_MODE_WRITE_LIKELY provides a hint to the kernel that the + * page is likely to be written in the near future. Providing the hint + * properly can improve performance. */ #define UFFDIO_COPY_MODE_WP ((__u64)1<<1) #define UFFDIO_COPY_MODE_ACCESS_LIKELY ((__u64)1<<2) +#define UFFDIO_COPY_MODE_WRITE_LIKELY ((__u64)1<<3) __u64 mode; /* @@ -277,6 +281,7 @@ struct uffdio_zeropage { struct uffdio_range range; #define UFFDIO_ZEROPAGE_MODE_DONTWAKE ((__u64)1<<0) #define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY ((__u64)1<<1) +#define UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY ((__u64)1<<2) __u64 mode; /* @@ -300,6 +305,10 @@ struct uffdio_writeprotect { * that the page is likely to be access in the near future. Providing * the hint properly can improve performance. * + * UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY: provides a hint to the kernel + * that the page is likely to be written in the near future. Providing + * the hint properly can improve performance. + * * NOTE: Write protecting a region (WP=1) is unrelated to page faults, * therefore DONTWAKE flag is meaningless with WP=1. Removing write * protection (WP=0) in response to a page fault wakes the faulting @@ -308,6 +317,7 @@ struct uffdio_writeprotect { #define UFFDIO_WRITEPROTECT_MODE_WP ((__u64)1<<0) #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE ((__u64)1<<1) #define UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY ((__u64)1<<2) +#define UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY ((__u64)1<<3) __u64 mode; }; @@ -315,6 +325,7 @@ struct uffdio_continue { struct uffdio_range range; #define UFFDIO_CONTINUE_MODE_DONTWAKE ((__u64)1<<0) #define UFFDIO_CONTINUE_MODE_ACCESS_LIKELY ((__u64)1<<1) +#define UFFDIO_CONTINUE_MODE_WRITE_LIKELY ((__u64)1<<2) __u64 mode; /* diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 2beff8a4bf7c..46814fc7762f 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5962,6 +5962,9 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, *pagep = NULL; } + /* The PTE is not marked as dirty unconditionally */ + SetPageDirty(page); + /* * The memory barrier inside __SetPageUptodate makes sure that * preceding stores to the page contents become visible before diff --git a/mm/shmem.c b/mm/shmem.c index 89c775275bae..7488cd186c32 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2404,6 +2404,9 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, VM_BUG_ON(PageSwapBacked(page)); __SetPageLocked(page); __SetPageSwapBacked(page); + + /* The PTE is not marked as dirty unconditionally */ + SetPageDirty(page); __SetPageUptodate(page); ret = -EFAULT; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 5051b9028722..6e767f1e7007 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -70,7 +70,6 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, pgoff_t offset, max_off; _dst_pte = mk_pte(page, dst_vma->vm_page_prot); - _dst_pte = pte_mkdirty(_dst_pte); if (page_in_cache && !vm_shared) writable = false; @@ -83,14 +82,19 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, writable = false; } - if (writable) + if (writable) { _dst_pte = pte_mkwrite(_dst_pte); - else + + /* Marking RO entries as dirty can mess with other code */ + if (uffd_flags & UFFD_FLAGS_WRITE_LIKELY) + _dst_pte = pte_mkdirty(_dst_pte); + } else { /* * We need this to make sure write bit removed; as mk_pte() * could return a pte with write bit set. */ _dst_pte = pte_wrprotect(_dst_pte); + } if (uffd_flags & UFFD_FLAGS_ACCESS_LIKELY) _dst_pte = pte_mkyoung(_dst_pte); @@ -180,6 +184,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, *pagep = NULL; } + /* The PTE is not marked as dirty unconditionally */ + SetPageDirty(page); + /* * The memory barrier inside __SetPageUptodate makes sure that * preceding stores to the page contents become visible before