From patchwork Thu Mar 10 14:08:59 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776402 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4FAC2C433FE for ; Thu, 10 Mar 2022 14:09:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AF02F8D0005; Thu, 10 Mar 2022 09:09:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AC6B28D0001; Thu, 10 Mar 2022 09:09:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9B5D78D0005; Thu, 10 Mar 2022 09:09:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.25]) by kanga.kvack.org (Postfix) with ESMTP id 8F41D8D0001 for ; Thu, 10 Mar 2022 09:09:46 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 64BC43FF for ; Thu, 10 Mar 2022 14:09:46 +0000 (UTC) X-FDA: 79228659972.09.E353028 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf14.hostedemail.com (Postfix) with ESMTP id 7AEB710001D for ; Thu, 10 Mar 2022 14:09:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921385; x=1678457385; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=HfTyQumQLz3rZd208DGi7M1QzPzA68CVqOu0FcAUOfs=; b=lEzIRTv+fBuighTO0sO7ivu5ohNf+B39GtwqINaVo/shdsihwAM2p+RD c8GQN2XJd01agHNfM+U9x1n4mf0HnK3TR+4d9VnEpUiz6oDFQKtaGwd3s jfIMrTWArTRalQayEGcJRS2YEaGUy6f+XRsUsVEwZN3lkrS0L7XthrgyX VjJupVvaPetXyo2psoktwJvM3zjHqx29oLEq9JyA1T5lGixDVZ01Ur8fD 7u+xv9Eq8rK2TP5Bsu9ZS8JtVxhgaBQeWJnYesXjTlc43cv150pQm+S+3 KGCN0Z0DrlknXf1cBbr4gyqHn8glja8M8KkewT0wDevb6ZDfyF+iGAbm2 w==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="318479228" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="318479228" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:09:44 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554654769" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:09:36 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 01/13] mm/memfd: Introduce MFD_INACCESSIBLE flag Date: Thu, 10 Mar 2022 22:08:59 +0800 Message-Id: <20220310140911.50924-2-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 7AEB710001D X-Rspam-User: Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=lEzIRTv+; dmarc=pass (policy=none) header.from=intel.com; spf=none (imf14.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 134.134.136.100) smtp.mailfrom=chao.p.peng@linux.intel.com X-Stat-Signature: 39c5naobdky4ejxh1bkaskdn8o44db91 X-HE-Tag: 1646921385-475621 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: "Kirill A. Shutemov" Introduce a new memfd_create() flag indicating the content of the created memfd is inaccessible from userspace through ordinary MMU access (e.g., read/write/mmap). However, the file content can be accessed via a different mechanism (e.g. KVM MMU) indirectly. It provides semantics required for KVM guest private memory support that a file descriptor with this flag set is going to be used as the source of guest memory in confidential computing environments such as Intel TDX/AMD SEV but may not be accessible from host userspace. Since page migration/swapping is not yet supported for such usages so these pages are currently marked as UNMOVABLE and UNEVICTABLE which makes them behave like long-term pinned pages. The flag can not coexist with MFD_ALLOW_SEALING, future sealing is also impossible for a memfd created with this flag. At this time only shmem implements this flag. Signed-off-by: Kirill A. Shutemov Signed-off-by: Chao Peng --- include/linux/shmem_fs.h | 7 +++++ include/uapi/linux/memfd.h | 1 + mm/memfd.c | 26 +++++++++++++++-- mm/shmem.c | 57 ++++++++++++++++++++++++++++++++++++++ 4 files changed, 88 insertions(+), 3 deletions(-) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index e65b80ed09e7..2dde843f28ef 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -12,6 +12,9 @@ /* inode in-kernel data */ +/* shmem extended flags */ +#define SHM_F_INACCESSIBLE 0x0001 /* prevent ordinary MMU access (e.g. read/write/mmap) to file content */ + struct shmem_inode_info { spinlock_t lock; unsigned int seals; /* shmem seals */ @@ -24,6 +27,7 @@ struct shmem_inode_info { struct shared_policy policy; /* NUMA memory alloc policy */ struct simple_xattrs xattrs; /* list of xattrs */ atomic_t stop_eviction; /* hold when working on inode */ + unsigned int xflags; /* shmem extended flags */ struct inode vfs_inode; }; @@ -61,6 +65,9 @@ extern struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags); extern struct file *shmem_kernel_file_setup(const char *name, loff_t size, unsigned long flags); +extern struct file *shmem_file_setup_xflags(const char *name, loff_t size, + unsigned long flags, + unsigned int xflags); extern struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt, const char *name, loff_t size, unsigned long flags); extern int shmem_zero_setup(struct vm_area_struct *); diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h index 7a8a26751c23..48750474b904 100644 --- a/include/uapi/linux/memfd.h +++ b/include/uapi/linux/memfd.h @@ -8,6 +8,7 @@ #define MFD_CLOEXEC 0x0001U #define MFD_ALLOW_SEALING 0x0002U #define MFD_HUGETLB 0x0004U +#define MFD_INACCESSIBLE 0x0008U /* * Huge page size encoding when MFD_HUGETLB is specified, and a huge page diff --git a/mm/memfd.c b/mm/memfd.c index 9f80f162791a..74d45a26cf5d 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -245,16 +245,20 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \ + MFD_INACCESSIBLE) SYSCALL_DEFINE2(memfd_create, const char __user *, uname, unsigned int, flags) { + struct address_space *mapping; unsigned int *file_seals; + unsigned int xflags; struct file *file; int fd, error; char *name; + gfp_t gfp; long len; if (!(flags & MFD_HUGETLB)) { @@ -267,6 +271,10 @@ SYSCALL_DEFINE2(memfd_create, return -EINVAL; } + /* Disallow sealing when MFD_INACCESSIBLE is set. */ + if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING) + return -EINVAL; + /* length includes terminating zero */ len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); if (len <= 0) @@ -301,8 +309,11 @@ SYSCALL_DEFINE2(memfd_create, HUGETLB_ANONHUGE_INODE, (flags >> MFD_HUGE_SHIFT) & MFD_HUGE_MASK); - } else - file = shmem_file_setup(name, 0, VM_NORESERVE); + } else { + xflags = flags & MFD_INACCESSIBLE ? SHM_F_INACCESSIBLE : 0; + file = shmem_file_setup_xflags(name, 0, VM_NORESERVE, xflags); + } + if (IS_ERR(file)) { error = PTR_ERR(file); goto err_fd; @@ -313,6 +324,15 @@ SYSCALL_DEFINE2(memfd_create, if (flags & MFD_ALLOW_SEALING) { file_seals = memfd_file_seals_ptr(file); *file_seals &= ~F_SEAL_SEAL; + } else if (flags & MFD_INACCESSIBLE) { + mapping = file_inode(file)->i_mapping; + gfp = mapping_gfp_mask(mapping); + gfp &= ~__GFP_MOVABLE; + mapping_set_gfp_mask(mapping, gfp); + mapping_set_unevictable(mapping); + + file_seals = memfd_file_seals_ptr(file); + *file_seals = F_SEAL_SEAL; } fd_install(fd, file); diff --git a/mm/shmem.c b/mm/shmem.c index a09b29ec2b45..9b31a7056009 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1084,6 +1084,13 @@ static int shmem_setattr(struct user_namespace *mnt_userns, (newsize > oldsize && (info->seals & F_SEAL_GROW))) return -EPERM; + if (info->xflags & SHM_F_INACCESSIBLE) { + if(oldsize) + return -EPERM; + if (!PAGE_ALIGNED(newsize)) + return -EINVAL; + } + if (newsize != oldsize) { error = shmem_reacct_size(SHMEM_I(inode)->flags, oldsize, newsize); @@ -1331,6 +1338,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) goto redirty; if (!total_swap_pages) goto redirty; + if (info->xflags & SHM_F_INACCESSIBLE) + goto redirty; /* * Our capabilities prevent regular writeback or sync from ever calling @@ -2228,6 +2237,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) if (ret) return ret; + if (info->xflags & SHM_F_INACCESSIBLE) + return -EPERM; + /* arm64 - allow memory tagging on RAM-based files */ vma->vm_flags |= VM_MTE_ALLOWED; @@ -2433,6 +2445,8 @@ shmem_write_begin(struct file *file, struct address_space *mapping, if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size) return -EPERM; } + if (unlikely(info->xflags & SHM_F_INACCESSIBLE)) + return -EPERM; ret = shmem_getpage(inode, index, pagep, SGP_WRITE); @@ -2517,6 +2531,21 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) end_index = i_size >> PAGE_SHIFT; if (index > end_index) break; + + /* + * inode_lock protects setting up seals as well as write to + * i_size. Setting SHM_F_INACCESSIBLE only allowed with + * i_size == 0. + * + * Check SHM_F_INACCESSIBLE after i_size. It effectively + * serialize read vs. setting SHM_F_INACCESSIBLE without + * taking inode_lock in read path. + */ + if (SHMEM_I(inode)->xflags & SHM_F_INACCESSIBLE) { + error = -EPERM; + break; + } + if (index == end_index) { nr = i_size & ~PAGE_MASK; if (nr <= offset) @@ -2648,6 +2677,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, goto out; } + if ((info->xflags & SHM_F_INACCESSIBLE) && + (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))) { + error = -EINVAL; + goto out; + } + shmem_falloc.waitq = &shmem_falloc_waitq; shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT; shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT; @@ -4082,6 +4117,28 @@ struct file *shmem_kernel_file_setup(const char *name, loff_t size, unsigned lon return __shmem_file_setup(shm_mnt, name, size, flags, S_PRIVATE); } +/** + * shmem_file_setup_xflags - get an unlinked file living in tmpfs with + * additional xflags. + * @name: name for dentry (to be seen in /proc//maps + * @size: size to be set for the file + * @flags: VM_NORESERVE suppresses pre-accounting of the entire object size + * @xflags: SHM_F_INACCESSIBLE prevents ordinary MMU access to the file content + */ + +struct file *shmem_file_setup_xflags(const char *name, loff_t size, + unsigned long flags, unsigned int xflags) +{ + struct shmem_inode_info *info; + struct file *res = __shmem_file_setup(shm_mnt, name, size, flags, 0); + + if(!IS_ERR(res)) { + info = SHMEM_I(file_inode(res)); + info->xflags = xflags & SHM_F_INACCESSIBLE; + } + return res; +} + /** * shmem_file_setup - get an unlinked file living in tmpfs * @name: name for dentry (to be seen in /proc//maps From patchwork Thu Mar 10 14:09:00 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776403 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 24CB6C433EF for ; Thu, 10 Mar 2022 14:09:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AA1778D0006; Thu, 10 Mar 2022 09:09:54 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A516B8D0001; Thu, 10 Mar 2022 09:09:54 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8F3668D0006; Thu, 10 Mar 2022 09:09:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.25]) by kanga.kvack.org (Postfix) with ESMTP id 812D18D0001 for ; Thu, 10 Mar 2022 09:09:54 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 5847C208A5 for ; Thu, 10 Mar 2022 14:09:54 +0000 (UTC) X-FDA: 79228660308.03.95BA659 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by imf16.hostedemail.com (Postfix) with ESMTP id 828A218000C for ; Thu, 10 Mar 2022 14:09:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921393; x=1678457393; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=GbR4wRPHIwdHjPFQakoEmpS8z2SMsND5ZXVQtQDRCQY=; b=Rf1qOIARm1k7Fu3wjcwx1PtRkGDcEkw3WuOCbole/7VOv7rrWv2O3CIj 9xWopVV2NclnJEDqnlAYOU4ULkfrNU076oMu9+4LlPG+LjCX/3Vt1UT3y IVpyX5AyEGqz4D5mfCwkhNN2tycQurjJgIrfJp+UYbgaG1+JF1meDolDG 3Llly2LA9nYrtZ1oo5f7h2RPKpwwev5jAW19cwk2qy7zpgEC2A0iJgqcM kHKSKBHnzURBo4swXMeCzc62UzlFYglTpd6mNNgAi9HsplfFUN3NAHm41 SwnBqclJEQi0EOTLhjknbKpHguVHl4a7iX7PBZQEfhQUjZw8HU4aFC2p8 w==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="242702461" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="242702461" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:09:52 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554654831" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:09:44 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 02/13] mm: Introduce memfile_notifier Date: Thu, 10 Mar 2022 22:09:00 +0800 Message-Id: <20220310140911.50924-3-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 828A218000C X-Stat-Signature: s44rh8aywwxnqpket65a55atc4we46o6 Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Rf1qOIAR; spf=none (imf16.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 134.134.136.20) smtp.mailfrom=chao.p.peng@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com X-HE-Tag: 1646921393-415509 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This patch introduces memfile_notifier facility so existing memory file subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a third kernel component to make use of memory bookmarked in the memory file and gets notified when the pages in the memory file become allocated/invalidated. It will be used for KVM to use a file descriptor as the guest memory backing store and KVM will use this memfile_notifier interface to interact with memory file subsystems. In the future there might be other consumers (e.g. VFIO with encrypted device memory). It consists two sets of callbacks: - memfile_notifier_ops: callbacks for memory backing store to notify KVM when memory gets allocated/invalidated. - memfile_pfn_ops: callbacks for KVM to call into memory backing store to request memory pages for guest private memory. Userspace is in charge of guest memory lifecycle: it first allocates pages in memory backing store and then passes the fd to KVM and lets KVM register each memory slot to memory backing store via memfile_register_notifier. The supported memory backing store should maintain a memfile_notifier list and provide routine for memfile_notifier to get the list head address and memfile_pfn_ops callbacks for memfile_register_notifier. It also should call memfile_notifier_fallocate/memfile_notifier_invalidate when the bookmarked memory gets allocated/invalidated. Co-developed-by: Kirill A. Shutemov Signed-off-by: Kirill A. Shutemov Signed-off-by: Chao Peng --- include/linux/memfile_notifier.h | 64 +++++++++++++++++ mm/Kconfig | 4 ++ mm/Makefile | 1 + mm/memfile_notifier.c | 114 +++++++++++++++++++++++++++++++ 4 files changed, 183 insertions(+) create mode 100644 include/linux/memfile_notifier.h create mode 100644 mm/memfile_notifier.c diff --git a/include/linux/memfile_notifier.h b/include/linux/memfile_notifier.h new file mode 100644 index 000000000000..e8d400558adb --- /dev/null +++ b/include/linux/memfile_notifier.h @@ -0,0 +1,64 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMFILE_NOTIFIER_H +#define _LINUX_MEMFILE_NOTIFIER_H + +#include +#include +#include +#include + +struct memfile_notifier; + +struct memfile_notifier_ops { + void (*invalidate)(struct memfile_notifier *notifier, + pgoff_t start, pgoff_t end); + void (*fallocate)(struct memfile_notifier *notifier, + pgoff_t start, pgoff_t end); +}; + +struct memfile_pfn_ops { + long (*get_lock_pfn)(struct inode *inode, pgoff_t offset, int *order); + void (*put_unlock_pfn)(unsigned long pfn); +}; + +struct memfile_notifier { + struct list_head list; + struct memfile_notifier_ops *ops; +}; + +struct memfile_notifier_list { + struct list_head head; + spinlock_t lock; +}; + +struct memfile_backing_store { + struct list_head list; + struct memfile_pfn_ops pfn_ops; + struct memfile_notifier_list* (*get_notifier_list)(struct inode *inode); +}; + +#ifdef CONFIG_MEMFILE_NOTIFIER +/* APIs for backing stores */ +static inline void memfile_notifier_list_init(struct memfile_notifier_list *list) +{ + INIT_LIST_HEAD(&list->head); + spin_lock_init(&list->lock); +} + +extern void memfile_notifier_invalidate(struct memfile_notifier_list *list, + pgoff_t start, pgoff_t end); +extern void memfile_notifier_fallocate(struct memfile_notifier_list *list, + pgoff_t start, pgoff_t end); +extern void memfile_register_backing_store(struct memfile_backing_store *bs); +extern void memfile_unregister_backing_store(struct memfile_backing_store *bs); + +/*APIs for notifier consumers */ +extern int memfile_register_notifier(struct inode *inode, + struct memfile_notifier *notifier, + struct memfile_pfn_ops **pfn_ops); +extern void memfile_unregister_notifier(struct inode *inode, + struct memfile_notifier *notifier); + +#endif /* CONFIG_MEMFILE_NOTIFIER */ + +#endif /* _LINUX_MEMFILE_NOTIFIER_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 3326ee3903f3..7c6b1ad3dade 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -892,6 +892,10 @@ config ANON_VMA_NAME area from being merged with adjacent virtual memory areas due to the difference in their name. +config MEMFILE_NOTIFIER + bool + select SRCU + source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index 70d4309c9ce3..f628256dce0d 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -132,3 +132,4 @@ obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o obj-$(CONFIG_IO_MAPPING) += io-mapping.o obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o +obj-$(CONFIG_MEMFILE_NOTIFIER) += memfile_notifier.o diff --git a/mm/memfile_notifier.c b/mm/memfile_notifier.c new file mode 100644 index 000000000000..a405db56fde2 --- /dev/null +++ b/mm/memfile_notifier.c @@ -0,0 +1,114 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * linux/mm/memfile_notifier.c + * + * Copyright (C) 2022 Intel Corporation. + * Chao Peng + */ + +#include +#include + +DEFINE_STATIC_SRCU(srcu); +static LIST_HEAD(backing_store_list); + +void memfile_notifier_invalidate(struct memfile_notifier_list *list, + pgoff_t start, pgoff_t end) +{ + struct memfile_notifier *notifier; + int id; + + id = srcu_read_lock(&srcu); + list_for_each_entry_srcu(notifier, &list->head, list, + srcu_read_lock_held(&srcu)) { + if (notifier->ops && notifier->ops->invalidate) + notifier->ops->invalidate(notifier, start, end); + } + srcu_read_unlock(&srcu, id); +} + +void memfile_notifier_fallocate(struct memfile_notifier_list *list, + pgoff_t start, pgoff_t end) +{ + struct memfile_notifier *notifier; + int id; + + id = srcu_read_lock(&srcu); + list_for_each_entry_srcu(notifier, &list->head, list, + srcu_read_lock_held(&srcu)) { + if (notifier->ops && notifier->ops->fallocate) + notifier->ops->fallocate(notifier, start, end); + } + srcu_read_unlock(&srcu, id); +} + +void memfile_register_backing_store(struct memfile_backing_store *bs) +{ + BUG_ON(!bs || !bs->get_notifier_list); + + list_add_tail(&bs->list, &backing_store_list); +} + +void memfile_unregister_backing_store(struct memfile_backing_store *bs) +{ + list_del(&bs->list); +} + +static int memfile_get_notifier_info(struct inode *inode, + struct memfile_notifier_list **list, + struct memfile_pfn_ops **ops) +{ + struct memfile_backing_store *bs, *iter; + struct memfile_notifier_list *tmp; + + list_for_each_entry_safe(bs, iter, &backing_store_list, list) { + tmp = bs->get_notifier_list(inode); + if (tmp) { + *list = tmp; + if (ops) + *ops = &bs->pfn_ops; + return 0; + } + } + return -EOPNOTSUPP; +} + +int memfile_register_notifier(struct inode *inode, + struct memfile_notifier *notifier, + struct memfile_pfn_ops **pfn_ops) +{ + struct memfile_notifier_list *list; + int ret; + + if (!inode || !notifier | !pfn_ops) + return -EINVAL; + + ret = memfile_get_notifier_info(inode, &list, pfn_ops); + if (ret) + return ret; + + spin_lock(&list->lock); + list_add_rcu(¬ifier->list, &list->head); + spin_unlock(&list->lock); + + return 0; +} +EXPORT_SYMBOL_GPL(memfile_register_notifier); + +void memfile_unregister_notifier(struct inode *inode, + struct memfile_notifier *notifier) +{ + struct memfile_notifier_list *list; + + if (!inode || !notifier) + return; + + BUG_ON(memfile_get_notifier_info(inode, &list, NULL)); + + spin_lock(&list->lock); + list_del_rcu(¬ifier->list); + spin_unlock(&list->lock); + + synchronize_srcu(&srcu); +} +EXPORT_SYMBOL_GPL(memfile_unregister_notifier); From patchwork Thu Mar 10 14:09:01 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776404 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A82E4C433F5 for ; Thu, 10 Mar 2022 14:10:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4391A8D0007; Thu, 10 Mar 2022 09:10:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3E8558D0001; Thu, 10 Mar 2022 09:10:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2B1468D0007; Thu, 10 Mar 2022 09:10:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0201.hostedemail.com [216.40.44.201]) by kanga.kvack.org (Postfix) with ESMTP id 1CA058D0001 for ; Thu, 10 Mar 2022 09:10:03 -0500 (EST) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id C0F039731A for ; Thu, 10 Mar 2022 14:10:02 +0000 (UTC) X-FDA: 79228660644.25.F902A7C Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by imf28.hostedemail.com (Postfix) with ESMTP id DB8A3C000D for ; Thu, 10 Mar 2022 14:10:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921402; x=1678457402; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=LshKgds1U2bNEdA3l08Zy9WrXI0Fxarkx0V/cuLKyUc=; b=cEk0tpvVDL3AW+wKp0lQm4XQVDDZhZSxpnQLawBNVidNMwYQQN497ccO RW+/tlCfd4Lvx3UxjRljA3fTlT11xYSid62Q/5xlrUcBmhoR1SSX+Y9qc JzOo3BrsuM9mvlBJPm+KYTFABBYtQ4unwB0Y2/Zbz7GKArcwP4ZfmTnTX RUXc79T7VE0bBIMaqT1O2M3mJh0J57fCJng5HZAJ71WFiGcklqkHAZJgo vLRfnDOQMgia9enLZVQWKSlXn6uYE1vx3IVWDQjv0/mmD4PDaXi4OSA9b R/YJ3rTbnoAnuHKgJ5fhvUFUnJfU94kLgnfCGHjtqoVNiDNazWFeNiXVS w==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="255448268" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="255448268" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:00 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554654872" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:09:52 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 03/13] mm/shmem: Support memfile_notifier Date: Thu, 10 Mar 2022 22:09:01 +0800 Message-Id: <20220310140911.50924-4-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: DB8A3C000D X-Stat-Signature: zi98xszes6wkskg646a9bjkrcq39e15b Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=cEk0tpvV; dmarc=pass (policy=none) header.from=intel.com; spf=none (imf28.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 192.55.52.115) smtp.mailfrom=chao.p.peng@linux.intel.com X-HE-Tag: 1646921401-675686 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: "Kirill A. Shutemov" It maintains a memfile_notifier list in shmem_inode_info structure and implements memfile_pfn_ops callbacks defined by memfile_notifier. It then exposes them to memfile_notifier via shmem_get_memfile_notifier_info. We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be allocated by userspace for private memory. If there is no pages allocated at the offset then error should be returned so KVM knows that the memory is not private memory. Signed-off-by: Kirill A. Shutemov Signed-off-by: Chao Peng --- include/linux/shmem_fs.h | 4 +++ mm/shmem.c | 76 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 80 insertions(+) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 2dde843f28ef..7bb16f2d2825 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -9,6 +9,7 @@ #include #include #include +#include /* inode in-kernel data */ @@ -28,6 +29,9 @@ struct shmem_inode_info { struct simple_xattrs xattrs; /* list of xattrs */ atomic_t stop_eviction; /* hold when working on inode */ unsigned int xflags; /* shmem extended flags */ +#ifdef CONFIG_MEMFILE_NOTIFIER + struct memfile_notifier_list memfile_notifiers; +#endif struct inode vfs_inode; }; diff --git a/mm/shmem.c b/mm/shmem.c index 9b31a7056009..7b43e274c9a2 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -903,6 +903,28 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index) return page ? page_folio(page) : NULL; } +static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end) +{ +#ifdef CONFIG_MEMFILE_NOTIFIER + struct shmem_inode_info *info = SHMEM_I(inode); + + memfile_notifier_fallocate(&info->memfile_notifiers, start, end); +#endif +} + +static void notify_invalidate_page(struct inode *inode, struct folio *folio, + pgoff_t start, pgoff_t end) +{ +#ifdef CONFIG_MEMFILE_NOTIFIER + struct shmem_inode_info *info = SHMEM_I(inode); + + start = max(start, folio->index); + end = min(end, folio->index + folio_nr_pages(folio)); + + memfile_notifier_invalidate(&info->memfile_notifiers, start, end); +#endif +} + /* * Remove range of pages and swap entries from page cache, and free them. * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. @@ -946,6 +968,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, } index += folio_nr_pages(folio) - 1; + notify_invalidate_page(inode, folio, start, end); + if (!unfalloc || !folio_test_uptodate(folio)) truncate_inode_folio(mapping, folio); folio_unlock(folio); @@ -1019,6 +1043,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, index--; break; } + + notify_invalidate_page(inode, folio, start, end); + VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio); truncate_inode_folio(mapping, folio); @@ -2279,6 +2306,9 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode info->flags = flags & VM_NORESERVE; INIT_LIST_HEAD(&info->shrinklist); INIT_LIST_HEAD(&info->swaplist); +#ifdef CONFIG_MEMFILE_NOTIFIER + memfile_notifier_list_init(&info->memfile_notifiers); +#endif simple_xattrs_init(&info->xattrs); cache_no_acl(inode); mapping_set_large_folios(inode->i_mapping); @@ -2802,6 +2832,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) i_size_write(inode, offset + len); inode->i_ctime = current_time(inode); + notify_fallocate(inode, start, end); undone: spin_lock(&inode->i_lock); inode->i_private = NULL; @@ -3909,6 +3940,47 @@ static struct file_system_type shmem_fs_type = { .fs_flags = FS_USERNS_MOUNT, }; +#ifdef CONFIG_MEMFILE_NOTIFIER +static long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset, int *order) +{ + struct page *page; + int ret; + + ret = shmem_getpage(inode, offset, &page, SGP_NOALLOC); + if (ret) + return ret; + + *order = thp_order(compound_head(page)); + + return page_to_pfn(page); +} + +static void shmem_put_unlock_pfn(unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + + VM_BUG_ON_PAGE(!PageLocked(page), page); + + set_page_dirty(page); + unlock_page(page); + put_page(page); +} + +static struct memfile_notifier_list* shmem_get_notifier_list(struct inode *inode) +{ + if (!shmem_mapping(inode->i_mapping)) + return NULL; + + return &SHMEM_I(inode)->memfile_notifiers; +} + +static struct memfile_backing_store shmem_backing_store = { + .pfn_ops.get_lock_pfn = shmem_get_lock_pfn, + .pfn_ops.put_unlock_pfn = shmem_put_unlock_pfn, + .get_notifier_list = shmem_get_notifier_list, +}; +#endif /* CONFIG_MEMFILE_NOTIFIER */ + int __init shmem_init(void) { int error; @@ -3934,6 +4006,10 @@ int __init shmem_init(void) else shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ #endif + +#ifdef CONFIG_MEMFILE_NOTIFIER + memfile_register_backing_store(&shmem_backing_store); +#endif return 0; out1: From patchwork Thu Mar 10 14:09:02 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776405 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 972A2C433FE for ; Thu, 10 Mar 2022 14:10:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2D9408D0008; Thu, 10 Mar 2022 09:10:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2894D8D0001; Thu, 10 Mar 2022 09:10:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 178838D0008; Thu, 10 Mar 2022 09:10:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0113.hostedemail.com [216.40.44.113]) by kanga.kvack.org (Postfix) with ESMTP id 0A4078D0001 for ; Thu, 10 Mar 2022 09:10:11 -0500 (EST) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id B542C9CD41 for ; Thu, 10 Mar 2022 14:10:10 +0000 (UTC) X-FDA: 79228660980.16.86E77F8 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by imf21.hostedemail.com (Postfix) with ESMTP id 0A07E1C0010 for ; Thu, 10 Mar 2022 14:10:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921410; x=1678457410; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=s3KeBsvI7L7pbUPmXrVsTjDsdOJOMmhGXQWiylgk8+Q=; b=YUn+y82D52r8rKz5OTYwpz+cZkdwlIk5LBYeNtraaW43dCXE75J1qFwz Q56AzRu8qnpgV5f2+3nWJ2ag9II1f/9sxxZ8eG39bkUL8DOsHZDdl4zaN Nwz08kOrkueTQ4HlgWcGU4XYB1ZFlknVGgkXb5pxpzhcHu/66vGE4BH+l y/EqjwNHX9YGnlvloevgZI+VXEhtleSKtCQjJhRtjpYXl3HYfCvrwqwIl jTzqEU1+buF11+1JYz6iRe4fUtqAsMNJ1ihz1Dl01RuFA3uxfTvu46ahL UcnkjpGX3s8FN1DY+/oJmPyTNeS0cArky/1zYlmOqthO/BjBxUZJfgbMr g==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="252823393" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="252823393" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:08 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554654936" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:00 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK Date: Thu, 10 Mar 2022 22:09:02 +0800 Message-Id: <20220310140911.50924-5-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 0A07E1C0010 X-Stat-Signature: 4qiqeq5phki4bkezeid4geksbwx1bnrm Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=YUn+y82D; dmarc=pass (policy=none) header.from=intel.com; spf=none (imf21.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 192.55.52.93) smtp.mailfrom=chao.p.peng@linux.intel.com X-HE-Tag: 1646921409-491398 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Since page migration / swapping is not supported yet, MFD_INACCESSIBLE memory behave like longterm pinned pages and thus should be accounted to mm->pinned_vm and be restricted by RLIMIT_MEMLOCK. Signed-off-by: Chao Peng --- mm/shmem.c | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/mm/shmem.c b/mm/shmem.c index 7b43e274c9a2..ae46fb96494b 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end) static void notify_invalidate_page(struct inode *inode, struct folio *folio, pgoff_t start, pgoff_t end) { -#ifdef CONFIG_MEMFILE_NOTIFIER struct shmem_inode_info *info = SHMEM_I(inode); +#ifdef CONFIG_MEMFILE_NOTIFIER start = max(start, folio->index); end = min(end, folio->index + folio_nr_pages(folio)); memfile_notifier_invalidate(&info->memfile_notifiers, start, end); #endif + + if (info->xflags & SHM_F_INACCESSIBLE) + atomic64_sub(end - start, ¤t->mm->pinned_vm); } /* @@ -2680,6 +2683,20 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence) return offset; } +static bool memlock_limited(unsigned long npages) +{ + unsigned long lock_limit; + unsigned long pinned; + + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + pinned = atomic64_add_return(npages, ¤t->mm->pinned_vm); + if (pinned > lock_limit && !capable(CAP_IPC_LOCK)) { + atomic64_sub(npages, ¤t->mm->pinned_vm); + return true; + } + return false; +} + static long shmem_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { @@ -2753,6 +2770,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, goto out; } + if ((info->xflags & SHM_F_INACCESSIBLE) && + memlock_limited(end - start)) { + error = -ENOMEM; + goto out; + } + shmem_falloc.waitq = NULL; shmem_falloc.start = start; shmem_falloc.next = start; From patchwork Thu Mar 10 14:09:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776406 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B3440C433EF for ; Thu, 10 Mar 2022 14:10:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 524728D0009; Thu, 10 Mar 2022 09:10:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4D2548D0001; Thu, 10 Mar 2022 09:10:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 39B2A8D0009; Thu, 10 Mar 2022 09:10:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0200.hostedemail.com [216.40.44.200]) by kanga.kvack.org (Postfix) with ESMTP id 2DAE28D0001 for ; Thu, 10 Mar 2022 09:10:19 -0500 (EST) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id E5AF09D262 for ; Thu, 10 Mar 2022 14:10:18 +0000 (UTC) X-FDA: 79228661316.26.C9C0E3F Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by imf09.hostedemail.com (Postfix) with ESMTP id 11479140017 for ; Thu, 10 Mar 2022 14:10:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921418; x=1678457418; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=R9N/T0uA9Epd8O0DLQkw3YWNAAFsUHVjLA2F5YkEzAE=; b=EuwlRaypq/ks3zRUGaXN33ON8xP4gcZKD0IaMHr5BHan6vcCPzwqN37y ppd66xjUSqb/Zl7XPzOjK7cqNepWNvHnextgJrajP3DlP+LCMeA7tiEAi 6N9cflqDalrBYatCuPA6fCZdSIPo/3g+pXlpTLrzmWAwmFc/nm+MnBFdR QdegJ0LxaYXLgHbVJLpda+PE6zy+sExXiYgN2fVvER9rhqB2DjpSNIYDQ r6iIe6rOJXZ4qU/fBtG1qb8TIMsqkNpkF3Qe6gI62Lvq/HzQ5NhPMVAxF 0QDrdS4s9U2WFXp3vZzUY0DFEMjifArulfG1ei4NRFH1q892KnyUNbmwj A==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="341684795" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="341684795" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:16 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554654963" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:08 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 05/13] KVM: Extend the memslot to support fd-based private memory Date: Thu, 10 Mar 2022 22:09:03 +0800 Message-Id: <20220310140911.50924-6-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> X-Rspamd-Queue-Id: 11479140017 X-Stat-Signature: cn7fito63oe3kg9q1q93zfnmj153ug7c X-Rspam-User: Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=EuwlRayp; dmarc=pass (policy=none) header.from=intel.com; spf=none (imf09.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 192.55.52.43) smtp.mailfrom=chao.p.peng@linux.intel.com X-Rspamd-Server: rspam03 X-HE-Tag: 1646921417-34940 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Extend the memslot definition to provide fd-based private memory support by adding two new fields (private_fd/private_offset). The memslot then can maintain memory for both shared pages and private pages in a single memslot. Shared pages are provided by existing userspace_addr(hva) field and private pages are provided through the new private_fd/private_offset fields. Since there is no 'hva' concept anymore for private memory so we cannot rely on get_user_pages() to get a pfn, instead we use the newly added memfile_notifier to complete the same job. This new extension is indicated by a new flag KVM_MEM_PRIVATE. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- Documentation/virt/kvm/api.rst | 37 +++++++++++++++++++++++++++------- include/linux/kvm_host.h | 7 +++++++ include/uapi/linux/kvm.h | 8 ++++++++ 3 files changed, 45 insertions(+), 7 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 3acbf4d263a5..f76ac598606c 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -1307,7 +1307,7 @@ yet and must be cleared on entry. :Capability: KVM_CAP_USER_MEMORY :Architectures: all :Type: vm ioctl -:Parameters: struct kvm_userspace_memory_region (in) +:Parameters: struct kvm_userspace_memory_region(_ext) (in) :Returns: 0 on success, -1 on error :: @@ -1320,9 +1320,17 @@ yet and must be cleared on entry. __u64 userspace_addr; /* start of the userspace allocated memory */ }; + struct kvm_userspace_memory_region_ext { + struct kvm_userspace_memory_region region; + __u64 private_offset; + __u32 private_fd; + __u32 padding[5]; +}; + /* for kvm_memory_region::flags */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) + #define KVM_MEM_PRIVATE (1UL << 2) This ioctl allows the user to create, modify or delete a guest physical memory slot. Bits 0-15 of "slot" specify the slot id and this value @@ -1353,12 +1361,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr be identical. This allows large pages in the guest to be backed by large pages in the host. -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it, -to make a new slot read-only. In this case, writes to this memory will be -posted to userspace as KVM_EXIT_MMIO exits. +kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region +fields. It also includes additional fields for some specific features. See +below description of flags field for more information. It's recommended to use +kvm_userspace_memory_region_ext in new userspace code. + +The flags field supports below flags: + +- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to + memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to use it. + +- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to + make a new slot read-only. In this case, writes to this memory will be posted + to userspace as KVM_EXIT_MMIO exits. + +- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by + a file descirptor(fd) and the content of the private memory is invisible to + userspace. In this case, userspace should use private_fd/private_offset in + kvm_userspace_memory_region_ext to instruct KVM to provide private memory to + guest. Userspace should guarantee not to map the same pfn indicated by + private_fd/private_offset to different gfns with multiple memslots. Failed to + do this may result undefined behavior. When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of the memory region are automatically reflected into the guest. For example, an diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 9536ffa0473b..3be8116079d4 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -563,8 +563,15 @@ struct kvm_memory_slot { u32 flags; short id; u16 as_id; + struct file *private_file; + loff_t private_offset; }; +static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) +{ + return slot && (slot->flags & KVM_MEM_PRIVATE); +} + static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot) { return slot->flags & KVM_MEM_LOG_DIRTY_PAGES; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 91a6fe4e02c0..a523d834efc8 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -103,6 +103,13 @@ struct kvm_userspace_memory_region { __u64 userspace_addr; /* start of the userspace allocated memory */ }; +struct kvm_userspace_memory_region_ext { + struct kvm_userspace_memory_region region; + __u64 private_offset; + __u32 private_fd; + __u32 padding[5]; +}; + /* * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace, * other bits are reserved for kvm internal use which are defined in @@ -110,6 +117,7 @@ struct kvm_userspace_memory_region { */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) +#define KVM_MEM_PRIVATE (1UL << 2) /* for KVM_IRQ_LINE */ struct kvm_irq_level { From patchwork Thu Mar 10 14:09:04 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776407 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 68826C4321E for ; Thu, 10 Mar 2022 14:10:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 01A318D000A; Thu, 10 Mar 2022 09:10:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id F0C638D0001; Thu, 10 Mar 2022 09:10:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DAD1E8D000A; Thu, 10 Mar 2022 09:10:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0200.hostedemail.com [216.40.44.200]) by kanga.kvack.org (Postfix) with ESMTP id CCDF48D0001 for ; Thu, 10 Mar 2022 09:10:26 -0500 (EST) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 996A68249980 for ; Thu, 10 Mar 2022 14:10:26 +0000 (UTC) X-FDA: 79228661652.27.220B531 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by imf05.hostedemail.com (Postfix) with ESMTP id D168F10000E for ; Thu, 10 Mar 2022 14:10:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921426; x=1678457426; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=aXDmGBTRlSAIbebZ1eKgmeAE+5/9H+lFwtsMHbxfleU=; b=NpAwKIFpSKCxjUC/3BiQkdsatjJed/h3nX90xj7t3osXE64se+4ruujM 5T2H0aVZ+4ZQeaSSlc9d5cNvnhREIviOoDgUdwCSK07qG/Oo3yhR72U7y QNz54Z4H5o5iUUV9R9H1lrdluM+5v5r25FUfTW1eyxRC/Jjtl5DnF880a 6571hpsb5wxBO0nA96w1zj8hJKrwHzi80Dcmm0mtio44l1eioSEYwngyE qEFhdtd3IynLdt14WwILoKiD8va0fSPU5Akml1TrixA9N6oAIhqIJQKob rgPKJ9KLr/sYdv0ZOl69fO6AccrLiU2M/hy86n/07U0s0JbyHFWGphGXr w==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="254994218" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="254994218" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:24 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655000" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:16 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 06/13] KVM: Use kvm_userspace_memory_region_ext Date: Thu, 10 Mar 2022 22:09:04 +0800 Message-Id: <20220310140911.50924-7-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: D168F10000E X-Stat-Signature: 46hxqgjpepqfspz7mtauaeqa1kdhustq Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=NpAwKIFp; spf=none (imf05.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 134.134.136.24) smtp.mailfrom=chao.p.peng@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspam-User: X-HE-Tag: 1646921425-571052 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Use the new extended memslot structure kvm_userspace_memory_region_ext. The extended part (private_fd/ private_offset) will be copied from userspace only when KVM_MEM_PRIVATE is set. Internally old kvm_userspace_memory_region will still be used for places where the extended fields are not needed. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- arch/x86/kvm/x86.c | 12 ++++++------ include/linux/kvm_host.h | 4 ++-- virt/kvm/kvm_main.c | 30 ++++++++++++++++++++---------- 3 files changed, 28 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8c06b8204fca..1d9dbef67715 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -11757,13 +11757,13 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, } for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { - struct kvm_userspace_memory_region m; + struct kvm_userspace_memory_region_ext m; - m.slot = id | (i << 16); - m.flags = 0; - m.guest_phys_addr = gpa; - m.userspace_addr = hva; - m.memory_size = size; + m.region.slot = id | (i << 16); + m.region.flags = 0; + m.region.guest_phys_addr = gpa; + m.region.userspace_addr = hva; + m.region.memory_size = size; r = __kvm_set_memory_region(kvm, &m); if (r < 0) return ERR_PTR_USR(r); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 3be8116079d4..c92c70174248 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1082,9 +1082,9 @@ enum kvm_mr_change { }; int kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem); + const struct kvm_userspace_memory_region_ext *region_ext); int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem); + const struct kvm_userspace_memory_region_ext *region_ext); void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot); void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen); int kvm_arch_prepare_memory_region(struct kvm *kvm, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 69c318fdff61..d11a2628b548 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1809,8 +1809,9 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id, * Must be called holding kvm->slots_lock for write. */ int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem) + const struct kvm_userspace_memory_region_ext *region_ext) { + const struct kvm_userspace_memory_region *mem = ®ion_ext->region; struct kvm_memory_slot *old, *new; struct kvm_memslots *slots; enum kvm_mr_change change; @@ -1913,24 +1914,24 @@ int __kvm_set_memory_region(struct kvm *kvm, EXPORT_SYMBOL_GPL(__kvm_set_memory_region); int kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem) + const struct kvm_userspace_memory_region_ext *region_ext) { int r; mutex_lock(&kvm->slots_lock); - r = __kvm_set_memory_region(kvm, mem); + r = __kvm_set_memory_region(kvm, region_ext); mutex_unlock(&kvm->slots_lock); return r; } EXPORT_SYMBOL_GPL(kvm_set_memory_region); static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm, - struct kvm_userspace_memory_region *mem) + struct kvm_userspace_memory_region_ext *region_ext) { - if ((u16)mem->slot >= KVM_USER_MEM_SLOTS) + if ((u16)region_ext->region.slot >= KVM_USER_MEM_SLOTS) return -EINVAL; - return kvm_set_memory_region(kvm, mem); + return kvm_set_memory_region(kvm, region_ext); } #ifndef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT @@ -4476,14 +4477,23 @@ static long kvm_vm_ioctl(struct file *filp, break; } case KVM_SET_USER_MEMORY_REGION: { - struct kvm_userspace_memory_region kvm_userspace_mem; + struct kvm_userspace_memory_region_ext region_ext; r = -EFAULT; - if (copy_from_user(&kvm_userspace_mem, argp, - sizeof(kvm_userspace_mem))) + if (copy_from_user(®ion_ext, argp, + sizeof(struct kvm_userspace_memory_region))) goto out; + if (region_ext.region.flags & KVM_MEM_PRIVATE) { + int offset = offsetof( + struct kvm_userspace_memory_region_ext, + private_offset); + if (copy_from_user(®ion_ext.private_offset, + argp + offset, + sizeof(region_ext) - offset)) + goto out; + } - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem); + r = kvm_vm_ioctl_set_memory_region(kvm, ®ion_ext); break; } case KVM_GET_DIRTY_LOG: { From patchwork Thu Mar 10 14:09:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776408 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7557EC433F5 for ; Thu, 10 Mar 2022 14:10:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 179678D000B; Thu, 10 Mar 2022 09:10:35 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 129EE8D0001; Thu, 10 Mar 2022 09:10:35 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 018928D000B; Thu, 10 Mar 2022 09:10:34 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0162.hostedemail.com [216.40.44.162]) by kanga.kvack.org (Postfix) with ESMTP id E75A88D0001 for ; Thu, 10 Mar 2022 09:10:34 -0500 (EST) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id A969C9CD5C for ; Thu, 10 Mar 2022 14:10:34 +0000 (UTC) X-FDA: 79228661988.20.D74EB72 Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by imf30.hostedemail.com (Postfix) with ESMTP id 07F7A80018 for ; Thu, 10 Mar 2022 14:10:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921434; x=1678457434; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=Z4vQdKuwJGnUEOHT5u0YKIBAyF/sX/jaF7InK/l4l/o=; b=ORB6pm1Tg5a2uHP7B9VIRJOuf2Oxa3uuMi7eYg2rw4VM0v3lLtf2WObn GWehlNF5LskzaY9jlpB5czE3cpxo87ZBS53S9jg110OiIT4LJweJX+jFq OtSemtQr0B2p7RiTh4iNQnl5mPmk7PySEYhb9KqNF0t420lidSoNtdGNo 8/yxH1uZSQGFvbCpiWtIEdnShRQHkTwImWz6kM04y9zmPu7m9i3J9yxx4 F3TfldieVFny1B0uXY10c8wZoYoDWswH/3rFnk2objaPwPes2WOfwriMc dvWVmY+LZ/TbpFRmBw/H3g/yyGg1q2gr3RozRn7Jrn/xcIZgSeHGgGoTM Q==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="255448396" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="255448396" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:33 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655053" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:24 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 07/13] KVM: Add KVM_EXIT_MEMORY_ERROR exit Date: Thu, 10 Mar 2022 22:09:05 +0800 Message-Id: <20220310140911.50924-8-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 07F7A80018 X-Stat-Signature: q3hgwg5rads4ydrdusddd849yz65s6y6 Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=ORB6pm1T; spf=none (imf30.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 192.55.52.115) smtp.mailfrom=chao.p.peng@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspam-User: X-HE-Tag: 1646921433-922676 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This new KVM exit allows userspace to handle memory-related errors. It indicates an error happens in KVM at guest memory range [gpa, gpa+size). The flags includes additional information for userspace to handle the error. Currently bit 0 is defined as 'private memory' where '1' indicates error happens due to private memory access and '0' indicates error happens due to shared memory access. After private memory is enabled, this new exit will be used for KVM to exit to userspace for shared memory <-> private memory conversion in memory encryption usage. In such usage, typically there are two kind of memory conversions: - explicit conversion: happens when guest explicitly calls into KVM to map a range (as private or shared), KVM then exits to userspace to do the map/unmap operations. - implicit conversion: happens in KVM page fault handler. * if the fault is due to a private memory access then causes a userspace exit for a shared->private conversion request when the page has not been allocated in the private memory backend. * If the fault is due to a shared memory access then causes a userspace exit for a private->shared conversion request when the page has already been allocated in the private memory backend. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- Documentation/virt/kvm/api.rst | 22 ++++++++++++++++++++++ include/uapi/linux/kvm.h | 9 +++++++++ 2 files changed, 31 insertions(+) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index f76ac598606c..bad550c2212b 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6216,6 +6216,28 @@ array field represents return values. The userspace should update the return values of SBI call before resuming the VCPU. For more details on RISC-V SBI spec refer, https://github.com/riscv/riscv-sbi-doc. +:: + + /* KVM_EXIT_MEMORY_ERROR */ + struct { + #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0) + __u32 flags; + __u32 padding; + __u64 gpa; + __u64 size; + } memory; +If exit reason is KVM_EXIT_MEMORY_ERROR then it indicates that the VCPU has +encountered a memory error which is not handled by KVM kernel module and +userspace may choose to handle it. The 'flags' field indicates the memory +properties of the exit. + + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by + private memory access when the bit is set otherwise the memory error is + caused by shared memory access when the bit is clear. + +'gpa' and 'size' indicate the memory range the error occurs at. The userspace +may handle the error and return to KVM to retry the previous memory access. + :: /* Fix the size of the union. */ diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index a523d834efc8..9ad0c8aa0263 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -278,6 +278,7 @@ struct kvm_xen_exit { #define KVM_EXIT_X86_BUS_LOCK 33 #define KVM_EXIT_XEN 34 #define KVM_EXIT_RISCV_SBI 35 +#define KVM_EXIT_MEMORY_ERROR 36 /* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -495,6 +496,14 @@ struct kvm_run { unsigned long args[6]; unsigned long ret[2]; } riscv_sbi; + /* KVM_EXIT_MEMORY_ERROR */ + struct { +#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0) + __u32 flags; + __u32 padding; + __u64 gpa; + __u64 size; + } memory; /* Fix the size of the union. */ char padding[256]; }; From patchwork Thu Mar 10 14:09:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776409 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0735FC433EF for ; Thu, 10 Mar 2022 14:10:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9DDB38D0002; Thu, 10 Mar 2022 09:10:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 98D708D0001; Thu, 10 Mar 2022 09:10:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 855BB8D0002; Thu, 10 Mar 2022 09:10:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.a.hostedemail.com [64.99.140.24]) by kanga.kvack.org (Postfix) with ESMTP id 78C898D0001 for ; Thu, 10 Mar 2022 09:10:45 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 5075621DF0 for ; Thu, 10 Mar 2022 14:10:45 +0000 (UTC) X-FDA: 79228662450.03.CF4A436 Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by imf21.hostedemail.com (Postfix) with ESMTP id 65CBC1C001E for ; Thu, 10 Mar 2022 14:10:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921444; x=1678457444; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=/0lWU+5EyoSjlOB1k9o55wX3sxsJiR5VJJRuHs8r7qY=; b=guLk5AodL4zOseulDUFvfset30C8THU++2S8YSKiFhwdkC+zZ4homfZv xJrj7WV7NRB7lCOnly62NTqVeYbKBXbGNyjgiJ9wZgqLQ5H+Y4lc/ZQLU pY8G3nvBp7OI5SoOdDYZETtxzpdoiP8Dnvq6d4qldS6yq89Y2Sbtw7DqT 3rw49W6A10+OrzUyUbrgBwV9SRa0vPBV2jTAFyj+wMftRn/NGPNbG0FR7 z7OPEyJpZT0cWRx5Z1usq0Nr9+Dhd+Edue3YabDqfFPaP6Mp6y+ZF/Sgt b7mf6ZqP8AEqPCGqDS24KNPW8Rmf4ujhUmUNdbmjDtA2YG/vjWPsmlilA A==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="315975832" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="315975832" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:41 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655084" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:32 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 08/13] KVM: Use memfile_pfn_ops to obtain pfn for private pages Date: Thu, 10 Mar 2022 22:09:06 +0800 Message-Id: <20220310140911.50924-9-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 65CBC1C001E X-Stat-Signature: 86iks8g55ocimkfwi58qow1om9es1smt Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=guLk5Aod; spf=none (imf21.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 134.134.136.31) smtp.mailfrom=chao.p.peng@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspam-User: X-HE-Tag: 1646921444-351629 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Private pages are not mmap-ed into userspace so can not reply on get_user_pages() to obtain the pfn. Instead we add a memfile_pfn_ops pointer pfn_ops in each private memslot and use it to obtain the pfn for a gfn. To do that, KVM should convert the gfn to the offset into the fd and then call get_lock_pfn callback. Once KVM completes its job it should call put_unlock_pfn to unlock the pfn. Note the pfn(page) is locked between get_lock_pfn/put_unlock_pfn to ensure pfn is valid when KVM uses it to establish the mapping in the secondary MMU page table. The pfn_ops is initialized via memfile_register_notifier from the memory backing store that provided the private_fd. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- arch/x86/kvm/Kconfig | 1 + include/linux/kvm_host.h | 33 +++++++++++++++++++++++++++++++++ 2 files changed, 34 insertions(+) diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index e3cbd7706136..ca7b2a6a452a 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -48,6 +48,7 @@ config KVM select SRCU select INTERVAL_TREE select HAVE_KVM_PM_NOTIFIER if PM + select MEMFILE_NOTIFIER help Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c92c70174248..6e1d770d6bf8 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -44,6 +44,7 @@ #include #include +#include #ifndef KVM_MAX_VCPU_IDS #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS @@ -565,6 +566,7 @@ struct kvm_memory_slot { u16 as_id; struct file *private_file; loff_t private_offset; + struct memfile_pfn_ops *pfn_ops; }; static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) @@ -915,6 +917,7 @@ static inline void kvm_irqfd_exit(void) { } #endif + int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align, struct module *module); void kvm_exit(void); @@ -2217,4 +2220,34 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu) /* Max number of entries allowed for each kvm dirty ring */ #define KVM_DIRTY_RING_MAX_ENTRIES 65536 +#ifdef CONFIG_MEMFILE_NOTIFIER +static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn, + int *order) +{ + pgoff_t index = gfn - slot->base_gfn + + (slot->private_offset >> PAGE_SHIFT); + + return slot->pfn_ops->get_lock_pfn(file_inode(slot->private_file), + index, order); +} + +static inline void kvm_memfile_put_pfn(struct kvm_memory_slot *slot, + kvm_pfn_t pfn) +{ + slot->pfn_ops->put_unlock_pfn(pfn); +} + +#else +static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn, + int *order) +{ + return -1; +} + +static inline void kvm_memfile_put_pfn(struct kvm_memory_slot *slot, + kvm_pfn_t pfn) +{ +} +#endif /* CONFIG_MEMFILE_NOTIFIER */ + #endif From patchwork Thu Mar 10 14:09:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776410 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 89637C4321E for ; Thu, 10 Mar 2022 14:10:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D39598D0005; Thu, 10 Mar 2022 09:10:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CE8E58D0001; Thu, 10 Mar 2022 09:10:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B8A108D0005; Thu, 10 Mar 2022 09:10:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.27]) by kanga.kvack.org (Postfix) with ESMTP id AB0F98D0001 for ; Thu, 10 Mar 2022 09:10:51 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 62E9561C75 for ; Thu, 10 Mar 2022 14:10:51 +0000 (UTC) X-FDA: 79228662702.12.15854C9 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by imf16.hostedemail.com (Postfix) with ESMTP id 30D92180015 for ; Thu, 10 Mar 2022 14:10:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921450; x=1678457450; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=uGzkzAL8Dy+iBgRDUMsNImudpdJ03OH3v6JTl1qmo+Y=; b=hPPEWNUi5+KYqs3o9rArvISZ4asnrQ+uiiLdp9Ecx3IGGMpzxST5ThRh n3/soFlCAzXO/weB/rQ1ix8is6zss4B61PIAMhSTfuHb5i7MegP+TfT6g uDB0Yw/p0pLcGcfp4YgkWtZjEAJ/0K2k/socthhQ+cnYeMGNazGV/yc9+ U7896OqmKi+30Tzu3lcNf2eHJDlrnZPtW0oaqCpQt6dfuvLice1qB1PlA GanB39dYlYLkSbv03OruesMnMy1XyofFh+uGOikiPk9rZBMw0LzUjbDyK GujvLTFKx43+RYZPT+KuA4FkedgDDcLRTPkXFGUmxMY0UqV0e8X7fP7AB w==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="237426405" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="237426405" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:48 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655113" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:41 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 09/13] KVM: Handle page fault for private memory Date: Thu, 10 Mar 2022 22:09:07 +0800 Message-Id: <20220310140911.50924-10-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 30D92180015 X-Rspam-User: Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=hPPEWNUi; spf=none (imf16.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 134.134.136.126) smtp.mailfrom=chao.p.peng@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com X-Stat-Signature: 3j95u9fiwreasdpa1pniufy6yr9fyk5x X-HE-Tag: 1646921449-727445 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When page fault happens for a memslot with KVM_MEM_PRIVATE, we use kvm_memfile_get_pfn() which further calls into memfile_pfn_ops callbacks defined for each memslot to request the pfn from the memory backing store. One assumption is that private pages are persistent and pre-allocated in the private memory fd (backing store) so KVM uses this information as an indicator for a page is private or shared (i.e. the private fd is the final source of truth as to whether or not a GPA is private). Depending on the access is private or shared, we go different paths: - For private access, KVM checks if the page is already allocated in the memory backing store, if yes KVM establishes the mapping, otherwise exits to userspace to convert a shared page to private one. - For shared access, KVM also checks if the page is already allocated in the memory backing store, if yes then exit to userspace to convert a private page to shared one, otherwise it's treated as a traditional hva-based shared memory, KVM lets existing code to obtain a pfn with get_user_pages() and establish the mapping. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- arch/x86/kvm/mmu/mmu.c | 73 ++++++++++++++++++++++++++++++++-- arch/x86/kvm/mmu/paging_tmpl.h | 11 +++-- 2 files changed, 77 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 3b8da8b0745e..f04c823ea09a 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2844,6 +2844,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, if (max_level == PG_LEVEL_4K) return PG_LEVEL_4K; + if (kvm_slot_is_private(slot)) + return max_level; + host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot); return min(host_level, max_level); } @@ -3890,7 +3893,59 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch); } -static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r) +static bool kvm_vcpu_is_private_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) +{ + /* + * At this time private gfn has not been supported yet. Other patch + * that enables it should change this. + */ + return false; +} + +static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu, + struct kvm_page_fault *fault, + bool *is_private_pfn, int *r) +{ + int order; + unsigned int flags = 0; + struct kvm_memory_slot *slot = fault->slot; + long pfn = kvm_memfile_get_pfn(slot, fault->gfn, &order); + + if (kvm_vcpu_is_private_gfn(vcpu, fault->addr >> PAGE_SHIFT)) { + if (pfn < 0) + flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE; + else { + fault->pfn = pfn; + if (slot->flags & KVM_MEM_READONLY) + fault->map_writable = false; + else + fault->map_writable = true; + + if (order == 0) + fault->max_level = PG_LEVEL_4K; + *is_private_pfn = true; + *r = RET_PF_FIXED; + return true; + } + } else { + if (pfn < 0) + return false; + + kvm_memfile_put_pfn(slot, pfn); + } + + vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR; + vcpu->run->memory.flags = flags; + vcpu->run->memory.padding = 0; + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT; + vcpu->run->memory.size = PAGE_SIZE; + fault->pfn = -1; + *r = -1; + return true; +} + +static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, + bool *is_private_pfn, int *r) { struct kvm_memory_slot *slot = fault->slot; bool async; @@ -3924,6 +3979,10 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, } } + if (kvm_slot_is_private(slot) && + kvm_faultin_pfn_private(vcpu, fault, is_private_pfn, r)) + return *r == RET_PF_FIXED ? false : true; + async = false; fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async, fault->write, &fault->map_writable, @@ -3984,6 +4043,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu); unsigned long mmu_seq; + bool is_private_pfn = false; int r; fault->gfn = fault->addr >> PAGE_SHIFT; @@ -4003,7 +4063,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault mmu_seq = vcpu->kvm->mmu_notifier_seq; smp_rmb(); - if (kvm_faultin_pfn(vcpu, fault, &r)) + if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r)) return r; if (handle_abnormal_pfn(vcpu, fault, ACC_ALL, &r)) @@ -4016,7 +4076,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault else write_lock(&vcpu->kvm->mmu_lock); - if (is_page_fault_stale(vcpu, fault, mmu_seq)) + if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq)) goto out_unlock; r = make_mmu_pages_available(vcpu); @@ -4033,7 +4093,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault read_unlock(&vcpu->kvm->mmu_lock); else write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); + + if (is_private_pfn) + kvm_memfile_put_pfn(fault->slot, fault->pfn); + else + kvm_release_pfn_clean(fault->pfn); + return r; } diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 252c77805eb9..6a5736699c0a 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -825,6 +825,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault int r; unsigned long mmu_seq; bool is_self_change_mapping; + bool is_private_pfn = false; + pgprintk("%s: addr %lx err %x\n", __func__, fault->addr, fault->error_code); WARN_ON_ONCE(fault->is_tdp); @@ -873,7 +875,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault mmu_seq = vcpu->kvm->mmu_notifier_seq; smp_rmb(); - if (kvm_faultin_pfn(vcpu, fault, &r)) + if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r)) return r; if (handle_abnormal_pfn(vcpu, fault, walker.pte_access, &r)) @@ -901,7 +903,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault r = RET_PF_RETRY; write_lock(&vcpu->kvm->mmu_lock); - if (is_page_fault_stale(vcpu, fault, mmu_seq)) + if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq)) goto out_unlock; r = make_mmu_pages_available(vcpu); @@ -911,7 +913,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault out_unlock: write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); + if (is_private_pfn) + kvm_memfile_put_pfn(fault->slot, fault->pfn); + else + kvm_release_pfn_clean(fault->pfn); return r; } From patchwork Thu Mar 10 14:09:08 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776411 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2450CC433EF for ; Thu, 10 Mar 2022 14:11:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B16D08D0006; Thu, 10 Mar 2022 09:11:00 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AC5538D0001; Thu, 10 Mar 2022 09:11:00 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 98D5E8D0006; Thu, 10 Mar 2022 09:11:00 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.26]) by kanga.kvack.org (Postfix) with ESMTP id 8C6DE8D0001 for ; Thu, 10 Mar 2022 09:11:00 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id CA3F66069E for ; Thu, 10 Mar 2022 14:10:59 +0000 (UTC) X-FDA: 79228663038.10.5B53252 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf08.hostedemail.com (Postfix) with ESMTP id 2CE2B16001C for ; Thu, 10 Mar 2022 14:10:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921459; x=1678457459; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=dJciXpEHLFj6Z+EAfDFvGxoZXvWRu/tGM+jdlgCTKlE=; b=L/jYOUcSHvFQ0TZPweJB+lZSEl5m0ZVVEt62Zi/l9AGSqKjkqG0JSjdn Gzl3znNL9SDvWB89qXW58TantQ952rLqF+qrJENxbxsGyAiNenaIYB55B 1aEkDpWHeXOAilZxE6m/I/fV2+7OyNrbeKPCYfRV7ySe1hn2UmT98bx2R mtKgJsXN4AMUlktpsm5AvoMgOP7qtYV7lm5a9SAYzR/dIVEkEJdGhOy2O 8wf/k+fcZ2saXzQEamZpLiTRj2Sz6a0sNVxtR8HAqhm001v3RJli1P3Nu mR/lB8rtqPGExhMmoWw6CPR4ucIfaMOOi9QCA0mS3gm6yk6e91YGhBuUL w==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="318479497" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="318479497" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:57 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655136" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:49 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 10/13] KVM: Register private memslot to memory backing store Date: Thu, 10 Mar 2022 22:09:08 +0800 Message-Id: <20220310140911.50924-11-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 2CE2B16001C X-Stat-Signature: 1i4uwhmer3iqjic4nz5e3uuub76sxyzf Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="L/jYOUcS"; dmarc=pass (policy=none) header.from=intel.com; spf=none (imf08.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 134.134.136.100) smtp.mailfrom=chao.p.peng@linux.intel.com X-HE-Tag: 1646921458-883039 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add 'notifier' to memslot to make it a memfile_notifier node and then register it to memory backing store via memfile_register_notifier() when memslot gets created. When memslot is deleted, do the reverse with memfile_unregister_notifier(). Note each KVM memslot can be registered to different memory backing stores (or the same backing store but at different offset) independently. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 75 ++++++++++++++++++++++++++++++++++++---- 2 files changed, 70 insertions(+), 6 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 6e1d770d6bf8..9b175aeca63f 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -567,6 +567,7 @@ struct kvm_memory_slot { struct file *private_file; loff_t private_offset; struct memfile_pfn_ops *pfn_ops; + struct memfile_notifier notifier; }; static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d11a2628b548..67349421eae3 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -840,6 +840,37 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ +#ifdef CONFIG_MEMFILE_NOTIFIER +static inline int kvm_memfile_register(struct kvm_memory_slot *slot) +{ + return memfile_register_notifier(file_inode(slot->private_file), + &slot->notifier, + &slot->pfn_ops); +} + +static inline void kvm_memfile_unregister(struct kvm_memory_slot *slot) +{ + if (slot->private_file) { + memfile_unregister_notifier(file_inode(slot->private_file), + &slot->notifier); + fput(slot->private_file); + slot->private_file = NULL; + } +} + +#else /* !CONFIG_MEMFILE_NOTIFIER */ + +static inline int kvm_memfile_register(struct kvm_memory_slot *slot) +{ + return -EOPNOTSUPP; +} + +static inline void kvm_memfile_unregister(struct kvm_memory_slot *slot) +{ +} + +#endif /* CONFIG_MEMFILE_NOTIFIER */ + #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER static int kvm_pm_notifier_call(struct notifier_block *bl, unsigned long state, @@ -884,6 +915,9 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot) /* This does not remove the slot from struct kvm_memslots data structures */ static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) { + if (slot->flags & KVM_MEM_PRIVATE) + kvm_memfile_unregister(slot); + kvm_destroy_dirty_bitmap(slot); kvm_arch_free_memslot(kvm, slot); @@ -1738,6 +1772,12 @@ static int kvm_set_memslot(struct kvm *kvm, kvm_invalidate_memslot(kvm, old, invalid_slot); } + if (new->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE) { + r = kvm_memfile_register(new); + if (r) + return r; + } + r = kvm_prepare_memory_region(kvm, old, new, change); if (r) { /* @@ -1752,6 +1792,10 @@ static int kvm_set_memslot(struct kvm *kvm, } else { mutex_unlock(&kvm->slots_arch_lock); } + + if (new->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE) + kvm_memfile_unregister(new); + return r; } @@ -1817,6 +1861,7 @@ int __kvm_set_memory_region(struct kvm *kvm, enum kvm_mr_change change; unsigned long npages; gfn_t base_gfn; + struct file *file = NULL; int as_id, id; int r; @@ -1890,14 +1935,24 @@ int __kvm_set_memory_region(struct kvm *kvm, return 0; } + if (mem->flags & KVM_MEM_PRIVATE) { + file = fdget(region_ext->private_fd).file; + if (!file) + return -EINVAL; + } + if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) && - kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages)) - return -EEXIST; + kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages)) { + r = -EEXIST; + goto out; + } /* Allocate a slot that will persist in the memslot. */ new = kzalloc(sizeof(*new), GFP_KERNEL_ACCOUNT); - if (!new) - return -ENOMEM; + if (!new) { + r = -ENOMEM; + goto out; + } new->as_id = as_id; new->id = id; @@ -1905,10 +1960,18 @@ int __kvm_set_memory_region(struct kvm *kvm, new->npages = npages; new->flags = mem->flags; new->userspace_addr = mem->userspace_addr; + new->private_file = file; + new->private_offset = mem->flags & KVM_MEM_PRIVATE ? + region_ext->private_offset : 0; r = kvm_set_memslot(kvm, old, new, change); - if (r) - kfree(new); + if (!r) + return r; + + kfree(new); +out: + if (file) + fput(file); return r; } EXPORT_SYMBOL_GPL(__kvm_set_memory_region); From patchwork Thu Mar 10 14:09:09 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776412 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 08EAEC433FE for ; Thu, 10 Mar 2022 14:11:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 96AA78D0007; Thu, 10 Mar 2022 09:11:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 917E28D0001; Thu, 10 Mar 2022 09:11:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7E0088D0007; Thu, 10 Mar 2022 09:11:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.25]) by kanga.kvack.org (Postfix) with ESMTP id 719DA8D0001 for ; Thu, 10 Mar 2022 09:11:07 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 4502C21D82 for ; Thu, 10 Mar 2022 14:11:07 +0000 (UTC) X-FDA: 79228663374.12.925555C Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by imf05.hostedemail.com (Postfix) with ESMTP id 7AF32100018 for ; Thu, 10 Mar 2022 14:11:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921466; x=1678457466; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=UDNf/tndP7AKz2Uk9JPtbsxmn9QkdDXr+UJSlA/azbE=; b=S7RKetvn3W+CMrB27eHmJQjnotkHz0IxQ4hSwAg+Hvg5GCE5KxxEDDn/ f0u3ZrQqdhhLPsem8NVUMFCAYgg12KfnSdqCGgaecytCpGkTLnzV5l2pU b/HpJ/JKuREpSopDdzDPrbNoHA4dVazKKwvNCxk/yYxyULSA4kf5PtOcO lCF00nSNC5WmgKtzWjqvtoJ9MrF4SGf8wJzOsO7CDhVLPOydprXuqU2+Z n8h70wrlaBAngdKAWWiN6CtvVw38ZhpoOHUrGtEYHpgyHQr4HMwPgBzqK MFLLsBDOn2XPTSAAPlVZPqwi9NCHmhYAjEbtAKdQhmIVB5Sebia+RKirr A==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="279993945" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="279993945" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:11:04 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655204" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:56 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd Date: Thu, 10 Mar 2022 22:09:09 +0800 Message-Id: <20220310140911.50924-12-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> X-Rspamd-Queue-Id: 7AF32100018 X-Stat-Signature: fypjdmhr1odqwocwq9d7egnsayqn7wqx X-Rspam-User: Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=S7RKetvn; spf=none (imf05.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 192.55.52.88) smtp.mailfrom=chao.p.peng@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspamd-Server: rspam07 X-HE-Tag: 1646921466-735554 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: KVM gets notified when memory pages changed in the memory backing store. When userspace allocates the memory with fallocate() or frees memory with fallocate(FALLOC_FL_PUNCH_HOLE), memory backing store calls into KVM fallocate/invalidate callbacks respectively. To ensure KVM never maps both the private and shared variants of a GPA into the guest, in the fallocate callback, we should zap the existing shared mapping and in the invalidate callback we should zap the existing private mapping. In the callbacks, KVM firstly converts the offset range into the gfn_range and then calls existing kvm_unmap_gfn_range() which will zap the shared or private mapping. Both callbacks pass in a memslot reference but we need 'kvm' so add a reference in memslot structure. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- include/linux/kvm_host.h | 3 ++- virt/kvm/kvm_main.c | 36 ++++++++++++++++++++++++++++++++++++ 2 files changed, 38 insertions(+), 1 deletion(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 9b175aeca63f..186b9b981a65 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -236,7 +236,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); #endif -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_MEMFILE_NOTIFIER) struct kvm_gfn_range { struct kvm_memory_slot *slot; gfn_t start; @@ -568,6 +568,7 @@ struct kvm_memory_slot { loff_t private_offset; struct memfile_pfn_ops *pfn_ops; struct memfile_notifier notifier; + struct kvm *kvm; }; static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 67349421eae3..52319f49d58a 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -841,8 +841,43 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ #ifdef CONFIG_MEMFILE_NOTIFIER +static void kvm_memfile_notifier_handler(struct memfile_notifier *notifier, + pgoff_t start, pgoff_t end) +{ + int idx; + struct kvm_memory_slot *slot = container_of(notifier, + struct kvm_memory_slot, + notifier); + struct kvm_gfn_range gfn_range = { + .slot = slot, + .start = start - (slot->private_offset >> PAGE_SHIFT), + .end = end - (slot->private_offset >> PAGE_SHIFT), + .may_block = true, + }; + struct kvm *kvm = slot->kvm; + + gfn_range.start = max(gfn_range.start, slot->base_gfn); + gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages); + + if (gfn_range.start >= gfn_range.end) + return; + + idx = srcu_read_lock(&kvm->srcu); + KVM_MMU_LOCK(kvm); + kvm_unmap_gfn_range(kvm, &gfn_range); + kvm_flush_remote_tlbs(kvm); + KVM_MMU_UNLOCK(kvm); + srcu_read_unlock(&kvm->srcu, idx); +} + +static struct memfile_notifier_ops kvm_memfile_notifier_ops = { + .invalidate = kvm_memfile_notifier_handler, + .fallocate = kvm_memfile_notifier_handler, +}; + static inline int kvm_memfile_register(struct kvm_memory_slot *slot) { + slot->notifier.ops = &kvm_memfile_notifier_ops; return memfile_register_notifier(file_inode(slot->private_file), &slot->notifier, &slot->pfn_ops); @@ -1963,6 +1998,7 @@ int __kvm_set_memory_region(struct kvm *kvm, new->private_file = file; new->private_offset = mem->flags & KVM_MEM_PRIVATE ? region_ext->private_offset : 0; + new->kvm = kvm; r = kvm_set_memslot(kvm, old, new, change); if (!r) From patchwork Thu Mar 10 14:09:10 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776413 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3A8DAC433FE for ; Thu, 10 Mar 2022 14:11:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C9BED8D0008; Thu, 10 Mar 2022 09:11:14 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C4B828D0001; Thu, 10 Mar 2022 09:11:14 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B13E58D0008; Thu, 10 Mar 2022 09:11:14 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.a.hostedemail.com [64.99.140.24]) by kanga.kvack.org (Postfix) with ESMTP id A4F428D0001 for ; Thu, 10 Mar 2022 09:11:14 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay11.hostedemail.com (Postfix) with ESMTP id 83E508107D for ; Thu, 10 Mar 2022 14:11:14 +0000 (UTC) X-FDA: 79228663668.02.43766B6 Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by imf04.hostedemail.com (Postfix) with ESMTP id C7F364001F for ; Thu, 10 Mar 2022 14:11:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921473; x=1678457473; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=dlixjqkIQMYPAGk8N2BV+CT0PbmeRswbnhn9umPYrd0=; b=bHXzrYWK1DtIP+eUasJ2zpKdz+lMmaQ3LWWHuBP0JJQlp5xKFfxmq+Ep kotmuWEosVZsxq/Es789ekxhNBKzBVV6giTgU648xe/lr2afK6QW/lA58 NVMm5CTpckc7f3UhGSkJd55rNq0MwDThH9gj79w1mmoTEjeQG4nHvxZYT KKI7va4gSe1byll+8pK8nXYaRFNUSneo1HTJS/iI0eo2sdQ/m0EjAJ4xH Ih5+dagQ+TG/+d95+cKOM/vayGPeQpGa9wp3kgfER3YWmrZ6KalsGQe3B AO7Fb/gphDeg4qFvlx/Iuax2p3aD2sUE2T+vd95OmG7ID6Qg9HhOeZRF+ A==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="315975973" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="315975973" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:11:12 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655235" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:11:04 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 12/13] KVM: Expose KVM_MEM_PRIVATE Date: Thu, 10 Mar 2022 22:09:10 +0800 Message-Id: <20220310140911.50924-13-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: C7F364001F X-Rspam-User: Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=bHXzrYWK; spf=none (imf04.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 134.134.136.31) smtp.mailfrom=chao.p.peng@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com X-Stat-Signature: rqdfpjbaeeggy7cdke7zn6iqeu8ocamx X-HE-Tag: 1646921473-136871 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: KVM_MEM_PRIVATE is not exposed by default but architecture code can turn on it by implementing kvm_arch_private_memory_supported(). Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 24 +++++++++++++++++++----- 2 files changed, 20 insertions(+), 5 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 186b9b981a65..0150e952a131 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1432,6 +1432,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu); int kvm_arch_post_init_vm(struct kvm *kvm); void kvm_arch_pre_destroy_vm(struct kvm *kvm); int kvm_arch_create_vm_debugfs(struct kvm *kvm); +bool kvm_arch_private_memory_supported(struct kvm *kvm); #ifndef __KVM_HAVE_ARCH_VM_ALLOC /* diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 52319f49d58a..df5311755a40 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1485,10 +1485,19 @@ static void kvm_replace_memslot(struct kvm *kvm, } } -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem) +bool __weak kvm_arch_private_memory_supported(struct kvm *kvm) +{ + return false; +} + +static int check_memory_region_flags(struct kvm *kvm, + const struct kvm_userspace_memory_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; + if (kvm_arch_private_memory_supported(kvm)) + valid_flags |= KVM_MEM_PRIVATE; + #ifdef __KVM_HAVE_READONLY_MEM valid_flags |= KVM_MEM_READONLY; #endif @@ -1900,7 +1909,7 @@ int __kvm_set_memory_region(struct kvm *kvm, int as_id, id; int r; - r = check_memory_region_flags(mem); + r = check_memory_region_flags(kvm, mem); if (r) return r; @@ -1913,10 +1922,12 @@ int __kvm_set_memory_region(struct kvm *kvm, return -EINVAL; if (mem->guest_phys_addr & (PAGE_SIZE - 1)) return -EINVAL; - /* We can read the guest memory with __xxx_user() later on. */ if ((mem->userspace_addr & (PAGE_SIZE - 1)) || - (mem->userspace_addr != untagged_addr(mem->userspace_addr)) || - !access_ok((void __user *)(unsigned long)mem->userspace_addr, + (mem->userspace_addr != untagged_addr(mem->userspace_addr))) + return -EINVAL; + /* We can read the guest memory with __xxx_user() later on. */ + if (!(mem->flags & KVM_MEM_PRIVATE) && + !access_ok((void __user *)(unsigned long)mem->userspace_addr, mem->memory_size)) return -EINVAL; if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM) @@ -1957,6 +1968,9 @@ int __kvm_set_memory_region(struct kvm *kvm, if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages) return -EINVAL; } else { /* Modify an existing slot. */ + /* Private memslots are immutable, they can only be deleted. */ + if (mem->flags & KVM_MEM_PRIVATE) + return -EINVAL; if ((mem->userspace_addr != old->userspace_addr) || (npages != old->npages) || ((mem->flags ^ old->flags) & KVM_MEM_READONLY)) From patchwork Thu Mar 10 14:09:11 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776414 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E6696C433EF for ; Thu, 10 Mar 2022 14:11:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 831AA8D0009; Thu, 10 Mar 2022 09:11:24 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7DED38D0001; Thu, 10 Mar 2022 09:11:24 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6A7218D0009; Thu, 10 Mar 2022 09:11:24 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0197.hostedemail.com [216.40.44.197]) by kanga.kvack.org (Postfix) with ESMTP id 5D8F58D0001 for ; Thu, 10 Mar 2022 09:11:24 -0500 (EST) Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 17FFD181C2220 for ; Thu, 10 Mar 2022 14:11:24 +0000 (UTC) X-FDA: 79228664088.21.6571CC8 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by imf14.hostedemail.com (Postfix) with ESMTP id 2C30210001F for ; Thu, 10 Mar 2022 14:11:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921483; x=1678457483; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=XcUkjLL+YpWw11P834waxgWm/NgeT+vT/ACdu5U5CAo=; b=KtdTXDHKfZXdOoV55Nre0ayVfkf75ExiYJdwQxNyHlkgHnPU250TY6Jo 2siMlSWrhVfs64efgZ9p+j3dPTGoaIXU90L5ACTWuISXetupf/kUvEKQl yeBHpCbXuggSZrccHXDNYlC9VxhlDv7TOd1QlnZWntNUcujsCQ1XsswH5 kc9jMz13qS9fMutUfbUMIKRd8hrzWVVy81h25rgTJUtdh3TBQHLqmgUO/ Vzqc80XxhKFN4kqSOcTfGYhviOFJlAuKvheLNPXo0Kf8Lk5TLg/wEuqMe mBGTBIbeMyxXy0byNhmomlox1IQttD608Bs4lEkRg7AIZ5kiQwvzNe05G g==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="255203401" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="255203401" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:11:21 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655270" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:11:12 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 13/13] memfd_create.2: Describe MFD_INACCESSIBLE flag Date: Thu, 10 Mar 2022 22:09:11 +0800 Message-Id: <20220310140911.50924-14-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 2C30210001F X-Stat-Signature: hcuigohguc5yp1btr6sex86uswhbbpqq Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=KtdTXDHK; spf=none (imf14.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 134.134.136.65) smtp.mailfrom=chao.p.peng@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspam-User: X-HE-Tag: 1646921482-505982 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Signed-off-by: Chao Peng --- man2/memfd_create.2 | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/man2/memfd_create.2 b/man2/memfd_create.2 index 89e9c4136..2698222ae 100644 --- a/man2/memfd_create.2 +++ b/man2/memfd_create.2 @@ -101,6 +101,19 @@ meaning that no other seals can be set on the file. .\" FIXME Why is the MFD_ALLOW_SEALING behavior not simply the default? .\" Is it worth adding some text explaining this? .TP +.BR MFD_INACCESSIBLE +Disallow userspace access through ordinary MMU accesses via +.BR read (2), +.BR write (2) +and +.BR mmap (2). +The file size cannot be changed once initialized. +This flag cannot coexist with +.B MFD_ALLOW_SEALING +and when this flag is set, the initial set of seals will be +.B F_SEAL_SEAL, +meaning that no other seals can be set on the file. +.TP .BR MFD_HUGETLB " (since Linux 4.14)" .\" commit 749df87bd7bee5a79cef073f5d032ddb2b211de8 The anonymous file will be created in the hugetlbfs filesystem using