From patchwork Mon Jul 27 16:29:29 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Rapoport X-Patchwork-Id: 11687175 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5A18E13B6 for ; Mon, 27 Jul 2020 16:29:58 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 425E720FC3 for ; Mon, 27 Jul 2020 16:29:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1595867398; bh=hdVX0IsUo7eSkaqkoaF98FlrWWgQtbf1bZ3UXG9vp9E=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=cMo6XFQV7oPbHh1/A/I1QjCMK3oUMvDqe67CzGi+fT3M5Iqaen4TUctjKVgxORD1y sFsdgzxKA9lLAQt6RY6vzFe4DpZ3wj3Kzi8E8BRMCEZmU/QAJ02p1fJ8hZymF7u2Ci OP+eISUKpCfOdS00c70OKvsXrTYMk1y20pzsK7cA= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732075AbgG0Q34 (ORCPT ); Mon, 27 Jul 2020 12:29:56 -0400 Received: from mail.kernel.org ([198.145.29.99]:58476 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730778AbgG0Q34 (ORCPT ); Mon, 27 Jul 2020 12:29:56 -0400 Received: from aquarius.haifa.ibm.com (nesher1.haifa.il.ibm.com [195.110.40.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 60A6A20775; Mon, 27 Jul 2020 16:29:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1595867395; bh=hdVX0IsUo7eSkaqkoaF98FlrWWgQtbf1bZ3UXG9vp9E=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=tpKi6YXufrADChYkIl1EiNp/etiAQqY8b6izlKQYhbgervIIKuPo/MhJD6AQKJRoV 0Ewn8WOwQZMfRXHtZHxCOV7Yg2QQlENg/hqmpF47QkCvnMZgby/XZSduYQFgHeQHEu IRrJOSO5v+O0uFgkTUYV/hLN6+Qu/tTEDQVXuTSg= From: Mike Rapoport To: linux-kernel@vger.kernel.org Cc: Alexander Viro , Andrew Morton , Andy Lutomirski , Arnd Bergmann , Borislav Petkov , Catalin Marinas , Christopher Lameter , Dan Williams , Dave Hansen , Elena Reshetova , "H. Peter Anvin" , Idan Yaniv , Ingo Molnar , James Bottomley , "Kirill A. Shutemov" , Matthew Wilcox , Mike Rapoport , Mike Rapoport , Michael Kerrisk , Palmer Dabbelt , Paul Walmsley , Peter Zijlstra , Thomas Gleixner , Tycho Andersen , Will Deacon , linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, linux-riscv@lists.infradead.org, x86@kernel.org Subject: [PATCH v2 1/7] mm: add definition of PMD_PAGE_ORDER Date: Mon, 27 Jul 2020 19:29:29 +0300 Message-Id: <20200727162935.31714-2-rppt@kernel.org> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20200727162935.31714-1-rppt@kernel.org> References: <20200727162935.31714-1-rppt@kernel.org> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Mike Rapoport The definition of PMD_PAGE_ORDER denoting the number of base pages in the second-level leaf page is already used by DAX and maybe handy in other cases as well. Several architectures already have definition of PMD_ORDER as the size of second level page table, so to avoid conflict with these definitions use PMD_PAGE_ORDER name and update DAX respectively. Signed-off-by: Mike Rapoport --- fs/dax.c | 10 +++++----- include/linux/pgtable.h | 3 +++ 2 files changed, 8 insertions(+), 5 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 11b16729b86f..b91d8c8dda45 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -50,7 +50,7 @@ static inline unsigned int pe_order(enum page_entry_size pe_size) #define PG_PMD_NR (PMD_SIZE >> PAGE_SHIFT) /* The order of a PMD entry */ -#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT) +#define PMD_PAGE_ORDER (PMD_SHIFT - PAGE_SHIFT) static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES]; @@ -98,7 +98,7 @@ static bool dax_is_locked(void *entry) static unsigned int dax_entry_order(void *entry) { if (xa_to_value(entry) & DAX_PMD) - return PMD_ORDER; + return PMD_PAGE_ORDER; return 0; } @@ -1456,7 +1456,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, { struct vm_area_struct *vma = vmf->vma; struct address_space *mapping = vma->vm_file->f_mapping; - XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, PMD_ORDER); + XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, PMD_PAGE_ORDER); unsigned long pmd_addr = vmf->address & PMD_MASK; bool write = vmf->flags & FAULT_FLAG_WRITE; bool sync; @@ -1515,7 +1515,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, * entry is already in the array, for instance), it will return * VM_FAULT_FALLBACK. */ - entry = grab_mapping_entry(&xas, mapping, PMD_ORDER); + entry = grab_mapping_entry(&xas, mapping, PMD_PAGE_ORDER); if (xa_is_internal(entry)) { result = xa_to_internal(entry); goto fallback; @@ -1681,7 +1681,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order) if (order == 0) ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn); #ifdef CONFIG_FS_DAX_PMD - else if (order == PMD_ORDER) + else if (order == PMD_PAGE_ORDER) ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE); #endif else diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 56c1e8eb7bb0..79f8443609e7 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -28,6 +28,9 @@ #define USER_PGTABLES_CEILING 0UL #endif +/* Number of base pages in a second level leaf page */ +#define PMD_PAGE_ORDER (PMD_SHIFT - PAGE_SHIFT) + /* * A page table page can be thought of an array like this: pXd_t[PTRS_PER_PxD] * From patchwork Mon Jul 27 16:29:30 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Rapoport X-Patchwork-Id: 11687183 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 14633138C for ; Mon, 27 Jul 2020 16:30:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id EF2672083E for ; Mon, 27 Jul 2020 16:30:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1595867409; bh=i3ck5owJRi0EzoQDSEZTjtoEmjw9+HiuwoyKwgJ9tdg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=n12sTpC8mD2Q+bqwK782BIoEWEzNAQsEqB4TdPTl7Mxha+KgjXltaMXed7CG70/h/ 0K6k+ivlgSKutB3No2Mu9RjwghUS2DdDTsywrOyYSneNhE9u2T5AN9h+C76TMQBawo w1MQpvyMs790yO+adMrrcAlCIInQzwLBsiBl53QQ= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730939AbgG0QaF (ORCPT ); Mon, 27 Jul 2020 12:30:05 -0400 Received: from mail.kernel.org ([198.145.29.99]:58690 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730778AbgG0QaE (ORCPT ); Mon, 27 Jul 2020 12:30:04 -0400 Received: from aquarius.haifa.ibm.com (nesher1.haifa.il.ibm.com [195.110.40.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id E548F2075A; Mon, 27 Jul 2020 16:29:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1595867403; bh=i3ck5owJRi0EzoQDSEZTjtoEmjw9+HiuwoyKwgJ9tdg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=jYQXxt3mDojgCrkM9qOVf+6yiwFkDDov4EflZwIMMkgnm/BN9oWlS00HvA/SdRmtX bN3VmIfmqkFJTIFb3IIwfAQ5GcNm2s76hAWkkN4yuKapACBTO2PWtn1Hhe1s8bH+TZ 02+m3Goh4AYw7uT/WY+dAJ7S9qOSzd6Le8sgdhRs= From: Mike Rapoport To: linux-kernel@vger.kernel.org Cc: Alexander Viro , Andrew Morton , Andy Lutomirski , Arnd Bergmann , Borislav Petkov , Catalin Marinas , Christopher Lameter , Dan Williams , Dave Hansen , Elena Reshetova , "H. Peter Anvin" , Idan Yaniv , Ingo Molnar , James Bottomley , "Kirill A. Shutemov" , Matthew Wilcox , Mike Rapoport , Mike Rapoport , Michael Kerrisk , Palmer Dabbelt , Paul Walmsley , Peter Zijlstra , Thomas Gleixner , Tycho Andersen , Will Deacon , linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, linux-riscv@lists.infradead.org, x86@kernel.org Subject: [PATCH v2 2/7] mmap: make mlock_future_check() global Date: Mon, 27 Jul 2020 19:29:30 +0300 Message-Id: <20200727162935.31714-3-rppt@kernel.org> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20200727162935.31714-1-rppt@kernel.org> References: <20200727162935.31714-1-rppt@kernel.org> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Mike Rapoport It will be used by the upcoming secret memory implementation. Signed-off-by: Mike Rapoport --- mm/internal.h | 3 +++ mm/mmap.c | 5 ++--- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 9886db20d94f..af0a92f8f6bc 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -349,6 +349,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma) extern void mlock_vma_page(struct page *page); extern unsigned int munlock_vma_page(struct page *page); +extern int mlock_future_check(struct mm_struct *mm, unsigned long flags, + unsigned long len); + /* * Clear the page's PageMlocked(). This can be useful in a situation where * we want to unconditionally remove a page from the pagecache -- e.g., diff --git a/mm/mmap.c b/mm/mmap.c index 8c7ca737a19b..ee92b7b4b185 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1310,9 +1310,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint) return hint; } -static inline int mlock_future_check(struct mm_struct *mm, - unsigned long flags, - unsigned long len) +int mlock_future_check(struct mm_struct *mm, unsigned long flags, + unsigned long len) { unsigned long locked, lock_limit; From patchwork Mon Jul 27 16:29:31 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Rapoport X-Patchwork-Id: 11687187 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5F0306C1 for ; Mon, 27 Jul 2020 16:30:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 396C42078A for ; Mon, 27 Jul 2020 16:30:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1595867418; bh=nwPr1m6QwakaD96Vicd58BMyOzSimKClw0mUV6BuDKU=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=ecEhuLHQ7U5S8N1pypTZlMSFtRMGJdWqUKDvvxhFhu0Ri0Y3Gw1BZSXj59T9m4qAh 33gNBkWd2XtN8hYWf9VKc8ToQ2U56UuzzYOn4YGbQ2mJD/xTWn/CRaAlC4qM275IUC N+vniW/x2d8sGAie3HlJ2zfhsJPp2wgI3bcyIaRM= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731223AbgG0QaO (ORCPT ); Mon, 27 Jul 2020 12:30:14 -0400 Received: from mail.kernel.org ([198.145.29.99]:58878 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730759AbgG0QaO (ORCPT ); Mon, 27 Jul 2020 12:30:14 -0400 Received: from aquarius.haifa.ibm.com (nesher1.haifa.il.ibm.com [195.110.40.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 7A1FA20719; Mon, 27 Jul 2020 16:30:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1595867412; bh=nwPr1m6QwakaD96Vicd58BMyOzSimKClw0mUV6BuDKU=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=J6NnfAGgxq3XeXeTz9P9N9swbf0NhSdVT+EQ7g7i3FWsrmPpjLjhCnAc7U81DhJFh 3EPPAkR74gNItRp66nDuluA6QBm9s0vk0ZgVRJDaqYXViOY5KLMPHswRoneGmQWGVL 7C+xLocl7uFkWKwY+YehCsQAUm+Lp5YzbEmSwKvc= From: Mike Rapoport To: linux-kernel@vger.kernel.org Cc: Alexander Viro , Andrew Morton , Andy Lutomirski , Arnd Bergmann , Borislav Petkov , Catalin Marinas , Christopher Lameter , Dan Williams , Dave Hansen , Elena Reshetova , "H. Peter Anvin" , Idan Yaniv , Ingo Molnar , James Bottomley , "Kirill A. Shutemov" , Matthew Wilcox , Mike Rapoport , Mike Rapoport , Michael Kerrisk , Palmer Dabbelt , Paul Walmsley , Peter Zijlstra , Thomas Gleixner , Tycho Andersen , Will Deacon , linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, linux-riscv@lists.infradead.org, x86@kernel.org Subject: [PATCH v2 3/7] mm: introduce memfd_secret system call to create "secret" memory areas Date: Mon, 27 Jul 2020 19:29:31 +0300 Message-Id: <20200727162935.31714-4-rppt@kernel.org> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20200727162935.31714-1-rppt@kernel.org> References: <20200727162935.31714-1-rppt@kernel.org> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Mike Rapoport Introduce "memfd_secret" system call with the ability to create memory areas visible only in the context of the owning process and not mapped not only to other processes but in the kernel page tables as well. The user will create a file descriptor using the memfd_secret() system call where flags supplied as a parameter to this system call will define the desired protection mode for the memory associated with that file descriptor. Currently there are two protection modes: * exclusive - the memory area is unmapped from the kernel direct map and it is present only in the page tables of the owning mm. * uncached - the memory area is present only in the page tables of the owning mm and it is mapped there as uncached. For instance, the following example will create an uncached mapping (error handling is omitted): fd = memfd_secret(SECRETMEM_UNCACHED); ftruncate(fd, MAP_SIZE); ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); Signed-off-by: Mike Rapoport --- include/uapi/linux/magic.h | 1 + include/uapi/linux/secretmem.h | 9 ++ kernel/sys_ni.c | 2 + mm/Kconfig | 4 + mm/Makefile | 1 + mm/secretmem.c | 266 +++++++++++++++++++++++++++++++++ 6 files changed, 283 insertions(+) create mode 100644 include/uapi/linux/secretmem.h create mode 100644 mm/secretmem.c diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index f3956fc11de6..35687dcb1a42 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -97,5 +97,6 @@ #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ #define Z3FOLD_MAGIC 0x33 #define PPC_CMM_MAGIC 0xc7571590 +#define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ #endif /* __LINUX_MAGIC_H__ */ diff --git a/include/uapi/linux/secretmem.h b/include/uapi/linux/secretmem.h new file mode 100644 index 000000000000..cef7a59f7492 --- /dev/null +++ b/include/uapi/linux/secretmem.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_SECRERTMEM_H +#define _UAPI_LINUX_SECRERTMEM_H + +/* secretmem operation modes */ +#define SECRETMEM_EXCLUSIVE 0x1 +#define SECRETMEM_UNCACHED 0x2 + +#endif /* _UAPI_LINUX_SECRERTMEM_H */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 3b69a560a7ac..fd40e1c083e5 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -349,6 +349,8 @@ COND_SYSCALL(pkey_mprotect); COND_SYSCALL(pkey_alloc); COND_SYSCALL(pkey_free); +/* memfd_secret */ +COND_SYSCALL(memfd_secret); /* * Architecture specific weak syscall entries. diff --git a/mm/Kconfig b/mm/Kconfig index f2104cc0d35c..8378175e72a4 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -872,4 +872,8 @@ config ARCH_HAS_HUGEPD config MAPPING_DIRTY_HELPERS bool +config SECRETMEM + def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED + select GENERIC_ALLOCATOR + endmenu diff --git a/mm/Makefile b/mm/Makefile index 6e9d46b2efc9..c2aa7a393b73 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -121,3 +121,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o obj-$(CONFIG_PTDUMP_CORE) += ptdump.o obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o +obj-$(CONFIG_SECRETMEM) += secretmem.o diff --git a/mm/secretmem.c b/mm/secretmem.c new file mode 100644 index 000000000000..9d29f3e1c49d --- /dev/null +++ b/mm/secretmem.c @@ -0,0 +1,266 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include + +#include "internal.h" + +#undef pr_fmt +#define pr_fmt(fmt) "secretmem: " fmt + +#define SECRETMEM_MODE_MASK (SECRETMEM_EXCLUSIVE | SECRETMEM_UNCACHED) +#define SECRETMEM_FLAGS_MASK SECRETMEM_MODE_MASK + +struct secretmem_ctx { + unsigned int mode; +}; + +static struct page *secretmem_alloc_page(gfp_t gfp) +{ + /* + * FIXME: use a cache of large pages to reduce the direct map + * fragmentation + */ + return alloc_page(gfp); +} + +static vm_fault_t secretmem_fault(struct vm_fault *vmf) +{ + struct address_space *mapping = vmf->vma->vm_file->f_mapping; + struct inode *inode = file_inode(vmf->vma->vm_file); + pgoff_t offset = vmf->pgoff; + unsigned long addr; + struct page *page; + int ret = 0; + + if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode)) + return vmf_error(-EINVAL); + + page = find_get_entry(mapping, offset); + if (!page) { + page = secretmem_alloc_page(vmf->gfp_mask); + if (!page) + return vmf_error(-ENOMEM); + + ret = add_to_page_cache(page, mapping, offset, vmf->gfp_mask); + if (unlikely(ret)) + goto err_put_page; + + ret = set_direct_map_invalid_noflush(page); + if (ret) + goto err_del_page_cache; + + addr = (unsigned long)page_address(page); + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + + __SetPageUptodate(page); + + ret = VM_FAULT_LOCKED; + } + + vmf->page = page; + return ret; + +err_del_page_cache: + delete_from_page_cache(page); +err_put_page: + put_page(page); + return vmf_error(ret); +} + +static const struct vm_operations_struct secretmem_vm_ops = { + .fault = secretmem_fault, +}; + +static int secretmem_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct secretmem_ctx *ctx = file->private_data; + unsigned long mode = ctx->mode; + unsigned long len = vma->vm_end - vma->vm_start; + + if (!mode) + return -EINVAL; + + if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0) + return -EINVAL; + + if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len)) + return -EAGAIN; + + switch (mode) { + case SECRETMEM_UNCACHED: + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + fallthrough; + case SECRETMEM_EXCLUSIVE: + vma->vm_ops = &secretmem_vm_ops; + break; + default: + return -EINVAL; + } + + vma->vm_flags |= VM_LOCKED; + + return 0; +} + +const struct file_operations secretmem_fops = { + .mmap = secretmem_mmap, +}; + +static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode) +{ + return false; +} + +static int secretmem_migratepage(struct address_space *mapping, + struct page *newpage, struct page *page, + enum migrate_mode mode) +{ + return -EBUSY; +} + +static void secretmem_freepage(struct page *page) +{ + set_direct_map_default_noflush(page); +} + +static const struct address_space_operations secretmem_aops = { + .freepage = secretmem_freepage, + .migratepage = secretmem_migratepage, + .isolate_page = secretmem_isolate_page, +}; + +static struct vfsmount *secretmem_mnt; + +static struct file *secretmem_file_create(unsigned long flags) +{ + struct file *file = ERR_PTR(-ENOMEM); + struct secretmem_ctx *ctx; + struct inode *inode; + + inode = alloc_anon_inode(secretmem_mnt->mnt_sb); + if (IS_ERR(inode)) + return ERR_CAST(inode); + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + goto err_free_inode; + + file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem", + O_RDWR, &secretmem_fops); + if (IS_ERR(file)) + goto err_free_ctx; + + mapping_set_unevictable(inode->i_mapping); + + inode->i_mapping->private_data = ctx; + inode->i_mapping->a_ops = &secretmem_aops; + + /* pretend we are a normal file with zero size */ + inode->i_mode |= S_IFREG; + inode->i_size = 0; + + file->private_data = ctx; + + ctx->mode = flags & SECRETMEM_MODE_MASK; + + return file; + +err_free_ctx: + kfree(ctx); +err_free_inode: + iput(inode); + return file; +} + +SYSCALL_DEFINE1(memfd_secret, unsigned long, flags) +{ + struct file *file; + unsigned int mode; + int fd, err; + + /* make sure local flags do not confict with global fcntl.h */ + BUILD_BUG_ON(SECRETMEM_FLAGS_MASK & O_CLOEXEC); + + if (flags & ~(SECRETMEM_FLAGS_MASK | O_CLOEXEC)) + return -EINVAL; + + /* modes are mutually exclusive, only one mode bit should be set */ + mode = flags & SECRETMEM_FLAGS_MASK; + if (ffs(mode) != fls(mode)) + return -EINVAL; + + fd = get_unused_fd_flags(flags & O_CLOEXEC); + if (fd < 0) + return fd; + + file = secretmem_file_create(flags); + if (IS_ERR(file)) { + err = PTR_ERR(file); + goto err_put_fd; + } + + file->f_flags |= O_LARGEFILE; + + fd_install(fd, file); + return fd; + +err_put_fd: + put_unused_fd(fd); + return err; +} + +static void secretmem_evict_inode(struct inode *inode) +{ + struct secretmem_ctx *ctx = inode->i_private; + + truncate_inode_pages_final(&inode->i_data); + clear_inode(inode); + kfree(ctx); +} + +static const struct super_operations secretmem_super_ops = { + .evict_inode = secretmem_evict_inode, +}; + +static int secretmem_init_fs_context(struct fs_context *fc) +{ + struct pseudo_fs_context *ctx = init_pseudo(fc, SECRETMEM_MAGIC); + + if (!ctx) + return -ENOMEM; + ctx->ops = &secretmem_super_ops; + + return 0; +} + +static struct file_system_type secretmem_fs = { + .name = "secretmem", + .init_fs_context = secretmem_init_fs_context, + .kill_sb = kill_anon_super, +}; + +static int secretmem_init(void) +{ + int ret = 0; + + secretmem_mnt = kern_mount(&secretmem_fs); + if (IS_ERR(secretmem_mnt)) + ret = PTR_ERR(secretmem_mnt); + + return ret; +} +fs_initcall(secretmem_init); From patchwork Mon Jul 27 16:29:32 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Rapoport X-Patchwork-Id: 11687193 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 740B6138C for ; Mon, 27 Jul 2020 16:30:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 564252074F for ; Mon, 27 Jul 2020 16:30:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1595867425; bh=u1eiiMaArTRuCUqI0im6HZbRNLX7K+2+DDG+1VgTAkU=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=ga5BXtH8U8vbFZ/Mk5TxVOxEt7uCdym9vZveP1ykvSAEtL8FPma2xyQ82DRTjzr6n 8TCB7YsuSCM0WsMBtMCCvYKaTPXbkngnTRtH150URksqoPGHoRoVomqe3pJXauV2fQ htjpkdJMbquS4B7svvEI0a16qweuoZ8KZjqMDgYY= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732111AbgG0QaW (ORCPT ); Mon, 27 Jul 2020 12:30:22 -0400 Received: from mail.kernel.org ([198.145.29.99]:59096 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730408AbgG0QaW (ORCPT ); Mon, 27 Jul 2020 12:30:22 -0400 Received: from aquarius.haifa.ibm.com (nesher1.haifa.il.ibm.com [195.110.40.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 4617A206E7; Mon, 27 Jul 2020 16:30:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1595867421; bh=u1eiiMaArTRuCUqI0im6HZbRNLX7K+2+DDG+1VgTAkU=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=0nX8W/cGw6detApkc5fvadBST2If0vYuGRcybJhAGjeKqSinDtAx5mDEi73QrkvBJ nJuyAhZt5SM9o8c2RSHn0oUWm+alokGOxAPsme+U5UbI0xxg5xe9xCpWRKY3xQMv9Y X2U7Pk31z9WUlWh4jcGDiSgMmzjO2nR0w9HR9SDk= From: Mike Rapoport To: linux-kernel@vger.kernel.org Cc: Alexander Viro , Andrew Morton , Andy Lutomirski , Arnd Bergmann , Borislav Petkov , Catalin Marinas , Christopher Lameter , Dan Williams , Dave Hansen , Elena Reshetova , "H. Peter Anvin" , Idan Yaniv , Ingo Molnar , James Bottomley , "Kirill A. Shutemov" , Matthew Wilcox , Mike Rapoport , Mike Rapoport , Michael Kerrisk , Palmer Dabbelt , Paul Walmsley , Peter Zijlstra , Thomas Gleixner , Tycho Andersen , Will Deacon , linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, linux-riscv@lists.infradead.org, x86@kernel.org, Palmer Dabbelt Subject: [PATCH v2 4/7] arch, mm: wire up memfd_secret system call were relevant Date: Mon, 27 Jul 2020 19:29:32 +0300 Message-Id: <20200727162935.31714-5-rppt@kernel.org> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20200727162935.31714-1-rppt@kernel.org> References: <20200727162935.31714-1-rppt@kernel.org> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Mike Rapoport Wire up memfd_secret system call on architectures that define ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86. Signed-off-by: Mike Rapoport Acked-by: Palmer Dabbelt Acked-by: Arnd Bergmann --- arch/arm64/include/asm/unistd32.h | 2 ++ arch/arm64/include/uapi/asm/unistd.h | 1 + arch/riscv/include/asm/unistd.h | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/syscalls.h | 1 + include/uapi/asm-generic/unistd.h | 7 ++++++- 7 files changed, 13 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index 6d95d0c8bf2f..a379ba31f7c4 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -885,6 +885,8 @@ __SYSCALL(__NR_openat2, sys_openat2) __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd) #define __NR_faccessat2 439 __SYSCALL(__NR_faccessat2, sys_faccessat2) +#define __NR_memfd_secret 439 +__SYSCALL(__NR_memfd_secret, sys_memfd_secret) /* * Please add new compat syscalls above this comment and update diff --git a/arch/arm64/include/uapi/asm/unistd.h b/arch/arm64/include/uapi/asm/unistd.h index f83a70e07df8..ce2ee8f1e361 100644 --- a/arch/arm64/include/uapi/asm/unistd.h +++ b/arch/arm64/include/uapi/asm/unistd.h @@ -20,5 +20,6 @@ #define __ARCH_WANT_SET_GET_RLIMIT #define __ARCH_WANT_TIME32_SYSCALLS #define __ARCH_WANT_SYS_CLONE3 +#define __ARCH_WANT_MEMFD_SECRET #include diff --git a/arch/riscv/include/asm/unistd.h b/arch/riscv/include/asm/unistd.h index 977ee6181dab..6c316093a1e5 100644 --- a/arch/riscv/include/asm/unistd.h +++ b/arch/riscv/include/asm/unistd.h @@ -9,6 +9,7 @@ */ #define __ARCH_WANT_SYS_CLONE +#define __ARCH_WANT_MEMFD_SECRET #include diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index d8f8a1a69ed1..6f8b5978053b 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -443,3 +443,4 @@ 437 i386 openat2 sys_openat2 438 i386 pidfd_getfd sys_pidfd_getfd 439 i386 faccessat2 sys_faccessat2 +440 i386 memfd_secret sys_memfd_secret diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 78847b32e137..7d3775d1c3d7 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -360,6 +360,7 @@ 437 common openat2 sys_openat2 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 +440 common memfd_secret sys_memfd_secret # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index b951a87da987..e4d7b30867c6 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1005,6 +1005,7 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig, siginfo_t __user *info, unsigned int flags); asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags); +asmlinkage long sys_memfd_secret(unsigned long flags); /* * Architecture-specific system calls diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index f4a01305d9a6..7b288347c5a9 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -858,8 +858,13 @@ __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd) #define __NR_faccessat2 439 __SYSCALL(__NR_faccessat2, sys_faccessat2) +#ifdef __ARCH_WANT_MEMFD_SECRET +#define __NR_memfd_secret 440 +__SYSCALL(__NR_memfd_secret, sys_memfd_secret) +#endif + #undef __NR_syscalls -#define __NR_syscalls 440 +#define __NR_syscalls 441 /* * 32 bit systems traditionally used different From patchwork Mon Jul 27 16:29:33 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Rapoport X-Patchwork-Id: 11687203 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 50BA413B6 for ; Mon, 27 Jul 2020 16:30:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 33B122074F for ; Mon, 27 Jul 2020 16:30:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1595867436; bh=w7MT//2ewKymg3G1SsBcSskJNwyMSzR/MTcAsHwJqGg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=WD1BGAq5T+Fe9WiKaIaMhBPMflAO/pnf6XAZUFZ7YBRUa+Z1Cz+pS8mkzN6vqfTVG FgjJdgfOcm9gFs40Hy81x0pW+qkTOptm40sn2Q2gEd1YuDevIK+KGTrg9zl8OVDjzx XHQ5iwEXrBblBpmO+FLtxEhvkfpnrhuYe5Xtltc4= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732140AbgG0Qab (ORCPT ); Mon, 27 Jul 2020 12:30:31 -0400 Received: from mail.kernel.org ([198.145.29.99]:59302 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730408AbgG0Qab (ORCPT ); Mon, 27 Jul 2020 12:30:31 -0400 Received: from aquarius.haifa.ibm.com (nesher1.haifa.il.ibm.com [195.110.40.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 0D4AB2078A; Mon, 27 Jul 2020 16:30:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1595867430; bh=w7MT//2ewKymg3G1SsBcSskJNwyMSzR/MTcAsHwJqGg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=EjHfqr+5h0X9yX2o/fKrEaH/dzMW6Fsp+d3tzNdvc/YFjhNrDQ8sWncQo5oob0yOv xeyqbsw4f1b+853VGI/GgLR7E1Y5d7PCuCUk42B7oglZ4/DUH0ImAk+4rePHEtf/nI hjOYFHWcvYFIP2lqZR8ayeMXex0VHqkpy120MqCU= From: Mike Rapoport To: linux-kernel@vger.kernel.org Cc: Alexander Viro , Andrew Morton , Andy Lutomirski , Arnd Bergmann , Borislav Petkov , Catalin Marinas , Christopher Lameter , Dan Williams , Dave Hansen , Elena Reshetova , "H. Peter Anvin" , Idan Yaniv , Ingo Molnar , James Bottomley , "Kirill A. Shutemov" , Matthew Wilcox , Mike Rapoport , Mike Rapoport , Michael Kerrisk , Palmer Dabbelt , Paul Walmsley , Peter Zijlstra , Thomas Gleixner , Tycho Andersen , Will Deacon , linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, linux-riscv@lists.infradead.org, x86@kernel.org Subject: [PATCH v2 5/7] mm: secretmem: use PMD-size pages to amortize direct map fragmentation Date: Mon, 27 Jul 2020 19:29:33 +0300 Message-Id: <20200727162935.31714-6-rppt@kernel.org> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20200727162935.31714-1-rppt@kernel.org> References: <20200727162935.31714-1-rppt@kernel.org> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Mike Rapoport Removing a PAGE_SIZE page from the direct map every time such page is allocated for a secret memory mapping will cause severe fragmentation of the direct map. This fragmentation can be reduced by using PMD-size pages as a pool for small pages for secret memory mappings. Add a gen_pool per secretmem inode and lazily populate this pool with PMD-size pages. Signed-off-by: Mike Rapoport --- mm/secretmem.c | 107 ++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 88 insertions(+), 19 deletions(-) diff --git a/mm/secretmem.c b/mm/secretmem.c index 9d29f3e1c49d..da609701e10e 100644 --- a/mm/secretmem.c +++ b/mm/secretmem.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -25,24 +26,66 @@ #define SECRETMEM_FLAGS_MASK SECRETMEM_MODE_MASK struct secretmem_ctx { + struct gen_pool *pool; unsigned int mode; }; -static struct page *secretmem_alloc_page(gfp_t gfp) +static int secretmem_pool_increase(struct secretmem_ctx *ctx, gfp_t gfp) { - /* - * FIXME: use a cache of large pages to reduce the direct map - * fragmentation - */ - return alloc_page(gfp); + unsigned long nr_pages = (1 << PMD_PAGE_ORDER); + struct gen_pool *pool = ctx->pool; + unsigned long addr; + struct page *page; + int err; + + page = alloc_pages(gfp, PMD_PAGE_ORDER); + if (!page) + return -ENOMEM; + + addr = (unsigned long)page_address(page); + split_page(page, PMD_PAGE_ORDER); + + err = gen_pool_add(pool, addr, PMD_SIZE, NUMA_NO_NODE); + if (err) { + __free_pages(page, PMD_PAGE_ORDER); + return err; + } + + __kernel_map_pages(page, nr_pages, 0); + + return 0; +} + +static struct page *secretmem_alloc_page(struct secretmem_ctx *ctx, + gfp_t gfp) +{ + struct gen_pool *pool = ctx->pool; + unsigned long addr; + struct page *page; + int err; + + if (gen_pool_avail(pool) < PAGE_SIZE) { + err = secretmem_pool_increase(ctx, gfp); + if (err) + return NULL; + } + + addr = gen_pool_alloc(pool, PAGE_SIZE); + if (!addr) + return NULL; + + page = virt_to_page(addr); + get_page(page); + + return page; } static vm_fault_t secretmem_fault(struct vm_fault *vmf) { + struct secretmem_ctx *ctx = vmf->vma->vm_file->private_data; struct address_space *mapping = vmf->vma->vm_file->f_mapping; struct inode *inode = file_inode(vmf->vma->vm_file); pgoff_t offset = vmf->pgoff; - unsigned long addr; struct page *page; int ret = 0; @@ -51,7 +94,7 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf) page = find_get_entry(mapping, offset); if (!page) { - page = secretmem_alloc_page(vmf->gfp_mask); + page = secretmem_alloc_page(ctx, vmf->gfp_mask); if (!page) return vmf_error(-ENOMEM); @@ -59,14 +102,8 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf) if (unlikely(ret)) goto err_put_page; - ret = set_direct_map_invalid_noflush(page); - if (ret) - goto err_del_page_cache; - - addr = (unsigned long)page_address(page); - flush_tlb_kernel_range(addr, addr + PAGE_SIZE); - __SetPageUptodate(page); + set_page_private(page, (unsigned long)ctx); ret = VM_FAULT_LOCKED; } @@ -74,8 +111,6 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf) vmf->page = page; return ret; -err_del_page_cache: - delete_from_page_cache(page); err_put_page: put_page(page); return vmf_error(ret); @@ -134,7 +169,11 @@ static int secretmem_migratepage(struct address_space *mapping, static void secretmem_freepage(struct page *page) { - set_direct_map_default_noflush(page); + unsigned long addr = (unsigned long)page_address(page); + struct secretmem_ctx *ctx = (struct secretmem_ctx *)page_private(page); + struct gen_pool *pool = ctx->pool; + + gen_pool_free(pool, addr, PAGE_SIZE); } static const struct address_space_operations secretmem_aops = { @@ -159,13 +198,18 @@ static struct file *secretmem_file_create(unsigned long flags) if (!ctx) goto err_free_inode; + ctx->pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE); + if (!ctx->pool) + goto err_free_ctx; + file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem", O_RDWR, &secretmem_fops); if (IS_ERR(file)) - goto err_free_ctx; + goto err_free_pool; mapping_set_unevictable(inode->i_mapping); + inode->i_private = ctx; inode->i_mapping->private_data = ctx; inode->i_mapping->a_ops = &secretmem_aops; @@ -179,6 +223,8 @@ static struct file *secretmem_file_create(unsigned long flags) return file; +err_free_pool: + gen_pool_destroy(ctx->pool); err_free_ctx: kfree(ctx); err_free_inode: @@ -223,11 +269,34 @@ SYSCALL_DEFINE1(memfd_secret, unsigned long, flags) return err; } +static void secretmem_cleanup_chunk(struct gen_pool *pool, + struct gen_pool_chunk *chunk, void *data) +{ + unsigned long start = chunk->start_addr; + unsigned long end = chunk->end_addr; + unsigned long nr_pages, addr; + + nr_pages = (end - start + 1) / PAGE_SIZE; + __kernel_map_pages(virt_to_page(start), nr_pages, 1); + + for (addr = start; addr < end; addr += PAGE_SIZE) + put_page(virt_to_page(addr)); +} + +static void secretmem_cleanup_pool(struct secretmem_ctx *ctx) +{ + struct gen_pool *pool = ctx->pool; + + gen_pool_for_each_chunk(pool, secretmem_cleanup_chunk, ctx); + gen_pool_destroy(pool); +} + static void secretmem_evict_inode(struct inode *inode) { struct secretmem_ctx *ctx = inode->i_private; truncate_inode_pages_final(&inode->i_data); + secretmem_cleanup_pool(ctx); clear_inode(inode); kfree(ctx); } From patchwork Mon Jul 27 16:29:34 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Rapoport X-Patchwork-Id: 11687209 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8897B13B6 for ; Mon, 27 Jul 2020 16:30:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6B0762083E for ; Mon, 27 Jul 2020 16:30:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1595867444; bh=EtHjzgniWiPXpM15YNvo+5Gcb8cwPdBCUgOLFSAcRWs=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=aNz1kuGY8MJrMNA2uEl34vuYMzos0u2XjC2bw5P4N2PG7xXEKSHUZAMvvUcvyeYlY 2J24ZnIBxVE4qJqZJT/cw0HmWVJqtn8rsE4UsDCxJHSHpoe8rr6ZAhOMYeaGzEUX98 rOQWtjUGU/MbREEcsQK9zWcd0ufAZOK529pxUMLg= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732195AbgG0Qak (ORCPT ); Mon, 27 Jul 2020 12:30:40 -0400 Received: from mail.kernel.org ([198.145.29.99]:59508 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732177AbgG0Qaj (ORCPT ); Mon, 27 Jul 2020 12:30:39 -0400 Received: from aquarius.haifa.ibm.com (nesher1.haifa.il.ibm.com [195.110.40.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 9245C20719; Mon, 27 Jul 2020 16:30:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1595867438; bh=EtHjzgniWiPXpM15YNvo+5Gcb8cwPdBCUgOLFSAcRWs=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=2qowF2k3E6IzoSvyVCDxDhXiZ9mC/bdHex8EHHDxmfMbqtHJf8l1p2YMm3plWQvJO PbvvxhREz9YtWMAp/pOSwI4fwwdSpRuXAkQTD1cwAj3N7QDNCCpICI/kO6Bg3kY1eC KeJMAWyZ+Bm/2JItsoEJ60TG2Pw21K7f/I6RhTlU= From: Mike Rapoport To: linux-kernel@vger.kernel.org Cc: Alexander Viro , Andrew Morton , Andy Lutomirski , Arnd Bergmann , Borislav Petkov , Catalin Marinas , Christopher Lameter , Dan Williams , Dave Hansen , Elena Reshetova , "H. Peter Anvin" , Idan Yaniv , Ingo Molnar , James Bottomley , "Kirill A. Shutemov" , Matthew Wilcox , Mike Rapoport , Mike Rapoport , Michael Kerrisk , Palmer Dabbelt , Paul Walmsley , Peter Zijlstra , Thomas Gleixner , Tycho Andersen , Will Deacon , linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, linux-riscv@lists.infradead.org, x86@kernel.org Subject: [PATCH v2 6/7] mm: secretmem: add ability to reserve memory at boot Date: Mon, 27 Jul 2020 19:29:34 +0300 Message-Id: <20200727162935.31714-7-rppt@kernel.org> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20200727162935.31714-1-rppt@kernel.org> References: <20200727162935.31714-1-rppt@kernel.org> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Mike Rapoport Taking pages out from the direct map and bringing them back may create undesired fragmentation and usage of the smaller pages in the direct mapping of the physical memory. This can be avoided if a significantly large area of the physical memory would be reserved for secretmem purposes at boot time. Add ability to reserve physical memory for secretmem at boot time using "secretmem" kernel parameter and then use that reserved memory as a global pool for secret memory needs. Signed-off-by: Mike Rapoport --- mm/secretmem.c | 134 ++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 126 insertions(+), 8 deletions(-) diff --git a/mm/secretmem.c b/mm/secretmem.c index da609701e10e..35616e3982a4 100644 --- a/mm/secretmem.c +++ b/mm/secretmem.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include @@ -30,6 +31,39 @@ struct secretmem_ctx { unsigned int mode; }; +struct secretmem_pool { + struct gen_pool *pool; + unsigned long reserved_size; + void *reserved; +}; + +static struct secretmem_pool secretmem_pool; + +static struct page *secretmem_alloc_huge_page(gfp_t gfp) +{ + struct gen_pool *pool = secretmem_pool.pool; + unsigned long addr = 0; + struct page *page = NULL; + + if (pool) { + if (gen_pool_avail(pool) < PMD_SIZE) + return NULL; + + addr = gen_pool_alloc(pool, PMD_SIZE); + if (!addr) + return NULL; + + page = virt_to_page(addr); + } else { + page = alloc_pages(gfp, PMD_PAGE_ORDER); + + if (page) + split_page(page, PMD_PAGE_ORDER); + } + + return page; +} + static int secretmem_pool_increase(struct secretmem_ctx *ctx, gfp_t gfp) { unsigned long nr_pages = (1 << PMD_PAGE_ORDER); @@ -38,12 +72,11 @@ static int secretmem_pool_increase(struct secretmem_ctx *ctx, gfp_t gfp) struct page *page; int err; - page = alloc_pages(gfp, PMD_PAGE_ORDER); + page = secretmem_alloc_huge_page(gfp); if (!page) return -ENOMEM; addr = (unsigned long)page_address(page); - split_page(page, PMD_PAGE_ORDER); err = gen_pool_add(pool, addr, PMD_SIZE, NUMA_NO_NODE); if (err) { @@ -269,11 +302,13 @@ SYSCALL_DEFINE1(memfd_secret, unsigned long, flags) return err; } -static void secretmem_cleanup_chunk(struct gen_pool *pool, - struct gen_pool_chunk *chunk, void *data) +static void secretmem_recycle_range(unsigned long start, unsigned long end) +{ + gen_pool_free(secretmem_pool.pool, start, PMD_SIZE); +} + +static void secretmem_release_range(unsigned long start, unsigned long end) { - unsigned long start = chunk->start_addr; - unsigned long end = chunk->end_addr; unsigned long nr_pages, addr; nr_pages = (end - start + 1) / PAGE_SIZE; @@ -283,6 +318,18 @@ static void secretmem_cleanup_chunk(struct gen_pool *pool, put_page(virt_to_page(addr)); } +static void secretmem_cleanup_chunk(struct gen_pool *pool, + struct gen_pool_chunk *chunk, void *data) +{ + unsigned long start = chunk->start_addr; + unsigned long end = chunk->end_addr; + + if (secretmem_pool.pool) + secretmem_recycle_range(start, end); + else + secretmem_release_range(start, end); +} + static void secretmem_cleanup_pool(struct secretmem_ctx *ctx) { struct gen_pool *pool = ctx->pool; @@ -322,14 +369,85 @@ static struct file_system_type secretmem_fs = { .kill_sb = kill_anon_super, }; +static int secretmem_reserved_mem_init(void) +{ + struct gen_pool *pool; + struct page *page; + void *addr; + int err; + + if (!secretmem_pool.reserved) + return 0; + + pool = gen_pool_create(PMD_SHIFT, NUMA_NO_NODE); + if (!pool) + return -ENOMEM; + + err = gen_pool_add(pool, (unsigned long)secretmem_pool.reserved, + secretmem_pool.reserved_size, NUMA_NO_NODE); + if (err) + goto err_destroy_pool; + + for (addr = secretmem_pool.reserved; + addr < secretmem_pool.reserved + secretmem_pool.reserved_size; + addr += PAGE_SIZE) { + page = virt_to_page(addr); + __ClearPageReserved(page); + set_page_count(page, 1); + } + + secretmem_pool.pool = pool; + page = virt_to_page(secretmem_pool.reserved); + __kernel_map_pages(page, secretmem_pool.reserved_size / PAGE_SIZE, 0); + return 0; + +err_destroy_pool: + gen_pool_destroy(pool); + return err; +} + static int secretmem_init(void) { - int ret = 0; + int ret; + + ret = secretmem_reserved_mem_init(); + if (ret) + return ret; secretmem_mnt = kern_mount(&secretmem_fs); - if (IS_ERR(secretmem_mnt)) + if (IS_ERR(secretmem_mnt)) { + gen_pool_destroy(secretmem_pool.pool); ret = PTR_ERR(secretmem_mnt); + } return ret; } fs_initcall(secretmem_init); + +static int __init secretmem_setup(char *str) +{ + phys_addr_t align = PMD_SIZE; + unsigned long reserved_size; + void *reserved; + + reserved_size = memparse(str, NULL); + if (!reserved_size) + return 0; + + if (reserved_size * 2 > PUD_SIZE) + align = PUD_SIZE; + + reserved = memblock_alloc(reserved_size, align); + if (!reserved) { + pr_err("failed to reserve %lu bytes\n", secretmem_pool.reserved_size); + return 0; + } + + secretmem_pool.reserved_size = reserved_size; + secretmem_pool.reserved = reserved; + + pr_info("reserved %luM\n", reserved_size >> 20); + + return 1; +} +__setup("secretmem=", secretmem_setup); From patchwork Mon Jul 27 16:29:35 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Rapoport X-Patchwork-Id: 11687213 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E20D5138C for ; Mon, 27 Jul 2020 16:30:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CA3F0207BB for ; Mon, 27 Jul 2020 16:30:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1595867451; bh=/Qw/NeLTpR/6aCJ04huD3Mp72u1Bfe6uWLUqSL0Zn2g=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=a93ZidwD7GytuXcLRBf0t23QcyFLlc0A0njPmpLsH7a8yHAnKIMmDDfVv83whx/MU 60G+LO9dL0YS1fyRZ0hFx1mZWfucN+IDsXtY6SV/JqvQV8XmTEDMNBejou4KHRYaAq B2wTD3sjw1PQLej2A+XN8o+xL8AiCB6hNm/0727s= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732229AbgG0Qas (ORCPT ); Mon, 27 Jul 2020 12:30:48 -0400 Received: from mail.kernel.org ([198.145.29.99]:59694 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729573AbgG0Qar (ORCPT ); Mon, 27 Jul 2020 12:30:47 -0400 Received: from aquarius.haifa.ibm.com (nesher1.haifa.il.ibm.com [195.110.40.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 286AF2074F; Mon, 27 Jul 2020 16:30:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1595867447; bh=/Qw/NeLTpR/6aCJ04huD3Mp72u1Bfe6uWLUqSL0Zn2g=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=KtM0Agb5HnPUkY+CbqIdR37o127nhQKSy5Dq+iFtbhAZ5rXOrCF/TCi/y9dVBPJgG ESLcg0dYR67EEDk5MwOpPeS2zHrUXad+3m3RPRP3kklfQeu9q9pdGF6IS3dzaGLPnI xswjDrT4XhuX8HarB2l6OUZFzMfxbAPY98Qdg+hQ= From: Mike Rapoport To: linux-kernel@vger.kernel.org Cc: Alexander Viro , Andrew Morton , Andy Lutomirski , Arnd Bergmann , Borislav Petkov , Catalin Marinas , Christopher Lameter , Dan Williams , Dave Hansen , Elena Reshetova , "H. Peter Anvin" , Idan Yaniv , Ingo Molnar , James Bottomley , "Kirill A. Shutemov" , Matthew Wilcox , Mike Rapoport , Mike Rapoport , Michael Kerrisk , Palmer Dabbelt , Paul Walmsley , Peter Zijlstra , Thomas Gleixner , Tycho Andersen , Will Deacon , linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, linux-riscv@lists.infradead.org, x86@kernel.org Subject: [PATCH v2 7/7] mm: secretmem: add ability to reserve memory at boot Date: Mon, 27 Jul 2020 19:29:35 +0300 Message-Id: <20200727162935.31714-8-rppt@kernel.org> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20200727162935.31714-1-rppt@kernel.org> References: <20200727162935.31714-1-rppt@kernel.org> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Mike Rapoport Taking pages out from the direct map and bringing them back may create undesired fragmentation and usage of the smaller pages in the direct mapping of the physical memory. This can be avoided if a significantly large area of the physical memory would be reserved for secretmem purposes at boot time. Add ability to reserve physical memory for secretmem at boot time using "secretmem" kernel parameter and then use that reserved memory as a global pool for secret memory needs. Signed-off-by: Mike Rapoport --- Documentation/admin-guide/kernel-parameters.txt | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index fb95fad81c79..6f3c2f28160f 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4548,6 +4548,10 @@ Format: integer between 0 and 10 Default is 0. + secretmem=n[KMG] + [KNL,BOOT] Reserve specified amount of memory to + back mappings of secret memory. + skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate xtime_lock contention on larger systems, and/or RCU lock contention on all systems with CONFIG_MAXSMP set.