[v19,5/8] mm: introduce memfd_secret system call to create "secret" memory areas

From: Mike Rapoport <rppt@linux.ibm.com>

From: Mike Rapoport <rppt@linux.ibm.com>

Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.

The secretmem feature is off by default and the user must explicitly enable
it at the boot time.

Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call. The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the processes
that have access to the file descriptor.

The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise(). File
descriptor approach allows explict and controlled sharing of the memory
areas, it allows to seal the operations. Besides, file descriptor based
memory paves the way for VMMs to remove the secret memory range from the
userpace hipervisor process, for instance QEMU. Andy Lutomirski says:

  "Getting fd-backed memory into a guest will take some possibly major work
   in the kernel, but getting vma-backed memory into a guest without
   mapping it in the host user address space seems much, much worse."

memfd_secret() is made a dedicated system call rather than an extention to
memfd_create() because it's purpose is to allow the user to create more
secure memory mappings rather than to simply allow file based access to the
memory. Nowadays a new system call cost is negligible while it is way
simpler for userspace to deal with a clear-cut system calls than with a
multiplexer or an overloaded syscall. Moreover, the initial implementation
of memfd_secret() is completely distinct from memfd_create() so there is no
much sense in overloading memfd_create() to begin with. If there will be a
need for code sharing between these implementation it can be easily
achieved without a need to adjust user visible APIs.

The secret memory remains accessible in the process context using uaccess
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.

Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.

Pages in the secretmem regions are unevictable and unmovable to avoid
accidental exposure of the sensitive data via swap or during page
migration.

Since the secretmem mappings are locked in memory they cannot exceed
RLIMIT_MEMLOCK. Since these mappings are already locked independently from
mlock(), an attempt to mlock()/munlock() secretmem range would fail and
mlockall()/munlockall() will ignore secretmem mappings.

However, unlike mlock()ed memory, secretmem currently behaves more like
long-term GUP: secretmem mappings are unmovable mappings directly consumed
by user space. With default limits, there is no excessive use of secretmem
and it poses no real problem in combination with ZONE_MOVABLE/CMA, but in
the future this should be addressed to allow balanced use of large amounts
of secretmem along with ZONE_MOVABLE/CMA.

A page that was a part of the secret memory area is cleared when it is
freed to ensure the data is not exposed to the next user of that page.

The following example demonstrates creation of a secret mapping (error
handling is omitted):

	fd = memfd_secret(0);
	ftruncate(fd, MAP_SIZE);
	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 drivers/char/mem.c         |   4 +
 include/linux/secretmem.h  |  48 ++++++++
 include/uapi/linux/magic.h |   1 +
 kernel/sys_ni.c            |   2 +
 mm/Kconfig                 |   4 +
 mm/Makefile                |   1 +
 mm/gup.c                   |  12 ++
 mm/mlock.c                 |   3 +-
 mm/secretmem.c             | 239 +++++++++++++++++++++++++++++++++++++
 9 files changed, 313 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/secretmem.h
 create mode 100644 mm/secretmem.c

Message ID	20210513184734.29317-6-rppt@kernel.org (mailing list archive)
State	Superseded
Headers	show Return-Path: <SRS0=975I=KI=lists.01.org=linux-nvdimm-bounces@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E2753C433B4 for <linux-nvdimm@archiver.kernel.org>; Thu, 13 May 2021 18:49:00 +0000 (UTC) Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 9FF2D613DE for <linux-nvdimm@archiver.kernel.org>; Thu, 13 May 2021 18:49:00 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9FF2D613DE Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-nvdimm-bounces@lists.01.org Received: from ml01.vlan13.01.org (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id 7B3F0100EAB78; Thu, 13 May 2021 11:49:00 -0700 (PDT) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=198.145.29.99; helo=mail.kernel.org; envelope-from=rppt@kernel.org; receiver=<UNKNOWN> Received: from mail.kernel.org (unknown [198.145.29.99]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 58FFF100EAB72 for <linux-nvdimm@lists.01.org>; Thu, 13 May 2021 11:48:58 -0700 (PDT) Received: by mail.kernel.org (Postfix) with ESMTPSA id 2003461404; Thu, 13 May 2021 18:48:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1620931728; bh=VtJAYfyrc3gxC/XvOCaw01neY2bi3oqkVkh87QwQCkY=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=vM6iH+Gl39bM4/FopxJ84lTTSyi2wAjOhgInqaoCIpLZ9+5/7yJ8FkKeQeWoJKMiT tf1c94BKgmx4Xs3r90wqw25CxsH2B7+VsBMqFr0LVC2Y4rXIrEvy1VQ05zAhOet/c7 2eZeoTa8gWbqBHDrWZAsRmgk6vR+eHOymx2CW71jYl7XmIGBPnDFcOXEimTXAQv2Z/ zArzIyBzaJ+ZVRHHvKZrmzTUrbtYhZ++20IUf8kSqVW2qg+mVVoC5CyhGNy/wPux6u +GsoJifvOPJuA5N7Ub7u333FACy4oARfbTLqUDLNAwO98LvzMcYkgne7TxG/JTOtZU yJvNs23pwaGtA== From: Mike Rapoport <rppt@kernel.org> To: Andrew Morton <akpm@linux-foundation.org> Subject: [PATCH v19 5/8] mm: introduce memfd_secret system call to create "secret" memory areas Date: Thu, 13 May 2021 21:47:31 +0300 Message-Id: <20210513184734.29317-6-rppt@kernel.org> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210513184734.29317-1-rppt@kernel.org> References: <20210513184734.29317-1-rppt@kernel.org> MIME-Version: 1.0 Message-ID-Hash: CCR46FDN6U67HF5RHTHK4KD36FXRG7T2 X-Message-ID-Hash: CCR46FDN6U67HF5RHTHK4KD36FXRG7T2 X-MailFrom: rppt@kernel.org X-Mailman-Rule-Hits: nonmember-moderation X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation CC: Alexander Viro <viro@zeniv.linux.org.uk>, Andy Lutomirski <luto@kernel.org>, Arnd Bergmann <arnd@arndb.de>, Borislav Petkov <bp@alien8.de>, Catalin Marinas <catalin.marinas@arm.com>, Christopher Lameter <cl@linux.com>, Dave Hansen <dave.hansen@linux.intel.com>, David Hildenbrand <david@redhat.com>, Elena Reshetova <elena.reshetova@intel.com>, "H. Peter Anvin" <hpa@zytor.com>, Hagen Paul Pfeifer <hagen@jauu.net>, Ingo Molnar <mingo@redhat.com>, James Bottomley <jejb@linux.ibm.com>, Kees Cook <keescook@chromium.org>, "Kirill A. Shutemov" <kirill@shutemov.name>, Matthew Wilcox <willy@infradead.org>, Matthew Garrett <mjg59@srcf.ucam.org>, Mark Rutland <mark.rutland@arm.com>, Michal Hocko <mhocko@suse.com>, Mike Rapoport <rppt@linux.ibm.com>, Mike Rapoport <rppt@kernel.org>, Michael Kerrisk <mtk.manpages@gmail.com>, Palmer Dabbelt <palmer@dabbelt.com>, Palmer Dabbelt <palmerdabbelt@google.com>, Paul Walmsley <paul.walmsley@sifive.com>, Peter Zijlstra <peterz@infradead.org>, "Rafael J. Wysocki" <rjw@rjwysocki.net>, Rick Edgecombe <rick.p.edgecombe@intel.com>, Roman Gushchin <guro@fb.com>, Shakeel Butt <shakeelb@google.com>, Shuah Khan <shuah@kernel.org>, Thomas Gleixner <tglx@linutronix.de>, Tycho Andersen <tycho@tycho.ws>, Will Deacon <will@kernel.org>, Yury Norov <yury.norov@gmail.com>, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-nvdimm@lists.01.org, linux-riscv@lists.infradead.org, x86@kernel.org X-Mailman-Version: 3.1.1 Precedence: list List-Id: "Linux-nvdimm developer list." <linux-nvdimm.lists.01.org> Archived-At: <https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/message/CCR46FDN6U67HF5RHTHK4KD36FXRG7T2/> List-Archive: <https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/> List-Help: <mailto:linux-nvdimm-request@lists.01.org?subject=help> List-Post: <mailto:linux-nvdimm@lists.01.org> List-Subscribe: <mailto:linux-nvdimm-join@lists.01.org> List-Unsubscribe: <mailto:linux-nvdimm-leave@lists.01.org> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit
Series	mm: introduce memfd_secret system call to create "secret" memory areas \| expand [v19,0/8] mm: introduce memfd_secret system call to create "secret" memory areas [v19,1/8] mmap: make mlock_future_check() global [v19,2/8] riscv/Kconfig: make direct map manipulation options depend on MMU [v19,3/8] set_memory: allow set_direct_map__noflush() for multiple pages [v19,4/8] set_memory: allow querying whether set_direct_map_() is actually enabled [v19,5/8] mm: introduce memfd_secret system call to create "secret" memory areas [v19,6/8] PM: hibernate: disable when there are active secretmem users [v19,7/8] arch, mm: wire up memfd_secret system call where relevant [v19,8/8] secretmem: test: add basic selftest for memfd_secret(2)

[v19,5/8] mm: introduce memfd_secret system call to create "secret" memory areas

Commit Message

Comments

Patch