[v17,07/10] mm: introduce memfd_secret system call to create "secret" memory areas

From: Mike Rapoport <rppt@linux.ibm.com>

From: Mike Rapoport <rppt@linux.ibm.com>

Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.

The secretmem feature is off by default and the user must explicitly enable
it at the boot time.

Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call. The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the owning mm.

The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
paves the way for VMMs to remove the secret memory range from the process;
there may be situations where sharing is useful and file descriptor based
approach allows to seal the operations.

As secret memory implementation is not an extension of tmpfs or hugetlbfs,
usage of a dedicated system call rather than hooking new functionality into
memfd_create(2) emphasises that memfd_secret(2) has different semantics and
allows better upwards compatibility.

The secret memory remains accessible in the process context using uaccess
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.

Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.

The secretmem mappings are locked in memory so they cannot exceed
RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
mlock() secretmem range would fail and mlockall() will ignore secretmem
mappings.

Pages in the secretmem regions are unevictable and unmovable to avoid
accidental exposure of the sensitive data via swap or during page
migration.

A page that was a part of the secret memory area is cleared when it is
freed to ensure the data is not exposed to the next user of that page.

The following example demonstrates creation of a secret mapping (error
handling is omitted):

	fd = memfd_secret(0);
	ftruncate(fd, MAP_SIZE);
	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
---
 include/linux/secretmem.h  |  24 ++++
 include/uapi/linux/magic.h |   1 +
 kernel/sys_ni.c            |   2 +
 mm/Kconfig                 |   3 +
 mm/Makefile                |   1 +
 mm/gup.c                   |  10 ++
 mm/mlock.c                 |   3 +-
 mm/secretmem.c             | 246 +++++++++++++++++++++++++++++++++++++
 8 files changed, 289 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/secretmem.h
 create mode 100644 mm/secretmem.c

Message ID	20210208084920.2884-8-rppt@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.6 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E38C5C433E6 for <linux-fsdevel@archiver.kernel.org>; Mon, 8 Feb 2021 08:55:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8C8E664E83 for <linux-fsdevel@archiver.kernel.org>; Mon, 8 Feb 2021 08:55:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230328AbhBHIzE (ORCPT <rfc822;linux-fsdevel@archiver.kernel.org>); Mon, 8 Feb 2021 03:55:04 -0500 Received: from mail.kernel.org ([198.145.29.99]:43390 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230326AbhBHIwm (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>); Mon, 8 Feb 2021 03:52:42 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 620D564E9C; Mon, 8 Feb 2021 08:50:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1612774259; bh=9vRyfrkC+zPVa5AhBC70BL0UKpbaJeLEZmTkfIa5kr0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=hNifm1AR6qbMpEfZqFo0SILgFLpLufWzFoh8gk+WJifHRd0UfdukmanGz7sC3U5E1 fr8jcdLjACqBjvaIyq38O9PQqzu61FKmclwX3CndBQml+MuQp+vBxw6x19J7lf/8ot Fcvoc3RYhGjLUOmwdqDG5gAGwaUnFh52PCjIowerVU85VAgfjkkbeoTr7jlU0iWjsV u58RPyGp7Ihcm7ya+s9wc8YD++KVhRRNaov9/H1RjgKguysX+XVT2lrwBUCXhqS8oG LEawgG6K0fOKd8R/UgBZAuM6XwKyEY7swRjXchgvux/VOa7YAMGIJba9+0lX+smjbv TI61i1JIgPe0w== From: Mike Rapoport <rppt@kernel.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk>, Andy Lutomirski <luto@kernel.org>, Arnd Bergmann <arnd@arndb.de>, Borislav Petkov <bp@alien8.de>, Catalin Marinas <catalin.marinas@arm.com>, Christopher Lameter <cl@linux.com>, Dan Williams <dan.j.williams@intel.com>, Dave Hansen <dave.hansen@linux.intel.com>, David Hildenbrand <david@redhat.com>, Elena Reshetova <elena.reshetova@intel.com>, "H. Peter Anvin" <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>, James Bottomley <jejb@linux.ibm.com>, "Kirill A. Shutemov" <kirill@shutemov.name>, Matthew Wilcox <willy@infradead.org>, Mark Rutland <mark.rutland@arm.com>, Michal Hocko <mhocko@suse.com>, Mike Rapoport <rppt@linux.ibm.com>, Mike Rapoport <rppt@kernel.org>, Michael Kerrisk <mtk.manpages@gmail.com>, Palmer Dabbelt <palmer@dabbelt.com>, Paul Walmsley <paul.walmsley@sifive.com>, Peter Zijlstra <peterz@infradead.org>, Rick Edgecombe <rick.p.edgecombe@intel.com>, Roman Gushchin <guro@fb.com>, Shakeel Butt <shakeelb@google.com>, Shuah Khan <shuah@kernel.org>, Thomas Gleixner <tglx@linutronix.de>, Tycho Andersen <tycho@tycho.ws>, Will Deacon <will@kernel.org>, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-nvdimm@lists.01.org, linux-riscv@lists.infradead.org, x86@kernel.org, Hagen Paul Pfeifer <hagen@jauu.net>, Palmer Dabbelt <palmerdabbelt@google.com> Subject: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas Date: Mon, 8 Feb 2021 10:49:17 +0200 Message-Id: <20210208084920.2884-8-rppt@kernel.org> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210208084920.2884-1-rppt@kernel.org> References: <20210208084920.2884-1-rppt@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <linux-fsdevel.vger.kernel.org> X-Mailing-List: linux-fsdevel@vger.kernel.org
Series	mm: introduce memfd_secret system call to create "secret" memory areas \| expand [v17,00/10] mm: introduce memfd_secret system call to create "secret" memory areas [v17,01/10] mm: add definition of PMD_PAGE_ORDER [v17,02/10] mmap: make mlock_future_check() global [v17,03/10] riscv/Kconfig: make direct map manipulation options depend on MMU [v17,04/10] set_memory: allow set_direct_map__noflush() for multiple pages [v17,05/10] set_memory: allow querying whether set_direct_map_() is actually enabled [v17,06/10] arm64: kfence: fix header inclusion [v17,07/10] mm: introduce memfd_secret system call to create "secret" memory areas [v17,08/10] PM: hibernate: disable when there are active secretmem users [v17,09/10] arch, mm: wire up memfd_secret system call where relevant [v17,10/10] secretmem: test: add basic selftest for memfd_secret(2)

[v17,07/10] mm: introduce memfd_secret system call to create "secret" memory areas

Commit Message

Comments

Patch