Message ID | 20200720092435.17469-4-rppt@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: introduce secretmemfd system call to create "secret" memory areas | expand |
On Mon, Jul 20, 2020 at 11:25 AM Mike Rapoport <rppt@kernel.org> wrote: > > From: Mike Rapoport <rppt@linux.ibm.com> > > Introduce "secretmemfd" system call with the ability to create memory areas > visible only in the context of the owning process and not mapped not only > to other processes but in the kernel page tables as well. > > The user will create a file descriptor using the secretmemfd system call > where flags supplied as a parameter to this system call will define the > desired protection mode for the memory associated with that file > descriptor. Currently there are two protection modes: > > * exclusive - the memory area is unmapped from the kernel direct map and it > is present only in the page tables of the owning mm. > * uncached - the memory area is present only in the page tables of the > owning mm and it is mapped there as uncached. > > For instance, the following example will create an uncached mapping (error > handling is omitted): > > fd = secretmemfd(SECRETMEM_UNCACHED); > ftruncate(fd, MAP_SIZE); > ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, > fd, 0); > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> I wonder if this should be more closely related to dmabuf file descriptors, which are already used for a similar purpose: sharing access to secret memory areas that are not visible to the OS but can be shared with hardware through device drivers that can import a dmabuf file descriptor. Arnd
On Mon, Jul 20, 2020 at 01:30:13PM +0200, Arnd Bergmann wrote: > On Mon, Jul 20, 2020 at 11:25 AM Mike Rapoport <rppt@kernel.org> wrote: > > > > From: Mike Rapoport <rppt@linux.ibm.com> > > > > Introduce "secretmemfd" system call with the ability to create memory areas > > visible only in the context of the owning process and not mapped not only > > to other processes but in the kernel page tables as well. > > > > The user will create a file descriptor using the secretmemfd system call > > where flags supplied as a parameter to this system call will define the > > desired protection mode for the memory associated with that file > > descriptor. Currently there are two protection modes: > > > > * exclusive - the memory area is unmapped from the kernel direct map and it > > is present only in the page tables of the owning mm. > > * uncached - the memory area is present only in the page tables of the > > owning mm and it is mapped there as uncached. > > > > For instance, the following example will create an uncached mapping (error > > handling is omitted): > > > > fd = secretmemfd(SECRETMEM_UNCACHED); > > ftruncate(fd, MAP_SIZE); > > ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, > > fd, 0); > > > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> > > I wonder if this should be more closely related to dmabuf file > descriptors, which > are already used for a similar purpose: sharing access to secret memory areas > that are not visible to the OS but can be shared with hardware through device > drivers that can import a dmabuf file descriptor. TBH, I didn't think about dmabuf, but my undestanding is that is this case memory areas are not visible to the OS because they are on device memory rather than normal RAM and when dmabuf is backed by the normal RAM, the memory is visible to the OS. Did I miss anything? > Arnd
On Mon, Jul 20, 2020 at 4:21 PM Mike Rapoport <rppt@kernel.org> wrote: > On Mon, Jul 20, 2020 at 01:30:13PM +0200, Arnd Bergmann wrote: > > On Mon, Jul 20, 2020 at 11:25 AM Mike Rapoport <rppt@kernel.org> wrote: > > > > > > From: Mike Rapoport <rppt@linux.ibm.com> > > > > > > Introduce "secretmemfd" system call with the ability to create memory areas > > > visible only in the context of the owning process and not mapped not only > > > to other processes but in the kernel page tables as well. > > > > > > The user will create a file descriptor using the secretmemfd system call > > > where flags supplied as a parameter to this system call will define the > > > desired protection mode for the memory associated with that file > > > descriptor. Currently there are two protection modes: > > > > > > * exclusive - the memory area is unmapped from the kernel direct map and it > > > is present only in the page tables of the owning mm. > > > * uncached - the memory area is present only in the page tables of the > > > owning mm and it is mapped there as uncached. > > > > > > For instance, the following example will create an uncached mapping (error > > > handling is omitted): > > > > > > fd = secretmemfd(SECRETMEM_UNCACHED); > > > ftruncate(fd, MAP_SIZE); > > > ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, > > > fd, 0); > > > > > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> > > > > I wonder if this should be more closely related to dmabuf file > > descriptors, which > > are already used for a similar purpose: sharing access to secret memory areas > > that are not visible to the OS but can be shared with hardware through device > > drivers that can import a dmabuf file descriptor. > > TBH, I didn't think about dmabuf, but my undestanding is that is this > case memory areas are not visible to the OS because they are on device > memory rather than normal RAM and when dmabuf is backed by the normal > RAM, the memory is visible to the OS. No, dmabuf is normally about normal RAM that is shared between multiple devices, the idea is that you can have one driver allocate a buffer in RAM and export it to user space through a file descriptor. The application can then go and mmap() it or pass it into one or more other drivers. This can be used e.g. for sharing a buffer between a video codec and the gpu, or between a crypto engine and another device that accesses unencrypted data while software can only observe the encrypted version. Arnd
On Mon, 2020-07-20 at 13:30 +0200, Arnd Bergmann wrote: > On Mon, Jul 20, 2020 at 11:25 AM Mike Rapoport <rppt@kernel.org> > wrote: > > > > From: Mike Rapoport <rppt@linux.ibm.com> > > > > Introduce "secretmemfd" system call with the ability to create > > memory areas visible only in the context of the owning process and > > not mapped not only to other processes but in the kernel page > > tables as well. > > > > The user will create a file descriptor using the secretmemfd system > > call where flags supplied as a parameter to this system call will > > define the desired protection mode for the memory associated with > > that file descriptor. Currently there are two protection modes: > > > > * exclusive - the memory area is unmapped from the kernel direct > > map and it > > is present only in the page tables of the owning mm. > > * uncached - the memory area is present only in the page tables of > > the > > owning mm and it is mapped there as uncached. > > > > For instance, the following example will create an uncached mapping > > (error handling is omitted): > > > > fd = secretmemfd(SECRETMEM_UNCACHED); > > ftruncate(fd, MAP_SIZE); > > ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, > > MAP_SHARED, > > fd, 0); > > > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> > > I wonder if this should be more closely related to dmabuf file > descriptors, which are already used for a similar purpose: sharing > access to secret memory areas that are not visible to the OS but can > be shared with hardware through device drivers that can import a > dmabuf file descriptor. I'll assume you mean the dmabuf userspace API? Because the kernel API is completely device exchange specific and wholly inappropriate for this use case. The user space API of dmabuf uses a pseudo-filesystem. So you mount the dmabuf file type (and by "you" I mean root because an ordinary user doesn't have sufficient privilege). This is basically because every dmabuf is usable by any user who has permissions. This really isn't the initial interface we want for secret memory because secret regions are supposed to be per process and not shared (at least we don't want other tenants to see who's using what). Once you have the fd, you can seek to find the size, mmap, poll and ioctl it. The ioctls are all to do with memory synchronization (as you'd expect from a device backed region) and the mmap is handled by the dma_buf_ops, which is device specific. Sizing is missing because that's reported by the device not settable by the user. What we want is the ability to get an fd, set the properties and the size and mmap it. This is pretty much a 100% overlap with the memfd API and not much overlap with the dmabuf one, which is why I don't think the interface is very well suited. James
On Mon, Jul 20, 2020 at 04:34:12PM +0200, Arnd Bergmann wrote: > On Mon, Jul 20, 2020 at 4:21 PM Mike Rapoport <rppt@kernel.org> wrote: > > On Mon, Jul 20, 2020 at 01:30:13PM +0200, Arnd Bergmann wrote: > > > On Mon, Jul 20, 2020 at 11:25 AM Mike Rapoport <rppt@kernel.org> wrote: > > > > > > > > From: Mike Rapoport <rppt@linux.ibm.com> > > > > > > > > Introduce "secretmemfd" system call with the ability to create memory areas > > > > visible only in the context of the owning process and not mapped not only > > > > to other processes but in the kernel page tables as well. > > > > > > > > The user will create a file descriptor using the secretmemfd system call > > > > where flags supplied as a parameter to this system call will define the > > > > desired protection mode for the memory associated with that file > > > > descriptor. Currently there are two protection modes: > > > > > > > > * exclusive - the memory area is unmapped from the kernel direct map and it > > > > is present only in the page tables of the owning mm. > > > > * uncached - the memory area is present only in the page tables of the > > > > owning mm and it is mapped there as uncached. > > > > > > > > For instance, the following example will create an uncached mapping (error > > > > handling is omitted): > > > > > > > > fd = secretmemfd(SECRETMEM_UNCACHED); > > > > ftruncate(fd, MAP_SIZE); > > > > ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, > > > > fd, 0); > > > > > > > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> > > > > > > I wonder if this should be more closely related to dmabuf file > > > descriptors, which > > > are already used for a similar purpose: sharing access to secret memory areas > > > that are not visible to the OS but can be shared with hardware through device > > > drivers that can import a dmabuf file descriptor. > > > > TBH, I didn't think about dmabuf, but my undestanding is that is this > > case memory areas are not visible to the OS because they are on device > > memory rather than normal RAM and when dmabuf is backed by the normal > > RAM, the memory is visible to the OS. > > No, dmabuf is normally about normal RAM that is shared between multiple > devices, the idea is that you can have one driver allocate a buffer in RAM > and export it to user space through a file descriptor. The application can then > go and mmap() it or pass it into one or more other drivers. > > This can be used e.g. for sharing a buffer between a video codec and the > gpu, or between a crypto engine and another device that accesses > unencrypted data while software can only observe the encrypted version. For our usecase sharing is optional from one side and there are no devices involved from the other. As James pointed out, there is no match for the userspace API and if there will emerge a usacase that requires integration of secretmem with dma-buf, we'll deal with it then. > Arnd
On Mon, Jul 20, 2020 at 5:52 PM James Bottomley <jejb@linux.ibm.com> wrote: > On Mon, 2020-07-20 at 13:30 +0200, Arnd Bergmann wrote: > > I'll assume you mean the dmabuf userspace API? Because the kernel API > is completely device exchange specific and wholly inappropriate for > this use case. > > The user space API of dmabuf uses a pseudo-filesystem. So you mount > the dmabuf file type (and by "you" I mean root because an ordinary user > doesn't have sufficient privilege). This is basically because every > dmabuf is usable by any user who has permissions. This really isn't > the initial interface we want for secret memory because secret regions > are supposed to be per process and not shared (at least we don't want > other tenants to see who's using what). > > Once you have the fd, you can seek to find the size, mmap, poll and > ioctl it. The ioctls are all to do with memory synchronization (as > you'd expect from a device backed region) and the mmap is handled by > the dma_buf_ops, which is device specific. Sizing is missing because > that's reported by the device not settable by the user. I was mainly talking about the in-kernel interface that is used for sharing a buffer with hardware. Aside from the limited ioctls, anything in the kernel can decide on how it wants to export a dma_buf by calling dma_buf_export()/dma_buf_fd(), which is roughly what the new syscall does as well. Using dma_buf vs the proposed implementation for this is not a big difference in complexity. The one thing that a dma_buf does is that it allows devices to do DMA on it. This is either something that can turn out to be useful later, or it is not. From the description, it sounded like the sharing might be useful, since we already have known use cases in which "secret" data is exchanged with a trusted execution environment using the dma-buf interface. If there is no way the data stored in this new secret memory area would relate to secret data in a TEE or some other hardware device, then I agree that dma-buf has no value. > What we want is the ability to get an fd, set the properties and the > size and mmap it. This is pretty much a 100% overlap with the memfd > API and not much overlap with the dmabuf one, which is why I don't > think the interface is very well suited. Does that mean you are suggesting to use additional flags on memfd_create() instead of a new system call? Arnd
On Mon, 2020-07-20 at 20:08 +0200, Arnd Bergmann wrote: > On Mon, Jul 20, 2020 at 5:52 PM James Bottomley <jejb@linux.ibm.com> > wrote: > > On Mon, 2020-07-20 at 13:30 +0200, Arnd Bergmann wrote: > > > > I'll assume you mean the dmabuf userspace API? Because the kernel > > API is completely device exchange specific and wholly inappropriate > > for this use case. > > > > The user space API of dmabuf uses a pseudo-filesystem. So you > > mount the dmabuf file type (and by "you" I mean root because an > > ordinary user doesn't have sufficient privilege). This is > > basically because every dmabuf is usable by any user who has > > permissions. This really isn't the initial interface we want for > > secret memory because secret regions are supposed to be per process > > and not shared (at least we don't want other tenants to see who's > > using what). > > > > Once you have the fd, you can seek to find the size, mmap, poll and > > ioctl it. The ioctls are all to do with memory synchronization (as > > you'd expect from a device backed region) and the mmap is handled > > by the dma_buf_ops, which is device specific. Sizing is missing > > because that's reported by the device not settable by the user. > > I was mainly talking about the in-kernel interface that is used for > sharing a buffer with hardware. Aside from the limited ioctls, > anything in the kernel can decide on how it wants to export a dma_buf > by calling dma_buf_export()/dma_buf_fd(), which is roughly what the > new syscall does as well. Using dma_buf vs the proposed > implementation for this is not a big difference in complexity. I have thought about it, but haven't got much further: We can't couple to SGX without a huge break in the current simple userspace API (it becomes complex because you'd have to enter the enclave each time you want to use the memory, or put the whole process in the enclave, which is a bit of a nightmare for simplicity), and we could only couple it to SEV if the memory encryption engine would respond to PCID as well as ASID, which it doesn't. > The one thing that a dma_buf does is that it allows devices to > do DMA on it. This is either something that can turn out to be > useful later, or it is not. From the description, it sounded like > the sharing might be useful, since we already have known use > cases in which "secret" data is exchanged with a trusted execution > environment using the dma-buf interface. The current use case for private keys is that you take an encrypted file (which would be the DMA coupled part) and you decrypt the contents into the secret memory. There might possibly be a DMA component later where a HSM like device DMAs a key directly into your secret memory to avoid exposure, but I wouldn't anticipate any need for anything beyond the usual page cache API for that case (effectively this would behave like an ordinary page cache page except that only the current process would be able to touch the contents). > If there is no way the data stored in this new secret memory area > would relate to secret data in a TEE or some other hardware > device, then I agree that dma-buf has no value. Never say never, but current TEE designs tend to require full confidentiality for the entire execution. What we're probing is whether we can improve security by doing an API that requires less than full confidentiality for the process. I think if the API proves useful then we will get HW support for it, but it likely won't be in the current TEE of today form. > > What we want is the ability to get an fd, set the properties and > > the size and mmap it. This is pretty much a 100% overlap with the > > memfd API and not much overlap with the dmabuf one, which is why I > > don't think the interface is very well suited. > > Does that mean you are suggesting to use additional flags on > memfd_create() instead of a new system call? Well, that was what the previous patch did. I'm agnostic on the mechanism for obtaining the fd: new syscall as this patch does or extension to memfd like the old one did. All I was saying is that once you have the fd, the API you use on it is the same as the memfd API. James
On Mon, Jul 20, 2020 at 9:16 PM James Bottomley <jejb@linux.ibm.com> wrote: > On Mon, 2020-07-20 at 20:08 +0200, Arnd Bergmann wrote: > > On Mon, Jul 20, 2020 at 5:52 PM James Bottomley <jejb@linux.ibm.com> > > > > If there is no way the data stored in this new secret memory area > > would relate to secret data in a TEE or some other hardware > > device, then I agree that dma-buf has no value. > > Never say never, but current TEE designs tend to require full > confidentiality for the entire execution. What we're probing is > whether we can improve security by doing an API that requires less than > full confidentiality for the process. I think if the API proves useful > then we will get HW support for it, but it likely won't be in the > current TEE of today form. As I understand it, you normally have two kinds of buffers for the TEE: one that may be allocated by Linux but is owned by the TEE itself and not accessible by any process, and one that is used for communication between the TEE and a user process. The sharing would clearly work only for the second type: data that a process wants to share with the TEE but as little else as possible. A hypothetical example might be a process that passes encrypted data to the TEE (which holds the key) for decryption, receives decrypted data and then consumes that data in its own address space. An electronic voting system (I know, evil example) might receive encrypted ballots and sum them up this way without itself having the secret key or anything else being able to observe intermediate results. > > > What we want is the ability to get an fd, set the properties and > > > the size and mmap it. This is pretty much a 100% overlap with the > > > memfd API and not much overlap with the dmabuf one, which is why I > > > don't think the interface is very well suited. > > > > Does that mean you are suggesting to use additional flags on > > memfd_create() instead of a new system call? > > Well, that was what the previous patch did. I'm agnostic on the > mechanism for obtaining the fd: new syscall as this patch does or > extension to memfd like the old one did. All I was saying is that once > you have the fd, the API you use on it is the same as the memfd API. Ok. I suppose we could even retrofit dma-buf underneath the secretmemfd implementation if it ends up being useful later on, Arnd
Hi Mike, On Mon, 20 Jul 2020 at 11:26, Mike Rapoport <rppt@kernel.org> wrote: > > From: Mike Rapoport <rppt@linux.ibm.com> > > Introduce "secretmemfd" system call with the ability to create memory areas > visible only in the context of the owning process and not mapped not only > to other processes but in the kernel page tables as well. > > The user will create a file descriptor using the secretmemfd system call Without wanting to start a bikeshed discussion, the more common convention in recently added system calls is to use an underscore in names that consist of multiple clearly distinct words. See many examples in https://man7.org/linux/man-pages/man2/syscalls.2.html. Thus, I'd suggest at least secret_memfd(). Also, I wonder whether memfd_secret() might not be even better. There's plenty of precedent for the naming style where related APIs share a common prefix [1]. Thanks, Michael [1] Some examples: epoll_create(2) epoll_create1(2) epoll_ctl(2) epoll_pwait(2) epoll_wait(2) mq_getsetattr(2) mq_notify(2) mq_open(2) mq_timedreceive(2) mq_timedsend(2) mq_unlink(2) sched_get_affinity(2) sched_get_priority_max(2) sched_get_priority_min(2) sched_getaffinity(2) sched_getattr(2) sched_getparam(2) sched_getscheduler(2) sched_rr_get_interval(2) sched_set_affinity(2) sched_setaffinity(2) sched_setattr(2) sched_setparam(2) sched_setscheduler(2) sched_yield(2) timer_create(2) timer_delete(2) timer_getoverrun(2) timer_gettime(2) timer_settime(2) timerfd_create(2) timerfd_gettime(2) timerfd_settime(2)
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index f3956fc11de6..35687dcb1a42 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -97,5 +97,6 @@ #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ #define Z3FOLD_MAGIC 0x33 #define PPC_CMM_MAGIC 0xc7571590 +#define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ #endif /* __LINUX_MAGIC_H__ */ diff --git a/include/uapi/linux/secretmem.h b/include/uapi/linux/secretmem.h new file mode 100644 index 000000000000..cef7a59f7492 --- /dev/null +++ b/include/uapi/linux/secretmem.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_SECRERTMEM_H +#define _UAPI_LINUX_SECRERTMEM_H + +/* secretmem operation modes */ +#define SECRETMEM_EXCLUSIVE 0x1 +#define SECRETMEM_UNCACHED 0x2 + +#endif /* _UAPI_LINUX_SECRERTMEM_H */ diff --git a/mm/Kconfig b/mm/Kconfig index f2104cc0d35c..c5aa948214f9 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -872,4 +872,8 @@ config ARCH_HAS_HUGEPD config MAPPING_DIRTY_HELPERS bool +config SECRETMEM + def_bool ARCH_HAS_SET_DIRECT_MAP + select GENERIC_ALLOCATOR + endmenu diff --git a/mm/Makefile b/mm/Makefile index 6e9d46b2efc9..c2aa7a393b73 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -121,3 +121,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o obj-$(CONFIG_PTDUMP_CORE) += ptdump.o obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o +obj-$(CONFIG_SECRETMEM) += secretmem.o diff --git a/mm/secretmem.c b/mm/secretmem.c new file mode 100644 index 000000000000..2f65219baf80 --- /dev/null +++ b/mm/secretmem.c @@ -0,0 +1,263 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/mm.h> +#include <linux/fs.h> +#include <linux/mount.h> +#include <linux/memfd.h> +#include <linux/bitops.h> +#include <linux/printk.h> +#include <linux/pagemap.h> +#include <linux/syscalls.h> +#include <linux/pseudo_fs.h> +#include <linux/set_memory.h> +#include <linux/sched/signal.h> + +#include <uapi/linux/secretmem.h> +#include <uapi/linux/magic.h> + +#include <asm/tlbflush.h> + +#include "internal.h" + +#undef pr_fmt +#define pr_fmt(fmt) "secretmem: " fmt + +#define SECRETMEM_MODE_MASK (SECRETMEM_EXCLUSIVE | SECRETMEM_UNCACHED) +#define SECRETMEM_FLAGS_MASK SECRETMEM_MODE_MASK + +struct secretmem_ctx { + unsigned int mode; +}; + +static struct page *secretmem_alloc_page(gfp_t gfp) +{ + /* + * FIXME: use a cache of large pages to reduce the direct map + * fragmentation + */ + return alloc_page(gfp); +} + +static vm_fault_t secretmem_fault(struct vm_fault *vmf) +{ + struct address_space *mapping = vmf->vma->vm_file->f_mapping; + struct inode *inode = file_inode(vmf->vma->vm_file); + pgoff_t offset = vmf->pgoff; + unsigned long addr; + struct page *page; + int ret = 0; + + if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode)) + return vmf_error(-EINVAL); + + page = find_get_entry(mapping, offset); + if (!page) { + page = secretmem_alloc_page(vmf->gfp_mask); + if (!page) + return vmf_error(-ENOMEM); + + ret = add_to_page_cache(page, mapping, offset, vmf->gfp_mask); + if (unlikely(ret)) + goto err_put_page; + + ret = set_direct_map_invalid_noflush(page); + if (ret) + goto err_del_page_cache; + + addr = (unsigned long)page_address(page); + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + + __SetPageUptodate(page); + + ret = VM_FAULT_LOCKED; + } + + vmf->page = page; + return ret; + +err_del_page_cache: + delete_from_page_cache(page); +err_put_page: + put_page(page); + return vmf_error(ret); +} + +static const struct vm_operations_struct secretmem_vm_ops = { + .fault = secretmem_fault, +}; + +static int secretmem_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct secretmem_ctx *ctx = file->private_data; + unsigned long mode = ctx->mode; + unsigned long len = vma->vm_end - vma->vm_start; + + if (!mode) + return -EINVAL; + + if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len)) + return -EAGAIN; + + switch (mode) { + case SECRETMEM_UNCACHED: + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + fallthrough; + case SECRETMEM_EXCLUSIVE: + vma->vm_ops = &secretmem_vm_ops; + break; + default: + return -EINVAL; + } + + vma->vm_flags |= VM_LOCKED; + + return 0; +} + +const struct file_operations secretmem_fops = { + .mmap = secretmem_mmap, +}; + +static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode) +{ + return false; +} + +static int secretmem_migratepage(struct address_space *mapping, + struct page *newpage, struct page *page, + enum migrate_mode mode) +{ + return -EBUSY; +} + +static void secretmem_freepage(struct page *page) +{ + set_direct_map_default_noflush(page); +} + +static const struct address_space_operations secretmem_aops = { + .freepage = secretmem_freepage, + .migratepage = secretmem_migratepage, + .isolate_page = secretmem_isolate_page, +}; + +static struct vfsmount *secretmem_mnt; + +static struct file *secretmem_file_create(unsigned long flags) +{ + struct file *file = ERR_PTR(-ENOMEM); + struct secretmem_ctx *ctx; + struct inode *inode; + + inode = alloc_anon_inode(secretmem_mnt->mnt_sb); + if (IS_ERR(inode)) + return ERR_CAST(inode); + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + goto err_free_inode; + + file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem", + O_RDWR, &secretmem_fops); + if (IS_ERR(file)) + goto err_free_ctx; + + mapping_set_unevictable(inode->i_mapping); + + inode->i_mapping->private_data = ctx; + inode->i_mapping->a_ops = &secretmem_aops; + + /* pretend we are a normal file with zero size */ + inode->i_mode |= S_IFREG; + inode->i_size = 0; + + file->private_data = ctx; + + ctx->mode = flags & SECRETMEM_MODE_MASK; + + return file; + +err_free_ctx: + kfree(ctx); +err_free_inode: + iput(inode); + return file; +} + +SYSCALL_DEFINE1(secretmemfd, unsigned long, flags) +{ + struct file *file; + unsigned int mode; + int fd, err; + + /* make sure local flags do not confict with global fcntl.h */ + BUILD_BUG_ON(SECRETMEM_FLAGS_MASK & O_CLOEXEC); + + if (flags & ~(SECRETMEM_FLAGS_MASK | O_CLOEXEC)) + return -EINVAL; + + /* modes are mutually exclusive, only one mode bit should be set */ + mode = flags & SECRETMEM_FLAGS_MASK; + if (ffs(mode) != fls(mode)) + return -EINVAL; + + fd = get_unused_fd_flags(flags & O_CLOEXEC); + if (fd < 0) + return fd; + + file = secretmem_file_create(flags); + if (IS_ERR(file)) { + err = PTR_ERR(file); + goto err_put_fd; + } + + file->f_flags |= O_LARGEFILE; + + fd_install(fd, file); + return fd; + +err_put_fd: + put_unused_fd(fd); + return err; +} + +static void secretmem_evict_inode(struct inode *inode) +{ + struct secretmem_ctx *ctx = inode->i_private; + + truncate_inode_pages_final(&inode->i_data); + clear_inode(inode); + kfree(ctx); +} + +static const struct super_operations secretmem_super_ops = { + .evict_inode = secretmem_evict_inode, +}; + +static int secretmem_init_fs_context(struct fs_context *fc) +{ + struct pseudo_fs_context *ctx = init_pseudo(fc, SECRETMEM_MAGIC); + + if (!ctx) + return -ENOMEM; + ctx->ops = &secretmem_super_ops; + + return 0; +} + +static struct file_system_type secretmem_fs = { + .name = "secretmem", + .init_fs_context = secretmem_init_fs_context, + .kill_sb = kill_anon_super, +}; + +static int secretmem_init(void) +{ + int ret = 0; + + secretmem_mnt = kern_mount(&secretmem_fs); + if (IS_ERR(secretmem_mnt)) + ret = PTR_ERR(secretmem_mnt); + + return ret; +} +fs_initcall(secretmem_init);