Message ID | 1572171452-7958-2-git-send-email-rppt@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings | expand |
On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote: > From: Mike Rapoport <rppt@linux.ibm.com> > > The mappings created with MAP_EXCLUSIVE are visible only in the context of > the owning process and can be used by applications to store secret > information that will not be visible not only to other processes but to the > kernel as well. > > The pages in these mappings are removed from the kernel direct map and > marked with PG_user_exclusive flag. When the exclusive area is unmapped, > the pages are mapped back into the direct map. I probably blind, but I don't see where you manipulate direct map...
On Mon, Oct 28, 2019 at 03:31:24PM +0300, Kirill A. Shutemov wrote: > On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote: > > From: Mike Rapoport <rppt@linux.ibm.com> > > > > The mappings created with MAP_EXCLUSIVE are visible only in the context of > > the owning process and can be used by applications to store secret > > information that will not be visible not only to other processes but to the > > kernel as well. > > > > The pages in these mappings are removed from the kernel direct map and > > marked with PG_user_exclusive flag. When the exclusive area is unmapped, > > the pages are mapped back into the direct map. > > I probably blind, but I don't see where you manipulate direct map... __get_user_pages() calls __set_page_user_exclusive() which in turn calls set_direct_map_invalid_noflush() that makes the page not present. > -- > Kirill A. Shutemov
On Mon, Oct 28, 2019 at 02:00:19PM +0100, Mike Rapoport wrote: > On Mon, Oct 28, 2019 at 03:31:24PM +0300, Kirill A. Shutemov wrote: > > On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote: > > > From: Mike Rapoport <rppt@linux.ibm.com> > > > > > > The mappings created with MAP_EXCLUSIVE are visible only in the context of > > > the owning process and can be used by applications to store secret > > > information that will not be visible not only to other processes but to the > > > kernel as well. > > > > > > The pages in these mappings are removed from the kernel direct map and > > > marked with PG_user_exclusive flag. When the exclusive area is unmapped, > > > the pages are mapped back into the direct map. > > > > I probably blind, but I don't see where you manipulate direct map... > > __get_user_pages() calls __set_page_user_exclusive() which in turn calls > set_direct_map_invalid_noflush() that makes the page not present. Ah. okay. I think active use of this feature will lead to performance degradation of the system with time. Setting a single 4k page non-present in the direct mapping will require splitting 2M or 1G page we usually map direct mapping with. And it's one way road. We don't have any mechanism to map the memory with huge page again after the application has freed the page. It might be okay if all these pages cluster together, but I don't think we have a way to achieve it easily.
On Mon, Oct 28, 2019 at 04:16:23PM +0300, Kirill A. Shutemov wrote: > I think active use of this feature will lead to performance degradation of > the system with time. > > Setting a single 4k page non-present in the direct mapping will require > splitting 2M or 1G page we usually map direct mapping with. And it's one > way road. We don't have any mechanism to map the memory with huge page > again after the application has freed the page. Right, we recently had a 'bug' where ftrace triggered something like this and facebook ran into it as a performance regression. So yes, this is a real concern.
On 27.10.19 11:17, Mike Rapoport wrote: > From: Mike Rapoport <rppt@linux.ibm.com> > > The mappings created with MAP_EXCLUSIVE are visible only in the context of > the owning process and can be used by applications to store secret > information that will not be visible not only to other processes but to the > kernel as well. > > The pages in these mappings are removed from the kernel direct map and > marked with PG_user_exclusive flag. When the exclusive area is unmapped, > the pages are mapped back into the direct map. > > The MAP_EXCLUSIVE flag implies MAP_POPULATE and MAP_LOCKED. > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> > --- > arch/x86/mm/fault.c | 14 ++++++++++ > fs/proc/task_mmu.c | 1 + > include/linux/mm.h | 9 +++++++ > include/linux/page-flags.h | 7 +++++ > include/linux/page_excl.h | 49 ++++++++++++++++++++++++++++++++++ > include/trace/events/mmflags.h | 9 ++++++- > include/uapi/asm-generic/mman-common.h | 1 + > kernel/fork.c | 3 ++- > mm/Kconfig | 3 +++ > mm/gup.c | 8 ++++++ > mm/memory.c | 3 +++ > mm/mmap.c | 16 +++++++++++ > mm/page_alloc.c | 5 ++++ > 13 files changed, 126 insertions(+), 2 deletions(-) > create mode 100644 include/linux/page_excl.h > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 9ceacd1..8f73a75 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -17,6 +17,7 @@ > #include <linux/context_tracking.h> /* exception_enter(), ... */ > #include <linux/uaccess.h> /* faulthandler_disabled() */ > #include <linux/efi.h> /* efi_recover_from_page_fault()*/ > +#include <linux/page_excl.h> /* page_is_user_exclusive() */ > #include <linux/mm_types.h> > > #include <asm/cpufeature.h> /* boot_cpu_has, ... */ > @@ -1218,6 +1219,13 @@ static int fault_in_kernel_space(unsigned long address) > return address >= TASK_SIZE_MAX; > } > > +static bool fault_in_user_exclusive_page(unsigned long address) > +{ > + struct page *page = virt_to_page(address); > + > + return page_is_user_exclusive(page); > +} > + > /* > * Called for all faults where 'address' is part of the kernel address > * space. Might get called for faults that originate from *code* that > @@ -1261,6 +1269,12 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, > if (spurious_kernel_fault(hw_error_code, address)) > return; > > + /* FIXME: warn and handle gracefully */ > + if (unlikely(fault_in_user_exclusive_page(address))) { > + pr_err("page fault in user exclusive page at %lx", address); > + force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address); > + } > + > /* kprobes don't want to hook the spurious faults: */ > if (kprobe_page_fault(regs, X86_TRAP_PF)) > return; > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 9442631..99e14d1 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -655,6 +655,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) > #ifdef CONFIG_X86_INTEL_MPX > [ilog2(VM_MPX)] = "mp", > #endif > + [ilog2(VM_EXCLUSIVE)] = "xl", > [ilog2(VM_LOCKED)] = "lo", > [ilog2(VM_IO)] = "io", > [ilog2(VM_SEQ_READ)] = "sr", > diff --git a/include/linux/mm.h b/include/linux/mm.h > index cc29227..9c43375 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -298,11 +298,13 @@ extern unsigned int kobjsize(const void *objp); > #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ > #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ > #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */ > +#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */ > #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) > #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) > #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) > #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) > #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) > +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) > #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ > > #ifdef CONFIG_ARCH_HAS_PKEYS > @@ -340,6 +342,12 @@ extern unsigned int kobjsize(const void *objp); > # define VM_MPX VM_NONE > #endif > > +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS > +# define VM_EXCLUSIVE VM_HIGH_ARCH_5 > +#else > +# define VM_EXCLUSIVE VM_NONE > +#endif > + > #ifndef VM_GROWSUP > # define VM_GROWSUP VM_NONE > #endif > @@ -2594,6 +2602,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, > #define FOLL_ANON 0x8000 /* don't do file mappings */ > #define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */ > #define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ > +#define FOLL_EXCLUSIVE 0x40000 /* mapping is exclusive to owning mm */ > > /* > * NOTE on FOLL_LONGTERM: > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index f91cb88..32d0aee 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -131,6 +131,9 @@ enum pageflags { > PG_young, > PG_idle, > #endif > +#if defined(CONFIG_EXCLUSIVE_USER_PAGES) > + PG_user_exclusive, > +#endif Last time I tried to introduce a new page flag I learned that this is very much frowned upon. Best you can usually do is reuse another flag - if valid in that context.
On 10/27/19 3:17 AM, Mike Rapoport wrote: > The pages in these mappings are removed from the kernel direct map and > marked with PG_user_exclusive flag. When the exclusive area is unmapped, > the pages are mapped back into the direct map. This looks fun. It's certainly simple. But, the description is not really calling out the pros and cons very well. I'm also not sure that folks will use an interface like this that requires up-front, special code to do an allocation instead of something like madvise(). That's why protection keys ended up the way it did: if you do this as a mmap() replacement, you need to modify all *allocators* to be enabled for this. If you do it with mprotect()-style, you can apply it to existing allocations. Some other random thoughts: * The page flag is probably not a good idea. It would be probably better to set _PAGE_SPECIAL on the PTE and force get_user_pages() into the slow path. * This really stops being "normal" memory. You can't do futexes on it, cant splice it. Probably need a more fleshed-out list of incompatible features. * As Kirill noted, each 4k page ends up with a potential 1GB "blast radius" of demoted pages in the direct map. Not cool. This is probably a non-starter as it stands. * The global TLB flushes are going to eat you alive. They probably border on a DoS on larger systems. * Do we really want this user interface to dictate the kernel implementation? In other words, do we really want MAP_EXCLUSIVE, or do we want MAP_SECRET? One tells the kernel what do *do*, the other tells the kernel what the memory *IS*. * There's a lot of other stuff going on in this area: XPFO, SEV, MKTME, Persistent Memory, where the kernel direct map is a liability in some way. We probably need some kind of overall, architected solution rather than five or ten things all poking at the direct map.
On Mon, Oct 28, 2019 at 10:12:44AM -0700, Dave Hansen wrote: > On 10/27/19 3:17 AM, Mike Rapoport wrote: > > The pages in these mappings are removed from the kernel direct map and > > marked with PG_user_exclusive flag. When the exclusive area is unmapped, > > the pages are mapped back into the direct map. > > This looks fun. It's certainly simple. > > But, the description is not really calling out the pros and cons very > well. I'm also not sure that folks will use an interface like this that > requires up-front, special code to do an allocation instead of something > like madvise(). That's why protection keys ended up the way it did: if > you do this as a mmap() replacement, you need to modify all *allocators* > to be enabled for this. If you do it with mprotect()-style, you can > apply it to existing allocations. > > Some other random thoughts: > > * The page flag is probably not a good idea. It would be probably > better to set _PAGE_SPECIAL on the PTE and force get_user_pages() > into the slow path. > * This really stops being "normal" memory. You can't do futexes on it, > cant splice it. Probably need a more fleshed-out list of > incompatible features. > * As Kirill noted, each 4k page ends up with a potential 1GB "blast > radius" of demoted pages in the direct map. Not cool. This is > probably a non-starter as it stands. > * The global TLB flushes are going to eat you alive. They probably > border on a DoS on larger systems. > * Do we really want this user interface to dictate the kernel > implementation? In other words, do we really want MAP_EXCLUSIVE, > or do we want MAP_SECRET? One tells the kernel what do *do*, the > other tells the kernel what the memory *IS*. If we go that route, maybe MAP_USER_SECRET so that there's wiggle room in the event that there are different secret keepers that require different implementations in the kernel? E.g. MAP_GUEST_SECRET for a KVM guest to take the userspace VMM (Qemu) out of the TCB, i.e. the mapping would be accessible by the kernel (or just KVM?) and the KVM guest, but not userspace. > * There's a lot of other stuff going on in this area: XPFO, SEV, MKTME, > Persistent Memory, where the kernel direct map is a liability in some > way. We probably need some kind of overall, architected solution > rather than five or ten things all poking at the direct map. >
On Sun, Oct 27, 2019 at 3:17 AM Mike Rapoport <rppt@kernel.org> wrote: > > From: Mike Rapoport <rppt@linux.ibm.com> > > The mappings created with MAP_EXCLUSIVE are visible only in the context of > the owning process and can be used by applications to store secret > information that will not be visible not only to other processes but to the > kernel as well. > > The pages in these mappings are removed from the kernel direct map and > marked with PG_user_exclusive flag. When the exclusive area is unmapped, > the pages are mapped back into the direct map. > > The MAP_EXCLUSIVE flag implies MAP_POPULATE and MAP_LOCKED. > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> > --- > arch/x86/mm/fault.c | 14 ++++++++++ > fs/proc/task_mmu.c | 1 + > include/linux/mm.h | 9 +++++++ > include/linux/page-flags.h | 7 +++++ > include/linux/page_excl.h | 49 ++++++++++++++++++++++++++++++++++ > include/trace/events/mmflags.h | 9 ++++++- > include/uapi/asm-generic/mman-common.h | 1 + > kernel/fork.c | 3 ++- > mm/Kconfig | 3 +++ > mm/gup.c | 8 ++++++ > mm/memory.c | 3 +++ > mm/mmap.c | 16 +++++++++++ > mm/page_alloc.c | 5 ++++ > 13 files changed, 126 insertions(+), 2 deletions(-) > create mode 100644 include/linux/page_excl.h > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 9ceacd1..8f73a75 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -17,6 +17,7 @@ > #include <linux/context_tracking.h> /* exception_enter(), ... */ > #include <linux/uaccess.h> /* faulthandler_disabled() */ > #include <linux/efi.h> /* efi_recover_from_page_fault()*/ > +#include <linux/page_excl.h> /* page_is_user_exclusive() */ > #include <linux/mm_types.h> > > #include <asm/cpufeature.h> /* boot_cpu_has, ... */ > @@ -1218,6 +1219,13 @@ static int fault_in_kernel_space(unsigned long address) > return address >= TASK_SIZE_MAX; > } > > +static bool fault_in_user_exclusive_page(unsigned long address) > +{ > + struct page *page = virt_to_page(address); > + > + return page_is_user_exclusive(page); > +} > + > /* > * Called for all faults where 'address' is part of the kernel address > * space. Might get called for faults that originate from *code* that > @@ -1261,6 +1269,12 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, > if (spurious_kernel_fault(hw_error_code, address)) > return; > > + /* FIXME: warn and handle gracefully */ > + if (unlikely(fault_in_user_exclusive_page(address))) { > + pr_err("page fault in user exclusive page at %lx", address); > + force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address); > + } Sending a signal here is not a reasonable thing to do in response to an unexpected kernel fault. You need to OOPS. Printing a nice message would be nice. --Andy
On Mon, Oct 28, 2019 at 10:12:44AM -0700, Dave Hansen wrote: > Some other random thoughts: > > * The page flag is probably not a good idea. It would be probably > better to set _PAGE_SPECIAL on the PTE and force get_user_pages() > into the slow path. > * This really stops being "normal" memory. You can't do futexes on it, > cant splice it. Probably need a more fleshed-out list of > incompatible features. > * As Kirill noted, each 4k page ends up with a potential 1GB "blast > radius" of demoted pages in the direct map. Not cool. This is > probably a non-starter as it stands. > * The global TLB flushes are going to eat you alive. They probably > border on a DoS on larger systems. > * Do we really want this user interface to dictate the kernel > implementation? In other words, do we really want MAP_EXCLUSIVE, > or do we want MAP_SECRET? One tells the kernel what do *do*, the > other tells the kernel what the memory *IS*. > * There's a lot of other stuff going on in this area: XPFO, SEV, MKTME, > Persistent Memory, where the kernel direct map is a liability in some > way. We probably need some kind of overall, architected solution > rather than five or ten things all poking at the direct map. Another random set of thoughts: - Should devices be permitted to DMA to/from MAP_SECRET pages? - How about GUP? Can I ptrace my way into another process's secret pages? - What if I splice() the page into a pipe?
On Mon, 2019-10-28 at 14:55 +0100, Peter Zijlstra wrote: > On Mon, Oct 28, 2019 at 04:16:23PM +0300, Kirill A. Shutemov wrote: > > > I think active use of this feature will lead to performance degradation of > > the system with time. > > > > Setting a single 4k page non-present in the direct mapping will require > > splitting 2M or 1G page we usually map direct mapping with. And it's one > > way road. We don't have any mechanism to map the memory with huge page > > again after the application has freed the page. > > Right, we recently had a 'bug' where ftrace triggered something like > this and facebook ran into it as a performance regression. So yes, this > is a real concern. Don't e/cBPF filters also break the direct map down to 4k pages when calling set_memory_ro() on the filter for 64 bit x86 and arm? I've been wondering if the page allocator should make some effort to find a broken down page for anything that can be known will have direct map permissions changed (or if it already groups them somehow). But also, why any potential slowdown of 4k pages on the direct map hasn't been noticed for apps that do a lot of insertions and removals of BPF filters, if this is indeed the case.
On Mon, Oct 28, 2019 at 07:59:25PM +0000, Edgecombe, Rick P wrote: > On Mon, 2019-10-28 at 14:55 +0100, Peter Zijlstra wrote: > > On Mon, Oct 28, 2019 at 04:16:23PM +0300, Kirill A. Shutemov wrote: > > > > > I think active use of this feature will lead to performance degradation of > > > the system with time. > > > > > > Setting a single 4k page non-present in the direct mapping will require > > > splitting 2M or 1G page we usually map direct mapping with. And it's one > > > way road. We don't have any mechanism to map the memory with huge page > > > again after the application has freed the page. > > > > Right, we recently had a 'bug' where ftrace triggered something like > > this and facebook ran into it as a performance regression. So yes, this > > is a real concern. > > Don't e/cBPF filters also break the direct map down to 4k pages when calling > set_memory_ro() on the filter for 64 bit x86 and arm? > > I've been wondering if the page allocator should make some effort to find a > broken down page for anything that can be known will have direct map permissions > changed (or if it already groups them somehow). But also, why any potential > slowdown of 4k pages on the direct map hasn't been noticed for apps that do a > lot of insertions and removals of BPF filters, if this is indeed the case. That should be limited to the module range. Random data maps could shatter the world.
On Mon, Oct 28, 2019 at 6:16 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > On Mon, Oct 28, 2019 at 02:00:19PM +0100, Mike Rapoport wrote: > > On Mon, Oct 28, 2019 at 03:31:24PM +0300, Kirill A. Shutemov wrote: > > > On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote: > > > > From: Mike Rapoport <rppt@linux.ibm.com> > > > > > > > > The mappings created with MAP_EXCLUSIVE are visible only in the context of > > > > the owning process and can be used by applications to store secret > > > > information that will not be visible not only to other processes but to the > > > > kernel as well. > > > > > > > > The pages in these mappings are removed from the kernel direct map and > > > > marked with PG_user_exclusive flag. When the exclusive area is unmapped, > > > > the pages are mapped back into the direct map. > > > > > > I probably blind, but I don't see where you manipulate direct map... > > > > __get_user_pages() calls __set_page_user_exclusive() which in turn calls > > set_direct_map_invalid_noflush() that makes the page not present. > > Ah. okay. > > I think active use of this feature will lead to performance degradation of > the system with time. > > Setting a single 4k page non-present in the direct mapping will require > splitting 2M or 1G page we usually map direct mapping with. And it's one > way road. We don't have any mechanism to map the memory with huge page > again after the application has freed the page. > > It might be okay if all these pages cluster together, but I don't think we > have a way to achieve it easily. Still, it would be worth exploring what that would look like if not for MAP_EXCLUSIVE then set_mce_nospec() that wants to punch out poison pages from the direct map. In the case of pmem, where those pages are able to be repaired, it would be nice to also repair the mapping granularity of the direct map.
On Mon, Oct 28, 2019 at 10:43:51PM -0700, Dan Williams wrote: > On Mon, Oct 28, 2019 at 6:16 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > On Mon, Oct 28, 2019 at 02:00:19PM +0100, Mike Rapoport wrote: > > > On Mon, Oct 28, 2019 at 03:31:24PM +0300, Kirill A. Shutemov wrote: > > > > On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote: > > > > > From: Mike Rapoport <rppt@linux.ibm.com> > > > > > > > > > > The mappings created with MAP_EXCLUSIVE are visible only in the context of > > > > > the owning process and can be used by applications to store secret > > > > > information that will not be visible not only to other processes but to the > > > > > kernel as well. > > > > > > > > > > The pages in these mappings are removed from the kernel direct map and > > > > > marked with PG_user_exclusive flag. When the exclusive area is unmapped, > > > > > the pages are mapped back into the direct map. > > > > > > > > I probably blind, but I don't see where you manipulate direct map... > > > > > > __get_user_pages() calls __set_page_user_exclusive() which in turn calls > > > set_direct_map_invalid_noflush() that makes the page not present. > > > > Ah. okay. > > > > I think active use of this feature will lead to performance degradation of > > the system with time. > > > > Setting a single 4k page non-present in the direct mapping will require > > splitting 2M or 1G page we usually map direct mapping with. And it's one > > way road. We don't have any mechanism to map the memory with huge page > > again after the application has freed the page. > > > > It might be okay if all these pages cluster together, but I don't think we > > have a way to achieve it easily. > > Still, it would be worth exploring what that would look like if not > for MAP_EXCLUSIVE then set_mce_nospec() that wants to punch out poison > pages from the direct map. In the case of pmem, where those pages are > able to be repaired, it would be nice to also repair the mapping > granularity of the direct map. The solution has to consist of two parts: finding a range to collapse and actually collapsing the range into a huge page. Finding the collapsible range will likely require background scanning of the direct mapping as we do for THP with khugepaged. It should not too hard, but likely require long and tedious tuning to be effective, but not too disturbing for the system. Alternatively, after any changes to the direct mapping, we can initiate checking if the range is collapsible. Up to 1G around the changed 4k. It might be more taxing than scanning if direct mapping changes often. Collapsing itself appears to be simple: re-check if the range is collapsible under the lock, replace the page table with the huge page and flush the TLB. But some CPUs don't like to have two TLB entries for the same memory with different sizes at the same time. See for instance AMD erratum 383. Getting it right would require making the range not present, flush TLB and only then install huge page. That's what we do for userspace. It will not fly for the direct mapping. There is no reasonable way to exclude other CPU from accessing the range while it's not present (call stop_machine()? :P). Moreover, the range may contain the code that doing the collapse or data required for it... BTW, looks like current __split_large_page() in pageattr.c is susceptible to the errata. Maybe we can get away with the easy way...
On Mon, 28 Oct 2019, Kirill A. Shutemov wrote: > Setting a single 4k page non-present in the direct mapping will require > splitting 2M or 1G page we usually map direct mapping with. And it's one > way road. We don't have any mechanism to map the memory with huge page > again after the application has freed the page. > > It might be okay if all these pages cluster together, but I don't think we > have a way to achieve it easily. Set aside a special physical memory range for this and migrate the page to that physical memory range when MAP_EXCLUSIVE is specified? Maybe some processors also have hardware ranges that offer additional protection for stuff like that?
On Tue, Oct 29, 2019 at 07:08:42AM +0000, Christopher Lameter wrote: > On Mon, 28 Oct 2019, Kirill A. Shutemov wrote: > > > Setting a single 4k page non-present in the direct mapping will require > > splitting 2M or 1G page we usually map direct mapping with. And it's one > > way road. We don't have any mechanism to map the memory with huge page > > again after the application has freed the page. > > > > It might be okay if all these pages cluster together, but I don't think we > > have a way to achieve it easily. > > Set aside a special physical memory range for this and migrate the > page to that physical memory range when MAP_EXCLUSIVE is specified? I've talked with Thomas yesterday and he suggested something similar: When the MAP_EXCLUSIVE request comes for the first time, we allocate a huge page for it and then use this page as a pool of 4K pages for subsequent requests. Once this huge page is full we allocate a new one and append it to the pool. When all the 4K pages that comprise the huge page are freed the huge page is collapsed. And then on top of this we can look into compaction of the direct map. Of course, this would work if the easy way of collapsing direct map pages Kirill mentioned on other mail will work. > Maybe some processors also have hardware ranges that offer additional > protection for stuff like that? >
On Tue, Oct 29, 2019 at 09:43:18AM +0300, Kirill A. Shutemov wrote: > But some CPUs don't like to have two TLB entries for the same memory with > different sizes at the same time. See for instance AMD erratum 383. > > Getting it right would require making the range not present, flush TLB and > only then install huge page. That's what we do for userspace. > > It will not fly for the direct mapping. There is no reasonable way to > exclude other CPU from accessing the range while it's not present (call > stop_machine()? :P). Moreover, the range may contain the code that doing > the collapse or data required for it... > > BTW, looks like current __split_large_page() in pageattr.c is susceptible > to the errata. Maybe we can get away with the easy way... As you write above, there is just no way we can have a (temporary) hole in the direct map. We are careful about that other errata, and make sure both translations are identical wrt everything else.
On Mon, Oct 28, 2019 at 10:12:44AM -0700, Dave Hansen wrote: > On 10/27/19 3:17 AM, Mike Rapoport wrote: > > The pages in these mappings are removed from the kernel direct map and > > marked with PG_user_exclusive flag. When the exclusive area is unmapped, > > the pages are mapped back into the direct map. > > This looks fun. It's certainly simple. > > But, the description is not really calling out the pros and cons very > well. I'm also not sure that folks will use an interface like this that > requires up-front, special code to do an allocation instead of something > like madvise(). That's why protection keys ended up the way it did: if > you do this as a mmap() replacement, you need to modify all *allocators* > to be enabled for this. If you do it with mprotect()-style, you can > apply it to existing allocations. Actually, I've started with mprotect() and then realized that mmap() would be simpler, so I switched over to mmap(). > Some other random thoughts: > > * The page flag is probably not a good idea. It would be probably > better to set _PAGE_SPECIAL on the PTE and force get_user_pages() > into the slow path. The page flag won't work on 32-bit, indeed. But do we really need such functionality on 32-bit? > * This really stops being "normal" memory. You can't do futexes on it, > cant splice it. Probably need a more fleshed-out list of > incompatible features. True, my bad. I should have mentioned more than THP/compaction/migration. > * As Kirill noted, each 4k page ends up with a potential 1GB "blast > radius" of demoted pages in the direct map. Not cool. This is > probably a non-starter as it stands. > * The global TLB flushes are going to eat you alive. They probably > border on a DoS on larger systems. As I wrote in another email, we could use some kind of pooling to reduce the "blast radius" and that will reduce that amount of TLB flushes as well. The size of the MAP_EXCLUSIVE obeys the RLIMIT_MEMLOCK and we can add a system-wide limit for size of such allocations. > * Do we really want this user interface to dictate the kernel > implementation? In other words, do we really want MAP_EXCLUSIVE, > or do we want MAP_SECRET? One tells the kernel what do *do*, the > other tells the kernel what the memory *IS*. I hesitated quite some time between EXCLUSIVE and SECRET. I've settled down on EXCLUSIVE because in my view that better describes the fact that the region is only mapped in its owner address space. And as such it can be used to store secrets, but it can be used for other purposes as well. > * There's a lot of other stuff going on in this area: XPFO, SEV, MKTME, > Persistent Memory, where the kernel direct map is a liability in some > way. We probably need some kind of overall, architected solution > rather than five or ten things all poking at the direct map. Agree.
On Mon, Oct 28, 2019 at 11:08:08AM -0700, Matthew Wilcox wrote: > On Mon, Oct 28, 2019 at 10:12:44AM -0700, Dave Hansen wrote: > > Some other random thoughts: > > > > * The page flag is probably not a good idea. It would be probably > > better to set _PAGE_SPECIAL on the PTE and force get_user_pages() > > into the slow path. > > * This really stops being "normal" memory. You can't do futexes on it, > > cant splice it. Probably need a more fleshed-out list of > > incompatible features. > > * As Kirill noted, each 4k page ends up with a potential 1GB "blast > > radius" of demoted pages in the direct map. Not cool. This is > > probably a non-starter as it stands. > > * The global TLB flushes are going to eat you alive. They probably > > border on a DoS on larger systems. > > * Do we really want this user interface to dictate the kernel > > implementation? In other words, do we really want MAP_EXCLUSIVE, > > or do we want MAP_SECRET? One tells the kernel what do *do*, the > > other tells the kernel what the memory *IS*. > > * There's a lot of other stuff going on in this area: XPFO, SEV, MKTME, > > Persistent Memory, where the kernel direct map is a liability in some > > way. We probably need some kind of overall, architected solution > > rather than five or ten things all poking at the direct map. > > Another random set of thoughts: > > - Should devices be permitted to DMA to/from MAP_SECRET pages? I can't say I have a clear cut yes or no here. One possible use case for such pages is to read a secrets from storage directly into them. On the other side, DMA to/from a device can be used to exploit those secrets... > - How about GUP? Do you mean GUP for "remote" memory? I'd say no. > - Can I ptrace my way into another process's secret pages? No. > - What if I splice() the page into a pipe? I think it should fail.
On Tue, 29 Oct 2019, Mike Rapoport wrote: > I've talked with Thomas yesterday and he suggested something similar: > > When the MAP_EXCLUSIVE request comes for the first time, we allocate a huge > page for it and then use this page as a pool of 4K pages for subsequent > requests. Once this huge page is full we allocate a new one and append it > to the pool. When all the 4K pages that comprise the huge page are freed > the huge page is collapsed. Or write a device driver that allows you to mmap a secure area and avoid all core kernel modifications? /dev/securemem or so? It may exist already.
On Tue, Oct 29, 2019 at 09:56:02AM +0100, Peter Zijlstra wrote: > On Tue, Oct 29, 2019 at 09:43:18AM +0300, Kirill A. Shutemov wrote: > > But some CPUs don't like to have two TLB entries for the same memory with > > different sizes at the same time. See for instance AMD erratum 383. > > > > Getting it right would require making the range not present, flush TLB and > > only then install huge page. That's what we do for userspace. > > > > It will not fly for the direct mapping. There is no reasonable way to > > exclude other CPU from accessing the range while it's not present (call > > stop_machine()? :P). Moreover, the range may contain the code that doing > > the collapse or data required for it... > > > > BTW, looks like current __split_large_page() in pageattr.c is susceptible > > to the errata. Maybe we can get away with the easy way... > > As you write above, there is just no way we can have a (temporary) hole > in the direct map. > > We are careful about that other errata, and make sure both translations > are identical wrt everything else. It's not clear if it is enough to avoid the issue. "under a highly specific and detailed set of conditions" is not very specific set of conditions :P
On 27.10.19 11:17, Mike Rapoport wrote: > From: Mike Rapoport <rppt@linux.ibm.com> > > The mappings created with MAP_EXCLUSIVE are visible only in the context of > the owning process and can be used by applications to store secret > information that will not be visible not only to other processes but to the > kernel as well. > > The pages in these mappings are removed from the kernel direct map and > marked with PG_user_exclusive flag. When the exclusive area is unmapped, > the pages are mapped back into the direct map. > Just a thought, the kernel is still able to indirectly read the contents of these pages by doing a kdump from kexec environment, right?. Also, I wonder what would happen if you map such pages via /dev/mem into another user space application and e.g., use them along with kvm [1]. [1] https://lwn.net/Articles/778240/
On Tue, Oct 29, 2019 at 02:00:24PM +0300, Kirill A. Shutemov wrote: > On Tue, Oct 29, 2019 at 09:56:02AM +0100, Peter Zijlstra wrote: > > On Tue, Oct 29, 2019 at 09:43:18AM +0300, Kirill A. Shutemov wrote: > > > But some CPUs don't like to have two TLB entries for the same memory with > > > different sizes at the same time. See for instance AMD erratum 383. > > > > > > Getting it right would require making the range not present, flush TLB and > > > only then install huge page. That's what we do for userspace. > > > > > > It will not fly for the direct mapping. There is no reasonable way to > > > exclude other CPU from accessing the range while it's not present (call > > > stop_machine()? :P). Moreover, the range may contain the code that doing > > > the collapse or data required for it... > > > > > > BTW, looks like current __split_large_page() in pageattr.c is susceptible > > > to the errata. Maybe we can get away with the easy way... > > > > As you write above, there is just no way we can have a (temporary) hole > > in the direct map. > > > > We are careful about that other errata, and make sure both translations > > are identical wrt everything else. > > It's not clear if it is enough to avoid the issue. "under a highly specific > and detailed set of conditions" is not very specific set of conditions :P Yeah, I know ... :/ Tom is there any chance you could shed a little more light on that errata?
On Mon, 2019-10-28 at 22:00 +0100, Peter Zijlstra wrote: > On Mon, Oct 28, 2019 at 07:59:25PM +0000, Edgecombe, Rick P wrote: > > On Mon, 2019-10-28 at 14:55 +0100, Peter Zijlstra wrote: > > > On Mon, Oct 28, 2019 at 04:16:23PM +0300, Kirill A. Shutemov wrote: > > > > > > > I think active use of this feature will lead to performance degradation > > > > of > > > > the system with time. > > > > > > > > Setting a single 4k page non-present in the direct mapping will require > > > > splitting 2M or 1G page we usually map direct mapping with. And it's one > > > > way road. We don't have any mechanism to map the memory with huge page > > > > again after the application has freed the page. > > > > > > Right, we recently had a 'bug' where ftrace triggered something like > > > this and facebook ran into it as a performance regression. So yes, this > > > is a real concern. > > > > Don't e/cBPF filters also break the direct map down to 4k pages when calling > > set_memory_ro() on the filter for 64 bit x86 and arm? > > > > I've been wondering if the page allocator should make some effort to find a > > broken down page for anything that can be known will have direct map > > permissions > > changed (or if it already groups them somehow). But also, why any potential > > slowdown of 4k pages on the direct map hasn't been noticed for apps that do > > a > > lot of insertions and removals of BPF filters, if this is indeed the case. > > That should be limited to the module range. Random data maps could > shatter the world. BPF has one vmalloc space allocation for the byte code and one for the module space allocation for the JIT. Both get RO also set on the direct map alias of the pages, and reset RW when freed. You mean shatter performance?
On Mon, Oct 28, 2019 at 11:43 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > On Mon, Oct 28, 2019 at 10:43:51PM -0700, Dan Williams wrote: > > On Mon, Oct 28, 2019 at 6:16 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > > > On Mon, Oct 28, 2019 at 02:00:19PM +0100, Mike Rapoport wrote: > > > > On Mon, Oct 28, 2019 at 03:31:24PM +0300, Kirill A. Shutemov wrote: > > > > > On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote: > > > > > > From: Mike Rapoport <rppt@linux.ibm.com> > > > > > > > > > > > > The mappings created with MAP_EXCLUSIVE are visible only in the context of > > > > > > the owning process and can be used by applications to store secret > > > > > > information that will not be visible not only to other processes but to the > > > > > > kernel as well. > > > > > > > > > > > > The pages in these mappings are removed from the kernel direct map and > > > > > > marked with PG_user_exclusive flag. When the exclusive area is unmapped, > > > > > > the pages are mapped back into the direct map. > > > > > > > > > > I probably blind, but I don't see where you manipulate direct map... > > > > > > > > __get_user_pages() calls __set_page_user_exclusive() which in turn calls > > > > set_direct_map_invalid_noflush() that makes the page not present. > > > > > > Ah. okay. > > > > > > I think active use of this feature will lead to performance degradation of > > > the system with time. > > > > > > Setting a single 4k page non-present in the direct mapping will require > > > splitting 2M or 1G page we usually map direct mapping with. And it's one > > > way road. We don't have any mechanism to map the memory with huge page > > > again after the application has freed the page. > > > > > > It might be okay if all these pages cluster together, but I don't think we > > > have a way to achieve it easily. > > > > Still, it would be worth exploring what that would look like if not > > for MAP_EXCLUSIVE then set_mce_nospec() that wants to punch out poison > > pages from the direct map. In the case of pmem, where those pages are > > able to be repaired, it would be nice to also repair the mapping > > granularity of the direct map. > > The solution has to consist of two parts: finding a range to collapse and > actually collapsing the range into a huge page. > > Finding the collapsible range will likely require background scanning of > the direct mapping as we do for THP with khugepaged. It should not too > hard, but likely require long and tedious tuning to be effective, but not > too disturbing for the system. > > Alternatively, after any changes to the direct mapping, we can initiate > checking if the range is collapsible. Up to 1G around the changed 4k. > It might be more taxing than scanning if direct mapping changes often. > > Collapsing itself appears to be simple: re-check if the range is > collapsible under the lock, replace the page table with the huge page and > flush the TLB. > > But some CPUs don't like to have two TLB entries for the same memory with > different sizes at the same time. See for instance AMD erratum 383. That basic description would seem to defeat most (all?) interesting huge page use cases. For example dax makes no attempt to make sure aliased mappings of pmem are the same size between the direct map that the driver uses, and userspace dax mappings. So I assume there are more details than "all aliased mappings must be the same size". > Getting it right would require making the range not present, flush TLB and > only then install huge page. That's what we do for userspace. > > It will not fly for the direct mapping. There is no reasonable way to > exclude other CPU from accessing the range while it's not present (call > stop_machine()? :P). Moreover, the range may contain the code that doing > the collapse or data required for it... At least for pmem all the access points can be controlled. pmem is never used for kernel text at least in the dax mode where it is accessed via file-backed shared mappings, or the pmem driver. So when I say "direct-map repair" I mean the incidental direct-map that pmem uses since it maps pmem with arch_add_memory(), not the typical DRAM direct-map that may house kernel text. Poison consumed from the kernel DRAM direct-map is fatal, poison consumed from dax mappings and the pmem driver path is recoverable and repairable.
On 10/29/19 12:43 PM, Dan Williams wrote: >> But some CPUs don't like to have two TLB entries for the same memory with >> different sizes at the same time. See for instance AMD erratum 383. > That basic description would seem to defeat most (all?) interesting > huge page use cases. For example dax makes no attempt to make sure > aliased mappings of pmem are the same size between the direct map that > the driver uses, and userspace dax mappings. So I assume there are > more details than "all aliased mappings must be the same size". These are about when large and small TLB entries could be held in the TLB at the same time for the same virtual address in the same process. It doesn't matter that two *different* mappings are using different page size. Imagine you were *just* changing the page size. Without these errata, you could just skip flushing the TLB. You might use the old hardware page size for a while, but it will be functionally OK. With these errata, we need to ensure in software that the old TLB entries for the old page size are flushed before the new page size is established.
On Tue, Oct 29, 2019 at 10:12:04AM +0000, Christopher Lameter wrote: > > > On Tue, 29 Oct 2019, Mike Rapoport wrote: > > > I've talked with Thomas yesterday and he suggested something similar: > > > > When the MAP_EXCLUSIVE request comes for the first time, we allocate a huge > > page for it and then use this page as a pool of 4K pages for subsequent > > requests. Once this huge page is full we allocate a new one and append it > > to the pool. When all the 4K pages that comprise the huge page are freed > > the huge page is collapsed. > > Or write a device driver that allows you to mmap a secure area and avoid > all core kernel modifications? > > /dev/securemem or so? A device driver will need to remove the secure area from the direct map and then we back to square one. > It may exist already. >
On Tue, Oct 29, 2019 at 12:02:34PM +0100, David Hildenbrand wrote: > On 27.10.19 11:17, Mike Rapoport wrote: > >From: Mike Rapoport <rppt@linux.ibm.com> > > > >The mappings created with MAP_EXCLUSIVE are visible only in the context of > >the owning process and can be used by applications to store secret > >information that will not be visible not only to other processes but to the > >kernel as well. > > > >The pages in these mappings are removed from the kernel direct map and > >marked with PG_user_exclusive flag. When the exclusive area is unmapped, > >the pages are mapped back into the direct map. > > > > Just a thought, the kernel is still able to indirectly read the contents of > these pages by doing a kdump from kexec environment, right? Right. > Also, I wonder > what would happen if you map such pages via /dev/mem into another user space > application and e.g., use them along with kvm [1]. Do you mean that one application creates MAP_EXCLUSIVE and another applications accesses the same physical pages via /dev/mem? With /dev/mem all physical memory is visible... > [1] https://lwn.net/Articles/778240/ > > -- > > Thanks, > > David / dhildenb >
On 30.10.19 09:15, Mike Rapoport wrote: > On Tue, Oct 29, 2019 at 12:02:34PM +0100, David Hildenbrand wrote: >> On 27.10.19 11:17, Mike Rapoport wrote: >>> From: Mike Rapoport <rppt@linux.ibm.com> >>> >>> The mappings created with MAP_EXCLUSIVE are visible only in the context of >>> the owning process and can be used by applications to store secret >>> information that will not be visible not only to other processes but to the >>> kernel as well. >>> >>> The pages in these mappings are removed from the kernel direct map and >>> marked with PG_user_exclusive flag. When the exclusive area is unmapped, >>> the pages are mapped back into the direct map. >>> >> >> Just a thought, the kernel is still able to indirectly read the contents of >> these pages by doing a kdump from kexec environment, right? > > Right. > >> Also, I wonder >> what would happen if you map such pages via /dev/mem into another user space >> application and e.g., use them along with kvm [1]. > > Do you mean that one application creates MAP_EXCLUSIVE and another > applications accesses the same physical pages via /dev/mem? Exactly. > > With /dev/mem all physical memory is visible... Okay, so the statement "information that will not be visible not only to other processes but to the kernel as well" is not correct. There are easy ways to access that information if you really want to (might require root permissions, though).
On Tue, Oct 29, 2019 at 05:27:43PM +0000, Edgecombe, Rick P wrote: > On Mon, 2019-10-28 at 22:00 +0100, Peter Zijlstra wrote: > > That should be limited to the module range. Random data maps could > > shatter the world. > > BPF has one vmalloc space allocation for the byte code and one for the module > space allocation for the JIT. Both get RO also set on the direct map alias of > the pages, and reset RW when freed. Argh, I didn't know they mapped the bytecode RO; why does it do that? It can throw out the bytecode once it's JIT'ed. > You mean shatter performance? Shatter (all) large pages.
On Wed, 30 Oct 2019, Mike Rapoport wrote: > > /dev/securemem or so? > > A device driver will need to remove the secure area from the direct map and > then we back to square one. We have avoided the need for modifications to kernel core code. And its a natural thing to treat this like special memory provided by a device driver.
On Wed, Oct 30, 2019 at 3:06 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Oct 29, 2019 at 05:27:43PM +0000, Edgecombe, Rick P wrote: > > On Mon, 2019-10-28 at 22:00 +0100, Peter Zijlstra wrote: > > > > That should be limited to the module range. Random data maps could > > > shatter the world. > > > > BPF has one vmalloc space allocation for the byte code and one for the module > > space allocation for the JIT. Both get RO also set on the direct map alias of > > the pages, and reset RW when freed. > > Argh, I didn't know they mapped the bytecode RO; why does it do that? It > can throw out the bytecode once it's JIT'ed. because of endless security "concerns" that some folks had. Like what if something can exploit another bug in the kernel and modify bytecode that was already verified then interpreter will execute that modified bytecode. Sort of similar reasoning why .text is read-only. I think it's not a realistic attack, but I didn't bother to argue back then. The mere presence of interpreter itself is a real security concern. People that care about speculation attacks should have CONFIG_BPF_JIT_ALWAYS_ON=y, so modifying bytecode via another exploit will be pointless. Getting rid of RO for bytecode will save a ton of memory too, since we won't need to allocate full page for each small programs.
On Wed, 2019-10-30 at 11:04 +0100, Peter Zijlstra wrote: > > You mean shatter performance? > > Shatter (all) large pages. So it looks like this is already happening then to some degree. It's not just BPF either, any module_alloc() user is going to do something similar with the direct map alias of the page they got for the text. So there must be at least some usages where breaking the direct map down, for like a page to store a key or something, isn't totally horrible.
On 10/30/19 10:48 AM, Edgecombe, Rick P wrote: > On Wed, 2019-10-30 at 11:04 +0100, Peter Zijlstra wrote: >>> You mean shatter performance? >> >> Shatter (all) large pages. > > So it looks like this is already happening then to some degree. It's not just > BPF either, any module_alloc() user is going to do something similar with the > direct map alias of the page they got for the text. > > So there must be at least some usages where breaking the direct map down, for > like a page to store a key or something, isn't totally horrible. The systems that really need large pages are the large ones. They have the same TLBs and data structures as really little systems, but orders of magnitude more address space. Modules and BPF are a (hopefully) drop in the bucket on small systems and they're really inconsequential on really big systems. Modules also require privilege. Allowing random user apps to fracture the direct map for every page of their memory or *lots* of pages of their memory is an entirely different kind of problem from modules. It takes a "drop in the bucket" fracturing and turns it into the common case.
On 10/30/19 10:58 AM, Dave Hansen wrote:
> Modules also require privilege.
IMNHO, if BPF is fracturing large swaths the direct map with no
privilege, it's only a matter of time until it starts to cause problems.
The fact that we do it today is only evidence that we have a ticking
time bomb, not that it's OK.
On Wed, Oct 30, 2019 at 08:35:09AM -0700, Alexei Starovoitov wrote: > On Wed, Oct 30, 2019 at 3:06 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Tue, Oct 29, 2019 at 05:27:43PM +0000, Edgecombe, Rick P wrote: > > > On Mon, 2019-10-28 at 22:00 +0100, Peter Zijlstra wrote: > > > > > > That should be limited to the module range. Random data maps could > > > > shatter the world. > > > > > > BPF has one vmalloc space allocation for the byte code and one for the module > > > space allocation for the JIT. Both get RO also set on the direct map alias of > > > the pages, and reset RW when freed. > > > > Argh, I didn't know they mapped the bytecode RO; why does it do that? It > > can throw out the bytecode once it's JIT'ed. > > because of endless security "concerns" that some folks had. > Like what if something can exploit another bug in the kernel > and modify bytecode that was already verified > then interpreter will execute that modified bytecode. But when it's JIT'ed the bytecode is no longer of relevance, right? So any scenario with a JIT on can then toss the bytecode and certainly doesn't need to map it RO. > Sort of similar reasoning why .text is read-only. > I think it's not a realistic attack, but I didn't bother to argue back then. > The mere presence of interpreter itself is a real security concern. > People that care about speculation attacks should > have CONFIG_BPF_JIT_ALWAYS_ON=y, This isn't about speculation attacks, it is about breaking buffer limits and being able to write to memory. And in that respect being able to change the current task state (write it's effective PID to 0) is much simpler than writing to text or bytecode, but if you cannot reach/find the task struct but can reach/find text.. > so modifying bytecode via another exploit will be pointless. > Getting rid of RO for bytecode will save a ton of memory too, > since we won't need to allocate full page for each small programs. So I'm thinking we can get rid of that for any scenario that has the JIT enabled -- not only JIT_ALWAYS_ON.
On Wed, Oct 30, 2019 at 11:39 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Oct 30, 2019 at 08:35:09AM -0700, Alexei Starovoitov wrote: > > On Wed, Oct 30, 2019 at 3:06 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Tue, Oct 29, 2019 at 05:27:43PM +0000, Edgecombe, Rick P wrote: > > > > On Mon, 2019-10-28 at 22:00 +0100, Peter Zijlstra wrote: > > > > > > > > That should be limited to the module range. Random data maps could > > > > > shatter the world. > > > > > > > > BPF has one vmalloc space allocation for the byte code and one for the module > > > > space allocation for the JIT. Both get RO also set on the direct map alias of > > > > the pages, and reset RW when freed. > > > > > > Argh, I didn't know they mapped the bytecode RO; why does it do that? It > > > can throw out the bytecode once it's JIT'ed. > > > > because of endless security "concerns" that some folks had. > > Like what if something can exploit another bug in the kernel > > and modify bytecode that was already verified > > then interpreter will execute that modified bytecode. > > But when it's JIT'ed the bytecode is no longer of relevance, right? So > any scenario with a JIT on can then toss the bytecode and certainly > doesn't need to map it RO. We keep so called "xlated" bytecode around for debugging. It's the one that is actually running. It was modified through several stages of the verifier before being runnable by interpreter. When folks debug stuff in production they want to see the whole thing. Both x86 asm and xlated bytecode. xlated bytecode also sanitized before it's returned back to user space. > > Sort of similar reasoning why .text is read-only. > > I think it's not a realistic attack, but I didn't bother to argue back then. > > The mere presence of interpreter itself is a real security concern. > > People that care about speculation attacks should > > have CONFIG_BPF_JIT_ALWAYS_ON=y, > > This isn't about speculation attacks, it is about breaking buffer limits > and being able to write to memory. And in that respect being able to > change the current task state (write it's effective PID to 0) is much > simpler than writing to text or bytecode, but if you cannot reach/find > the task struct but can reach/find text.. exactly. that's why RO bytecode was dubious to me from the beginning. For an attacker to write meaningful bytecode they need to know quite a few other kernel internal pointers. If an exploit can write into memory there are plenty of easier targets. > > so modifying bytecode via another exploit will be pointless. > > Getting rid of RO for bytecode will save a ton of memory too, > > since we won't need to allocate full page for each small programs. > > So I'm thinking we can get rid of that for any scenario that has the JIT > enabled -- not only JIT_ALWAYS_ON. Sounds good to me. Happy to do that. Will add it to our todo list.
On Wed, Oct 30, 2019 at 09:19:33AM +0100, David Hildenbrand wrote: > On 30.10.19 09:15, Mike Rapoport wrote: > >On Tue, Oct 29, 2019 at 12:02:34PM +0100, David Hildenbrand wrote: > >>On 27.10.19 11:17, Mike Rapoport wrote: > >>>From: Mike Rapoport <rppt@linux.ibm.com> > >>> > >>>The mappings created with MAP_EXCLUSIVE are visible only in the context of > >>>the owning process and can be used by applications to store secret > >>>information that will not be visible not only to other processes but to the > >>>kernel as well. > >>> > >>>The pages in these mappings are removed from the kernel direct map and > >>>marked with PG_user_exclusive flag. When the exclusive area is unmapped, > >>>the pages are mapped back into the direct map. > >>> > >> > >>Just a thought, the kernel is still able to indirectly read the contents of > >>these pages by doing a kdump from kexec environment, right? > > > >Right. > > > >>Also, I wonder > >>what would happen if you map such pages via /dev/mem into another user space > >>application and e.g., use them along with kvm [1]. > > > >Do you mean that one application creates MAP_EXCLUSIVE and another > >applications accesses the same physical pages via /dev/mem? > > Exactly. > > > > >With /dev/mem all physical memory is visible... > > Okay, so the statement "information that will not be visible not only to > other processes but to the kernel as well" is not correct. There are easy > ways to access that information if you really want to (might require root > permissions, though). Right, but /dev/mem is an easy way to extract any information in any environment if one has root permissions... > -- > > Thanks, > > David / dhildenb >
On Thu, Oct 31, 2019 at 12:17 PM Mike Rapoport <rppt@kernel.org> wrote: > > On Wed, Oct 30, 2019 at 09:19:33AM +0100, David Hildenbrand wrote: > > On 30.10.19 09:15, Mike Rapoport wrote: > > >On Tue, Oct 29, 2019 at 12:02:34PM +0100, David Hildenbrand wrote: > > >>On 27.10.19 11:17, Mike Rapoport wrote: > > >>>From: Mike Rapoport <rppt@linux.ibm.com> > > >>> > > >>>The mappings created with MAP_EXCLUSIVE are visible only in the context of > > >>>the owning process and can be used by applications to store secret > > >>>information that will not be visible not only to other processes but to the > > >>>kernel as well. > > >>> > > >>>The pages in these mappings are removed from the kernel direct map and > > >>>marked with PG_user_exclusive flag. When the exclusive area is unmapped, > > >>>the pages are mapped back into the direct map. > > >>> > > >> > > >>Just a thought, the kernel is still able to indirectly read the contents of > > >>these pages by doing a kdump from kexec environment, right? > > > > > >Right. > > > > > >>Also, I wonder > > >>what would happen if you map such pages via /dev/mem into another user space > > >>application and e.g., use them along with kvm [1]. > > > > > >Do you mean that one application creates MAP_EXCLUSIVE and another > > >applications accesses the same physical pages via /dev/mem? > > > > Exactly. > > > > > > > >With /dev/mem all physical memory is visible... > > > > Okay, so the statement "information that will not be visible not only to > > other processes but to the kernel as well" is not correct. There are easy > > ways to access that information if you really want to (might require root > > permissions, though). > > Right, but /dev/mem is an easy way to extract any information in any > environment if one has root permissions... > I don't understand this concern with /dev/mem. Just add these pages to the growing list of the things /dev/mem is not allowed to touch.
On 10/29/19 7:39 AM, Peter Zijlstra wrote: > On Tue, Oct 29, 2019 at 02:00:24PM +0300, Kirill A. Shutemov wrote: >> On Tue, Oct 29, 2019 at 09:56:02AM +0100, Peter Zijlstra wrote: >>> On Tue, Oct 29, 2019 at 09:43:18AM +0300, Kirill A. Shutemov wrote: >>>> But some CPUs don't like to have two TLB entries for the same memory with >>>> different sizes at the same time. See for instance AMD erratum 383. >>>> >>>> Getting it right would require making the range not present, flush TLB and >>>> only then install huge page. That's what we do for userspace. >>>> >>>> It will not fly for the direct mapping. There is no reasonable way to >>>> exclude other CPU from accessing the range while it's not present (call >>>> stop_machine()? :P). Moreover, the range may contain the code that doing >>>> the collapse or data required for it... >>>> >>>> BTW, looks like current __split_large_page() in pageattr.c is susceptible >>>> to the errata. Maybe we can get away with the easy way... >>> >>> As you write above, there is just no way we can have a (temporary) hole >>> in the direct map. >>> >>> We are careful about that other errata, and make sure both translations >>> are identical wrt everything else. >> >> It's not clear if it is enough to avoid the issue. "under a highly specific >> and detailed set of conditions" is not very specific set of conditions :P > > Yeah, I know ... :/ Tom is there any chance you could shed a little more > light on that errata? I talked with some of the hardware folks and if you maintain the same bits in the large and small pages (aside from the large page bit) until the flush, then the errata should not occur. The errata really applies to mappings that end up with different attribute bits being set. Even then, it doesn't fail every time. There are other conditions required to make it fail. Thanks, Tom >
On Fri, Nov 15, 2019 at 08:12:52AM -0600, Tom Lendacky wrote: > I talked with some of the hardware folks and if you maintain the same bits > in the large and small pages (aside from the large page bit) until the > flush, then the errata should not occur. Excellent! Thanks for digging that out Tom.
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 9ceacd1..8f73a75 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -17,6 +17,7 @@ #include <linux/context_tracking.h> /* exception_enter(), ... */ #include <linux/uaccess.h> /* faulthandler_disabled() */ #include <linux/efi.h> /* efi_recover_from_page_fault()*/ +#include <linux/page_excl.h> /* page_is_user_exclusive() */ #include <linux/mm_types.h> #include <asm/cpufeature.h> /* boot_cpu_has, ... */ @@ -1218,6 +1219,13 @@ static int fault_in_kernel_space(unsigned long address) return address >= TASK_SIZE_MAX; } +static bool fault_in_user_exclusive_page(unsigned long address) +{ + struct page *page = virt_to_page(address); + + return page_is_user_exclusive(page); +} + /* * Called for all faults where 'address' is part of the kernel address * space. Might get called for faults that originate from *code* that @@ -1261,6 +1269,12 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, if (spurious_kernel_fault(hw_error_code, address)) return; + /* FIXME: warn and handle gracefully */ + if (unlikely(fault_in_user_exclusive_page(address))) { + pr_err("page fault in user exclusive page at %lx", address); + force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address); + } + /* kprobes don't want to hook the spurious faults: */ if (kprobe_page_fault(regs, X86_TRAP_PF)) return; diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 9442631..99e14d1 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -655,6 +655,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) #ifdef CONFIG_X86_INTEL_MPX [ilog2(VM_MPX)] = "mp", #endif + [ilog2(VM_EXCLUSIVE)] = "xl", [ilog2(VM_LOCKED)] = "lo", [ilog2(VM_IO)] = "io", [ilog2(VM_SEQ_READ)] = "sr", diff --git a/include/linux/mm.h b/include/linux/mm.h index cc29227..9c43375 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -298,11 +298,13 @@ extern unsigned int kobjsize(const void *objp); #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ #ifdef CONFIG_ARCH_HAS_PKEYS @@ -340,6 +342,12 @@ extern unsigned int kobjsize(const void *objp); # define VM_MPX VM_NONE #endif +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS +# define VM_EXCLUSIVE VM_HIGH_ARCH_5 +#else +# define VM_EXCLUSIVE VM_NONE +#endif + #ifndef VM_GROWSUP # define VM_GROWSUP VM_NONE #endif @@ -2594,6 +2602,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, #define FOLL_ANON 0x8000 /* don't do file mappings */ #define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */ #define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ +#define FOLL_EXCLUSIVE 0x40000 /* mapping is exclusive to owning mm */ /* * NOTE on FOLL_LONGTERM: diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index f91cb88..32d0aee 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -131,6 +131,9 @@ enum pageflags { PG_young, PG_idle, #endif +#if defined(CONFIG_EXCLUSIVE_USER_PAGES) + PG_user_exclusive, +#endif __NR_PAGEFLAGS, /* Filesystems */ @@ -431,6 +434,10 @@ TESTCLEARFLAG(Young, young, PF_ANY) PAGEFLAG(Idle, idle, PF_ANY) #endif +#ifdef CONFIG_EXCLUSIVE_USER_PAGES +__PAGEFLAG(UserExclusive, user_exclusive, PF_ANY) +#endif + /* * On an anonymous page mapped into a user virtual memory area, * page->mapping points to its anon_vma, not to a struct address_space; diff --git a/include/linux/page_excl.h b/include/linux/page_excl.h new file mode 100644 index 0000000..b7ea3ce --- /dev/null +++ b/include/linux/page_excl.h @@ -0,0 +1,49 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MM_PAGE_EXCLUSIVE_H +#define _LINUX_MM_PAGE_EXCLUSIVE_H + +#include <linux/bitops.h> +#include <linux/page-flags.h> +#include <linux/set_memory.h> +#include <asm/tlbflush.h> + +#ifdef CONFIG_EXCLUSIVE_USER_PAGES + +static inline bool page_is_user_exclusive(struct page *page) +{ + return PageUserExclusive(page); +} + +static inline void __set_page_user_exclusive(struct page *page) +{ + unsigned long addr = (unsigned long)page_address(page); + + __SetPageUserExclusive(page); + set_direct_map_invalid_noflush(page); + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); +} + +static inline void __clear_page_user_exclusive(struct page *page) +{ + __ClearPageUserExclusive(page); + set_direct_map_default_noflush(page); +} + +#else /* !CONFIG_EXCLUSIVE_USER_PAGES */ + +static inline bool page_is_user_exclusive(struct page *page) +{ + return false; +} + +static inline void __set_page_user_exclusive(struct page *page) +{ +} + +static inline void __clear_page_user_exclusive(struct page *page) +{ +} + +#endif /* CONFIG_EXCLUSIVE_USER_PAGES */ + +#endif /* _LINUX_MM_PAGE_EXCLUSIVE_H */ diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index a1675d4..2d3c14a 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -79,6 +79,12 @@ #define IF_HAVE_PG_IDLE(flag,string) #endif +#ifdef CONFIG_EXCLUSIVE_USER_PAGES +#define IF_HAVE_PG_USER_EXCLUSIVE(flag,string) ,{1UL << flag, string} +#else +#define IF_HAVE_PG_USER_EXCLUSIVE(flag,string) +#endif + #define __def_pageflag_names \ {1UL << PG_locked, "locked" }, \ {1UL << PG_waiters, "waiters" }, \ @@ -105,7 +111,8 @@ IF_HAVE_PG_MLOCK(PG_mlocked, "mlocked" ) \ IF_HAVE_PG_UNCACHED(PG_uncached, "uncached" ) \ IF_HAVE_PG_HWPOISON(PG_hwpoison, "hwpoison" ) \ IF_HAVE_PG_IDLE(PG_young, "young" ) \ -IF_HAVE_PG_IDLE(PG_idle, "idle" ) +IF_HAVE_PG_IDLE(PG_idle, "idle" ) \ +IF_HAVE_PG_USER_EXCLUSIVE(PG_user_exclusive, "user_exclusive" ) #define show_page_flags(flags) \ (flags) ? __print_flags(flags, "|", \ diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index c160a53..bf8f23e 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -27,6 +27,7 @@ #define MAP_HUGETLB 0x040000 /* create a huge page mapping */ #define MAP_SYNC 0x080000 /* perform synchronous page faults for the mapping */ #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */ +#define MAP_EXCLUSIVE 0x200000 /* mapping is exclusive to the owning task; the pages in it are dropped from the direct map */ #define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */ diff --git a/kernel/fork.c b/kernel/fork.c index bcdf531..d63adec 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -518,7 +518,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) { struct file *file; - if (mpnt->vm_flags & VM_DONTCOPY) { + if (mpnt->vm_flags & VM_DONTCOPY || + mpnt->vm_flags & VM_EXCLUSIVE) { vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); continue; } diff --git a/mm/Kconfig b/mm/Kconfig index a5dae9a..9d60141 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -736,4 +736,7 @@ config ARCH_HAS_PTE_SPECIAL config ARCH_HAS_HUGEPD bool +config EXCLUSIVE_USER_PAGES + def_bool ARCH_USES_HIGH_VMA_FLAGS && ARCH_HAS_SET_DIRECT_MAP + endmenu diff --git a/mm/gup.c b/mm/gup.c index 8f236a3..a98c0ca0 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -17,6 +17,7 @@ #include <linux/migrate.h> #include <linux/mm_inline.h> #include <linux/sched/mm.h> +#include <linux/page_excl.h> #include <asm/mmu_context.h> #include <asm/pgtable.h> @@ -868,6 +869,10 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, ret = PTR_ERR(page); goto out; } + + if (gup_flags & FOLL_EXCLUSIVE) + __set_page_user_exclusive(page); + if (pages) { pages[i] = page; flush_anon_page(vma, page, start); @@ -1216,6 +1221,9 @@ long populate_vma_page_range(struct vm_area_struct *vma, if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) gup_flags |= FOLL_FORCE; + if (vma->vm_flags & VM_EXCLUSIVE) + gup_flags |= FOLL_EXCLUSIVE; + /* * We made sure addr is within a VMA, so the following will * not result in a stack expansion that recurses back here. diff --git a/mm/memory.c b/mm/memory.c index b1ca51a..a4b4cff 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -71,6 +71,7 @@ #include <linux/dax.h> #include <linux/oom.h> #include <linux/numa.h> +#include <linux/page_excl.h> #include <asm/io.h> #include <asm/mmu_context.h> @@ -1062,6 +1063,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, page_remove_rmap(page, false); if (unlikely(page_mapcount(page) < 0)) print_bad_pte(vma, addr, ptent, page); + if (page_is_user_exclusive(page)) + __clear_page_user_exclusive(page); if (unlikely(__tlb_remove_page(tlb, page))) { force_flush = 1; addr += PAGE_SIZE; diff --git a/mm/mmap.c b/mm/mmap.c index a7d8c84..d8cc82d 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1574,6 +1574,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags |= VM_NORESERVE; } + if (flags & MAP_EXCLUSIVE) + vm_flags |= VM_EXCLUSIVE; + addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || @@ -1591,6 +1594,19 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, addr = untagged_addr(addr); + if (flags & MAP_EXCLUSIVE) { + /* + * MAP_EXCLUSIVE is only supported for private + * anonymous memory not backed by hugetlbfs + */ + if (!(flags & MAP_ANONYMOUS) || !(flags & MAP_PRIVATE) || + (flags & MAP_HUGETLB)) + return -EINVAL; + + /* and impies MAP_LOCKED and MAP_POPULATE */ + flags |= (MAP_LOCKED | MAP_POPULATE); + } + if (!(flags & MAP_ANONYMOUS)) { audit_mmap_fd(fd, flags); file = fget(fd); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ecc3dba..2f1de9d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -68,6 +68,7 @@ #include <linux/lockdep.h> #include <linux/nmi.h> #include <linux/psi.h> +#include <linux/page_excl.h> #include <asm/sections.h> #include <asm/tlbflush.h> @@ -4779,6 +4780,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, page = NULL; } + /* FIXME: should not happen! */ + if (WARN_ON(page_is_user_exclusive(page))) + __clear_page_user_exclusive(page); + trace_mm_page_alloc(page, order, alloc_mask, ac.migratetype); return page;