Message ID | 20241014215022.68530-2-jeffxu@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | seal system mappings | expand |
(I don't think this needs "RFC" any more) On Mon, Oct 14, 2024 at 09:50:20PM +0000, jeffxu@chromium.org wrote: > From: Jeff Xu <jeffxu@chromium.org> > > Seal vdso, vvar, sigpage, uprobes and vsyscall. > > Those mappings are readonly or executable only, sealing can protect > them from ever changing during the life time of the process. For > complete descriptions of memory sealing, please see mseal.rst [1]. > > System mappings such as vdso, vvar, and sigpage (for arm) are > generated by the kernel during program initialization. These mappings > are designated as non-writable, and sealing them will prevent them > from ever becoming writeable. > > Unlike the aforementioned mappings, the uprobe mapping is not > established during program startup. However, its lifetime is the same > as the process's lifetime [2], thus sealable. > > The vdso, vvar, sigpage, and uprobe mappings all invoke the > _install_special_mapping() function. As no other mappings utilize this > function, it is logical to incorporate sealing logic within > _install_special_mapping(). This approach avoids the necessity of > modifying code across various architecture-specific implementations. > > The vsyscall mapping, which has its own initialization function, is > sealed in the XONLY case, it seems to be the most common and secure > case of using vsyscall. > > It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may > alter the mapping of vdso, vvar, and sigpage during restore > operations. Consequently, this feature cannot be universally enabled > across all systems. To address this, a kernel configuration option has > been introduced to enable or disable this functionality. Note, uprobe > is always sealed and not controlled by this kernel configuration. > > [1] Documentation/userspace-api/mseal.rst > [2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/ > > Signed-off-by: Jeff Xu <jeffxu@chromium.org> > --- > .../admin-guide/kernel-parameters.txt | 10 ++++ > arch/x86/entry/vsyscall/vsyscall_64.c | 9 +++- > fs/exec.c | 53 +++++++++++++++++++ > include/linux/fs.h | 1 + > kernel/events/uprobes.c | 2 +- > mm/mmap.c | 1 + > security/Kconfig | 26 +++++++++ > 7 files changed, 99 insertions(+), 3 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index e7bfe1bde49e..02e5eb23d76f 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -1538,6 +1538,16 @@ > Permit 'security.evm' to be updated regardless of > current integrity status. > > + exec.seal_system_mappings = [KNL] > + Format: { never | always } > + Seal system mappings: vdso, vvar, sigpage, uprobes, > + vsyscall. > + This overwrites KCONFIG CONFIG_SEAL_SYSTEM_MAPPINGS_* > + - 'never': never seal system mappings. > + - 'always': always seal system mappings. > + If not specified or invalid, default is the KCONFIG value. > + This option has no effect if CONFIG_64BIT=n > + Any reason for "always"/"never" instead of the more traditional y/n enabled/disabled, etc? Otherwise, this all makes sense to me.
On Wed, Oct 16, 2024 at 2:26 PM Kees Cook <kees@kernel.org> wrote: > > (I don't think this needs "RFC" any more) > > On Mon, Oct 14, 2024 at 09:50:20PM +0000, jeffxu@chromium.org wrote: > > From: Jeff Xu <jeffxu@chromium.org> > > > > Seal vdso, vvar, sigpage, uprobes and vsyscall. > > > > Those mappings are readonly or executable only, sealing can protect > > them from ever changing during the life time of the process. For > > complete descriptions of memory sealing, please see mseal.rst [1]. > > > > System mappings such as vdso, vvar, and sigpage (for arm) are > > generated by the kernel during program initialization. These mappings > > are designated as non-writable, and sealing them will prevent them > > from ever becoming writeable. > > > > Unlike the aforementioned mappings, the uprobe mapping is not > > established during program startup. However, its lifetime is the same > > as the process's lifetime [2], thus sealable. > > > > The vdso, vvar, sigpage, and uprobe mappings all invoke the > > _install_special_mapping() function. As no other mappings utilize this > > function, it is logical to incorporate sealing logic within > > _install_special_mapping(). This approach avoids the necessity of > > modifying code across various architecture-specific implementations. > > > > The vsyscall mapping, which has its own initialization function, is > > sealed in the XONLY case, it seems to be the most common and secure > > case of using vsyscall. > > > > It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may > > alter the mapping of vdso, vvar, and sigpage during restore > > operations. Consequently, this feature cannot be universally enabled > > across all systems. To address this, a kernel configuration option has > > been introduced to enable or disable this functionality. Note, uprobe > > is always sealed and not controlled by this kernel configuration. > > > > [1] Documentation/userspace-api/mseal.rst > > [2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/ > > > > Signed-off-by: Jeff Xu <jeffxu@chromium.org> > > --- > > .../admin-guide/kernel-parameters.txt | 10 ++++ > > arch/x86/entry/vsyscall/vsyscall_64.c | 9 +++- > > fs/exec.c | 53 +++++++++++++++++++ > > include/linux/fs.h | 1 + > > kernel/events/uprobes.c | 2 +- > > mm/mmap.c | 1 + > > security/Kconfig | 26 +++++++++ > > 7 files changed, 99 insertions(+), 3 deletions(-) > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > index e7bfe1bde49e..02e5eb23d76f 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -1538,6 +1538,16 @@ > > Permit 'security.evm' to be updated regardless of > > current integrity status. > > > > + exec.seal_system_mappings = [KNL] > > + Format: { never | always } > > + Seal system mappings: vdso, vvar, sigpage, uprobes, > > + vsyscall. > > + This overwrites KCONFIG CONFIG_SEAL_SYSTEM_MAPPINGS_* > > + - 'never': never seal system mappings. > > + - 'always': always seal system mappings. > > + If not specified or invalid, default is the KCONFIG value. > > + This option has no effect if CONFIG_64BIT=n > > + > > Any reason for "always"/"never" instead of the more traditional y/n > enabled/disabled, etc? > Yes. I like to leave room for future extension, in case someone ever needs a prctl for pre-process opt-in, e.g. Format:{never|prctl|always} > Otherwise, this all makes sense to me. > > -- > Kees Cook
* jeffxu@chromium.org <jeffxu@chromium.org> [241014 17:50]: > From: Jeff Xu <jeffxu@chromium.org> > > Seal vdso, vvar, sigpage, uprobes and vsyscall. > > Those mappings are readonly or executable only, sealing can protect > them from ever changing during the life time of the process. For > complete descriptions of memory sealing, please see mseal.rst [1]. > > System mappings such as vdso, vvar, and sigpage (for arm) are > generated by the kernel during program initialization. These mappings > are designated as non-writable, and sealing them will prevent them > from ever becoming writeable. > > Unlike the aforementioned mappings, the uprobe mapping is not > established during program startup. However, its lifetime is the same > as the process's lifetime [2], thus sealable. > > The vdso, vvar, sigpage, and uprobe mappings all invoke the > _install_special_mapping() function. As no other mappings utilize this > function, it is logical to incorporate sealing logic within > _install_special_mapping(). This approach avoids the necessity of > modifying code across various architecture-specific implementations. > > The vsyscall mapping, which has its own initialization function, is > sealed in the XONLY case, it seems to be the most common and secure > case of using vsyscall. > > It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may > alter the mapping of vdso, vvar, and sigpage during restore > operations. Consequently, this feature cannot be universally enabled > across all systems. To address this, a kernel configuration option has > been introduced to enable or disable this functionality. Note, uprobe > is always sealed and not controlled by this kernel configuration. Considering that uprobes are always sealed regardless of boot or kernel config, the descriptions below all suffer from inaccurate descriptions. It is also very easy to overlook that you are changing the default vm flags of uprobe, especially considering the text implies that they are not altered if you select "never". > > [1] Documentation/userspace-api/mseal.rst > [2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/ > > Signed-off-by: Jeff Xu <jeffxu@chromium.org> > --- > .../admin-guide/kernel-parameters.txt | 10 ++++ > arch/x86/entry/vsyscall/vsyscall_64.c | 9 +++- > fs/exec.c | 53 +++++++++++++++++++ > include/linux/fs.h | 1 + > kernel/events/uprobes.c | 2 +- > mm/mmap.c | 1 + > security/Kconfig | 26 +++++++++ > 7 files changed, 99 insertions(+), 3 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index e7bfe1bde49e..02e5eb23d76f 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -1538,6 +1538,16 @@ > Permit 'security.evm' to be updated regardless of > current integrity status. > > + exec.seal_system_mappings = [KNL] > + Format: { never | always } > + Seal system mappings: vdso, vvar, sigpage, uprobes, > + vsyscall. > + This overwrites KCONFIG CONFIG_SEAL_SYSTEM_MAPPINGS_* > + - 'never': never seal system mappings. Not true, uprobes are sealed when 'never' is selected. > + - 'always': always seal system mappings. > + If not specified or invalid, default is the KCONFIG value. > + This option has no effect if CONFIG_64BIT=n > + > early_page_ext [KNL,EARLY] Enforces page_ext initialization to earlier > stages so cover more early boot allocations. > Please note that as side effect some optimizations > diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c > index 2fb7d53cf333..20a3000550d2 100644 > --- a/arch/x86/entry/vsyscall/vsyscall_64.c > +++ b/arch/x86/entry/vsyscall/vsyscall_64.c > @@ -32,6 +32,7 @@ > #include <linux/mm_types.h> > #include <linux/syscalls.h> > #include <linux/ratelimit.h> > +#include <linux/fs.h> > > #include <asm/vsyscall.h> > #include <asm/unistd.h> > @@ -366,8 +367,12 @@ void __init map_vsyscall(void) > set_vsyscall_pgtable_user_bits(swapper_pg_dir); > } > > - if (vsyscall_mode == XONLY) > - vm_flags_init(&gate_vma, VM_EXEC); > + if (vsyscall_mode == XONLY) { > + unsigned long vm_flags = VM_EXEC; > + > + update_seal_exec_system_mappings(&vm_flags); > + vm_flags_init(&gate_vma, vm_flags); > + } > > BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) != > (unsigned long)VSYSCALL_ADDR); > diff --git a/fs/exec.c b/fs/exec.c > index 77364806b48d..5030879cda47 100644 > --- a/fs/exec.c > +++ b/fs/exec.c Does it make sense for this to live in exec? Couldn't you put it in the mm/mseal.c file? It's vma flags for mappings and you've put it in fs/exec? > @@ -68,6 +68,7 @@ > #include <linux/user_events.h> > #include <linux/rseq.h> > #include <linux/ksm.h> > +#include <linux/fs_parser.h> > > #include <linux/uaccess.h> > #include <asm/mmu_context.h> > @@ -2159,3 +2160,55 @@ fs_initcall(init_fs_exec_sysctls); > #ifdef CONFIG_EXEC_KUNIT_TEST > #include "tests/exec_kunit.c" > #endif > + > +#ifdef CONFIG_64BIT > +/* > + * Kernel cmdline overwrite for CONFIG_SEAL_SYSTEM_MAPPINGS_X > + */ > +enum seal_system_mappings_type { > + SEAL_SYSTEM_MAPPINGS_NEVER, > + SEAL_SYSTEM_MAPPINGS_ALWAYS > +}; > + > +static enum seal_system_mappings_type seal_system_mappings __ro_after_init = > + IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS_ALWAYS) ? SEAL_SYSTEM_MAPPINGS_ALWAYS : > + SEAL_SYSTEM_MAPPINGS_NEVER; > + > +static const struct constant_table value_table_sys_mapping[] __initconst = { > + { "never", SEAL_SYSTEM_MAPPINGS_NEVER}, > + { "always", SEAL_SYSTEM_MAPPINGS_ALWAYS}, > + { } > +}; > + > +static int __init early_seal_system_mappings_override(char *buf) > +{ > + if (!buf) > + return -EINVAL; > + > + seal_system_mappings = lookup_constant(value_table_sys_mapping, > + buf, seal_system_mappings); > + > + return 0; > +} > + > +early_param("exec.seal_system_mappings", early_seal_system_mappings_override); > + > +static bool seal_system_mappings_enabled(void) > +{ > + if (seal_system_mappings == SEAL_SYSTEM_MAPPINGS_ALWAYS) > + return true; > + > + return false; > +} This function seems unnecessary, it is called from another 3-4 line function only. > + > +void update_seal_exec_system_mappings(unsigned long *vm_flags) > +{ > + if (!(*vm_flags & VM_SEALED) && seal_system_mappings_enabled()) Why !(*vm_flags & VM_SEALED) here? > + *vm_flags |= VM_SEALED; > + > +} Instead of passing a pointer around and checking enabled, why don't you have a function that just returns the VM_SEALED or 0 and just or it into the flags? This seems very heavy for what it does, why did you do it this way? The name is also very long and a bit odd, it could be used for other reasons, but you have _system_mappings on the end, and you use seal but it's mseal (or vm_seal)? Would mseal_flag() work? > +#else > +void update_seal_exec_system_mappings(unsigned long *vm_flags) > +{ > +} > +#endif /* CONFIG_64BIT */ > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 42444ec95c9b..6e44aca4b24b 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h Again, I don't understand why fs.h is the place for mseal definitions? > @@ -3079,6 +3079,7 @@ ssize_t __kernel_read(struct file *file, void *buf, size_t count, loff_t *pos); > extern ssize_t kernel_write(struct file *, const void *, size_t, loff_t *); > extern ssize_t __kernel_write(struct file *, const void *, size_t, loff_t *); > extern struct file * open_exec(const char *); > +extern void update_seal_exec_system_mappings(unsigned long *vm_flags); We are dropping extern where possible now. > > /* fs/dcache.c -- generic fs support functions */ > extern bool is_subdir(struct dentry *, struct dentry *); > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index c47a0bf25e58..e9876fae8887 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -1506,7 +1506,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area) > } > > vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE, > - VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO, > + VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED, > &xol_mapping); Changing all uprobes seems like something that should probably be mentioned more than just the note at the end of the change log, even if you think it won't have any impact. The note is even hidden at the end of a paragraph. I would go as far as splitting this patch out as its own so that the subject line specifies that all uprobes will be VM_SEALED now. Maybe it's fine but maybe it isn't and you've buried it so that it will be missed by virtually everyone. > if (IS_ERR(vma)) { > ret = PTR_ERR(vma); > diff --git a/mm/mmap.c b/mm/mmap.c > index 57fd5ab2abe7..d4717e34a60d 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2133,6 +2133,7 @@ struct vm_area_struct *_install_special_mapping( > unsigned long addr, unsigned long len, > unsigned long vm_flags, const struct vm_special_mapping *spec) > { > + update_seal_exec_system_mappings(&vm_flags); > return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec, > &special_mapping_vmops); If you were to return a flag, you could change the vm_flags argument to vm_flags | mseal_flag() > } > diff --git a/security/Kconfig b/security/Kconfig > index 28e685f53bd1..4ec8045339c3 100644 > --- a/security/Kconfig > +++ b/security/Kconfig > @@ -51,6 +51,32 @@ config PROC_MEM_NO_FORCE > > endchoice > > +choice > + prompt "Seal system mappings" > + default SEAL_SYSTEM_MAPPINGS_NEVER > + help > + Seal system mappings such as vdso, vvar, sigpage, uprobes and > + vsyscall. > + Note: kernel command line exec.seal_system_mappings overwrites this. Not true, uprobes are always sealed. > + > +config SEAL_SYSTEM_MAPPINGS_NEVER > + bool "Traditional behavior - not sealed" Not true, uprobes are sealed. > + help > + Do not seal system mappings. > + This is default. > + > +config SEAL_SYSTEM_MAPPINGS_ALWAYS > + bool "Always seal system mappings" > + depends on 64BIT > + depends on !CHECKPOINT_RESTORE > + help > + Seal system mappings such as vdso, vvar, sigpage, uprobes and > + vsyscall. > + Note: CHECKPOINT_RESTORE might relocate vdso mapping during restore, > + and remap will fail if the mapping is sealed, therefore > + !CHECKPOINT_RESTORE is added as dependency. > +endchoice > + > config SECURITY > bool "Enable different security models" > depends on SYSFS > -- > 2.47.0.rc1.288.g06298d1525-goog >
On Wed, Oct 16, 2024 at 6:10 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > + exec.seal_system_mappings = [KNL] > > + Format: { never | always } > > + Seal system mappings: vdso, vvar, sigpage, uprobes, > > + vsyscall. > > + This overwrites KCONFIG CONFIG_SEAL_SYSTEM_MAPPINGS_* > > + - 'never': never seal system mappings. > > Not true, uprobes are sealed when 'never' is selected. > Thanks. I forgot to uprobes from the description in Kconfig and kernel-parameters.txt, will update. > > + - 'always': always seal system mappings. > > + If not specified or invalid, default is the KCONFIG value. > > + This option has no effect if CONFIG_64BIT=n > > + > > early_page_ext [KNL,EARLY] Enforces page_ext initialization to earlier > > stages so cover more early boot allocations. > > Please note that as side effect some optimizations > > diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c > > index 2fb7d53cf333..20a3000550d2 100644 > > --- a/arch/x86/entry/vsyscall/vsyscall_64.c > > +++ b/arch/x86/entry/vsyscall/vsyscall_64.c > > @@ -32,6 +32,7 @@ > > #include <linux/mm_types.h> > > #include <linux/syscalls.h> > > #include <linux/ratelimit.h> > > +#include <linux/fs.h> > > > > #include <asm/vsyscall.h> > > #include <asm/unistd.h> > > @@ -366,8 +367,12 @@ void __init map_vsyscall(void) > > set_vsyscall_pgtable_user_bits(swapper_pg_dir); > > } > > > > - if (vsyscall_mode == XONLY) > > - vm_flags_init(&gate_vma, VM_EXEC); > > + if (vsyscall_mode == XONLY) { > > + unsigned long vm_flags = VM_EXEC; > > + > > + update_seal_exec_system_mappings(&vm_flags); > > + vm_flags_init(&gate_vma, vm_flags); > > + } > > > > BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) != > > (unsigned long)VSYSCALL_ADDR); > > diff --git a/fs/exec.c b/fs/exec.c > > index 77364806b48d..5030879cda47 100644 > > --- a/fs/exec.c > > +++ b/fs/exec.c > > Does it make sense for this to live in exec? Couldn't you put it in the > mm/mseal.c file? It's vma flags for mappings and you've put it in > fs/exec? > If you are referring to utilities related to kernel cmdline, they should be in this file. > > @@ -68,6 +68,7 @@ > > #include <linux/user_events.h> > > #include <linux/rseq.h> > > #include <linux/ksm.h> > > +#include <linux/fs_parser.h> > > > > #include <linux/uaccess.h> > > #include <asm/mmu_context.h> > > @@ -2159,3 +2160,55 @@ fs_initcall(init_fs_exec_sysctls); > > #ifdef CONFIG_EXEC_KUNIT_TEST > > #include "tests/exec_kunit.c" > > #endif > > + > > +#ifdef CONFIG_64BIT > > +/* > > + * Kernel cmdline overwrite for CONFIG_SEAL_SYSTEM_MAPPINGS_X > > + */ > > +enum seal_system_mappings_type { > > + SEAL_SYSTEM_MAPPINGS_NEVER, > > + SEAL_SYSTEM_MAPPINGS_ALWAYS > > +}; > > + > > +static enum seal_system_mappings_type seal_system_mappings __ro_after_init = > > + IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS_ALWAYS) ? SEAL_SYSTEM_MAPPINGS_ALWAYS : > > + SEAL_SYSTEM_MAPPINGS_NEVER; > > + > > +static const struct constant_table value_table_sys_mapping[] __initconst = { > > + { "never", SEAL_SYSTEM_MAPPINGS_NEVER}, > > + { "always", SEAL_SYSTEM_MAPPINGS_ALWAYS}, > > + { } > > +}; > > + > > +static int __init early_seal_system_mappings_override(char *buf) > > +{ > > + if (!buf) > > + return -EINVAL; > > + > > + seal_system_mappings = lookup_constant(value_table_sys_mapping, > > + buf, seal_system_mappings); > > + > > + return 0; > > +} > > + > > +early_param("exec.seal_system_mappings", early_seal_system_mappings_override); > > + > > +static bool seal_system_mappings_enabled(void) > > +{ > > + if (seal_system_mappings == SEAL_SYSTEM_MAPPINGS_ALWAYS) > > + return true; > > + > > + return false; > > +} > > This function seems unnecessary, it is called from another 3-4 line > function only. > It is more readable this way. > > + > > +void update_seal_exec_system_mappings(unsigned long *vm_flags) > > +{ > > + if (!(*vm_flags & VM_SEALED) && seal_system_mappings_enabled()) > > Why !(*vm_flags & VM_SEALED) here? > If vm_flags is already sealed, then there is no need to check seal_system_mappings_enabled. > > + *vm_flags |= VM_SEALED; > > + > > +} > > Instead of passing a pointer around and checking enabled, why don't you > have a function that just returns the VM_SEALED or 0 and just or it into > the flags? This seems very heavy for what it does, why did you do it > this way? > Why is that heavy ? passing a pointer for updating variables is natural. > The name is also very long and a bit odd, it could be used for other > reasons, but you have _system_mappings on the end, and you use seal but > it's mseal (or vm_seal)? Would mseal_flag() work? > It could be longer :-) it means update_sealing_flag_for_executable_system_mappings. mseal_flag is too short and not descriptive. > > +#else > > +void update_seal_exec_system_mappings(unsigned long *vm_flags) > > +{ > > +} > > +#endif /* CONFIG_64BIT */ > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 42444ec95c9b..6e44aca4b24b 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > Again, I don't understand why fs.h is the place for mseal definitions? > include/linux/fs.h contains other exec related function signatures. So it is better to keep them at the same header. > > @@ -3079,6 +3079,7 @@ ssize_t __kernel_read(struct file *file, void *buf, size_t count, loff_t *pos); > > extern ssize_t kernel_write(struct file *, const void *, size_t, loff_t *); > > extern ssize_t __kernel_write(struct file *, const void *, size_t, loff_t *); > > extern struct file * open_exec(const char *); > > +extern void update_seal_exec_system_mappings(unsigned long *vm_flags); > > We are dropping extern where possible now. > extern can be dropped, it appears not causing link error. > > > > /* fs/dcache.c -- generic fs support functions */ > > extern bool is_subdir(struct dentry *, struct dentry *); > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > index c47a0bf25e58..e9876fae8887 100644 > > --- a/kernel/events/uprobes.c > > +++ b/kernel/events/uprobes.c > > @@ -1506,7 +1506,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area) > > } > > > > vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE, > > - VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO, > > + VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED, > > &xol_mapping); > > Changing all uprobes seems like something that should probably be > mentioned more than just the note at the end of the change log, even if > you think it won't have any impact. The note is even hidden at the end > of a paragraph. > > I would go as far as splitting this patch out as its own so that the > subject line specifies that all uprobes will be VM_SEALED now. > > Maybe it's fine but maybe it isn't and you've buried it so that it will > be missed by virtually everyone. > I will add "It is sealed from creation." in the above "uprobe" section. > > > if (IS_ERR(vma)) { > > ret = PTR_ERR(vma); > > diff --git a/mm/mmap.c b/mm/mmap.c > > index 57fd5ab2abe7..d4717e34a60d 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -2133,6 +2133,7 @@ struct vm_area_struct *_install_special_mapping( > > unsigned long addr, unsigned long len, > > unsigned long vm_flags, const struct vm_special_mapping *spec) > > { > > + update_seal_exec_system_mappings(&vm_flags); > > return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec, > > &special_mapping_vmops); > > If you were to return a flag, you could change the vm_flags argument to > vm_flags | mseal_flag() > passing pointer seems to be the most efficient way. Thanks -Jeff
On Wed, Oct 16, 2024 at 3:06 PM Jeff Xu <jeffxu@chromium.org> wrote: > > On Wed, Oct 16, 2024 at 2:26 PM Kees Cook <kees@kernel.org> wrote: > > > > (I don't think this needs "RFC" any more) > > > > On Mon, Oct 14, 2024 at 09:50:20PM +0000, jeffxu@chromium.org wrote: > > > From: Jeff Xu <jeffxu@chromium.org> > > > > > > Seal vdso, vvar, sigpage, uprobes and vsyscall. > > > > > > Those mappings are readonly or executable only, sealing can protect > > > them from ever changing during the life time of the process. For > > > complete descriptions of memory sealing, please see mseal.rst [1]. > > > > > > System mappings such as vdso, vvar, and sigpage (for arm) are > > > generated by the kernel during program initialization. These mappings > > > are designated as non-writable, and sealing them will prevent them > > > from ever becoming writeable. > > > > > > Unlike the aforementioned mappings, the uprobe mapping is not > > > established during program startup. However, its lifetime is the same > > > as the process's lifetime [2], thus sealable. > > > > > > The vdso, vvar, sigpage, and uprobe mappings all invoke the > > > _install_special_mapping() function. As no other mappings utilize this > > > function, it is logical to incorporate sealing logic within > > > _install_special_mapping(). This approach avoids the necessity of > > > modifying code across various architecture-specific implementations. > > > > > > The vsyscall mapping, which has its own initialization function, is > > > sealed in the XONLY case, it seems to be the most common and secure > > > case of using vsyscall. > > > > > > It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may > > > alter the mapping of vdso, vvar, and sigpage during restore > > > operations. Consequently, this feature cannot be universally enabled > > > across all systems. To address this, a kernel configuration option has > > > been introduced to enable or disable this functionality. Note, uprobe > > > is always sealed and not controlled by this kernel configuration. > > > > > > [1] Documentation/userspace-api/mseal.rst > > > [2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/ > > > > > > Signed-off-by: Jeff Xu <jeffxu@chromium.org> > > > --- > > > .../admin-guide/kernel-parameters.txt | 10 ++++ > > > arch/x86/entry/vsyscall/vsyscall_64.c | 9 +++- > > > fs/exec.c | 53 +++++++++++++++++++ > > > include/linux/fs.h | 1 + > > > kernel/events/uprobes.c | 2 +- > > > mm/mmap.c | 1 + > > > security/Kconfig | 26 +++++++++ > > > 7 files changed, 99 insertions(+), 3 deletions(-) > > > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > > index e7bfe1bde49e..02e5eb23d76f 100644 > > > --- a/Documentation/admin-guide/kernel-parameters.txt > > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > > @@ -1538,6 +1538,16 @@ > > > Permit 'security.evm' to be updated regardless of > > > current integrity status. > > > > > > + exec.seal_system_mappings = [KNL] > > > + Format: { never | always } > > > + Seal system mappings: vdso, vvar, sigpage, uprobes, > > > + vsyscall. > > > + This overwrites KCONFIG CONFIG_SEAL_SYSTEM_MAPPINGS_* > > > + - 'never': never seal system mappings. > > > + - 'always': always seal system mappings. > > > + If not specified or invalid, default is the KCONFIG value. > > > + This option has no effect if CONFIG_64BIT=n > > > + > > > > Any reason for "always"/"never" instead of the more traditional y/n > > enabled/disabled, etc? > > > Yes. I like to leave room for future extension, in case someone ever > needs a prctl for pre-process opt-in, e.g. > Format:{never|prctl|always} > I copied the pattern from: proc_mem.force_override= [KNL] Format: {always | ptrace | never} > > Otherwise, this all makes sense to me. > > > > -- > > Kees Cook
On 10/16, Jeff Xu wrote: > > On Wed, Oct 16, 2024 at 6:10 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > > + exec.seal_system_mappings = [KNL] > > > + Format: { never | always } > > > + Seal system mappings: vdso, vvar, sigpage, uprobes, > > > + vsyscall. > > > + This overwrites KCONFIG CONFIG_SEAL_SYSTEM_MAPPINGS_* > > > + - 'never': never seal system mappings. > > > > Not true, uprobes are sealed when 'never' is selected. > > > Thanks. I forgot to uprobes from the description in Kconfig and > kernel-parameters.txt, will update. Jeff, I am sorry for confusion. No need to make uprobes "special" and complicate the logic/documentation. I just meant that, unlike vdso, it is always safe/good to mseal the "[uprobes]" vma, regardless of config/boot options. Please do what you think is right, I am fine either way. Oleg.
> On Wed, Oct 16, 2024 at 6:10 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: ... > > > diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c > > > index 2fb7d53cf333..20a3000550d2 100644 > > > --- a/arch/x86/entry/vsyscall/vsyscall_64.c > > > +++ b/arch/x86/entry/vsyscall/vsyscall_64.c > > > @@ -32,6 +32,7 @@ > > > #include <linux/mm_types.h> > > > #include <linux/syscalls.h> > > > #include <linux/ratelimit.h> > > > +#include <linux/fs.h> > > > > > > #include <asm/vsyscall.h> > > > #include <asm/unistd.h> > > > @@ -366,8 +367,12 @@ void __init map_vsyscall(void) > > > set_vsyscall_pgtable_user_bits(swapper_pg_dir); > > > } > > > > > > - if (vsyscall_mode == XONLY) > > > - vm_flags_init(&gate_vma, VM_EXEC); > > > + if (vsyscall_mode == XONLY) { > > > + unsigned long vm_flags = VM_EXEC; > > > + > > > + update_seal_exec_system_mappings(&vm_flags); > > > + vm_flags_init(&gate_vma, vm_flags); > > > + } > > > > > > BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) != > > > (unsigned long)VSYSCALL_ADDR); > > > diff --git a/fs/exec.c b/fs/exec.c > > > index 77364806b48d..5030879cda47 100644 > > > --- a/fs/exec.c > > > +++ b/fs/exec.c > > > > Does it make sense for this to live in exec? Couldn't you put it in the > > mm/mseal.c file? It's vma flags for mappings and you've put it in > > fs/exec? > > > If you are referring to utilities related to kernel cmdline, they > should be in this file. You created a wrapper for the command line, but then included the user in this file as well. hugetlbfs reads the command line as well, in cmdline_parse_hugetlb_cma. That parser lives with the rest of the hugetlb code in hugetlb.c I think this has to do with your view as this is an exec thing, where I think it's an mm thing. My arguments are that you are directly adding flags to vmas and it's dealing with mseal which has memory in the name with the file living in the mm/ directory. If I wanted to know what's using mseal, I'd start there and totally miss what you are adding here. Besides applying a vma flag to exec mappings, why do you feel like it belongs in fs/ ? > > > > @@ -68,6 +68,7 @@ > > > #include <linux/user_events.h> > > > #include <linux/rseq.h> > > > #include <linux/ksm.h> > > > +#include <linux/fs_parser.h> > > > > > > #include <linux/uaccess.h> > > > #include <asm/mmu_context.h> > > > @@ -2159,3 +2160,55 @@ fs_initcall(init_fs_exec_sysctls); > > > #ifdef CONFIG_EXEC_KUNIT_TEST > > > #include "tests/exec_kunit.c" > > > #endif > > > + > > > +#ifdef CONFIG_64BIT > > > +/* > > > + * Kernel cmdline overwrite for CONFIG_SEAL_SYSTEM_MAPPINGS_X > > > + */ > > > +enum seal_system_mappings_type { > > > + SEAL_SYSTEM_MAPPINGS_NEVER, > > > + SEAL_SYSTEM_MAPPINGS_ALWAYS > > > +}; > > > + > > > +static enum seal_system_mappings_type seal_system_mappings __ro_after_init = > > > + IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS_ALWAYS) ? SEAL_SYSTEM_MAPPINGS_ALWAYS : > > > + SEAL_SYSTEM_MAPPINGS_NEVER; > > > + > > > +static const struct constant_table value_table_sys_mapping[] __initconst = { > > > + { "never", SEAL_SYSTEM_MAPPINGS_NEVER}, > > > + { "always", SEAL_SYSTEM_MAPPINGS_ALWAYS}, > > > + { } > > > +}; > > > + > > > +static int __init early_seal_system_mappings_override(char *buf) > > > +{ > > > + if (!buf) > > > + return -EINVAL; > > > + > > > + seal_system_mappings = lookup_constant(value_table_sys_mapping, > > > + buf, seal_system_mappings); > > > + > > > + return 0; > > > +} > > > + > > > +early_param("exec.seal_system_mappings", early_seal_system_mappings_override); > > > + > > > +static bool seal_system_mappings_enabled(void) > > > +{ > > > + if (seal_system_mappings == SEAL_SYSTEM_MAPPINGS_ALWAYS) > > > + return true; > > > + > > > + return false; > > > +} > > > > This function seems unnecessary, it is called from another 3-4 line > > function only. > > > It is more readable this way. It really isn't - you don't have to cram everything into one if statement. > > > > + > > > +void update_seal_exec_system_mappings(unsigned long *vm_flags) > > > +{ > > > + if (!(*vm_flags & VM_SEALED) && seal_system_mappings_enabled()) > > > > Why !(*vm_flags & VM_SEALED) here? > > > If vm_flags is already sealed, then there is no need to check > seal_system_mappings_enabled. > > > > + *vm_flags |= VM_SEALED; > > > + > > > +} > > > > Instead of passing a pointer around and checking enabled, why don't you > > have a function that just returns the VM_SEALED or 0 and just or it into > > the flags? This seems very heavy for what it does, why did you do it > > this way? > > > Why is that heavy ? passing a pointer for updating variables is natural. What you have here will pass a pointer to a function, dereference the pointer and potentially call another function to check seal system mappings is enabled to dereference a pointer to or a bit. I would hope the compiler improves this, but that depends on the compiler. What you *could* do is call a function to check the seal system mappings is enabled and return the bit to set. The caller would just need to or the bit on value without dereferencing the pointer. We may be able to even inline the function you call. So I am suggesting removing two dereferences and a function call (maybe two?) while making the code more compact and readable. This is what I mean by heavy. > > > The name is also very long and a bit odd, it could be used for other > > reasons, but you have _system_mappings on the end, and you use seal but > > it's mseal (or vm_seal)? Would mseal_flag() work? > > > It could be longer :-) > it means update_sealing_flag_for_executable_system_mappings. > mseal_flag is too short and not descriptive. mseal_exec_flags() ? > > > > +#else > > > +void update_seal_exec_system_mappings(unsigned long *vm_flags) > > > +{ > > > +} > > > +#endif /* CONFIG_64BIT */ > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > index 42444ec95c9b..6e44aca4b24b 100644 > > > --- a/include/linux/fs.h > > > +++ b/include/linux/fs.h > > > > Again, I don't understand why fs.h is the place for mseal definitions? > > > include/linux/fs.h contains other exec related function signatures. So > it is better to keep them at the same header. I don't see this as an exec related function, it's changing vma flags. Besides this new function, the fs/exec.c deals with the stack flags. Well, this header has VM_MAY{READ/WRITE/EXEC} for nommu - but I don't think we can really count those here and they are used to translate things anyways. Recently, we have been trying to modularize the vma code for better testing. Spreading the code across many files will make that testing harder to do in isolation. > > > > @@ -3079,6 +3079,7 @@ ssize_t __kernel_read(struct file *file, void *buf, size_t count, loff_t *pos); > > > extern ssize_t kernel_write(struct file *, const void *, size_t, loff_t *); > > > extern ssize_t __kernel_write(struct file *, const void *, size_t, loff_t *); > > > extern struct file * open_exec(const char *); > > > +extern void update_seal_exec_system_mappings(unsigned long *vm_flags); > > > > We are dropping extern where possible now. > > > extern can be dropped, it appears not causing link error. > > > > > > /* fs/dcache.c -- generic fs support functions */ > > > extern bool is_subdir(struct dentry *, struct dentry *); > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > index c47a0bf25e58..e9876fae8887 100644 > > > --- a/kernel/events/uprobes.c > > > +++ b/kernel/events/uprobes.c > > > @@ -1506,7 +1506,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area) > > > } > > > > > > vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE, > > > - VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO, > > > + VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED, > > > &xol_mapping); > > > > Changing all uprobes seems like something that should probably be > > mentioned more than just the note at the end of the change log, even if > > you think it won't have any impact. The note is even hidden at the end > > of a paragraph. > > > > I would go as far as splitting this patch out as its own so that the > > subject line specifies that all uprobes will be VM_SEALED now. > > > > Maybe it's fine but maybe it isn't and you've buried it so that it will > > be missed by virtually everyone. > > > I will add "It is sealed from creation." in the above "uprobe" section. That in inadequate for what you are changing. Doing it this way risks breaking things because no one will notice. > > > > > > if (IS_ERR(vma)) { > > > ret = PTR_ERR(vma); > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > index 57fd5ab2abe7..d4717e34a60d 100644 > > > --- a/mm/mmap.c > > > +++ b/mm/mmap.c > > > @@ -2133,6 +2133,7 @@ struct vm_area_struct *_install_special_mapping( > > > unsigned long addr, unsigned long len, > > > unsigned long vm_flags, const struct vm_special_mapping *spec) > > > { > > > + update_seal_exec_system_mappings(&vm_flags); > > > return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec, > > > &special_mapping_vmops); > > > > If you were to return a flag, you could change the vm_flags argument to > > vm_flags | mseal_flag() > > > passing pointer seems to be the most efficient way. I disagree. Here is the godbolt.org output for gcc x86-64 14.2 of your code (with some added #defines to make it compile) seal_system_mappings: .long 1 seal_system_mappings_enabled: push rbp mov rbp, rsp mov eax, DWORD PTR seal_system_mappings[rip] cmp eax, 1 jne .L2 mov eax, 1 jmp .L3 .L2: mov eax, 0 .L3: pop rbp ret update_seal_exec_system_mappings: push rbp mov rbp, rsp sub rsp, 8 mov QWORD PTR [rbp-8], rdi mov rax, QWORD PTR [rbp-8] mov rax, QWORD PTR [rax] and eax, 2 test rax, rax jne .L6 call seal_system_mappings_enabled test al, al je .L6 mov rax, QWORD PTR [rbp-8] mov rax, QWORD PTR [rax] or rax, 2 mov rdx, rax mov rax, QWORD PTR [rbp-8] mov QWORD PTR [rax], rdx .L6: nop leave ret main: push rbp mov rbp, rsp sub rsp, 16 mov QWORD PTR [rbp-8], 0 lea rax, [rbp-8] mov rdi, rax call update_seal_exec_system_mappings mov rax, QWORD PTR [rbp-8] leave ret ----- 48 lines ----- Here is what I am suggesting to do with replacing the passing of a pointer with a concise "vm_flags | mseal_exec_flags()" (with the same added #defines to make it compile) seal_system_mappings: .long 1 mseal_exec_flags: push rbp mov rbp, rsp mov eax, DWORD PTR seal_system_mappings[rip] cmp eax, 1 jne .L2 mov eax, 2 jmp .L3 .L2: mov eax, 0 .L3: pop rbp ret main: push rbp mov rbp, rsp sub rsp, 16 mov QWORD PTR [rbp-8], 0 call mseal_exec_flags mov edx, eax mov rax, QWORD PTR [rbp-8] or eax, edx leave ret ----- 26 lines ----- So as you can see, there are less instructions in my version; there are 47.92% less lines of assembly. Your vsyscall version is even worse. It is very tiresome to do code inspections and be told flat-out "no" repetitively. Liam
On Thu, Oct 17, 2024 at 1:38 AM Oleg Nesterov <oleg@redhat.com> wrote: > > On 10/16, Jeff Xu wrote: > > > > On Wed, Oct 16, 2024 at 6:10 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > > > > + exec.seal_system_mappings = [KNL] > > > > + Format: { never | always } > > > > + Seal system mappings: vdso, vvar, sigpage, uprobes, > > > > + vsyscall. > > > > + This overwrites KCONFIG CONFIG_SEAL_SYSTEM_MAPPINGS_* > > > > + - 'never': never seal system mappings. > > > > > > Not true, uprobes are sealed when 'never' is selected. > > > > > Thanks. I forgot to uprobes from the description in Kconfig and > > kernel-parameters.txt, will update. > > Jeff, I am sorry for confusion. > > No need to make uprobes "special" and complicate the logic/documentation. > > I just meant that, unlike vdso, it is always safe/good to mseal the "[uprobes]" > vma, regardless of config/boot options. > > Please do what you think is right, I am fine either way. > OK, in that case, V1 is a better approach, using the same config to control all system mappings. I will send V1 to revert that. Thanks for clarifying. -Jeff > Oleg. >
Hi Liam On Thu, Oct 17, 2024 at 9:01 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > Does it make sense for this to live in exec? Couldn't you put it in the > > > mm/mseal.c file? It's vma flags for mappings and you've put it in > > > fs/exec? > > > > > If you are referring to utilities related to kernel cmdline, they > > should be in this file. > > You created a wrapper for the command line, but then included the user > in this file as well. > > hugetlbfs reads the command line as well, in cmdline_parse_hugetlb_cma. > That parser lives with the rest of the hugetlb code in hugetlb.c > > I think this has to do with your view as this is an exec thing, where I > think it's an mm thing. My arguments are that you are directly adding > flags to vmas and it's dealing with mseal which has memory in the name > with the file living in the mm/ directory. If I wanted to know what's > using mseal, I'd start there and totally miss what you are adding here. > > Besides applying a vma flag to exec mappings, why do you feel like it > belongs in fs/ ? > The vdso/vvar/stack/heap alike are type of mappings belonging to processes, and are created during execve() syscall which is in fs/exec.c. mm/mseal.c provides core memory sealing functionality and exec.c uses it. IMO, it is better to keep the provider (mm/mseal.c) and consumer (executable) separate. To make modulization better, I can do below adjustment: if (seal_system_mapping_enabled()) <-- implemented by fs/exec.c add_vm_sealed() <- keep in include/linux/mm.h However, if you have a strong opinion on this, I could move the parsing logic to mm/mseal. > > > > +void update_seal_exec_system_mappings(unsigned long * > > > The name is also very long and a bit odd, it could be used for other > > > reasons, but you have _system_mappings on the end, and you use seal but > > > it's mseal (or vm_seal)? Would mseal_flag() work? > > > > > It could be longer :-) > > it means update_sealing_flag_for_executable_system_mappings. > > mseal_flag is too short and not descriptive. > > mseal_exec_flags() ? > It needs to be more descriptive because there are also stacks and heaps to be sealed. I suggest to use below name to make it shorter: if (seal_system_mapping_enabled()) add_vm_sealed() > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > > index 57fd5ab2abe7..d4717e34a60d 100644 > > > > --- a/mm/mmap.c > > > > +++ b/mm/mmap.c > > > > @@ -2133,6 +2133,7 @@ struct vm_area_struct *_install_special_mapping( > > > > unsigned long addr, unsigned long len, > > > > unsigned long vm_flags, const struct vm_special_mapping *spec) > > > > { > > > > + update_seal_exec_system_mappings(&vm_flags); > > > > return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec, > > > > &special_mapping_vmops); > > > > > > If you were to return a flag, you could change the vm_flags argument to > > > vm_flags | mseal_flag() > > > > > passing pointer seems to be the most efficient way. > > I disagree. Here is the godbolt.org output for gcc x86-64 14.2 of your > code (with some added #defines to make it compile) > > seal_system_mappings: > .long 1 > seal_system_mappings_enabled: > push rbp > mov rbp, rsp > mov eax, DWORD PTR seal_system_mappings[rip] > cmp eax, 1 > jne .L2 > mov eax, 1 > jmp .L3 > .L2: > mov eax, 0 > .L3: > pop rbp > ret > update_seal_exec_system_mappings: > push rbp > mov rbp, rsp > sub rsp, 8 > mov QWORD PTR [rbp-8], rdi > mov rax, QWORD PTR [rbp-8] > mov rax, QWORD PTR [rax] > and eax, 2 > test rax, rax > jne .L6 > call seal_system_mappings_enabled > test al, al > je .L6 > mov rax, QWORD PTR [rbp-8] > mov rax, QWORD PTR [rax] > or rax, 2 > mov rdx, rax > mov rax, QWORD PTR [rbp-8] > mov QWORD PTR [rax], rdx > .L6: > nop > leave > ret > main: > push rbp > mov rbp, rsp > sub rsp, 16 > mov QWORD PTR [rbp-8], 0 > lea rax, [rbp-8] > mov rdi, rax > call update_seal_exec_system_mappings > mov rax, QWORD PTR [rbp-8] > leave > ret > > ----- 48 lines ----- > Here is what I am suggesting to do with replacing the passing of a > pointer with a concise "vm_flags | mseal_exec_flags()" (with the same > added #defines to make it compile) > > seal_system_mappings: > .long 1 > mseal_exec_flags: > push rbp > mov rbp, rsp > mov eax, DWORD PTR seal_system_mappings[rip] > cmp eax, 1 > jne .L2 > mov eax, 2 > jmp .L3 > .L2: > mov eax, 0 > .L3: > pop rbp > ret > main: > push rbp > mov rbp, rsp > sub rsp, 16 > mov QWORD PTR [rbp-8], 0 > call mseal_exec_flags > mov edx, eax > mov rax, QWORD PTR [rbp-8] > or eax, edx > leave > ret > > ----- 26 lines ----- > > So as you can see, there are less instructions in my version; there are > 47.92% less lines of assembly. > vm_flags already run out of space in 32 bit, sooner or later we will need to change that to *** a struct ***, passing address will be becoming necessary with struct. Since this is not a performance sensitive code path, 3 or 4 times during execve(), I think it would be good to start here. -Jeff
* Jeff Xu <jeffxu@chromium.org> [241111 14:10]: > Hi Liam > > On Thu, Oct 17, 2024 at 9:01 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > > Does it make sense for this to live in exec? Couldn't you put it in the > > > > mm/mseal.c file? It's vma flags for mappings and you've put it in > > > > fs/exec? > > > > > > > If you are referring to utilities related to kernel cmdline, they > > > should be in this file. > > > > You created a wrapper for the command line, but then included the user > > in this file as well. > > > > hugetlbfs reads the command line as well, in cmdline_parse_hugetlb_cma. > > That parser lives with the rest of the hugetlb code in hugetlb.c > > > > I think this has to do with your view as this is an exec thing, where I > > think it's an mm thing. My arguments are that you are directly adding > > flags to vmas and it's dealing with mseal which has memory in the name > > with the file living in the mm/ directory. If I wanted to know what's > > using mseal, I'd start there and totally miss what you are adding here. > > > > Besides applying a vma flag to exec mappings, why do you feel like it > > belongs in fs/ ? > > > The vdso/vvar/stack/heap alike are type of mappings belonging to > processes, and are created during execve() syscall which is in > fs/exec.c. > > mm/mseal.c provides core memory sealing functionality and exec.c uses > it. IMO, it is better to keep the provider (mm/mseal.c) and consumer > (executable) separate. IMO, mseal should not be sitting in its own file since it's just a vma flag. The entire feature is to do with vma and vma flag checking. It should at least be globbed together in one file, as much as possible. The way this and the help text is written means there is only the VM_SEALED flag that links this to the actual feature name. And with all the abstractions, that flag is 2-3 functions deep across 2-3 files. In fact, there isn't a dependency for enabling mseal in the kconfig? Isn't that needed? Please fix this. > > To make modulization better, I can do below adjustment: > if (seal_system_mapping_enabled()) <-- implemented by fs/exec.c > add_vm_sealed() <- keep in include/linux/mm.h > > However, if you have a strong opinion on this, I could move the > parsing logic to mm/mseal. Yes, please move the parsing logic together with the other mseal code. > > > > > > +void update_seal_exec_system_mappings(unsigned long * > > > > The name is also very long and a bit odd, it could be used for other > > > > reasons, but you have _system_mappings on the end, and you use seal but > > > > it's mseal (or vm_seal)? Would mseal_flag() work? > > > > > > > It could be longer :-) > > > it means update_sealing_flag_for_executable_system_mappings. > > > mseal_flag is too short and not descriptive. > > > > mseal_exec_flags() ? > > > It needs to be more descriptive because there are also stacks and > heaps to be sealed. Stopping vma modifications to vmas that exist specifically to be dynamic does not sound wise. > I suggest to use below name to make it shorter: > > if (seal_system_mapping_enabled()) > add_vm_sealed() I am not sure what one you are renaming or what these functions would do. I think you need to look at your overall design and try to fit it into what exists instead of putting a call in a wrapper function (_install_special_mapping) to alter an argument. > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > > > index 57fd5ab2abe7..d4717e34a60d 100644 > > > > > --- a/mm/mmap.c > > > > > +++ b/mm/mmap.c > > > > > @@ -2133,6 +2133,7 @@ struct vm_area_struct *_install_special_mapping( > > > > > unsigned long addr, unsigned long len, > > > > > unsigned long vm_flags, const struct vm_special_mapping *spec) > > > > > { > > > > > + update_seal_exec_system_mappings(&vm_flags); > > > > > return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec, > > > > > &special_mapping_vmops); > > > > > > > > If you were to return a flag, you could change the vm_flags argument to > > > > vm_flags | mseal_flag() > > > > > > > passing pointer seems to be the most efficient way. > > > > I disagree. Here is the godbolt.org output for gcc x86-64 14.2 of your > > code (with some added #defines to make it compile) > > > > seal_system_mappings: > > .long 1 > > seal_system_mappings_enabled: > > push rbp > > mov rbp, rsp > > mov eax, DWORD PTR seal_system_mappings[rip] > > cmp eax, 1 > > jne .L2 > > mov eax, 1 > > jmp .L3 > > .L2: > > mov eax, 0 > > .L3: > > pop rbp > > ret > > update_seal_exec_system_mappings: > > push rbp > > mov rbp, rsp > > sub rsp, 8 > > mov QWORD PTR [rbp-8], rdi > > mov rax, QWORD PTR [rbp-8] > > mov rax, QWORD PTR [rax] > > and eax, 2 > > test rax, rax > > jne .L6 > > call seal_system_mappings_enabled > > test al, al > > je .L6 > > mov rax, QWORD PTR [rbp-8] > > mov rax, QWORD PTR [rax] > > or rax, 2 > > mov rdx, rax > > mov rax, QWORD PTR [rbp-8] > > mov QWORD PTR [rax], rdx > > .L6: > > nop > > leave > > ret > > main: > > push rbp > > mov rbp, rsp > > sub rsp, 16 > > mov QWORD PTR [rbp-8], 0 > > lea rax, [rbp-8] > > mov rdi, rax > > call update_seal_exec_system_mappings > > mov rax, QWORD PTR [rbp-8] > > leave > > ret > > > > ----- 48 lines ----- > > Here is what I am suggesting to do with replacing the passing of a > > pointer with a concise "vm_flags | mseal_exec_flags()" (with the same > > added #defines to make it compile) > > > > seal_system_mappings: > > .long 1 > > mseal_exec_flags: > > push rbp > > mov rbp, rsp > > mov eax, DWORD PTR seal_system_mappings[rip] > > cmp eax, 1 > > jne .L2 > > mov eax, 2 > > jmp .L3 > > .L2: > > mov eax, 0 > > .L3: > > pop rbp > > ret > > main: > > push rbp > > mov rbp, rsp > > sub rsp, 16 > > mov QWORD PTR [rbp-8], 0 > > call mseal_exec_flags > > mov edx, eax > > mov rax, QWORD PTR [rbp-8] > > or eax, edx > > leave > > ret > > > > ----- 26 lines ----- > > > > So as you can see, there are less instructions in my version; there are > > 47.92% less lines of assembly. > > > vm_flags already run out of space in 32 bit, sooner or later we will > need to change that to *** a struct ***, passing address will be > becoming necessary with struct. Since this is not a performance > sensitive code path, 3 or 4 times during execve(), I think it would be > good to start here. No. I have pointed out why this is not 'the most efficient way' and you are now switching your argument to 'this will be needed in the future'. Please fix this. Liam
On Mon, Nov 11, 2024 at 2:35 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > To make modulization better, I can do below adjustment: > > if (seal_system_mapping_enabled()) <-- implemented by fs/exec.c > > add_vm_sealed() <- keep in include/linux/mm.h > > > > However, if you have a strong opinion on this, I could move the > > parsing logic to mm/mseal. > > Yes, please move the parsing logic together with the other mseal code. > Sure > > I am not sure what one you are renaming or what these functions would > do. I think you need to look at your overall design and try to fit it > into what exists instead of putting a call in a wrapper function > (_install_special_mapping) to alter an argument. > Please see V3. > > > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > > > > index 57fd5ab2abe7..d4717e34a60d 100644 > > > > > > --- a/mm/mmap.c > > > > > > +++ b/mm/mmap.c > > > > > > @@ -2133,6 +2133,7 @@ struct vm_area_struct *_install_special_mapping( > > > > > > unsigned long addr, unsigned long len, > > > > > > unsigned long vm_flags, const struct vm_special_mapping *spec) > > > > > > { > > > > > > + update_seal_exec_system_mappings(&vm_flags); > > > > > > return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec, > > > > > > &special_mapping_vmops); > > > > > > > > > > If you were to return a flag, you could change the vm_flags argument to > > > > > vm_flags | mseal_flag() > > > > > > > > > passing pointer seems to be the most efficient way. > > > > > > I disagree. Here is the godbolt.org output for gcc x86-64 14.2 of your > > > code (with some added #defines to make it compile) > > > > > > seal_system_mappings: > > > .long 1 > > > seal_system_mappings_enabled: > > > push rbp > > > mov rbp, rsp > > > mov eax, DWORD PTR seal_system_mappings[rip] > > > cmp eax, 1 > > > jne .L2 > > > mov eax, 1 > > > jmp .L3 > > > .L2: > > > mov eax, 0 > > > .L3: > > > pop rbp > > > ret > > > update_seal_exec_system_mappings: > > > push rbp > > > mov rbp, rsp > > > sub rsp, 8 > > > mov QWORD PTR [rbp-8], rdi > > > mov rax, QWORD PTR [rbp-8] > > > mov rax, QWORD PTR [rax] > > > and eax, 2 > > > test rax, rax > > > jne .L6 > > > call seal_system_mappings_enabled > > > test al, al > > > je .L6 > > > mov rax, QWORD PTR [rbp-8] > > > mov rax, QWORD PTR [rax] > > > or rax, 2 > > > mov rdx, rax > > > mov rax, QWORD PTR [rbp-8] > > > mov QWORD PTR [rax], rdx > > > .L6: > > > nop > > > leave > > > ret > > > main: > > > push rbp > > > mov rbp, rsp > > > sub rsp, 16 > > > mov QWORD PTR [rbp-8], 0 > > > lea rax, [rbp-8] > > > mov rdi, rax > > > call update_seal_exec_system_mappings > > > mov rax, QWORD PTR [rbp-8] > > > leave > > > ret > > > > > > ----- 48 lines ----- > > > Here is what I am suggesting to do with replacing the passing of a > > > pointer with a concise "vm_flags | mseal_exec_flags()" (with the same > > > added #defines to make it compile) > > > > > > seal_system_mappings: > > > .long 1 > > > mseal_exec_flags: > > > push rbp > > > mov rbp, rsp > > > mov eax, DWORD PTR seal_system_mappings[rip] > > > cmp eax, 1 > > > jne .L2 > > > mov eax, 2 > > > jmp .L3 > > > .L2: > > > mov eax, 0 > > > .L3: > > > pop rbp > > > ret > > > main: > > > push rbp > > > mov rbp, rsp > > > sub rsp, 16 > > > mov QWORD PTR [rbp-8], 0 > > > call mseal_exec_flags > > > mov edx, eax > > > mov rax, QWORD PTR [rbp-8] > > > or eax, edx > > > leave > > > ret > > > > > > ----- 26 lines ----- > > > > > > So as you can see, there are less instructions in my version; there are > > > 47.92% less lines of assembly. > > > > > vm_flags already run out of space in 32 bit, sooner or later we will > > need to change that to *** a struct ***, passing address will be > > becoming necessary with struct. Since this is not a performance > > sensitive code path, 3 or 4 times during execve(), I think it would be > > good to start here. > > No. > > I have pointed out why this is not 'the most efficient way' and you are > now switching your argument to 'this will be needed in the future'. > > Please fix this. > Sure, updated in V3
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index e7bfe1bde49e..02e5eb23d76f 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1538,6 +1538,16 @@ Permit 'security.evm' to be updated regardless of current integrity status. + exec.seal_system_mappings = [KNL] + Format: { never | always } + Seal system mappings: vdso, vvar, sigpage, uprobes, + vsyscall. + This overwrites KCONFIG CONFIG_SEAL_SYSTEM_MAPPINGS_* + - 'never': never seal system mappings. + - 'always': always seal system mappings. + If not specified or invalid, default is the KCONFIG value. + This option has no effect if CONFIG_64BIT=n + early_page_ext [KNL,EARLY] Enforces page_ext initialization to earlier stages so cover more early boot allocations. Please note that as side effect some optimizations diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c index 2fb7d53cf333..20a3000550d2 100644 --- a/arch/x86/entry/vsyscall/vsyscall_64.c +++ b/arch/x86/entry/vsyscall/vsyscall_64.c @@ -32,6 +32,7 @@ #include <linux/mm_types.h> #include <linux/syscalls.h> #include <linux/ratelimit.h> +#include <linux/fs.h> #include <asm/vsyscall.h> #include <asm/unistd.h> @@ -366,8 +367,12 @@ void __init map_vsyscall(void) set_vsyscall_pgtable_user_bits(swapper_pg_dir); } - if (vsyscall_mode == XONLY) - vm_flags_init(&gate_vma, VM_EXEC); + if (vsyscall_mode == XONLY) { + unsigned long vm_flags = VM_EXEC; + + update_seal_exec_system_mappings(&vm_flags); + vm_flags_init(&gate_vma, vm_flags); + } BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) != (unsigned long)VSYSCALL_ADDR); diff --git a/fs/exec.c b/fs/exec.c index 77364806b48d..5030879cda47 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -68,6 +68,7 @@ #include <linux/user_events.h> #include <linux/rseq.h> #include <linux/ksm.h> +#include <linux/fs_parser.h> #include <linux/uaccess.h> #include <asm/mmu_context.h> @@ -2159,3 +2160,55 @@ fs_initcall(init_fs_exec_sysctls); #ifdef CONFIG_EXEC_KUNIT_TEST #include "tests/exec_kunit.c" #endif + +#ifdef CONFIG_64BIT +/* + * Kernel cmdline overwrite for CONFIG_SEAL_SYSTEM_MAPPINGS_X + */ +enum seal_system_mappings_type { + SEAL_SYSTEM_MAPPINGS_NEVER, + SEAL_SYSTEM_MAPPINGS_ALWAYS +}; + +static enum seal_system_mappings_type seal_system_mappings __ro_after_init = + IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS_ALWAYS) ? SEAL_SYSTEM_MAPPINGS_ALWAYS : + SEAL_SYSTEM_MAPPINGS_NEVER; + +static const struct constant_table value_table_sys_mapping[] __initconst = { + { "never", SEAL_SYSTEM_MAPPINGS_NEVER}, + { "always", SEAL_SYSTEM_MAPPINGS_ALWAYS}, + { } +}; + +static int __init early_seal_system_mappings_override(char *buf) +{ + if (!buf) + return -EINVAL; + + seal_system_mappings = lookup_constant(value_table_sys_mapping, + buf, seal_system_mappings); + + return 0; +} + +early_param("exec.seal_system_mappings", early_seal_system_mappings_override); + +static bool seal_system_mappings_enabled(void) +{ + if (seal_system_mappings == SEAL_SYSTEM_MAPPINGS_ALWAYS) + return true; + + return false; +} + +void update_seal_exec_system_mappings(unsigned long *vm_flags) +{ + if (!(*vm_flags & VM_SEALED) && seal_system_mappings_enabled()) + *vm_flags |= VM_SEALED; + +} +#else +void update_seal_exec_system_mappings(unsigned long *vm_flags) +{ +} +#endif /* CONFIG_64BIT */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 42444ec95c9b..6e44aca4b24b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3079,6 +3079,7 @@ ssize_t __kernel_read(struct file *file, void *buf, size_t count, loff_t *pos); extern ssize_t kernel_write(struct file *, const void *, size_t, loff_t *); extern ssize_t __kernel_write(struct file *, const void *, size_t, loff_t *); extern struct file * open_exec(const char *); +extern void update_seal_exec_system_mappings(unsigned long *vm_flags); /* fs/dcache.c -- generic fs support functions */ extern bool is_subdir(struct dentry *, struct dentry *); diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index c47a0bf25e58..e9876fae8887 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1506,7 +1506,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area) } vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE, - VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO, + VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|VM_SEALED, &xol_mapping); if (IS_ERR(vma)) { ret = PTR_ERR(vma); diff --git a/mm/mmap.c b/mm/mmap.c index 57fd5ab2abe7..d4717e34a60d 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2133,6 +2133,7 @@ struct vm_area_struct *_install_special_mapping( unsigned long addr, unsigned long len, unsigned long vm_flags, const struct vm_special_mapping *spec) { + update_seal_exec_system_mappings(&vm_flags); return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec, &special_mapping_vmops); } diff --git a/security/Kconfig b/security/Kconfig index 28e685f53bd1..4ec8045339c3 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -51,6 +51,32 @@ config PROC_MEM_NO_FORCE endchoice +choice + prompt "Seal system mappings" + default SEAL_SYSTEM_MAPPINGS_NEVER + help + Seal system mappings such as vdso, vvar, sigpage, uprobes and + vsyscall. + Note: kernel command line exec.seal_system_mappings overwrites this. + +config SEAL_SYSTEM_MAPPINGS_NEVER + bool "Traditional behavior - not sealed" + help + Do not seal system mappings. + This is default. + +config SEAL_SYSTEM_MAPPINGS_ALWAYS + bool "Always seal system mappings" + depends on 64BIT + depends on !CHECKPOINT_RESTORE + help + Seal system mappings such as vdso, vvar, sigpage, uprobes and + vsyscall. + Note: CHECKPOINT_RESTORE might relocate vdso mapping during restore, + and remap will fail if the mapping is sealed, therefore + !CHECKPOINT_RESTORE is added as dependency. +endchoice + config SECURITY bool "Enable different security models" depends on SYSFS