diff mbox series

[v3,1/1] exec: seal system mappings

Message ID 20241113191602.3541870-2-jeffxu@google.com (mailing list archive)
State New
Headers show
Series seal system mappings | expand

Commit Message

Jeff Xu Nov. 13, 2024, 7:16 p.m. UTC
From: Jeff Xu <jeffxu@chromium.org>

Seal vdso, vvar, sigpage, uprobes and vsyscall.

Those mappings are readonly or executable only, sealing can protect
them from ever changing or unmapped during the life time of the process.
For complete descriptions of memory sealing, please see mseal.rst [1].

System mappings such as vdso, vvar, and sigpage (for arm) are
generated by the kernel during program initialization, and are
sealed after creation.

Unlike the aforementioned mappings, the uprobe mapping is not
established during program startup. However, its lifetime is the same
as the process's lifetime [1]. It is sealed from creation.

The vdso, vvar, sigpage, and uprobe mappings all invoke the
_install_special_mapping() function. As no other mappings utilize this
function, it is logical to incorporate sealing logic within
_install_special_mapping(). This approach avoids the necessity of
modifying code across various architecture-specific implementations.

The vsyscall mapping, which has its own initialization function, is
sealed in the XONLY case, it seems to be the most common and secure
case of using vsyscall.

It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
alter the mapping of vdso, vvar, and sigpage during restore
operations. Consequently, this feature cannot be universally enabled
across all systems. To address this, a kernel configuration option has
been introduced to enable or disable this functionality.

[1] Documentation/userspace-api/mseal.rst
[2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/
Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 .../admin-guide/kernel-parameters.txt         | 10 +++++
 arch/x86/entry/vsyscall/vsyscall_64.c         |  9 ++++-
 include/linux/mm.h                            | 12 ++++++
 mm/mmap.c                                     | 10 +++++
 mm/mseal.c                                    | 39 +++++++++++++++++++
 security/Kconfig                              | 11 ++++++
 6 files changed, 89 insertions(+), 2 deletions(-)

Comments

Lorenzo Stoakes Nov. 13, 2024, 8:47 p.m. UTC | #1
I'd prefer not to move forward with this until we have confirmation that
adequate testing has been performed, given how invasive this change is,
even if behind a flag (unless we explicitly mention it is untested in the
Kconfig).

We are touching arch-specific stuff with VDSO, VVAR, etc. so we need to be
cautious when we're in effect hooking an arch-specific function in mm.

Other than that, the actual patch isn't too crazy overall.

I think a sensible approach might be to only enable on known-good arches.

On Wed, Nov 13, 2024 at 07:16:02PM +0000, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
>
> Seal vdso, vvar, sigpage, uprobes and vsyscall.
>
> Those mappings are readonly or executable only, sealing can protect
> them from ever changing or unmapped during the life time of the process.
> For complete descriptions of memory sealing, please see mseal.rst [1].
>
> System mappings such as vdso, vvar, and sigpage (for arm) are
> generated by the kernel during program initialization, and are
> sealed after creation.
>
> Unlike the aforementioned mappings, the uprobe mapping is not
> established during program startup. However, its lifetime is the same
> as the process's lifetime [1]. It is sealed from creation.
>
> The vdso, vvar, sigpage, and uprobe mappings all invoke the
> _install_special_mapping() function. As no other mappings utilize this
> function, it is logical to incorporate sealing logic within
> _install_special_mapping(). This approach avoids the necessity of
> modifying code across various architecture-specific implementations.
>
> The vsyscall mapping, which has its own initialization function, is
> sealed in the XONLY case, it seems to be the most common and secure
> case of using vsyscall.
>
> It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> alter the mapping of vdso, vvar, and sigpage during restore
> operations. Consequently, this feature cannot be universally enabled
> across all systems. To address this, a kernel configuration option has
> been introduced to enable or disable this functionality.
>
> [1] Documentation/userspace-api/mseal.rst

It'd be nice to explicitly refer to this in the docs, it's not quite urgent
though would be nice to be part of this series.

> [2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/
> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> ---
>  .../admin-guide/kernel-parameters.txt         | 10 +++++
>  arch/x86/entry/vsyscall/vsyscall_64.c         |  9 ++++-
>  include/linux/mm.h                            | 12 ++++++
>  mm/mmap.c                                     | 10 +++++
>  mm/mseal.c                                    | 39 +++++++++++++++++++
>  security/Kconfig                              | 11 ++++++
>  6 files changed, 89 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index e7bfe1bde49e..469a65b3cf50 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1538,6 +1538,16 @@
>  			Permit 'security.evm' to be updated regardless of
>  			current integrity status.
>
> +	exec.seal_system_mappings = [KNL]
> +			Format: { no | yes }
> +			Seal system mappings: vdso, vvar, sigpage, vsyscall,
> +			uprobe.
> +			This overwrites KCONFIG CONFIG_SEAL_SYSTEM_MAPPINGS
> +			- 'no':  do not seal system mappings.
> +			- 'yes': seal system mappings.
> +			If not specified or invalid, default is the KCONFIG value.
> +			This option has no effect if CONFIG_64BIT=n

Or if CONFIG_CHECKPOINT_RESTORE is not set. Please update to reference this
also.

> +
>  	early_page_ext [KNL,EARLY] Enforces page_ext initialization to earlier
>  			stages so cover more early boot allocations.
>  			Please note that as side effect some optimizations
> diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
> index 2fb7d53cf333..185553376f39 100644
> --- a/arch/x86/entry/vsyscall/vsyscall_64.c
> +++ b/arch/x86/entry/vsyscall/vsyscall_64.c
> @@ -366,8 +366,13 @@ void __init map_vsyscall(void)
>  		set_vsyscall_pgtable_user_bits(swapper_pg_dir);
>  	}
>
> -	if (vsyscall_mode == XONLY)
> -		vm_flags_init(&gate_vma, VM_EXEC);
> +	if (vsyscall_mode == XONLY) {
> +		unsigned long vm_flags = VM_EXEC;
> +
> +		vm_flags |= seal_system_mappings();
> +
> +		vm_flags_init(&gate_vma, vm_flags);

Nit: remove weird whitespace above. Also might be worth adding a comment as
to what we're doing here similar to the one in _install_special_mapping().

> +	}
>
>  	BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
>  		     (unsigned long)VSYSCALL_ADDR);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index df0a5eac66b7..f787d6c85cbb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4238,4 +4238,16 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
>  int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
>  int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
>
> +#ifdef CONFIG_64BIT
> +/*
> + * return VM_SEALED if seal system mapping is enabled.
> + */
> +unsigned long seal_system_mappings(void);
> +#else
> +static inline unsigned long seal_system_mappings(void)
> +{
> +	return 0;
> +}
> +#endif
> +
>  #endif /* _LINUX_MM_H */
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 57fd5ab2abe7..bc694c555805 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2133,6 +2133,16 @@ struct vm_area_struct *_install_special_mapping(
>  	unsigned long addr, unsigned long len,
>  	unsigned long vm_flags, const struct vm_special_mapping *spec)
>  {
> +	/*
> +	 * At present, all mappings (vdso, vvar, sigpage, and uprobe) that
> +	 * invoke the _install_special_mapping function can be sealed.
> +	 * Therefore, it is logical to call the seal_system_mappings_enabled()
> +	 * function here. In the future, if this is not the case, i.e. if certain
> +	 * mappings cannot be sealed, then it would be necessary to move this
> +	 * check to the calling function.
> +	 */

Nice comment!

> +	vm_flags |= seal_system_mappings();
> +
>  	return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec,
>  					&special_mapping_vmops);
>  }
> diff --git a/mm/mseal.c b/mm/mseal.c
> index ece977bd21e1..0a9d1e9faa28 100644
> --- a/mm/mseal.c
> +++ b/mm/mseal.c
> @@ -7,6 +7,7 @@
>   *  Author: Jeff Xu <jeffxu@chromium.org>
>   */
>
> +#include <linux/fs_parser.h>
>  #include <linux/mempolicy.h>
>  #include <linux/mman.h>
>  #include <linux/mm.h>
> @@ -266,3 +267,41 @@ SYSCALL_DEFINE3(mseal, unsigned long, start, size_t, len, unsigned long,
>  {
>  	return do_mseal(start, len, flags);
>  }
> +
> +/*
> + * Kernel cmdline overwrite for CONFIG_SEAL_SYSTEM_MAPPINGS
> + */
> +enum seal_system_mappings_type {
> +	SEAL_SYSTEM_MAPPINGS_DISABLED,
> +	SEAL_SYSTEM_MAPPINGS_ENABLED
> +};
> +
> +static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init =
> +	IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
> +	SEAL_SYSTEM_MAPPINGS_DISABLED;
> +
> +static const struct constant_table value_table_sys_mapping[] __initconst = {
> +	{ "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
> +	{ "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
> +	{ }
> +};
> +
> +static int __init early_seal_system_mappings_override(char *buf)
> +{
> +	if (!buf)
> +		return -EINVAL;
> +
> +	seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
> +			buf, seal_system_mappings_v);
> +	return 0;
> +}
> +
> +early_param("exec.seal_system_mappings", early_seal_system_mappings_override);
> +
> +unsigned long seal_system_mappings(void)
> +{
> +	if (seal_system_mappings_v == SEAL_SYSTEM_MAPPINGS_ENABLED)
> +		return VM_SEALED;
> +
> +	return 0;
> +}
> diff --git a/security/Kconfig b/security/Kconfig
> index 28e685f53bd1..63b87a218943 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -51,6 +51,17 @@ config PROC_MEM_NO_FORCE
>
>  endchoice
>
> +config SEAL_SYSTEM_MAPPINGS
> +	bool "seal system mappings"
> +	default n
> +	depends on 64BIT
> +	depends on !CHECKPOINT_RESTORE

Would prefer to depend on actually tested architectures only.

> +	help
> +	  Seal system mappings such as vdso, vvar, sigpage, vsyscall, uprobes.
> +	  Note: CHECKPOINT_RESTORE might relocate vdso mapping during restore,
> +	  and remap will fail if the mapping is sealed, therefore
> +	  !CHECKPOINT_RESTORE is added as dependency.
> +
>  config SECURITY
>  	bool "Enable different security models"
>  	depends on SYSFS
> --
> 2.47.0.277.g8800431eea-goog
>
Liam R. Howlett Nov. 14, 2024, 12:54 a.m. UTC | #2
* jeffxu@chromium.org <jeffxu@chromium.org> [241113 14:16]:
> From: Jeff Xu <jeffxu@chromium.org>

Thanks, this looks much better.  Some minor things below.

> 
> Seal vdso, vvar, sigpage, uprobes and vsyscall.
> 
> Those mappings are readonly or executable only, sealing can protect
> them from ever changing or unmapped during the life time of the process.
> For complete descriptions of memory sealing, please see mseal.rst [1].
> 
> System mappings such as vdso, vvar, and sigpage (for arm) are
> generated by the kernel during program initialization, and are
> sealed after creation.
> 
> Unlike the aforementioned mappings, the uprobe mapping is not
> established during program startup. However, its lifetime is the same
> as the process's lifetime [1]. It is sealed from creation.

Why are you referencing mseal.rst for the uprobe mapping lifetime?  I
can't find anything in there about uprobe.

> 
> The vdso, vvar, sigpage, and uprobe mappings all invoke the
> _install_special_mapping() function. As no other mappings utilize this
> function, it is logical to incorporate sealing logic within
> _install_special_mapping(). This approach avoids the necessity of
> modifying code across various architecture-specific implementations.
> 
> The vsyscall mapping, which has its own initialization function, is
> sealed in the XONLY case, it seems to be the most common and secure
> case of using vsyscall.
> 
> It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> alter the mapping of vdso, vvar, and sigpage during restore
> operations. Consequently, this feature cannot be universally enabled
> across all systems. To address this, a kernel configuration option has
> been introduced to enable or disable this functionality.

It also can't be used on 32 bit systems, as per your kernel-parameters
changes (and mseal specification).  This is missing from the changelog.

> 
> [1] Documentation/userspace-api/mseal.rst
> [2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/
> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> ---
>  .../admin-guide/kernel-parameters.txt         | 10 +++++
>  arch/x86/entry/vsyscall/vsyscall_64.c         |  9 ++++-
>  include/linux/mm.h                            | 12 ++++++
>  mm/mmap.c                                     | 10 +++++
>  mm/mseal.c                                    | 39 +++++++++++++++++++
>  security/Kconfig                              | 11 ++++++
>  6 files changed, 89 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index e7bfe1bde49e..469a65b3cf50 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1538,6 +1538,16 @@
>  			Permit 'security.evm' to be updated regardless of
>  			current integrity status.
>  
> +	exec.seal_system_mappings = [KNL]
> +			Format: { no | yes }
> +			Seal system mappings: vdso, vvar, sigpage, vsyscall,
> +			uprobe.
> +			This overwrites KCONFIG CONFIG_SEAL_SYSTEM_MAPPINGS
                             ^^^^^^^^^ overrides ?
> +			- 'no':  do not seal system mappings.
> +			- 'yes': seal system mappings.
> +			If not specified or invalid, default is the KCONFIG value.
> +			This option has no effect if CONFIG_64BIT=n
> +
>  	early_page_ext [KNL,EARLY] Enforces page_ext initialization to earlier
>  			stages so cover more early boot allocations.
>  			Please note that as side effect some optimizations
> diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
> index 2fb7d53cf333..185553376f39 100644
> --- a/arch/x86/entry/vsyscall/vsyscall_64.c
> +++ b/arch/x86/entry/vsyscall/vsyscall_64.c
> @@ -366,8 +366,13 @@ void __init map_vsyscall(void)
>  		set_vsyscall_pgtable_user_bits(swapper_pg_dir);
>  	}
>  
> -	if (vsyscall_mode == XONLY)
> -		vm_flags_init(&gate_vma, VM_EXEC);
> +	if (vsyscall_mode == XONLY) {
> +		unsigned long vm_flags = VM_EXEC;
> +
> +		vm_flags |= seal_system_mappings();
> +

nit, extra line here.

> +		vm_flags_init(&gate_vma, vm_flags);
> +	}
>  
>  	BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
>  		     (unsigned long)VSYSCALL_ADDR);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index df0a5eac66b7..f787d6c85cbb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4238,4 +4238,16 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
>  int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
>  int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
>  
> +#ifdef CONFIG_64BIT
> +/*
> + * return VM_SEALED if seal system mapping is enabled.
> + */
> +unsigned long seal_system_mappings(void);
> +#else
> +static inline unsigned long seal_system_mappings(void)
> +{
> +	return 0;
> +}
> +#endif
> +
>  #endif /* _LINUX_MM_H */
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 57fd5ab2abe7..bc694c555805 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2133,6 +2133,16 @@ struct vm_area_struct *_install_special_mapping(
>  	unsigned long addr, unsigned long len,
>  	unsigned long vm_flags, const struct vm_special_mapping *spec)
>  {
> +	/*
> +	 * At present, all mappings (vdso, vvar, sigpage, and uprobe) that
> +	 * invoke the _install_special_mapping function can be sealed.
> +	 * Therefore, it is logical to call the seal_system_mappings_enabled()
> +	 * function here. In the future, if this is not the case, i.e. if certain
> +	 * mappings cannot be sealed, then it would be necessary to move this
> +	 * check to the calling function.
> +	 */
> +	vm_flags |= seal_system_mappings();
> +

Thank you for making this change into an or operation with a single
function.  It is much easier to figure out what is going on.

But.. this will add the VM_SEALED flag on any 64bit architecture that
enables the SEAL_SYSTEM_MAPPINGS config.  That will happen by bots with
random config builds.  I don't know if they have test cases that
specifically unmap the vmas you are sealing (ppc64 probably tries to
unmap the vdso).

I do know that I've had syzbot bugs that unmap _all_ vmas.  I'm guessing
you will get bot notification on these failures for any 64bit
architecture.  You may want to look into it to avoid such fuzzing
failures, but we still need this to be tested somehow.

>  	return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec,
>  					&special_mapping_vmops);
>  }
> diff --git a/mm/mseal.c b/mm/mseal.c
> index ece977bd21e1..0a9d1e9faa28 100644
> --- a/mm/mseal.c
> +++ b/mm/mseal.c

Thank you for moving this to mseal.c

> @@ -7,6 +7,7 @@
>   *  Author: Jeff Xu <jeffxu@chromium.org>
>   */
>  
> +#include <linux/fs_parser.h>
>  #include <linux/mempolicy.h>
>  #include <linux/mman.h>
>  #include <linux/mm.h>
> @@ -266,3 +267,41 @@ SYSCALL_DEFINE3(mseal, unsigned long, start, size_t, len, unsigned long,
>  {
>  	return do_mseal(start, len, flags);
>  }
> +
> +/*
> + * Kernel cmdline overwrite for CONFIG_SEAL_SYSTEM_MAPPINGS

overwrite or override?  I think the difference is that overwrite implies
permanence where override doesn't.  I'm fine with either, it just reads
a bit odd to me.

> + */
> +enum seal_system_mappings_type {
> +	SEAL_SYSTEM_MAPPINGS_DISABLED,
> +	SEAL_SYSTEM_MAPPINGS_ENABLED
> +};
> +
> +static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init =
> +	IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
> +	SEAL_SYSTEM_MAPPINGS_DISABLED;
> +
> +static const struct constant_table value_table_sys_mapping[] __initconst = {
> +	{ "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
> +	{ "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
> +	{ }
> +};
> +
> +static int __init early_seal_system_mappings_override(char *buf)
> +{
> +	if (!buf)
> +		return -EINVAL;
> +
> +	seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
> +			buf, seal_system_mappings_v);
> +	return 0;
> +}
> +
> +early_param("exec.seal_system_mappings", early_seal_system_mappings_override);
> +
> +unsigned long seal_system_mappings(void)
> +{
> +	if (seal_system_mappings_v == SEAL_SYSTEM_MAPPINGS_ENABLED)
> +		return VM_SEALED;
> +
> +	return 0;
> +}
> diff --git a/security/Kconfig b/security/Kconfig
> index 28e685f53bd1..63b87a218943 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -51,6 +51,17 @@ config PROC_MEM_NO_FORCE
>  
>  endchoice
>  
> +config SEAL_SYSTEM_MAPPINGS
> +	bool "seal system mappings"
> +	default n
> +	depends on 64BIT
> +	depends on !CHECKPOINT_RESTORE
> +	help
> +	  Seal system mappings such as vdso, vvar, sigpage, vsyscall, uprobes.
> +	  Note: CHECKPOINT_RESTORE might relocate vdso mapping during restore,
> +	  and remap will fail if the mapping is sealed, therefore
> +	  !CHECKPOINT_RESTORE is added as dependency.

You could also add a portion here about your override functionality on
command line. "this can be disabled or enabled by..."

I really think having something that can be found by searching for mseal
is really desirable here.  That is, make menuconfig, press / for search,
type mseal -> find this feature.  If this was MSEAL_SYSTEM_MAPPINGS,
searching for seal or mseal would work and would serve to link the
config option to the mseal document.

Right now there is no information in the help that will allow it to be
found by 'mseal'.  There is also nothing in the documentation that
states this exists, which you should probably update with this feature?

Thanks,
Liam
Jeff Xu Nov. 19, 2024, 8:57 p.m. UTC | #3
Hi Liam,

On Wed, Nov 13, 2024 at 4:54 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > Unlike the aforementioned mappings, the uprobe mapping is not
> > established during program startup. However, its lifetime is the same
> > as the process's lifetime [1]. It is sealed from creation.
>
> Why are you referencing mseal.rst for the uprobe mapping lifetime?  I
> can't find anything in there about uprobe.
>
This should be [2], thanks for checking.

>
> It also can't be used on 32 bit systems, as per your kernel-parameters
> changes (and mseal specification).  This is missing from the changelog.
>
sure, I will add that to the commit msg.

> > +     exec.seal_system_mappings = [KNL]
> > +                     Format: { no | yes }
> > +                     Seal system mappings: vdso, vvar, sigpage, vsyscall,
> > +                     uprobe.
> > +                     This overwrites KCONFIG CONFIG_SEAL_SYSTEM_MAPPINGS
>                              ^^^^^^^^^ overrides ?
sure.


> > -     if (vsyscall_mode == XONLY)
> > -             vm_flags_init(&gate_vma, VM_EXEC);
> > +     if (vsyscall_mode == XONLY) {
> > +             unsigned long vm_flags = VM_EXEC;
> > +
> > +             vm_flags |= seal_system_mappings();
> > +
>
> nit, extra line here.
>
removed.

> But.. this will add the VM_SEALED flag on any 64bit architecture that
> enables the SEAL_SYSTEM_MAPPINGS config.  That will happen by bots with
> random config builds.  I don't know if they have test cases that
> specifically unmap the vmas you are sealing (ppc64 probably tries to
> unmap the vdso).
>
> I do know that I've had syzbot bugs that unmap _all_ vmas.  I'm guessing
> you will get bot notification on these failures for any 64bit
> architecture.  You may want to look into it to avoid such fuzzing
> failures, but we still need this to be tested somehow.
>test_mremap_vdso.c
I found one selftest that could fail:
tools/testing/selftests/x86/test_mremap_vdso.c

I could add tools/testing/selftests/x86/config and add
CONFIG_SYSTEM_MAPPINGS=n there.
as instructed in selftest documentation [1]

[1] https://docs.kernel.org/dev-tools/kselftest.html#contributing-new-tests-details

>
> overwrite or override?  I think the difference is that overwrite implies
> permanence where override doesn't.  I'm fine with either, it just reads
> a bit odd to me.
>
sure, changed to override

> >
> > +config SEAL_SYSTEM_MAPPINGS
> > +     bool "seal system mappings"
> > +     default n
> > +     depends on 64BIT
> > +     depends on !CHECKPOINT_RESTORE
> > +     help
> > +       Seal system mappings such as vdso, vvar, sigpage, vsyscall, uprobes.
> > +       Note: CHECKPOINT_RESTORE might relocate vdso mapping during restore,
> > +       and remap will fail if the mapping is sealed, therefore
> > +       !CHECKPOINT_RESTORE is added as dependency.
>
> You could also add a portion here about your override functionality on
> command line. "this can be disabled or enabled by..."
>
sure.

> I really think having something that can be found by searching for mseal
> is really desirable here.  That is, make menuconfig, press / for search,
> type mseal -> find this feature.  If this was MSEAL_SYSTEM_MAPPINGS,
> searching for seal or mseal would work and would serve to link the
> config option to the mseal document.
>
using "seal" would work here. I will add a note here to mseal.rst for reference.

> Right now there is no information in the help that will allow it to be
> found by 'mseal'.  There is also nothing in the documentation that
> states this exists, which you should probably update with this feature?
>
I will update mseal.rst to include this feature.

Thanks for reviewing.
-Jeff
Jeff Xu Nov. 19, 2024, 11:49 p.m. UTC | #4
Hi Lorenzo

On Wed, Nov 13, 2024 at 12:47 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> I'd prefer not to move forward with this until we have confirmation that
> adequate testing has been performed, given how invasive this change is,
> even if behind a flag (unless we explicitly mention it is untested in the
> Kconfig).
>
> We are touching arch-specific stuff with VDSO, VVAR, etc. so we need to be
> cautious when we're in effect hooking an arch-specific function in mm.
>
> Other than that, the actual patch isn't too crazy overall.
>
> I think a sensible approach might be to only enable on known-good arches.
>
I responded to this in the other email where you raised the same point.

> On Wed, Nov 13, 2024 at 07:16:02PM +0000, jeffxu@chromium.org wrote:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > Seal vdso, vvar, sigpage, uprobes and vsyscall.
> >
> > Those mappings are readonly or executable only, sealing can protect
> > them from ever changing or unmapped during the life time of the process.
> > For complete descriptions of memory sealing, please see mseal.rst [1].
> >
> > System mappings such as vdso, vvar, and sigpage (for arm) are
> > generated by the kernel during program initialization, and are
> > sealed after creation.
> >
> > Unlike the aforementioned mappings, the uprobe mapping is not
> > established during program startup. However, its lifetime is the same
> > as the process's lifetime [1]. It is sealed from creation.
> >
> > The vdso, vvar, sigpage, and uprobe mappings all invoke the
> > _install_special_mapping() function. As no other mappings utilize this
> > function, it is logical to incorporate sealing logic within
> > _install_special_mapping(). This approach avoids the necessity of
> > modifying code across various architecture-specific implementations.
> >
> > The vsyscall mapping, which has its own initialization function, is
> > sealed in the XONLY case, it seems to be the most common and secure
> > case of using vsyscall.
> >
> > It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> > alter the mapping of vdso, vvar, and sigpage during restore
> > operations. Consequently, this feature cannot be universally enabled
> > across all systems. To address this, a kernel configuration option has
> > been introduced to enable or disable this functionality.
> >
> > [1] Documentation/userspace-api/mseal.rst
>
> It'd be nice to explicitly refer to this in the docs, it's not quite urgent
> though would be nice to be part of this series.
>
will update mseal.rst next version.

> > [2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/
> > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > ---
> >  .../admin-guide/kernel-parameters.txt         | 10 +++++
> >  arch/x86/entry/vsyscall/vsyscall_64.c         |  9 ++++-
> >  include/linux/mm.h                            | 12 ++++++
> >  mm/mmap.c                                     | 10 +++++
> >  mm/mseal.c                                    | 39 +++++++++++++++++++
> >  security/Kconfig                              | 11 ++++++
> >  6 files changed, 89 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index e7bfe1bde49e..469a65b3cf50 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -1538,6 +1538,16 @@
> >                       Permit 'security.evm' to be updated regardless of
> >                       current integrity status.
> >
> > +     exec.seal_system_mappings = [KNL]
> > +                     Format: { no | yes }
> > +                     Seal system mappings: vdso, vvar, sigpage, vsyscall,
> > +                     uprobe.
> > +                     This overwrites KCONFIG CONFIG_SEAL_SYSTEM_MAPPINGS
> > +                     - 'no':  do not seal system mappings.
> > +                     - 'yes': seal system mappings.
> > +                     If not specified or invalid, default is the KCONFIG value.
> > +                     This option has no effect if CONFIG_64BIT=n
>
> Or if CONFIG_CHECKPOINT_RESTORE is not set. Please update to reference this
> also.
>
I will update this part. Liam has a similar comment.

> > +
> >       early_page_ext [KNL,EARLY] Enforces page_ext initialization to earlier
> >                       stages so cover more early boot allocations.
> >                       Please note that as side effect some optimizations
> > diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
> > index 2fb7d53cf333..185553376f39 100644
> > --- a/arch/x86/entry/vsyscall/vsyscall_64.c
> > +++ b/arch/x86/entry/vsyscall/vsyscall_64.c
> > @@ -366,8 +366,13 @@ void __init map_vsyscall(void)
> >               set_vsyscall_pgtable_user_bits(swapper_pg_dir);
> >       }
> >
> > -     if (vsyscall_mode == XONLY)
> > -             vm_flags_init(&gate_vma, VM_EXEC);
> > +     if (vsyscall_mode == XONLY) {
> > +             unsigned long vm_flags = VM_EXEC;
> > +
> > +             vm_flags |= seal_system_mappings();
> > +
> > +             vm_flags_init(&gate_vma, vm_flags);
>
> Nit: remove weird whitespace above. Also might be worth adding a comment as
> to what we're doing here similar to the one in _install_special_mapping().
>
Done.

> > +     }
> >
> >       BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
> >                    (unsigned long)VSYSCALL_ADDR);
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index df0a5eac66b7..f787d6c85cbb 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4238,4 +4238,16 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
> >  int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
> >  int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
> >
> > +#ifdef CONFIG_64BIT
> > +/*
> > + * return VM_SEALED if seal system mapping is enabled.
> > + */
> > +unsigned long seal_system_mappings(void);
> > +#else
> > +static inline unsigned long seal_system_mappings(void)
> > +{
> > +     return 0;
> > +}
> > +#endif
> > +
> >  #endif /* _LINUX_MM_H */
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 57fd5ab2abe7..bc694c555805 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2133,6 +2133,16 @@ struct vm_area_struct *_install_special_mapping(
> >       unsigned long addr, unsigned long len,
> >       unsigned long vm_flags, const struct vm_special_mapping *spec)
> >  {
> > +     /*
> > +      * At present, all mappings (vdso, vvar, sigpage, and uprobe) that
> > +      * invoke the _install_special_mapping function can be sealed.
> > +      * Therefore, it is logical to call the seal_system_mappings_enabled()
> > +      * function here. In the future, if this is not the case, i.e. if certain
> > +      * mappings cannot be sealed, then it would be necessary to move this
> > +      * check to the calling function.
> > +      */
>
> Nice comment!
>
> > +     vm_flags |= seal_system_mappings();
> > +
> >       return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec,
> >                                       &special_mapping_vmops);
> >  }
> > diff --git a/mm/mseal.c b/mm/mseal.c
> > index ece977bd21e1..0a9d1e9faa28 100644
> > --- a/mm/mseal.c
> > +++ b/mm/mseal.c
> > @@ -7,6 +7,7 @@
> >   *  Author: Jeff Xu <jeffxu@chromium.org>
> >   */
> >
> > +#include <linux/fs_parser.h>
> >  #include <linux/mempolicy.h>
> >  #include <linux/mman.h>
> >  #include <linux/mm.h>
> > @@ -266,3 +267,41 @@ SYSCALL_DEFINE3(mseal, unsigned long, start, size_t, len, unsigned long,
> >  {
> >       return do_mseal(start, len, flags);
> >  }
> > +
> > +/*
> > + * Kernel cmdline overwrite for CONFIG_SEAL_SYSTEM_MAPPINGS
> > + */
> > +enum seal_system_mappings_type {
> > +     SEAL_SYSTEM_MAPPINGS_DISABLED,
> > +     SEAL_SYSTEM_MAPPINGS_ENABLED
> > +};
> > +
> > +static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init =
> > +     IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
> > +     SEAL_SYSTEM_MAPPINGS_DISABLED;
> > +
> > +static const struct constant_table value_table_sys_mapping[] __initconst = {
> > +     { "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
> > +     { "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
> > +     { }
> > +};
> > +
> > +static int __init early_seal_system_mappings_override(char *buf)
> > +{
> > +     if (!buf)
> > +             return -EINVAL;
> > +
> > +     seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
> > +                     buf, seal_system_mappings_v);
> > +     return 0;
> > +}
> > +
> > +early_param("exec.seal_system_mappings", early_seal_system_mappings_override);
> > +
> > +unsigned long seal_system_mappings(void)
> > +{
> > +     if (seal_system_mappings_v == SEAL_SYSTEM_MAPPINGS_ENABLED)
> > +             return VM_SEALED;
> > +
> > +     return 0;
> > +}
> > diff --git a/security/Kconfig b/security/Kconfig
> > index 28e685f53bd1..63b87a218943 100644
> > --- a/security/Kconfig
> > +++ b/security/Kconfig
> > @@ -51,6 +51,17 @@ config PROC_MEM_NO_FORCE
> >
> >  endchoice
> >
> > +config SEAL_SYSTEM_MAPPINGS
> > +     bool "seal system mappings"
> > +     default n
> > +     depends on 64BIT
> > +     depends on !CHECKPOINT_RESTORE
>
> Would prefer to depend on actually tested architectures only.
>
I responded in the other email where you raised the same point.

Thanks for reviewing

-Jeff

> > +     help
> > +       Seal system mappings such as vdso, vvar, sigpage, vsyscall, uprobes.
> > +       Note: CHECKPOINT_RESTORE might relocate vdso mapping during restore,
> > +       and remap will fail if the mapping is sealed, therefore
> > +       !CHECKPOINT_RESTORE is added as dependency.
> > +
> >  config SECURITY
> >       bool "Enable different security models"
> >       depends on SYSFS
> > --
> > 2.47.0.277.g8800431eea-goog
> >
diff mbox series

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index e7bfe1bde49e..469a65b3cf50 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1538,6 +1538,16 @@ 
 			Permit 'security.evm' to be updated regardless of
 			current integrity status.
 
+	exec.seal_system_mappings = [KNL]
+			Format: { no | yes }
+			Seal system mappings: vdso, vvar, sigpage, vsyscall,
+			uprobe.
+			This overwrites KCONFIG CONFIG_SEAL_SYSTEM_MAPPINGS
+			- 'no':  do not seal system mappings.
+			- 'yes': seal system mappings.
+			If not specified or invalid, default is the KCONFIG value.
+			This option has no effect if CONFIG_64BIT=n
+
 	early_page_ext [KNL,EARLY] Enforces page_ext initialization to earlier
 			stages so cover more early boot allocations.
 			Please note that as side effect some optimizations
diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
index 2fb7d53cf333..185553376f39 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -366,8 +366,13 @@  void __init map_vsyscall(void)
 		set_vsyscall_pgtable_user_bits(swapper_pg_dir);
 	}
 
-	if (vsyscall_mode == XONLY)
-		vm_flags_init(&gate_vma, VM_EXEC);
+	if (vsyscall_mode == XONLY) {
+		unsigned long vm_flags = VM_EXEC;
+
+		vm_flags |= seal_system_mappings();
+
+		vm_flags_init(&gate_vma, vm_flags);
+	}
 
 	BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
 		     (unsigned long)VSYSCALL_ADDR);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index df0a5eac66b7..f787d6c85cbb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4238,4 +4238,16 @@  int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
 int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
 int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
 
+#ifdef CONFIG_64BIT
+/*
+ * return VM_SEALED if seal system mapping is enabled.
+ */
+unsigned long seal_system_mappings(void);
+#else
+static inline unsigned long seal_system_mappings(void)
+{
+	return 0;
+}
+#endif
+
 #endif /* _LINUX_MM_H */
diff --git a/mm/mmap.c b/mm/mmap.c
index 57fd5ab2abe7..bc694c555805 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2133,6 +2133,16 @@  struct vm_area_struct *_install_special_mapping(
 	unsigned long addr, unsigned long len,
 	unsigned long vm_flags, const struct vm_special_mapping *spec)
 {
+	/*
+	 * At present, all mappings (vdso, vvar, sigpage, and uprobe) that
+	 * invoke the _install_special_mapping function can be sealed.
+	 * Therefore, it is logical to call the seal_system_mappings_enabled()
+	 * function here. In the future, if this is not the case, i.e. if certain
+	 * mappings cannot be sealed, then it would be necessary to move this
+	 * check to the calling function.
+	 */
+	vm_flags |= seal_system_mappings();
+
 	return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec,
 					&special_mapping_vmops);
 }
diff --git a/mm/mseal.c b/mm/mseal.c
index ece977bd21e1..0a9d1e9faa28 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -7,6 +7,7 @@ 
  *  Author: Jeff Xu <jeffxu@chromium.org>
  */
 
+#include <linux/fs_parser.h>
 #include <linux/mempolicy.h>
 #include <linux/mman.h>
 #include <linux/mm.h>
@@ -266,3 +267,41 @@  SYSCALL_DEFINE3(mseal, unsigned long, start, size_t, len, unsigned long,
 {
 	return do_mseal(start, len, flags);
 }
+
+/*
+ * Kernel cmdline overwrite for CONFIG_SEAL_SYSTEM_MAPPINGS
+ */
+enum seal_system_mappings_type {
+	SEAL_SYSTEM_MAPPINGS_DISABLED,
+	SEAL_SYSTEM_MAPPINGS_ENABLED
+};
+
+static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init =
+	IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
+	SEAL_SYSTEM_MAPPINGS_DISABLED;
+
+static const struct constant_table value_table_sys_mapping[] __initconst = {
+	{ "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
+	{ "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
+	{ }
+};
+
+static int __init early_seal_system_mappings_override(char *buf)
+{
+	if (!buf)
+		return -EINVAL;
+
+	seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
+			buf, seal_system_mappings_v);
+	return 0;
+}
+
+early_param("exec.seal_system_mappings", early_seal_system_mappings_override);
+
+unsigned long seal_system_mappings(void)
+{
+	if (seal_system_mappings_v == SEAL_SYSTEM_MAPPINGS_ENABLED)
+		return VM_SEALED;
+
+	return 0;
+}
diff --git a/security/Kconfig b/security/Kconfig
index 28e685f53bd1..63b87a218943 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -51,6 +51,17 @@  config PROC_MEM_NO_FORCE
 
 endchoice
 
+config SEAL_SYSTEM_MAPPINGS
+	bool "seal system mappings"
+	default n
+	depends on 64BIT
+	depends on !CHECKPOINT_RESTORE
+	help
+	  Seal system mappings such as vdso, vvar, sigpage, vsyscall, uprobes.
+	  Note: CHECKPOINT_RESTORE might relocate vdso mapping during restore,
+	  and remap will fail if the mapping is sealed, therefore
+	  !CHECKPOINT_RESTORE is added as dependency.
+
 config SECURITY
 	bool "Enable different security models"
 	depends on SYSFS