Message ID | 20240222131230.635-5-petrtesarik@huaweicloud.com (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Series | PoC: convert AppArmor parser to SandBox Mode | expand |
On 2/22/24 05:12, Petr Tesarik wrote: > static const struct sbm_fixup fixups[] = > { > + /* kmalloc() and friends */ > + { kmalloc_trace, proxy_alloc3 }, > + { __kmalloc, proxy_alloc1 }, > + { __kmalloc_node, proxy_alloc1 }, > + { __kmalloc_node_track_caller, proxy_alloc1 }, > + { kmalloc_large, proxy_alloc1 }, > + { kmalloc_large_node, proxy_alloc1 }, > + { krealloc, proxy_alloc2 }, > + { kfree, proxy_free }, > + > + /* vmalloc() and friends */ > + { vmalloc, proxy_alloc1 }, > + { __vmalloc, proxy_alloc1 }, > + { __vmalloc_node, proxy_alloc1 }, > + { vzalloc, proxy_alloc1 }, > + { vfree, proxy_free }, > + > { } > }; Petr, thanks for sending this. This _is_ a pretty concise example of what it means to convert kernel code to run in your sandbox mode. But, from me, it's still "no thanks". Establishing and maintaining this proxy list will be painful. Folks will change the code to call something new and break this *constantly*. That goes for infrastructure like the allocators and for individual sandbox instances like apparmor. It's also telling that sandboxing a bit of apparmor took four fixups. That tells me we're probably still only looking at the tip of the icebeg if we were to convert a bunch more sites. That's on top of everything I was concerned about before.
On Thu, 22 Feb 2024 07:51:00 -0800 Dave Hansen <dave.hansen@intel.com> wrote: > On 2/22/24 05:12, Petr Tesarik wrote: > > static const struct sbm_fixup fixups[] = > > { > > + /* kmalloc() and friends */ > > + { kmalloc_trace, proxy_alloc3 }, > > + { __kmalloc, proxy_alloc1 }, > > + { __kmalloc_node, proxy_alloc1 }, > > + { __kmalloc_node_track_caller, proxy_alloc1 }, > > + { kmalloc_large, proxy_alloc1 }, > > + { kmalloc_large_node, proxy_alloc1 }, > > + { krealloc, proxy_alloc2 }, > > + { kfree, proxy_free }, > > + > > + /* vmalloc() and friends */ > > + { vmalloc, proxy_alloc1 }, > > + { __vmalloc, proxy_alloc1 }, > > + { __vmalloc_node, proxy_alloc1 }, > > + { vzalloc, proxy_alloc1 }, > > + { vfree, proxy_free }, > > + > > { } > > }; > > Petr, thanks for sending this. This _is_ a pretty concise example of > what it means to convert kernel code to run in your sandbox mode. But, > from me, it's still "no thanks". > > Establishing and maintaining this proxy list will be painful. Folks > will change the code to call something new and break this *constantly*. > > That goes for infrastructure like the allocators and for individual > sandbox instances like apparmor. Understood. OTOH the proxy list is here for the PoC so I could send something that builds and runs without making it an overly big patch series. As explained in patch 5/5, the goal is not to make a global list. Instead, each instance should define what it needs and that way define its specific policy of interfacing with the rest of the kernel. To give an example, these AppArmor fixups would be added only to the sandbox which runs aa_unpack(), but not to the one which runs unpack_to_rootfs(), which is another PoC I did (but required porting more patches). If more fixups are needed after you change your code, you know you've just added a new dependency. It's then up to you to decide if it was intentional. > It's also telling that sandboxing a bit of apparmor took four fixups. > That tells me we're probably still only looking at the tip of the icebeg > if we were to convert a bunch more sites. Yes, it is the cost paid for getting code and data flows under control. In your opinion this kind of memory safety is not worth the effort of explicitly defining the interface between a sandboxed component and the rest of the kernel, because it increases maintenance costs. Correct? > That's on top of everything I was concerned about before. Good, I think I can understand the new concern, but regarding everything you were concerned about before, this part is still not quite clear to me. I'll try to summarize the points: * Running code in ring-0 is inherently safer than running code in ring-3. Since what I'm trying to do is protect kernel data structures from memory safety bugs in another part of the kernel, it roughly translates to: "Kernel data structures are better protected from rogue kernel modules than from userspace applications." This cannot possibly be what you are trying to say. * SMAP, SMEP and/or LASS can somehow protect one part of the kernel from memory safety bugs in another part of the kernel. I somehow can't see how that is the case. I have always thought that: * SMEP prevents the kernel to execute code from user pages. * SMAP prevents the kernel to read from or write into user pages. * LASS does pretty much the same job as SMEP+SMAP, but instead of using page table protection bits, it uses the highest bit of the virtual address because that's much faster. * Hardware designers are adding (other) hardware security defenses to ring-0 that are not applied to ring-3. Could you give an example of these other security defenses, please? * Ring-3 is more exposed to attacks. This statement sounds a bit too vague on its own. What attack vectors are we talking about? The primary attack vector that SBM is trying to address are exploits of kernel code vulnerabilities triggered by data from sources outside the kernel (boot loader, userspace, etc.). H. Peter Anvin added a few other points: * SBM has all the downsides of a microkernel without the upsides. I can only guess what would be the downsides and upsides... One notorious downside is performance. Agreed, there is some overhead. I'm not promoting SBM for time-critical operations. But compared to user-mode helpers (which was suggested as an alternative for one of the proposed scenarios), the overhead of SBM is at least an order of magnitude less. IPC and the need to define how servers interact with each other is another downside I can think of. Yes, there is a bit of it in SBM, as you have correctly noted above. * SBM introduces architectural changes that are most definitely *very* harmful both to maintainers and users. It is very difficult to learn something from this statement. Could you give some examples of how SBM harms either group, please? * SBM feels like paravirtualization all over again. All right, hpa, you've had lots of pain with paravirtualization. I feel with you, I've had my part of it too. Can you imagine how much trouble I could have spared myself for the libkdumpfile project if I didn't have to deal with the difference between "physical addresses" and "machine addresses"? However, this is hardly a relevant point. The Linux kernel community is respected for making decisions based on facts, not feelings. * SBM exposes kernel memory to user space. This is a misunderstanding. Sandbox mode does not share anything at all with user mode. It does share some CPU state with kernel mode, but not with user mode. If "user space" was intended to mean "Ring-3", then it doesn't explain how that is a really bad idea. * SBM is not needed, because there is already eBPF. Well, yes, but I believe they work on a different level. For example, eBPF needs a verifier to ensure memory safety. If you run eBPF code itself in a sandbox instead, that verifier is not needed, because memory safety is enforced by CPU hardware. When hpa says that SandBox Mode is "an enormous step in the wrong direction", I want to understand why this direction is wrong, so I can take a step in the right direction next time. So far there has been only one objective concern: the need to track code (and data) dependencies explicitly. AFAICS this is an inherent drawback of any kind of program decomposition. Is decomposition considered harmful? Petr T
On 2/22/24 09:57, Petr Tesařík wrote: > * Hardware designers are adding (other) hardware security defenses to > ring-0 that are not applied to ring-3. > > Could you give an example of these other security defenses, please? Here's one example: > https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/data-dependent-prefetcher.html "DDP is neither trained by nor triggered by supervisor-mode accesses." But seriously, this is going to be my last message on this topic. I appreciate your enthusiasm, but I don't see any viable way forward for this approach.
diff --git a/arch/x86/kernel/sbm/core.c b/arch/x86/kernel/sbm/core.c index c8ac7ecb08cc..3cf3842292b9 100644 --- a/arch/x86/kernel/sbm/core.c +++ b/arch/x86/kernel/sbm/core.c @@ -20,6 +20,12 @@ #include <linux/sbm.h> #include <linux/sched/task_stack.h> +/* + * FIXME: Remove these includes when there is proper API for defining + * which functions can be called from sandbox mode. + */ +#include <linux/vmalloc.h> + #define GFP_SBM_PGTABLE (GFP_KERNEL | __GFP_ZERO) #define PGD_ORDER get_order(sizeof(pgd_t) * PTRS_PER_PGD) @@ -52,8 +58,83 @@ struct sbm_fixup { sbm_proxy_call_fn proxy; }; +static int map_range(struct x86_sbm_state *state, unsigned long start, + unsigned long end, pgprot_t prot); + +/* Map the newly allocated dynamic memory region. */ +static unsigned long post_alloc(struct x86_sbm_state *state, + unsigned long objp, size_t size) +{ + int err; + + if (!objp) + return objp; + + err = map_range(state, objp, objp + size, PAGE_SHARED); + if (err) { + kfree((void*)objp); + return 0UL; + } + return objp; +} + +/* Allocation proxy handler if size is the 1st parameter. */ +static unsigned long proxy_alloc1(struct x86_sbm_state *state, + unsigned long func, struct pt_regs *regs) +{ + unsigned long objp; + + objp = x86_sbm_proxy_call(state, func, regs); + return post_alloc(state, objp, regs->di); +} + +/* Allocation proxy handler if size is the 2nd parameter. */ +static unsigned long proxy_alloc2(struct x86_sbm_state *state, + unsigned long func, struct pt_regs *regs) +{ + unsigned long objp; + + objp = x86_sbm_proxy_call(state, func, regs); + return post_alloc(state, objp, regs->si); +} + +/* Allocation proxy handler if size is the 3rd parameter. */ +static unsigned long proxy_alloc3(struct x86_sbm_state *state, + unsigned long func, struct pt_regs *regs) +{ + unsigned long objp; + + objp = x86_sbm_proxy_call(state, func, regs); + return post_alloc(state, objp, regs->dx); +} + +/* Proxy handler to free previously allocated memory. */ +static unsigned long proxy_free(struct x86_sbm_state *state, + unsigned long func, struct pt_regs *regs) +{ + /* TODO: unmap allocated addresses from sandbox! */ + return x86_sbm_proxy_call(state, func, regs); +} + static const struct sbm_fixup fixups[] = { + /* kmalloc() and friends */ + { kmalloc_trace, proxy_alloc3 }, + { __kmalloc, proxy_alloc1 }, + { __kmalloc_node, proxy_alloc1 }, + { __kmalloc_node_track_caller, proxy_alloc1 }, + { kmalloc_large, proxy_alloc1 }, + { kmalloc_large_node, proxy_alloc1 }, + { krealloc, proxy_alloc2 }, + { kfree, proxy_free }, + + /* vmalloc() and friends */ + { vmalloc, proxy_alloc1 }, + { __vmalloc, proxy_alloc1 }, + { __vmalloc_node, proxy_alloc1 }, + { vzalloc, proxy_alloc1 }, + { vfree, proxy_free }, + { } }; diff --git a/mm/slab_common.c b/mm/slab_common.c index 238293b1dbe1..2b72118d9bfa 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -28,6 +28,7 @@ #include <asm/page.h> #include <linux/memcontrol.h> #include <linux/stackdepot.h> +#include <linux/sbm.h> #include "internal.h" #include "slab.h" @@ -1208,7 +1209,7 @@ __do_krealloc(const void *p, size_t new_size, gfp_t flags) * * Return: pointer to the allocated memory or %NULL in case of error */ -void *krealloc(const void *p, size_t new_size, gfp_t flags) +void * __nosbm krealloc(const void *p, size_t new_size, gfp_t flags) { void *ret; diff --git a/mm/slub.c b/mm/slub.c index 2ef88bbf56a3..5f2290fe4df0 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -42,6 +42,7 @@ #include <kunit/test.h> #include <kunit/test-bug.h> #include <linux/sort.h> +#include <linux/sbm.h> #include <linux/debugfs.h> #include <trace/events/kmem.h> @@ -3913,7 +3914,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_node); * directly to the page allocator. We use __GFP_COMP, because we will need to * know the allocation order to free the pages properly in kfree. */ -static void *__kmalloc_large_node(size_t size, gfp_t flags, int node) +static void * __nosbm __kmalloc_large_node(size_t size, gfp_t flags, int node) { struct folio *folio; void *ptr = NULL; @@ -3938,7 +3939,7 @@ static void *__kmalloc_large_node(size_t size, gfp_t flags, int node) return ptr; } -void *kmalloc_large(size_t size, gfp_t flags) +void * __nosbm kmalloc_large(size_t size, gfp_t flags) { void *ret = __kmalloc_large_node(size, flags, NUMA_NO_NODE); @@ -3983,26 +3984,26 @@ void *__do_kmalloc_node(size_t size, gfp_t flags, int node, return ret; } -void *__kmalloc_node(size_t size, gfp_t flags, int node) +void * __nosbm __kmalloc_node(size_t size, gfp_t flags, int node) { return __do_kmalloc_node(size, flags, node, _RET_IP_); } EXPORT_SYMBOL(__kmalloc_node); -void *__kmalloc(size_t size, gfp_t flags) +void * __nosbm __kmalloc(size_t size, gfp_t flags) { return __do_kmalloc_node(size, flags, NUMA_NO_NODE, _RET_IP_); } EXPORT_SYMBOL(__kmalloc); -void *__kmalloc_node_track_caller(size_t size, gfp_t flags, - int node, unsigned long caller) +void * __nosbm __kmalloc_node_track_caller(size_t size, gfp_t flags, + int node, unsigned long caller) { return __do_kmalloc_node(size, flags, node, caller); } EXPORT_SYMBOL(__kmalloc_node_track_caller); -void *kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size) +void * __nosbm kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size) { void *ret = slab_alloc_node(s, NULL, gfpflags, NUMA_NO_NODE, _RET_IP_, size); @@ -4386,7 +4387,7 @@ static void free_large_kmalloc(struct folio *folio, void *object) * * If @object is NULL, no operation is performed. */ -void kfree(const void *object) +void __nosbm kfree(const void *object) { struct folio *folio; struct slab *slab; diff --git a/mm/vmalloc.c b/mm/vmalloc.c index d12a17fc0c17..d7a5b715ac03 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -40,6 +40,7 @@ #include <linux/pgtable.h> #include <linux/hugetlb.h> #include <linux/sched/mm.h> +#include <linux/sbm.h> #include <asm/tlbflush.h> #include <asm/shmparam.h> @@ -2804,7 +2805,7 @@ void vfree_atomic(const void *addr) * if we have CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG, but making the calling * conventions for vfree() arch-dependent would be a really bad idea). */ -void vfree(const void *addr) +void __nosbm vfree(const void *addr) { struct vm_struct *vm; int i; @@ -3379,7 +3380,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, * * Return: pointer to the allocated memory or %NULL on error */ -void *__vmalloc_node(unsigned long size, unsigned long align, +void * __nosbm __vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask, int node, const void *caller) { return __vmalloc_node_range(size, align, VMALLOC_START, VMALLOC_END, @@ -3394,7 +3395,7 @@ void *__vmalloc_node(unsigned long size, unsigned long align, EXPORT_SYMBOL_GPL(__vmalloc_node); #endif -void *__vmalloc(unsigned long size, gfp_t gfp_mask) +void * __nosbm __vmalloc(unsigned long size, gfp_t gfp_mask) { return __vmalloc_node(size, 1, gfp_mask, NUMA_NO_NODE, __builtin_return_address(0)); @@ -3413,7 +3414,7 @@ EXPORT_SYMBOL(__vmalloc); * * Return: pointer to the allocated memory or %NULL on error */ -void *vmalloc(unsigned long size) +void * __nosbm vmalloc(unsigned long size) { return __vmalloc_node(size, 1, GFP_KERNEL, NUMA_NO_NODE, __builtin_return_address(0)); @@ -3453,7 +3454,7 @@ EXPORT_SYMBOL_GPL(vmalloc_huge); * * Return: pointer to the allocated memory or %NULL on error */ -void *vzalloc(unsigned long size) +void * __nosbm vzalloc(unsigned long size) { return __vmalloc_node(size, 1, GFP_KERNEL | __GFP_ZERO, NUMA_NO_NODE, __builtin_return_address(0));