Message ID | 20230526051529.3387103-2-song@kernel.org (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | Type aware module allocator | expand |
Hi-- On 5/25/23 22:15, Song Liu wrote: > +/** > + * struct vmalloc_params - Parameters to call __vmalloc_node_range() > + * @start: Address space range start > + * @end: Address space range end > + * @gfp_mask: The gfp_t mask used for this range > + * @pgprot: The page protection for this range > + * @vm_flags The vm_flag used for this range * @vm_flags: > + */ > +struct vmalloc_params { > + unsigned long start; > + unsigned long end; > + gfp_t gfp_mask; > + pgprot_t pgprot; > + unsigned long vm_flags; > +};
On Thu, May 25, 2023 at 10:15:27PM -0700, Song Liu wrote: > Introduce memory type aware module_alloc_type, which provides a unified > allocator for all different archs. This work was discussed in [1]. > > Each arch can configure the allocator to do the following: > > 1. Specify module_vaddr and module_end > 2. Random module start address for KASLR > 3. kasan_alloc_module_shadow() > 4. kasan_reset_tag() > 5. Preferred and secondary module address ranges > > enum mod_alloc_params_flags are used to control the behavior of > module_alloc_type. Specifically: MOD_ALLOC_FALLBACK let module_alloc_type > fallback to existing module_alloc. MOD_ALLOC_SET_MEMORY let > module_alloc_type to protect the memory before returning to the user. > > module_allocator_init() call is added to start_kernel() to initialize > module_alloc_type. > ... > +/** > + * struct vmalloc_params - Parameters to call __vmalloc_node_range() > + * @start: Address space range start > + * @end: Address space range end > + * @gfp_mask: The gfp_t mask used for this range > + * @pgprot: The page protection for this range > + * @vm_flags The vm_flag used for this range > + */ > +struct vmalloc_params { > + unsigned long start; > + unsigned long end; > + gfp_t gfp_mask; > + pgprot_t pgprot; > + unsigned long vm_flags; > +}; > + > +/** > + * struct mod_alloc_params - Parameters for module allocation type > + * @flags: Properties in mod_alloc_params_flags > + * @granularity: The allocation granularity (PAGE/PMD) in bytes > + * @alignment: The allocation alignment requirement > + * @vmp: Parameters used to call vmalloc > + * @fill: Function to fill allocated space. If NULL, use memcpy() > + * @invalidate: Function to invalidate memory space. > + * > + * If @granularity > @alignment the allocation can reuse free space in > + * previously allocated pages. If they are the same, then fresh pages > + * have to be allocated. > + */ > +struct mod_alloc_params { > + unsigned int flags; > + unsigned int granularity; > + unsigned int alignment; > + struct vmalloc_params vmp[MOD_MAX_ADDR_SPACES]; > + void * (*fill)(void *dst, const void *src, size_t len); > + void * (*invalidate)(void *ptr, size_t len); > +}; > + > +struct mod_type_allocator { > + struct mod_alloc_params params; > +}; > + > +struct mod_allocators { > + struct mod_type_allocator *types[MOD_MEM_NUM_TYPES]; > +}; > + > +void *module_alloc_type(size_t size, enum mod_mem_type type); > +void module_memfree_type(void *ptr, enum mod_mem_type type); > +void module_memory_fill_type(void *dst, void *src, size_t len, enum mod_mem_type type); > +void module_memory_invalidate_type(void *ptr, size_t len, enum mod_mem_type type); > +void module_memory_protect(void *ptr, size_t len, enum mod_mem_type type); > +void module_memory_unprotect(void *ptr, size_t len, enum mod_mem_type type); > +void module_memory_force_protect(void *ptr, size_t len, enum mod_mem_type type); > +void module_memory_force_unprotect(void *ptr, size_t len, enum mod_mem_type type); > +void module_alloc_type_init(struct mod_allocators *allocators); This is a pretty big and complicated interface, and I haven't been able to find any reasoning or justification for why it's needed. Why is this going in module.h? Don't do that, this is supposed to be a general purpose allocator for executable memory so start a new file. What is vmalloc_params doing here? Why is it needed? Can we just get rid of it? module_memory_protect() and module_memory_unprotect() looks like a completely broken interface for supporting sub-page allocations - different threads with different allocations on the same page will race. The text_poke() abstraction works; the exposed interface should look like that. Please kill MOD_ALLOC_SET_MEMORY, and _only_ return memory that is read-only. You should be able to kill mod_alloc_params entirely. Our other memory allocators don't expose kasan - why does this one? Again, that looks wrong. I would like to see something much simpler (that doesn't encode awkward assumptions from module and bpf!): please look at the code I sketched out below and tell me why this interface won't work - or if it can, go with _that_. commit 80ceffd6b1108afc7593bdf060954c73eaf0ffc4 Author: Kent Overstreet <kent.overstreet@linux.dev> Date: Wed May 17 01:22:06 2023 -0400 mm: jit/text allocator This provides a new, very simple slab allocator for jit/text, i.e. bpf, ftrace trampolines, or bcachefs unpack functions. With this API we can avoid ever mapping pages both writeable and executable (not implemented in this patch: need to tweak module_alloc()), and it also supports sub-page sized allocations. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h new file mode 100644 index 0000000000..f1549d60e8 --- /dev/null +++ b/include/linux/jitalloc.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_JITALLOC_H +#define _LINUX_JITALLOC_H + +void jit_update(void *buf, void *new_buf, size_t len); +void jit_free(void *buf); +void *jit_alloc(void *buf, size_t len); + +#endif /* _LINUX_JITALLOC_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 4751031f3f..ff26a4f0c9 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1202,6 +1202,9 @@ config LRU_GEN_STATS This option has a per-memcg and per-node memory overhead. # } +config JITALLOC + bool + source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index c03e1e5859..25e82db9e8 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o +obj-$(CONFIG_JITALLOC) += jitalloc.o diff --git a/mm/jitalloc.c b/mm/jitalloc.c new file mode 100644 index 0000000000..7c4d621802 --- /dev/null +++ b/mm/jitalloc.c @@ -0,0 +1,187 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <linux/gfp.h> +#include <linux/highmem.h> +#include <linux/jitalloc.h> +#include <linux/mm.h> +#include <linux/moduleloader.h> +#include <linux/mutex.h> +#include <linux/set_memory.h> +#include <linux/vmalloc.h> + +#include <asm/text-patching.h> + +static DEFINE_MUTEX(jit_alloc_lock); + +struct jit_cache { + unsigned obj_size_bits; + unsigned objs_per_slab; + struct list_head partial; +}; + +#define JITALLOC_MIN_SIZE 16 +#define NR_JIT_CACHES ilog2(PAGE_SIZE / JITALLOC_MIN_SIZE) + +static struct jit_cache jit_caches[NR_JIT_CACHES]; + +struct jit_slab { + unsigned long __page_flags; + + struct jit_cache *cache; + void *executably_mapped;; + unsigned long *objs_allocated; /* bitmap of free objects */ + struct list_head list; +}; + +#define folio_jit_slab(folio) (_Generic((folio), \ + const struct folio *: (const struct jit_slab *)(folio), \ + struct folio *: (struct jit_slab *)(folio))) + +#define jit_slab_folio(s) (_Generic((s), \ + const struct jit_slab *: (const struct folio *)s, \ + struct jit_slab *: (struct folio *)s)) + +static struct jit_slab *jit_slab_alloc(struct jit_cache *cache) +{ + void *executably_mapped = module_alloc(PAGE_SIZE); + struct page *page; + struct folio *folio; + struct jit_slab *slab; + unsigned long *objs_allocated; + + if (!executably_mapped) + return NULL; + + objs_allocated = kcalloc(BITS_TO_LONGS(cache->objs_per_slab), sizeof(unsigned long), GFP_KERNEL); + if (!objs_allocated ) { + vfree(executably_mapped); + return NULL; + } + + set_vm_flush_reset_perms(executably_mapped); + set_memory_rox((unsigned long) executably_mapped, 1); + + page = vmalloc_to_page(executably_mapped); + folio = page_folio(page); + + __folio_set_slab(folio); + slab = folio_jit_slab(folio); + slab->cache = cache; + slab->executably_mapped = executably_mapped; + slab->objs_allocated = objs_allocated; + INIT_LIST_HEAD(&slab->list); + + return slab; +} + +static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache) +{ + struct jit_slab *s = + list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?: + jit_slab_alloc(cache); + unsigned obj_idx, nr_allocated; + + if (!s) + return NULL; + + obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab); + + BUG_ON(obj_idx >= cache->objs_per_slab); + __set_bit(obj_idx, s->objs_allocated); + + nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab); + + if (nr_allocated == s->cache->objs_per_slab) { + list_del_init(&s->list); + } else if (nr_allocated == 1) { + list_del(&s->list); + list_add(&s->list, &s->cache->partial); + } + + return s->executably_mapped + (obj_idx << cache->obj_size_bits); +} + +void jit_update(void *buf, void *new_buf, size_t len) +{ + text_poke_copy(buf, new_buf, len); +} +EXPORT_SYMBOL_GPL(jit_update); + +void jit_free(void *buf) +{ + struct page *page; + struct folio *folio; + struct jit_slab *s; + unsigned obj_idx, nr_allocated; + size_t offset; + + if (!buf) + return; + + page = vmalloc_to_page(buf); + folio = page_folio(page); + offset = offset_in_folio(folio, buf); + + if (!folio_test_slab(folio)) { + vfree(buf); + return; + } + + s = folio_jit_slab(folio); + + mutex_lock(&jit_alloc_lock); + obj_idx = offset >> s->cache->obj_size_bits; + + __clear_bit(obj_idx, s->objs_allocated); + + nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab); + + if (nr_allocated == 0) { + list_del(&s->list); + kfree(s->objs_allocated); + folio_put(folio); + } else if (nr_allocated + 1 == s->cache->objs_per_slab) { + list_del(&s->list); + list_add(&s->list, &s->cache->partial); + } + + mutex_unlock(&jit_alloc_lock); +} +EXPORT_SYMBOL_GPL(jit_free); + +void *jit_alloc(void *buf, size_t len) +{ + unsigned jit_cache_idx = ilog2(roundup_pow_of_two(len) / 16); + void *p; + + if (jit_cache_idx < NR_JIT_CACHES) { + mutex_lock(&jit_alloc_lock); + p = jit_cache_alloc(buf, len, &jit_caches[jit_cache_idx]); + mutex_unlock(&jit_alloc_lock); + } else { + p = module_alloc(len); + if (p) { + set_vm_flush_reset_perms(p); + set_memory_rox((unsigned long) p, DIV_ROUND_UP(len, PAGE_SIZE)); + } + } + + if (p && buf) + jit_update(p, buf, len); + + return p; +} +EXPORT_SYMBOL_GPL(jit_alloc); + +static int __init jit_alloc_init(void) +{ + for (unsigned i = 0; i < ARRAY_SIZE(jit_caches); i++) { + jit_caches[i].obj_size_bits = ilog2(JITALLOC_MIN_SIZE) + i; + jit_caches[i].objs_per_slab = PAGE_SIZE >> jit_caches[i].obj_size_bits; + + INIT_LIST_HEAD(&jit_caches[i].partial); + } + + return 0; +} +core_initcall(jit_alloc_init);
On Fri, May 26, 2023 at 3:30 PM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Thu, May 25, 2023 at 10:15:27PM -0700, Song Liu wrote: > > Introduce memory type aware module_alloc_type, which provides a unified > > allocator for all different archs. This work was discussed in [1]. > > > > Each arch can configure the allocator to do the following: > > > > 1. Specify module_vaddr and module_end > > 2. Random module start address for KASLR > > 3. kasan_alloc_module_shadow() > > 4. kasan_reset_tag() > > 5. Preferred and secondary module address ranges > > > > enum mod_alloc_params_flags are used to control the behavior of > > module_alloc_type. Specifically: MOD_ALLOC_FALLBACK let module_alloc_type > > fallback to existing module_alloc. MOD_ALLOC_SET_MEMORY let > > module_alloc_type to protect the memory before returning to the user. > > > > module_allocator_init() call is added to start_kernel() to initialize > > module_alloc_type. > > > > ... > > > +/** > > + * struct vmalloc_params - Parameters to call __vmalloc_node_range() > > + * @start: Address space range start > > + * @end: Address space range end > > + * @gfp_mask: The gfp_t mask used for this range > > + * @pgprot: The page protection for this range > > + * @vm_flags The vm_flag used for this range > > + */ > > +struct vmalloc_params { > > + unsigned long start; > > + unsigned long end; > > + gfp_t gfp_mask; > > + pgprot_t pgprot; > > + unsigned long vm_flags; > > +}; > > + > > +/** > > + * struct mod_alloc_params - Parameters for module allocation type > > + * @flags: Properties in mod_alloc_params_flags > > + * @granularity: The allocation granularity (PAGE/PMD) in bytes > > + * @alignment: The allocation alignment requirement > > + * @vmp: Parameters used to call vmalloc > > + * @fill: Function to fill allocated space. If NULL, use memcpy() > > + * @invalidate: Function to invalidate memory space. > > + * > > + * If @granularity > @alignment the allocation can reuse free space in > > + * previously allocated pages. If they are the same, then fresh pages > > + * have to be allocated. > > + */ > > +struct mod_alloc_params { > > + unsigned int flags; > > + unsigned int granularity; > > + unsigned int alignment; > > + struct vmalloc_params vmp[MOD_MAX_ADDR_SPACES]; > > + void * (*fill)(void *dst, const void *src, size_t len); > > + void * (*invalidate)(void *ptr, size_t len); > > +}; > > + > > +struct mod_type_allocator { > > + struct mod_alloc_params params; > > +}; > > + > > +struct mod_allocators { > > + struct mod_type_allocator *types[MOD_MEM_NUM_TYPES]; > > +}; > > + > > +void *module_alloc_type(size_t size, enum mod_mem_type type); > > +void module_memfree_type(void *ptr, enum mod_mem_type type); > > +void module_memory_fill_type(void *dst, void *src, size_t len, enum mod_mem_type type); > > +void module_memory_invalidate_type(void *ptr, size_t len, enum mod_mem_type type); > > +void module_memory_protect(void *ptr, size_t len, enum mod_mem_type type); > > +void module_memory_unprotect(void *ptr, size_t len, enum mod_mem_type type); > > +void module_memory_force_protect(void *ptr, size_t len, enum mod_mem_type type); > > +void module_memory_force_unprotect(void *ptr, size_t len, enum mod_mem_type type); > > +void module_alloc_type_init(struct mod_allocators *allocators); > > This is a pretty big and complicated interface, and I haven't been able > to find any reasoning or justification for why it's needed. > > Why is this going in module.h? Don't do that, this is supposed to be a > general purpose allocator for executable memory so start a new file. The goal of this work is to build a memory allocator for modules, text, rw data, and ro data. So it is not the same as execmem_alloc or jitalloc. > > What is vmalloc_params doing here? Why is it needed? Can we just get rid > of it? We need it to have an interface for all archs. They are all using different variations of vmalloc for module_alloc. > > module_memory_protect() and module_memory_unprotect() looks like a > completely broken interface for supporting sub-page allocations - > different threads with different allocations on the same page will race. These two APIs allow the core code work with all archs. They won't break sub-page allocations. (not all archs will have sub-page allocations) OTOH, the "force" version of the two APIs should be removed later. In this set, we only need them for arch_prepare_bpf_trampoline(). I plan to remove this limitation for x86 soon. > > The text_poke() abstraction works; the exposed interface should look > like that. > > Please kill MOD_ALLOC_SET_MEMORY, and _only_ return memory that is > read-only. You should be able to kill mod_alloc_params entirely. > > Our other memory allocators don't expose kasan - why does this one? > Again, that looks wrong. > > I would like to see something much simpler (that doesn't encode awkward > assumptions from module and bpf!): please look at the code I sketched > out below and tell me why this interface won't work - or if it can, go > with _that_. I think we need to align the goal here. PS: we did align the goal about 6 months ago when I proposed the execmem_alloc() set. My current goal is to build an allocator for all the use cases of modules, text, data, rodata, etc. Then the same allocator can be used by other dynamic kernel text: bpf, ftrace, kprobe, bcachefs, etc. OTOH, jit_alloc (as-is) solves a subset of the problem (only the text). We can probably use it (with some updates) instead of the vmap_area based allocator. But I guess we need to align the direction first. Thanks, Song > > commit 80ceffd6b1108afc7593bdf060954c73eaf0ffc4 > Author: Kent Overstreet <kent.overstreet@linux.dev> > Date: Wed May 17 01:22:06 2023 -0400 > > mm: jit/text allocator > > This provides a new, very simple slab allocator for jit/text, i.e. bpf, > ftrace trampolines, or bcachefs unpack functions. > > With this API we can avoid ever mapping pages both writeable and > executable (not implemented in this patch: need to tweak > module_alloc()), and it also supports sub-page sized allocations. > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> > > diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h > new file mode 100644 > index 0000000000..f1549d60e8 > --- /dev/null > +++ b/include/linux/jitalloc.h > @@ -0,0 +1,9 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _LINUX_JITALLOC_H > +#define _LINUX_JITALLOC_H > + > +void jit_update(void *buf, void *new_buf, size_t len); > +void jit_free(void *buf); > +void *jit_alloc(void *buf, size_t len); > + > +#endif /* _LINUX_JITALLOC_H */ > diff --git a/mm/Kconfig b/mm/Kconfig > index 4751031f3f..ff26a4f0c9 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -1202,6 +1202,9 @@ config LRU_GEN_STATS > This option has a per-memcg and per-node memory overhead. > # } > > +config JITALLOC > + bool > + > source "mm/damon/Kconfig" > > endmenu > diff --git a/mm/Makefile b/mm/Makefile > index c03e1e5859..25e82db9e8 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o > obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o > obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o > obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o > +obj-$(CONFIG_JITALLOC) += jitalloc.o > diff --git a/mm/jitalloc.c b/mm/jitalloc.c > new file mode 100644 > index 0000000000..7c4d621802 > --- /dev/null > +++ b/mm/jitalloc.c > @@ -0,0 +1,187 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +#include <linux/gfp.h> > +#include <linux/highmem.h> > +#include <linux/jitalloc.h> > +#include <linux/mm.h> > +#include <linux/moduleloader.h> > +#include <linux/mutex.h> > +#include <linux/set_memory.h> > +#include <linux/vmalloc.h> > + > +#include <asm/text-patching.h> > + > +static DEFINE_MUTEX(jit_alloc_lock); > + > +struct jit_cache { > + unsigned obj_size_bits; > + unsigned objs_per_slab; > + struct list_head partial; > +}; > + > +#define JITALLOC_MIN_SIZE 16 > +#define NR_JIT_CACHES ilog2(PAGE_SIZE / JITALLOC_MIN_SIZE) > + > +static struct jit_cache jit_caches[NR_JIT_CACHES]; > + > +struct jit_slab { > + unsigned long __page_flags; > + > + struct jit_cache *cache; > + void *executably_mapped;; > + unsigned long *objs_allocated; /* bitmap of free objects */ > + struct list_head list; > +}; > + > +#define folio_jit_slab(folio) (_Generic((folio), \ > + const struct folio *: (const struct jit_slab *)(folio), \ > + struct folio *: (struct jit_slab *)(folio))) > + > +#define jit_slab_folio(s) (_Generic((s), \ > + const struct jit_slab *: (const struct folio *)s, \ > + struct jit_slab *: (struct folio *)s)) > + > +static struct jit_slab *jit_slab_alloc(struct jit_cache *cache) > +{ > + void *executably_mapped = module_alloc(PAGE_SIZE); > + struct page *page; > + struct folio *folio; > + struct jit_slab *slab; > + unsigned long *objs_allocated; > + > + if (!executably_mapped) > + return NULL; > + > + objs_allocated = kcalloc(BITS_TO_LONGS(cache->objs_per_slab), sizeof(unsigned long), GFP_KERNEL); > + if (!objs_allocated ) { > + vfree(executably_mapped); > + return NULL; > + } > + > + set_vm_flush_reset_perms(executably_mapped); > + set_memory_rox((unsigned long) executably_mapped, 1); > + > + page = vmalloc_to_page(executably_mapped); > + folio = page_folio(page); > + > + __folio_set_slab(folio); > + slab = folio_jit_slab(folio); > + slab->cache = cache; > + slab->executably_mapped = executably_mapped; > + slab->objs_allocated = objs_allocated; > + INIT_LIST_HEAD(&slab->list); > + > + return slab; > +} > + > +static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache) > +{ > + struct jit_slab *s = > + list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?: > + jit_slab_alloc(cache); > + unsigned obj_idx, nr_allocated; > + > + if (!s) > + return NULL; > + > + obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab); > + > + BUG_ON(obj_idx >= cache->objs_per_slab); > + __set_bit(obj_idx, s->objs_allocated); > + > + nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab); > + > + if (nr_allocated == s->cache->objs_per_slab) { > + list_del_init(&s->list); > + } else if (nr_allocated == 1) { > + list_del(&s->list); > + list_add(&s->list, &s->cache->partial); > + } > + > + return s->executably_mapped + (obj_idx << cache->obj_size_bits); > +} > + > +void jit_update(void *buf, void *new_buf, size_t len) > +{ > + text_poke_copy(buf, new_buf, len); > +} > +EXPORT_SYMBOL_GPL(jit_update); > + > +void jit_free(void *buf) > +{ > + struct page *page; > + struct folio *folio; > + struct jit_slab *s; > + unsigned obj_idx, nr_allocated; > + size_t offset; > + > + if (!buf) > + return; > + > + page = vmalloc_to_page(buf); > + folio = page_folio(page); > + offset = offset_in_folio(folio, buf); > + > + if (!folio_test_slab(folio)) { > + vfree(buf); > + return; > + } > + > + s = folio_jit_slab(folio); > + > + mutex_lock(&jit_alloc_lock); > + obj_idx = offset >> s->cache->obj_size_bits; > + > + __clear_bit(obj_idx, s->objs_allocated); > + > + nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab); > + > + if (nr_allocated == 0) { > + list_del(&s->list); > + kfree(s->objs_allocated); > + folio_put(folio); > + } else if (nr_allocated + 1 == s->cache->objs_per_slab) { > + list_del(&s->list); > + list_add(&s->list, &s->cache->partial); > + } > + > + mutex_unlock(&jit_alloc_lock); > +} > +EXPORT_SYMBOL_GPL(jit_free); > + > +void *jit_alloc(void *buf, size_t len) > +{ > + unsigned jit_cache_idx = ilog2(roundup_pow_of_two(len) / 16); > + void *p; > + > + if (jit_cache_idx < NR_JIT_CACHES) { > + mutex_lock(&jit_alloc_lock); > + p = jit_cache_alloc(buf, len, &jit_caches[jit_cache_idx]); > + mutex_unlock(&jit_alloc_lock); > + } else { > + p = module_alloc(len); > + if (p) { > + set_vm_flush_reset_perms(p); > + set_memory_rox((unsigned long) p, DIV_ROUND_UP(len, PAGE_SIZE)); > + } > + } > + > + if (p && buf) > + jit_update(p, buf, len); > + > + return p; > +} > +EXPORT_SYMBOL_GPL(jit_alloc); > + > +static int __init jit_alloc_init(void) > +{ > + for (unsigned i = 0; i < ARRAY_SIZE(jit_caches); i++) { > + jit_caches[i].obj_size_bits = ilog2(JITALLOC_MIN_SIZE) + i; > + jit_caches[i].objs_per_slab = PAGE_SIZE >> jit_caches[i].obj_size_bits; > + > + INIT_LIST_HEAD(&jit_caches[i].partial); > + } > + > + return 0; > +} > +core_initcall(jit_alloc_init);
On Fri, May 26, 2023 at 04:09:01PM -0700, Song Liu wrote: > On Fri, May 26, 2023 at 3:30 PM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > On Thu, May 25, 2023 at 10:15:27PM -0700, Song Liu wrote: > > > Introduce memory type aware module_alloc_type, which provides a unified > > > allocator for all different archs. This work was discussed in [1]. > > > > > > Each arch can configure the allocator to do the following: > > > > > > 1. Specify module_vaddr and module_end > > > 2. Random module start address for KASLR > > > 3. kasan_alloc_module_shadow() > > > 4. kasan_reset_tag() > > > 5. Preferred and secondary module address ranges > > > > > > enum mod_alloc_params_flags are used to control the behavior of > > > module_alloc_type. Specifically: MOD_ALLOC_FALLBACK let module_alloc_type > > > fallback to existing module_alloc. MOD_ALLOC_SET_MEMORY let > > > module_alloc_type to protect the memory before returning to the user. > > > > > > module_allocator_init() call is added to start_kernel() to initialize > > > module_alloc_type. > > > > > > > ... > > > > > +/** > > > + * struct vmalloc_params - Parameters to call __vmalloc_node_range() > > > + * @start: Address space range start > > > + * @end: Address space range end > > > + * @gfp_mask: The gfp_t mask used for this range > > > + * @pgprot: The page protection for this range > > > + * @vm_flags The vm_flag used for this range > > > + */ > > > +struct vmalloc_params { > > > + unsigned long start; > > > + unsigned long end; > > > + gfp_t gfp_mask; > > > + pgprot_t pgprot; > > > + unsigned long vm_flags; > > > +}; > > > + > > > +/** > > > + * struct mod_alloc_params - Parameters for module allocation type > > > + * @flags: Properties in mod_alloc_params_flags > > > + * @granularity: The allocation granularity (PAGE/PMD) in bytes > > > + * @alignment: The allocation alignment requirement > > > + * @vmp: Parameters used to call vmalloc > > > + * @fill: Function to fill allocated space. If NULL, use memcpy() > > > + * @invalidate: Function to invalidate memory space. > > > + * > > > + * If @granularity > @alignment the allocation can reuse free space in > > > + * previously allocated pages. If they are the same, then fresh pages > > > + * have to be allocated. > > > + */ > > > +struct mod_alloc_params { > > > + unsigned int flags; > > > + unsigned int granularity; > > > + unsigned int alignment; > > > + struct vmalloc_params vmp[MOD_MAX_ADDR_SPACES]; > > > + void * (*fill)(void *dst, const void *src, size_t len); > > > + void * (*invalidate)(void *ptr, size_t len); > > > +}; > > > + > > > +struct mod_type_allocator { > > > + struct mod_alloc_params params; > > > +}; > > > + > > > +struct mod_allocators { > > > + struct mod_type_allocator *types[MOD_MEM_NUM_TYPES]; > > > +}; > > > + > > > +void *module_alloc_type(size_t size, enum mod_mem_type type); > > > +void module_memfree_type(void *ptr, enum mod_mem_type type); > > > +void module_memory_fill_type(void *dst, void *src, size_t len, enum mod_mem_type type); > > > +void module_memory_invalidate_type(void *ptr, size_t len, enum mod_mem_type type); > > > +void module_memory_protect(void *ptr, size_t len, enum mod_mem_type type); > > > +void module_memory_unprotect(void *ptr, size_t len, enum mod_mem_type type); > > > +void module_memory_force_protect(void *ptr, size_t len, enum mod_mem_type type); > > > +void module_memory_force_unprotect(void *ptr, size_t len, enum mod_mem_type type); > > > +void module_alloc_type_init(struct mod_allocators *allocators); > > > > This is a pretty big and complicated interface, and I haven't been able > > to find any reasoning or justification for why it's needed. > > > > Why is this going in module.h? Don't do that, this is supposed to be a > > general purpose allocator for executable memory so start a new file. > > The goal of this work is to build a memory allocator for modules, text, > rw data, and ro data. So it is not the same as execmem_alloc or jitalloc. > > > > > What is vmalloc_params doing here? Why is it needed? Can we just get rid > > of it? > > We need it to have an interface for all archs. They are all using different > variations of vmalloc for module_alloc. But it should be an internal implementation detail, I don't think we want the external interface tied to vmalloc - > These two APIs allow the core code work with all archs. They won't break > sub-page allocations. (not all archs will have sub-page allocations) So yes, text_poke() doesn't work on all architectures, and we need a fallback. But if the fallback needs to go the unprotect/protect route, I think we need to put that in the helper, and not expose it. Part of what people are wanting is to limit or eliminate pages that are RWX, so we definitely shouldn't be creating new interfaces to flipping page permissions: that should be pushed down to as low a level as possible. E.g., with my jitalloc_update(), a more complete version would look like void jitalloc_update(void *dst, void *src, size_t len) { if (text_poke_available) { text_poke(...); } else { unprotect(); memcpy(); protect(); } } See? We provide a simpler, safer interface, and this also lets us handle multiple threads racing to update/flip permissions on the same page in a single place (e.g. with sticking a lock somewhere in the jitalloc structures). Or ideally, we'd have a kmap_local() variant that actually creates a new mapping on all architectures, and we'd just use that to do the update without any arch-dependent code - if we get that, having this helper means we'll only have to change a single place later. > > OTOH, the "force" version of the two APIs should be removed later. In this > set, we only need them for arch_prepare_bpf_trampoline(). I plan to remove > this limitation for x86 soon. > > > > > The text_poke() abstraction works; the exposed interface should look > > like that. > > > > Please kill MOD_ALLOC_SET_MEMORY, and _only_ return memory that is > > read-only. You should be able to kill mod_alloc_params entirely. > > > > Our other memory allocators don't expose kasan - why does this one? > > Again, that looks wrong. > > > > I would like to see something much simpler (that doesn't encode awkward > > assumptions from module and bpf!): please look at the code I sketched > > out below and tell me why this interface won't work - or if it can, go > > with _that_. > > I think we need to align the goal here. PS: we did align the goal about > 6 months ago when I proposed the execmem_alloc() set. Yeah, I know what it feels like to get yanked around by different reviewers who see and join in the discussion at different times. Sorry :) Having a rational/design doc in the header file would help a _lot_. > My current goal is to build an allocator for all the use cases of modules, > text, data, rodata, etc. Then the same allocator can be used by other > dynamic kernel text: bpf, ftrace, kprobe, bcachefs, etc. This must have been part of the previous discussion since you started with execmem_alloc(), but I did scan through that and I'm still not seeing the justification for why this needs to handle non-text allocations. If I had to hazard a guess it would be because of tglx wanting to solve tlb fragmentation for modules, but to me that doesn't sound like a good enough reason and it looks like we're ending up with a messy "try to be all things to all people" interface as a result :/ Do we _really_ need that? Mike was just saying at LSF that direct map fragmentation turned out to be a non issue for performance for non-text, so maybe we don't. Also - module_memory_fill_type(), module_memory_invalidate_type() - I know these are for BPF, but could you explain why we need them in the external interface here? Could they perhaps be small helpers in the bpf code that use something like jitalloc_update()? > OTOH, jit_alloc (as-is) solves a subset of the problem (only the text). > We can probably use it (with some updates) instead of the vmap_area > based allocator. But I guess we need to align the direction first. Let's see if we can get tglx to chime in again, since he was pretty vocal in the last discussion. It's also good practice to try to summarize all the relevant "whys" from the discussion that went into a feature and put that in at least the commit message - or even better, header file comments. Also, organization: the module code is a huge mess, let's _please_ do split this out and think about organization a bit more, not add to the mess :) Cheers, Kent
On Fri, May 26, 2023 at 4:39 PM Kent Overstreet <kent.overstreet@linux.dev> wrote: > [...] > > But it should be an internal implementation detail, I don't think we > want the external interface tied to vmalloc - > > > These two APIs allow the core code work with all archs. They won't break > > sub-page allocations. (not all archs will have sub-page allocations) > > So yes, text_poke() doesn't work on all architectures, and we need a > fallback. > > But if the fallback needs to go the unprotect/protect route, I think we > need to put that in the helper, and not expose it. Part of what people > are wanting is to limit or eliminate pages that are RWX, so we > definitely shouldn't be creating new interfaces to flipping page > permissions: that should be pushed down to as low a level as possible. > > E.g., with my jitalloc_update(), a more complete version would look like > > void jitalloc_update(void *dst, void *src, size_t len) > { > if (text_poke_available) { > text_poke(...); > } else { > unprotect(); > memcpy(); > protect(); > } > } I think this doesn't work for sub page allocation? > > See? We provide a simpler, safer interface, and this also lets us handle > multiple threads racing to update/flip permissions on the same page in a > single place (e.g. with sticking a lock somewhere in the jitalloc > structures). [...] > > This must have been part of the previous discussion since you started > with execmem_alloc(), but I did scan through that and I'm still not > seeing the justification for why this needs to handle non-text > allocations. > > If I had to hazard a guess it would be because of tglx wanting to solve > tlb fragmentation for modules, but to me that doesn't sound like a good > enough reason and it looks like we're ending up with a messy "try to be > all things to all people" interface as a result :/ > > Do we _really_ need that? > > Mike was just saying at LSF that direct map fragmentation turned out to > be a non issue for performance for non-text, so maybe we don't. At the end of all this, we will have modules running from huge pages, data and text. It will give significant performance benefit when some key driver cannot be compiled into the kernel. > > Also - module_memory_fill_type(), module_memory_invalidate_type() - I > know these are for BPF, but could you explain why we need them in the > external interface here? Could they perhaps be small helpers in the bpf > code that use something like jitalloc_update()? These are used by all users, not just BPF. 1/3 uses them in kernel/module/main.c. I didn't use them in 3/3 as it is arch code, but I can use them instead of text_poke_* (and that is probably better code style). As I am writing the email, I think I should use it in ftrace.c (2/3) to avoid the __weak function. > > > OTOH, jit_alloc (as-is) solves a subset of the problem (only the text). > > We can probably use it (with some updates) instead of the vmap_area > > based allocator. But I guess we need to align the direction first. > > Let's see if we can get tglx to chime in again, since he was pretty > vocal in the last discussion. > > It's also good practice to try to summarize all the relevant "whys" from > the discussion that went into a feature and put that in at least the > commit message - or even better, header file comments. > > Also, organization: the module code is a huge mess, let's _please_ do > split this out and think about organization a bit more, not add to the > mess :) I don't really think module code is very messy at the moment. If necessary, we can put module_alloc_type related code in a separate file. Thanks, Song
On Fri, May 26, 2023 at 05:03:18PM -0700, Song Liu wrote: > On Fri, May 26, 2023 at 4:39 PM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > [...] > > > > > But it should be an internal implementation detail, I don't think we > > want the external interface tied to vmalloc - > > > > > These two APIs allow the core code work with all archs. They won't break > > > sub-page allocations. (not all archs will have sub-page allocations) > > > > So yes, text_poke() doesn't work on all architectures, and we need a > > fallback. > > > > But if the fallback needs to go the unprotect/protect route, I think we > > need to put that in the helper, and not expose it. Part of what people > > are wanting is to limit or eliminate pages that are RWX, so we > > definitely shouldn't be creating new interfaces to flipping page > > permissions: that should be pushed down to as low a level as possible. > > > > E.g., with my jitalloc_update(), a more complete version would look like > > > > void jitalloc_update(void *dst, void *src, size_t len) > > { > > if (text_poke_available) { > > text_poke(...); > > } else { > > unprotect(); > > memcpy(); > > protect(); > > } > > } > > I think this doesn't work for sub page allocation? Perhaps I elided too much - it does if you add a single lock. You can't do that if it's not a common helper. > At the end of all this, we will have modules running from huge pages, data > and text. It will give significant performance benefit when some key driver > cannot be compiled into the kernel. Yeah, I've seen the numbers for the perf impact of running as a module due to the extra TLB overhead - but Mike's recent data was showing that this doesn't matter nearly as much as data as it does for text. > > Also - module_memory_fill_type(), module_memory_invalidate_type() - I > > know these are for BPF, but could you explain why we need them in the > > external interface here? Could they perhaps be small helpers in the bpf > > code that use something like jitalloc_update()? > > These are used by all users, not just BPF. 1/3 uses them in > kernel/module/main.c. I didn't use them in 3/3 as it is arch code, but I can > use them instead of text_poke_* (and that is probably better code style). > As I am writing the email, I think I should use it in ftrace.c (2/3) to avoid > the __weak function. Ok. Could we make it clearer why they're needed and what they're for? I know bpf fills in invalid instructions initially; I didn't read through enough code to find the "why", and let's document that to save other people the same effort. And do they really need to be callbacks in mod_alloc_params...? Do we need the other things in mod_alloc_params/vmalloc_params? - your granularity field says it's for specifying PAGE/PMD size: we definitely do not want that. We've had way too much code along the lines of "implement hugepages for x", we need to stop doing that. It's an internal mm/ detail. - vmalloc_params: we need gfp_t and pgprot_t, but those should be normal arguments. start/end/vm_flags? They seem like historical module baggage, can we get rid of them? > > > OTOH, jit_alloc (as-is) solves a subset of the problem (only the text). > > > We can probably use it (with some updates) instead of the vmap_area > > > based allocator. But I guess we need to align the direction first. > > > > Let's see if we can get tglx to chime in again, since he was pretty > > vocal in the last discussion. > > > > It's also good practice to try to summarize all the relevant "whys" from > > the discussion that went into a feature and put that in at least the > > commit message - or even better, header file comments. > > > > Also, organization: the module code is a huge mess, let's _please_ do > > split this out and think about organization a bit more, not add to the > > mess :) > > I don't really think module code is very messy at the moment. If > necessary, we can put module_alloc_type related code in a > separate file. Hey, it's been organized since I last looked at it :) But I tihink this belongs in mm/, not module code.
On Fri, May 26, 2023 at 8:20 PM Kent Overstreet <kent.overstreet@linux.dev> wrote: > [...] > > > void jitalloc_update(void *dst, void *src, size_t len) > > > { > > > if (text_poke_available) { > > > text_poke(...); > > > } else { > > > unprotect(); > > > memcpy(); > > > protect(); > > > } > > > } > > > > I think this doesn't work for sub page allocation? > > Perhaps I elided too much - it does if you add a single lock. You can't > do that if it's not a common helper. Actually, sub page allocation is not the problem. The problem is with unprotect() and protect(), as they require global TLB flush. > > > At the end of all this, we will have modules running from huge pages, data > > and text. It will give significant performance benefit when some key driver > > cannot be compiled into the kernel. > > Yeah, I've seen the numbers for the perf impact of running as a module > due to the extra TLB overhead - but Mike's recent data was showing that > this doesn't matter nearly as much as data as it does for text. > > > > Also - module_memory_fill_type(), module_memory_invalidate_type() - I > > > know these are for BPF, but could you explain why we need them in the > > > external interface here? Could they perhaps be small helpers in the bpf > > > code that use something like jitalloc_update()? > > > > These are used by all users, not just BPF. 1/3 uses them in > > kernel/module/main.c. I didn't use them in 3/3 as it is arch code, but I can > > use them instead of text_poke_* (and that is probably better code style). > > As I am writing the email, I think I should use it in ftrace.c (2/3) to avoid > > the __weak function. > > Ok. Could we make it clearer why they're needed and what they're for? > > I know bpf fills in invalid instructions initially; I didn't read > through enough code to find the "why", and let's document that to save > other people the same effort. > > And do they really need to be callbacks in mod_alloc_params...? If we want to call them from common code, they will either be callback or some __weak functions. > > Do we need the other things in mod_alloc_params/vmalloc_params? > - your granularity field says it's for specifying PAGE/PMD size: we > definitely do not want that. We've had way too much code along the > lines of "implement hugepages for x", we need to stop doing that. > It's an internal mm/ detail. We don't need "granularity" yet. We will need them with the binpack allocator. > - vmalloc_params: we need gfp_t and pgprot_t, but those should be > normal arguments. start/end/vm_flags? They seem like historical > module baggage, can we get rid of them? All these fields are needed by some of the module_alloc()s for different archs. I am not an expert of all the archs, so I cannot say whether some of them are not needed. [...] > > I don't really think module code is very messy at the moment. If > > necessary, we can put module_alloc_type related code in a > > separate file. > > Hey, it's been organized since I last looked at it :) But I tihink this > belongs in mm/, not module code. I don't have a strong preference for this (at the moment). If we agree it belongs to mm/, we sure can move it. Thanks, Song
diff --git a/include/linux/module.h b/include/linux/module.h index 9e56763dff81..948b8132a742 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -752,6 +752,8 @@ static inline bool is_livepatch_module(struct module *mod) void set_module_sig_enforced(void); +void __init module_allocator_init(void); + #else /* !CONFIG_MODULES... */ static inline struct module *__module_address(unsigned long addr) @@ -855,6 +857,10 @@ void *dereference_module_function_descriptor(struct module *mod, void *ptr) return ptr; } +static inline void __init module_allocator_init(void) +{ +} + #endif /* CONFIG_MODULES */ #ifdef CONFIG_SYSFS diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h index 03be088fb439..59c7114a7b65 100644 --- a/include/linux/moduleloader.h +++ b/include/linux/moduleloader.h @@ -32,6 +32,81 @@ void *module_alloc(unsigned long size); /* Free memory returned from module_alloc. */ void module_memfree(void *module_region); +#ifdef CONFIG_MODULES + +/* For mod_alloc_params.flags */ +enum mod_alloc_params_flags { + MOD_ALLOC_FALLBACK = (1 << 0), /* Fallback to module_alloc() */ + MOD_ALLOC_KASAN_MODULE_SHADOW = (1 << 1), /* Calls kasan_alloc_module_shadow() */ + MOD_ALLOC_KASAN_RESET_TAG = (1 << 2), /* Calls kasan_reset_tag() */ + MOD_ALLOC_SET_MEMORY = (1 << 3), /* The allocator calls set_memory_ on + * memory before returning it to the + * caller, so that the caller do not need + * to call set_memory_* again. This does + * not work for MOD_RO_AFTER_INIT. + */ +}; + +#define MOD_MAX_ADDR_SPACES 2 + +/** + * struct vmalloc_params - Parameters to call __vmalloc_node_range() + * @start: Address space range start + * @end: Address space range end + * @gfp_mask: The gfp_t mask used for this range + * @pgprot: The page protection for this range + * @vm_flags The vm_flag used for this range + */ +struct vmalloc_params { + unsigned long start; + unsigned long end; + gfp_t gfp_mask; + pgprot_t pgprot; + unsigned long vm_flags; +}; + +/** + * struct mod_alloc_params - Parameters for module allocation type + * @flags: Properties in mod_alloc_params_flags + * @granularity: The allocation granularity (PAGE/PMD) in bytes + * @alignment: The allocation alignment requirement + * @vmp: Parameters used to call vmalloc + * @fill: Function to fill allocated space. If NULL, use memcpy() + * @invalidate: Function to invalidate memory space. + * + * If @granularity > @alignment the allocation can reuse free space in + * previously allocated pages. If they are the same, then fresh pages + * have to be allocated. + */ +struct mod_alloc_params { + unsigned int flags; + unsigned int granularity; + unsigned int alignment; + struct vmalloc_params vmp[MOD_MAX_ADDR_SPACES]; + void * (*fill)(void *dst, const void *src, size_t len); + void * (*invalidate)(void *ptr, size_t len); +}; + +struct mod_type_allocator { + struct mod_alloc_params params; +}; + +struct mod_allocators { + struct mod_type_allocator *types[MOD_MEM_NUM_TYPES]; +}; + +void *module_alloc_type(size_t size, enum mod_mem_type type); +void module_memfree_type(void *ptr, enum mod_mem_type type); +void module_memory_fill_type(void *dst, void *src, size_t len, enum mod_mem_type type); +void module_memory_invalidate_type(void *ptr, size_t len, enum mod_mem_type type); +void module_memory_protect(void *ptr, size_t len, enum mod_mem_type type); +void module_memory_unprotect(void *ptr, size_t len, enum mod_mem_type type); +void module_memory_force_protect(void *ptr, size_t len, enum mod_mem_type type); +void module_memory_force_unprotect(void *ptr, size_t len, enum mod_mem_type type); +void module_alloc_type_init(struct mod_allocators *allocators); + +#endif /* CONFIG_MODULES */ + /* Determines if the section name is an init section (that is only used during * module loading). */ diff --git a/init/main.c b/init/main.c index af50044deed5..e05228cabde8 100644 --- a/init/main.c +++ b/init/main.c @@ -936,6 +936,7 @@ asmlinkage __visible void __init __no_sanitize_address __noreturn start_kernel(v sort_main_extable(); trap_init(); mm_core_init(); + module_allocator_init(); poking_init(); ftrace_init(); diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index d3f0a4825fa6..e4ec4be866cc 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -12,6 +12,7 @@ #include <linux/mutex.h> #include <linux/btf_ids.h> #include <linux/rcupdate_wait.h> +#include <linux/moduleloader.h> enum bpf_struct_ops_state { BPF_STRUCT_OPS_STATE_INIT, @@ -512,7 +513,8 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, err = st_ops->validate(kdata); if (err) goto reset_unlock; - set_memory_rox((long)st_map->image, 1); + module_memory_protect(st_map->image, PAGE_SIZE, MOD_TEXT); + /* Let bpf_link handle registration & unregistration. * * Pair with smp_load_acquire() during lookup_elem(). @@ -521,7 +523,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, goto unlock; } - set_memory_rox((long)st_map->image, 1); + module_memory_protect(st_map->image, PAGE_SIZE, MOD_TEXT); err = st_ops->reg(kdata); if (likely(!err)) { /* This refcnt increment on the map here after @@ -544,8 +546,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, * there was a race in registering the struct_ops (under the same name) to * a sub-system through different struct_ops's maps. */ - set_memory_nx((long)st_map->image, 1); - set_memory_rw((long)st_map->image, 1); + module_memory_unprotect(st_map->image, PAGE_SIZE, MOD_TEXT); reset_unlock: bpf_struct_ops_map_put_progs(st_map); @@ -907,4 +908,3 @@ int bpf_struct_ops_link_create(union bpf_attr *attr) kfree(link); return err; } - diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 7421487422d4..4c989a8fe8b8 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -860,7 +860,7 @@ static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins GFP_KERNEL); if (!pack) return NULL; - pack->ptr = module_alloc(BPF_PROG_PACK_SIZE); + pack->ptr = module_alloc_type(BPF_PROG_PACK_SIZE, MOD_TEXT); if (!pack->ptr) { kfree(pack); return NULL; @@ -869,8 +869,7 @@ static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins bitmap_zero(pack->bitmap, BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE); list_add_tail(&pack->list, &pack_list); - set_vm_flush_reset_perms(pack->ptr); - set_memory_rox((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE); + module_memory_protect(pack->ptr, BPF_PROG_PACK_SIZE, MOD_TEXT); return pack; } @@ -884,11 +883,10 @@ void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns) mutex_lock(&pack_mutex); if (size > BPF_PROG_PACK_SIZE) { size = round_up(size, PAGE_SIZE); - ptr = module_alloc(size); + ptr = module_alloc_type(size, MOD_TEXT); if (ptr) { bpf_fill_ill_insns(ptr, size); - set_vm_flush_reset_perms(ptr); - set_memory_rox((unsigned long)ptr, size / PAGE_SIZE); + module_memory_protect(ptr, size, MOD_TEXT); } goto out; } @@ -922,7 +920,8 @@ void bpf_prog_pack_free(struct bpf_binary_header *hdr) mutex_lock(&pack_mutex); if (hdr->size > BPF_PROG_PACK_SIZE) { - module_memfree(hdr); + module_memfree_type(hdr, MOD_TEXT); + goto out; } @@ -946,7 +945,8 @@ void bpf_prog_pack_free(struct bpf_binary_header *hdr) if (bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0, BPF_PROG_CHUNK_COUNT, 0) == 0) { list_del(&pack->list); - module_memfree(pack->ptr); + module_memfree_type(pack->ptr, MOD_TEXT); + kfree(pack); } out: @@ -997,12 +997,12 @@ void bpf_jit_uncharge_modmem(u32 size) void *__weak bpf_jit_alloc_exec(unsigned long size) { - return module_alloc(size); + return module_alloc_type(size, MOD_TEXT); } void __weak bpf_jit_free_exec(void *addr) { - module_memfree(addr); + module_memfree_type(addr, MOD_TEXT); } struct bpf_binary_header * diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index ac021bc43a66..fd2d46c9a295 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -13,6 +13,7 @@ #include <linux/bpf_verifier.h> #include <linux/bpf_lsm.h> #include <linux/delay.h> +#include <linux/moduleloader.h> /* dummy _ops. The verifier will operate on target program's ops. */ const struct bpf_verifier_ops bpf_extension_verifier_ops = { @@ -440,7 +441,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mut if (err < 0) goto out; - set_memory_rox((long)im->image, 1); + module_memory_protect(im->image, PAGE_SIZE, MOD_TEXT); WARN_ON(tr->cur_image && tr->selector == 0); WARN_ON(!tr->cur_image && tr->selector); @@ -462,8 +463,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mut tr->fops->trampoline = 0; /* reset im->image memory attr for arch_prepare_bpf_trampoline */ - set_memory_nx((long)im->image, 1); - set_memory_rw((long)im->image, 1); + module_memory_unprotect(im->image, PAGE_SIZE, MOD_TEXT); goto again; } #endif diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 00e177de91cc..daf47da3c96e 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -113,17 +113,17 @@ enum kprobe_slot_state { void __weak *alloc_insn_page(void) { /* - * Use module_alloc() so this page is within +/- 2GB of where the + * Use module_alloc_type() so this page is within +/- 2GB of where the * kernel image and loaded module images reside. This is required * for most of the architectures. * (e.g. x86-64 needs this to handle the %rip-relative fixups.) */ - return module_alloc(PAGE_SIZE); + return module_alloc_type(PAGE_SIZE, MOD_TEXT); } static void free_insn_page(void *page) { - module_memfree(page); + module_memfree_type(page, MOD_TEXT); } struct kprobe_insn_cache kprobe_insn_slots = { diff --git a/kernel/module/internal.h b/kernel/module/internal.h index dc7b0160c480..b2e136326c4c 100644 --- a/kernel/module/internal.h +++ b/kernel/module/internal.h @@ -12,6 +12,7 @@ #include <linux/mutex.h> #include <linux/rculist.h> #include <linux/rcupdate.h> +#include <linux/moduleloader.h> #include <linux/mm.h> #ifndef ARCH_SHF_SMALL @@ -392,3 +393,5 @@ static inline int same_magic(const char *amagic, const char *bmagic, bool has_cr return strcmp(amagic, bmagic) == 0; } #endif /* CONFIG_MODVERSIONS */ + +extern struct mod_allocators module_allocators; diff --git a/kernel/module/main.c b/kernel/module/main.c index ea7d0c7f3e60..0f9183f1ca9f 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -1203,11 +1203,11 @@ static bool mod_mem_use_vmalloc(enum mod_mem_type type) mod_mem_type_is_core_data(type); } -static void *module_memory_alloc(unsigned int size, enum mod_mem_type type) +static void *module_memory_alloc(size_t size, enum mod_mem_type type) { if (mod_mem_use_vmalloc(type)) return vzalloc(size); - return module_alloc(size); + return module_alloc_type(size, type); } static void module_memory_free(void *ptr, enum mod_mem_type type) @@ -1215,7 +1215,7 @@ static void module_memory_free(void *ptr, enum mod_mem_type type) if (mod_mem_use_vmalloc(type)) vfree(ptr); else - module_memfree(ptr); + module_memfree_type(ptr, type); } static void free_mod_mem(struct module *mod) @@ -1609,6 +1609,201 @@ void * __weak module_alloc(unsigned long size) NUMA_NO_NODE, __builtin_return_address(0)); } +struct mod_allocators module_allocators; + +static struct mod_type_allocator default_mod_type_allocator = { + .params = { + .flags = MOD_ALLOC_FALLBACK, + }, +}; + +void __init __weak module_alloc_type_init(struct mod_allocators *allocators) +{ + for_each_mod_mem_type(type) + allocators->types[type] = &default_mod_type_allocator; +} + +static void module_memory_enable_protection(void *ptr, size_t len, enum mod_mem_type type) +{ + int npages = DIV_ROUND_UP(len, PAGE_SIZE); + + switch (type) { + case MOD_TEXT: + case MOD_INIT_TEXT: + set_memory_rox((unsigned long)ptr, npages); + break; + case MOD_DATA: + case MOD_INIT_DATA: + set_memory_nx((unsigned long)ptr, npages); + break; + case MOD_RODATA: + set_memory_nx((unsigned long)ptr, npages); + set_memory_ro((unsigned long)ptr, npages); + break; + case MOD_RO_AFTER_INIT: + set_memory_ro((unsigned long)ptr, npages); + break; + default: + WARN_ONCE(true, "Unknown mod_mem_type: %d\n", type); + break; + } +} + +static void module_memory_disable_protection(void *ptr, size_t len, enum mod_mem_type type) +{ + int npages = DIV_ROUND_UP(len, PAGE_SIZE); + + switch (type) { + case MOD_TEXT: + case MOD_INIT_TEXT: + set_memory_nx((unsigned long)ptr, npages); + set_memory_rw((unsigned long)ptr, npages); + break; + case MOD_RODATA: + case MOD_RO_AFTER_INIT: + set_memory_rw((unsigned long)ptr, npages); + break; + case MOD_DATA: + case MOD_INIT_DATA: + break; + default: + WARN_ONCE(true, "Unknown mod_mem_type: %d\n", type); + break; + } +} + +void *module_alloc_type(size_t size, enum mod_mem_type type) +{ + struct mod_type_allocator *allocator; + struct mod_alloc_params *params; + void *ptr = NULL; + int i; + + if (WARN_ON_ONCE(type >= MOD_MEM_NUM_TYPES)) + return NULL; + + allocator = module_allocators.types[type]; + params = &allocator->params; + + if (params->flags & MOD_ALLOC_FALLBACK) + return module_alloc(size); + + for (i = 0; i < MOD_MAX_ADDR_SPACES; i++) { + struct vmalloc_params *vmp = ¶ms->vmp[i]; + + if (vmp->start == vmp->end) + continue; + + ptr = __vmalloc_node_range(size, params->alignment, vmp->start, vmp->end, + vmp->gfp_mask, vmp->pgprot, vmp->vm_flags, + NUMA_NO_NODE, __builtin_return_address(0)); + if (!ptr) + continue; + + if (params->flags & MOD_ALLOC_KASAN_MODULE_SHADOW) { + if (ptr && kasan_alloc_module_shadow(ptr, size, vmp->gfp_mask)) { + vfree(ptr); + return NULL; + } + } + + /* + * VM_FLUSH_RESET_PERMS is still needed here. This is + * because "size" is not available in module_memfree_type + * at the moment, so we cannot undo set_memory_rox in + * module_memfree_type. Once a better allocator is used, + * we can manually undo set_memory_rox, and thus remove + * VM_FLUSH_RESET_PERMS. + */ + set_vm_flush_reset_perms(ptr); + + if (params->flags & MOD_ALLOC_SET_MEMORY) + module_memory_enable_protection(ptr, size, type); + + if (params->flags & MOD_ALLOC_KASAN_RESET_TAG) + return kasan_reset_tag(ptr); + return ptr; + } + return NULL; +} + +void module_memfree_type(void *ptr, enum mod_mem_type type) +{ + module_memfree(ptr); +} + +void module_memory_fill_type(void *dst, void *src, size_t len, enum mod_mem_type type) +{ + struct mod_type_allocator *allocator; + struct mod_alloc_params *params; + + allocator = module_allocators.types[type]; + params = &allocator->params; + + if (params->fill) + params->fill(dst, src, len); + else + memcpy(dst, src, len); +} + +void module_memory_invalidate_type(void *dst, size_t len, enum mod_mem_type type) +{ + struct mod_type_allocator *allocator; + struct mod_alloc_params *params; + + allocator = module_allocators.types[type]; + params = &allocator->params; + + if (params->invalidate) + params->invalidate(dst, len); + else + memset(dst, 0, len); +} + +/* + * Protect memory allocated by module_alloc_type(). Called by users of + * module_alloc_type. This is a no-op with MOD_ALLOC_SET_MEMORY. + */ +void module_memory_protect(void *ptr, size_t len, enum mod_mem_type type) +{ + struct mod_alloc_params *params = &module_allocators.types[type]->params; + + if (params->flags & MOD_ALLOC_SET_MEMORY) + return; + module_memory_enable_protection(ptr, len, type); +} + +/* + * Unprotect memory allocated by module_alloc_type(). Called by users of + * module_alloc_type. This is a no-op with MOD_ALLOC_SET_MEMORY. + */ +void module_memory_unprotect(void *ptr, size_t len, enum mod_mem_type type) +{ + struct mod_alloc_params *params = &module_allocators.types[type]->params; + + if (params->flags & MOD_ALLOC_SET_MEMORY) + return; + module_memory_disable_protection(ptr, len, type); +} + +/* + * Should only be used by arch code in cases where text_poke like + * solution is not ready yet + */ +void module_memory_force_protect(void *ptr, size_t len, enum mod_mem_type type) +{ + module_memory_enable_protection(ptr, len, type); +} + +/* + * Should only be used by arch code in cases where text_poke like + * solution is not ready yet + */ +void module_memory_force_unprotect(void *ptr, size_t len, enum mod_mem_type type) +{ + module_memory_disable_protection(ptr, len, type); +} + bool __weak module_init_section(const char *name) { return strstarts(name, ".init"); @@ -2241,7 +2436,7 @@ static int move_module(struct module *mod, struct load_info *info) t = type; goto out_enomem; } - memset(ptr, 0, mod->mem[type].size); + module_memory_invalidate_type(ptr, mod->mem[type].size, type); mod->mem[type].base = ptr; } @@ -2269,7 +2464,8 @@ static int move_module(struct module *mod, struct load_info *info) ret = -ENOEXEC; goto out_enomem; } - memcpy(dest, (void *)shdr->sh_addr, shdr->sh_size); + + module_memory_fill_type(dest, (void *)shdr->sh_addr, shdr->sh_size, type); } /* * Update the userspace copy's ELF section address to point to @@ -2471,9 +2667,9 @@ static void do_free_init(struct work_struct *w) llist_for_each_safe(pos, n, list) { initfree = container_of(pos, struct mod_initfree, node); - module_memfree(initfree->init_text); - module_memfree(initfree->init_data); - module_memfree(initfree->init_rodata); + module_memfree_type(initfree->init_text, MOD_INIT_TEXT); + module_memfree_type(initfree->init_data, MOD_INIT_DATA); + module_memfree_type(initfree->init_rodata, MOD_INIT_RODATA); kfree(initfree); } } @@ -3268,3 +3464,8 @@ static int module_debugfs_init(void) } module_init(module_debugfs_init); #endif + +void __init module_allocator_init(void) +{ + module_alloc_type_init(&module_allocators); +} diff --git a/kernel/module/strict_rwx.c b/kernel/module/strict_rwx.c index a2b656b4e3d2..65ff1b09dc84 100644 --- a/kernel/module/strict_rwx.c +++ b/kernel/module/strict_rwx.c @@ -16,6 +16,10 @@ static void module_set_memory(const struct module *mod, enum mod_mem_type type, { const struct module_memory *mod_mem = &mod->mem[type]; + /* The allocator already called set_memory_*, skip here. */ + if (module_allocators.types[type]->params.flags & MOD_ALLOC_SET_MEMORY) + return; + set_vm_flush_reset_perms(mod_mem->base); set_memory((unsigned long)mod_mem->base, mod_mem->size >> PAGE_SHIFT); }
Introduce memory type aware module_alloc_type, which provides a unified allocator for all different archs. This work was discussed in [1]. Each arch can configure the allocator to do the following: 1. Specify module_vaddr and module_end 2. Random module start address for KASLR 3. kasan_alloc_module_shadow() 4. kasan_reset_tag() 5. Preferred and secondary module address ranges enum mod_alloc_params_flags are used to control the behavior of module_alloc_type. Specifically: MOD_ALLOC_FALLBACK let module_alloc_type fallback to existing module_alloc. MOD_ALLOC_SET_MEMORY let module_alloc_type to protect the memory before returning to the user. module_allocator_init() call is added to start_kernel() to initialize module_alloc_type. Signed-off-by: Song Liu <song@kernel.org> [1] https://lore.kernel.org/linux-mm/20221107223921.3451913-1-song@kernel.org/ --- include/linux/module.h | 6 + include/linux/moduleloader.h | 75 ++++++++++++ init/main.c | 1 + kernel/bpf/bpf_struct_ops.c | 10 +- kernel/bpf/core.c | 20 ++-- kernel/bpf/trampoline.c | 6 +- kernel/kprobes.c | 6 +- kernel/module/internal.h | 3 + kernel/module/main.c | 217 +++++++++++++++++++++++++++++++++-- kernel/module/strict_rwx.c | 4 + 10 files changed, 319 insertions(+), 29 deletions(-)