Message ID | 20250226121537.752241-1-dongml2@chinatelecom.cn (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | [bpf-next,v2] add function metadata support | expand |
Hi Menglong,
kernel test robot noticed the following build warnings:
[auto build test WARNING on bpf-next/master]
url: https://github.com/intel-lab-lkp/linux/commits/Menglong-Dong/add-function-metadata-support/20250226-202312
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link: https://lore.kernel.org/r/20250226121537.752241-1-dongml2%40chinatelecom.cn
patch subject: [PATCH bpf-next v2] add function metadata support
config: i386-kismet-CONFIG_CALL_PADDING-CONFIG_FUNCTION_METADATA-0-0 (https://download.01.org/0day-ci/archive/20250227/202502272339.DtueAali-lkp@intel.com/config)
reproduce: (https://download.01.org/0day-ci/archive/20250227/202502272339.DtueAali-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502272339.DtueAali-lkp@intel.com/
kismet warnings: (new ones prefixed by >>)
>> kismet: WARNING: unmet direct dependencies detected for CALL_PADDING when selected by FUNCTION_METADATA
WARNING: unmet direct dependencies detected for CALL_PADDING
Depends on [n]: CC_HAS_ENTRY_PADDING [=y] && OBJTOOL [=n]
Selected by [y]:
- FUNCTION_METADATA [=y]
Hi Menglong, kernel test robot noticed the following build warnings: [auto build test WARNING on bpf-next/master] url: https://github.com/intel-lab-lkp/linux/commits/Menglong-Dong/add-function-metadata-support/20250226-202312 base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master patch link: https://lore.kernel.org/r/20250226121537.752241-1-dongml2%40chinatelecom.cn patch subject: [PATCH bpf-next v2] add function metadata support config: x86_64-randconfig-r112-20250227 (https://download.01.org/0day-ci/archive/20250228/202502280004.QmU2zIb5-lkp@intel.com/config) compiler: clang version 19.1.7 (https://github.com/llvm/llvm-project cd708029e0b2869e80abe31ddb175f7c35361f90) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250228/202502280004.QmU2zIb5-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202502280004.QmU2zIb5-lkp@intel.com/ sparse warnings: (new ones prefixed by >>) kernel/trace/kfunc_md.c:12:23: sparse: sparse: symbol 'kfunc_mds' redeclared with different type (different address spaces): kernel/trace/kfunc_md.c:12:23: sparse: struct kfunc_md [noderef] __rcu *[addressable] [toplevel] kfunc_mds kernel/trace/kfunc_md.c: note: in included file: include/linux/kfunc_md.h:16:24: sparse: note: previously declared as: include/linux/kfunc_md.h:16:24: sparse: struct kfunc_md *extern [addressable] [toplevel] kfunc_mds >> kernel/trace/kfunc_md.c:186:20: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct kfunc_md *md @@ got struct kfunc_md [noderef] __rcu * @@ kernel/trace/kfunc_md.c:186:20: sparse: expected struct kfunc_md *md kernel/trace/kfunc_md.c:186:20: sparse: got struct kfunc_md [noderef] __rcu * vim +186 kernel/trace/kfunc_md.c 169 170 /* Get a exist metadata by the function address, and NULL will be returned 171 * if not exist. 172 * 173 * NOTE: rcu lock should be held during reading the metadata, and 174 * kfunc_md_lock should be held if writing happens. 175 */ 176 struct kfunc_md *kfunc_md_find(void *ip) 177 { 178 struct kfunc_md *md; 179 u32 index; 180 181 if (kfunc_md_arch_exist(ip)) { 182 index = kfunc_md_get_index(ip); 183 if (WARN_ON_ONCE(index >= kfunc_md_count)) 184 return NULL; 185 > 186 md = &kfunc_mds[index]; 187 return md; 188 } 189 return NULL; 190 } 191 EXPORT_SYMBOL_GPL(kfunc_md_find); 192
On Wed, Feb 26, 2025 at 08:15:37PM +0800, Menglong Dong wrote: > In x86, we need 5-bytes to prepend a "mov %eax xxx" insn, which can hold > a 4-bytes index. So we have following logic: > > 1. use the head 5-bytes if CFI_CLANG is not enabled > 2. use the tail 5-bytes if MITIGATION_CALL_DEPTH_TRACKING is not enabled > 3. compile the kernel with extra 5-bytes padding if > MITIGATION_CALL_DEPTH_TRACKING and CFI_CLANG are both enabled. 3) would result in 16+5 bytes padding, what does that do for alignment? Functions should be 16 byte aligned. Also, did you make sure all the code in arch/x86/kernel/alternative.c still works? Because adding extra padding in the CFI_CLANG case moves where the CFI bytes are emitted and all the CFI rewriting code goes sideways.
Hi Menglong, kernel test robot noticed the following build warnings: [auto build test WARNING on bpf-next/master] url: https://github.com/intel-lab-lkp/linux/commits/Menglong-Dong/add-function-metadata-support/20250226-202312 base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master patch link: https://lore.kernel.org/r/20250226121537.752241-1-dongml2%40chinatelecom.cn patch subject: [PATCH bpf-next v2] add function metadata support config: i386-randconfig-062-20250227 (https://download.01.org/0day-ci/archive/20250228/202502280123.BNaCYT2A-lkp@intel.com/config) compiler: clang version 19.1.7 (https://github.com/llvm/llvm-project cd708029e0b2869e80abe31ddb175f7c35361f90) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250228/202502280123.BNaCYT2A-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202502280123.BNaCYT2A-lkp@intel.com/ sparse warnings: (new ones prefixed by >>) kernel/trace/kfunc_md.c:12:23: sparse: sparse: symbol 'kfunc_mds' redeclared with different type (different address spaces): kernel/trace/kfunc_md.c:12:23: sparse: struct kfunc_md [noderef] __rcu *[addressable] [toplevel] kfunc_mds kernel/trace/kfunc_md.c: note: in included file: include/linux/kfunc_md.h:16:24: sparse: note: previously declared as: include/linux/kfunc_md.h:16:24: sparse: struct kfunc_md *extern [addressable] [toplevel] kfunc_mds kernel/trace/kfunc_md.c:186:20: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct kfunc_md *md @@ got struct kfunc_md [noderef] __rcu * @@ kernel/trace/kfunc_md.c:186:20: sparse: expected struct kfunc_md *md kernel/trace/kfunc_md.c:186:20: sparse: got struct kfunc_md [noderef] __rcu * >> kernel/trace/kfunc_md.c:98:25: sparse: sparse: non size-preserving pointer to integer cast vim +98 kernel/trace/kfunc_md.c 10 11 static u32 kfunc_md_count = ENTRIES_PER_PAGE, kfunc_md_used; > 12 struct kfunc_md __rcu *kfunc_mds; 13 EXPORT_SYMBOL_GPL(kfunc_mds); 14 15 static DEFINE_MUTEX(kfunc_md_mutex); 16 17 18 void kfunc_md_unlock(void) 19 { 20 mutex_unlock(&kfunc_md_mutex); 21 } 22 EXPORT_SYMBOL_GPL(kfunc_md_unlock); 23 24 void kfunc_md_lock(void) 25 { 26 mutex_lock(&kfunc_md_mutex); 27 } 28 EXPORT_SYMBOL_GPL(kfunc_md_lock); 29 30 static u32 kfunc_md_get_index(void *ip) 31 { 32 return *(u32 *)(ip - KFUNC_MD_DATA_OFFSET); 33 } 34 35 static void kfunc_md_init(struct kfunc_md *mds, u32 start, u32 end) 36 { 37 u32 i; 38 39 for (i = start; i < end; i++) 40 mds[i].users = 0; 41 } 42 43 static int kfunc_md_page_order(void) 44 { 45 return fls(DIV_ROUND_UP(kfunc_md_count, ENTRIES_PER_PAGE)) - 1; 46 } 47 48 /* Get next usable function metadata. On success, return the usable 49 * kfunc_md and store the index of it to *index. If no usable kfunc_md is 50 * found in kfunc_mds, a larger array will be allocated. 51 */ 52 static struct kfunc_md *kfunc_md_get_next(u32 *index) 53 { 54 struct kfunc_md *new_mds, *mds; 55 u32 i, order; 56 57 mds = rcu_dereference(kfunc_mds); 58 if (mds == NULL) { 59 order = kfunc_md_page_order(); 60 new_mds = (void *)__get_free_pages(GFP_KERNEL, order); 61 if (!new_mds) 62 return NULL; 63 kfunc_md_init(new_mds, 0, kfunc_md_count); 64 /* The first time to initialize kfunc_mds, so it is not 65 * used anywhere yet, and we can update it directly. 66 */ 67 rcu_assign_pointer(kfunc_mds, new_mds); 68 mds = new_mds; 69 } 70 71 if (likely(kfunc_md_used < kfunc_md_count)) { 72 /* maybe we can manage the used function metadata entry 73 * with a bit map ? 74 */ 75 for (i = 0; i < kfunc_md_count; i++) { 76 if (!mds[i].users) { 77 kfunc_md_used++; 78 *index = i; 79 mds[i].users++; 80 return mds + i; 81 } 82 } 83 } 84 85 order = kfunc_md_page_order(); 86 /* no available function metadata, so allocate a bigger function 87 * metadata array. 88 */ 89 new_mds = (void *)__get_free_pages(GFP_KERNEL, order + 1); 90 if (!new_mds) 91 return NULL; 92 93 memcpy(new_mds, mds, kfunc_md_count * sizeof(*new_mds)); 94 kfunc_md_init(new_mds, kfunc_md_count, kfunc_md_count * 2); 95 96 rcu_assign_pointer(kfunc_mds, new_mds); 97 synchronize_rcu(); > 98 free_pages((u64)mds, order); 99 100 mds = new_mds + kfunc_md_count; 101 *index = kfunc_md_count; 102 kfunc_md_count <<= 1; 103 kfunc_md_used++; 104 mds->users++; 105 106 return mds; 107 } 108
Hi Menglong, kernel test robot noticed the following build warnings: [auto build test WARNING on bpf-next/master] url: https://github.com/intel-lab-lkp/linux/commits/Menglong-Dong/add-function-metadata-support/20250226-202312 base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master patch link: https://lore.kernel.org/r/20250226121537.752241-1-dongml2%40chinatelecom.cn patch subject: [PATCH bpf-next v2] add function metadata support config: i386-buildonly-randconfig-002-20250227 (https://download.01.org/0day-ci/archive/20250228/202502280842.nI3PwNwz-lkp@intel.com/config) compiler: gcc-11 (Debian 11.3.0-12) 11.3.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250228/202502280842.nI3PwNwz-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202502280842.nI3PwNwz-lkp@intel.com/ All warnings (new ones prefixed by >>): kernel/trace/kfunc_md.c: In function 'kfunc_md_get_next': >> kernel/trace/kfunc_md.c:98:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] 98 | free_pages((u64)mds, order); | ^ Kconfig warnings: (for reference only) WARNING: unmet direct dependencies detected for CALL_PADDING Depends on [n]: CC_HAS_ENTRY_PADDING [=y] && OBJTOOL [=n] Selected by [y]: - FUNCTION_METADATA [=y] vim +98 kernel/trace/kfunc_md.c 47 48 /* Get next usable function metadata. On success, return the usable 49 * kfunc_md and store the index of it to *index. If no usable kfunc_md is 50 * found in kfunc_mds, a larger array will be allocated. 51 */ 52 static struct kfunc_md *kfunc_md_get_next(u32 *index) 53 { 54 struct kfunc_md *new_mds, *mds; 55 u32 i, order; 56 57 mds = rcu_dereference(kfunc_mds); 58 if (mds == NULL) { 59 order = kfunc_md_page_order(); 60 new_mds = (void *)__get_free_pages(GFP_KERNEL, order); 61 if (!new_mds) 62 return NULL; 63 kfunc_md_init(new_mds, 0, kfunc_md_count); 64 /* The first time to initialize kfunc_mds, so it is not 65 * used anywhere yet, and we can update it directly. 66 */ 67 rcu_assign_pointer(kfunc_mds, new_mds); 68 mds = new_mds; 69 } 70 71 if (likely(kfunc_md_used < kfunc_md_count)) { 72 /* maybe we can manage the used function metadata entry 73 * with a bit map ? 74 */ 75 for (i = 0; i < kfunc_md_count; i++) { 76 if (!mds[i].users) { 77 kfunc_md_used++; 78 *index = i; 79 mds[i].users++; 80 return mds + i; 81 } 82 } 83 } 84 85 order = kfunc_md_page_order(); 86 /* no available function metadata, so allocate a bigger function 87 * metadata array. 88 */ 89 new_mds = (void *)__get_free_pages(GFP_KERNEL, order + 1); 90 if (!new_mds) 91 return NULL; 92 93 memcpy(new_mds, mds, kfunc_md_count * sizeof(*new_mds)); 94 kfunc_md_init(new_mds, kfunc_md_count, kfunc_md_count * 2); 95 96 rcu_assign_pointer(kfunc_mds, new_mds); 97 synchronize_rcu(); > 98 free_pages((u64)mds, order); 99 100 mds = new_mds + kfunc_md_count; 101 *index = kfunc_md_count; 102 kfunc_md_count <<= 1; 103 kfunc_md_used++; 104 mds->users++; 105 106 return mds; 107 } 108
On Fri, Feb 28, 2025 at 12:53 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Feb 26, 2025 at 08:15:37PM +0800, Menglong Dong wrote: > > > In x86, we need 5-bytes to prepend a "mov %eax xxx" insn, which can hold > > a 4-bytes index. So we have following logic: > > > > 1. use the head 5-bytes if CFI_CLANG is not enabled > > 2. use the tail 5-bytes if MITIGATION_CALL_DEPTH_TRACKING is not enabled > > 3. compile the kernel with extra 5-bytes padding if > > MITIGATION_CALL_DEPTH_TRACKING and CFI_CLANG are both enabled. > > 3) would result in 16+5 bytes padding, what does that do for alignment? Hi Peter, thank you for your reply~ Yeah, it will make the function not 16 byte aligned, and this is the most pointer that I hesitate in. In this link, I tested the performance with 16+5 bytes padding, and it seems that the performance is not impacted: https://lore.kernel.org/bpf/CADxym3af+CU5Mx8myB8UowdXSc3wJOqWyH4oyq+eXKahXBTXyg@mail.gmail.com/ However, it may have other effects if the function is unaligned. I don't know either. :/ Do you have any advice here? Such as, we'd better make the padding 32 bytes instead in case 3 :/ > > Functions should be 16 byte aligned. > > Also, did you make sure all the code in arch/x86/kernel/alternative.c > still works? Because adding extra padding in the CFI_CLANG case moves > where the CFI bytes are emitted and all the CFI rewriting code goes > sideways. I tested it a little by enabling CFI_CLANG and the extra 5-bytes padding. It works fine, as mostly CFI_CLANG use CONFIG_FUNCTION_PADDING_BYTES to find the tags. I'll do more testing on CFI_CLANG to make sure everything goes well. Thanks! Menglong Dong >
On Fri, Feb 28, 2025 at 05:53:07PM +0800, Menglong Dong wrote: > I tested it a little by enabling CFI_CLANG and the extra 5-bytes > padding. It works fine, as mostly CFI_CLANG use > CONFIG_FUNCTION_PADDING_BYTES to find the tags. I'll > do more testing on CFI_CLANG to make sure everything goes > well. I don't think you understand; please read: arch/x86/kernel/alternative.c:__apply_fineibt() and all the code involved with patching FineIBT. I think you'll find it very broken if you change anything here. Can you post an actual function preamble from a kernel with CONFIG_FINEIBT=y with your changes on? Ex. $ objdump -wdr build/kernel/futex/core.o Disassembly of section .text: 0000000000000000 <__cfi_futex_hash>: 0: b9 93 0c f9 ad mov $0xadf90c93,%ecx 0000000000000005 <.Ltmp0>: 5: 90 nop 6: 90 nop 7: 90 nop 8: 90 nop 9: 90 nop a: 90 nop b: 90 nop c: 90 nop d: 90 nop e: 90 nop f: 90 nop 0000000000000010 <futex_hash>: 10: f3 0f 1e fa endbr64 14: e8 00 00 00 00 call 19 <futex_hash+0x9> 15: R_X86_64_PLT32 __fentry__-0x4 19: 8b 47 10 mov 0x10(%rdi),%eax Any change to the layout here *WILL* break the FineIBT code. If you want to test, make sure your build has FINEIBT=y and boot on an Intel CPU that has CET-IBT (alderlake and later).
On Fri, Feb 28, 2025 at 11:26:46AM +0100, Peter Zijlstra wrote: > On Fri, Feb 28, 2025 at 05:53:07PM +0800, Menglong Dong wrote: > > > I tested it a little by enabling CFI_CLANG and the extra 5-bytes > > padding. It works fine, as mostly CFI_CLANG use > > CONFIG_FUNCTION_PADDING_BYTES to find the tags. I'll > > do more testing on CFI_CLANG to make sure everything goes > > well. > > I don't think you understand; please read: > > arch/x86/kernel/alternative.c:__apply_fineibt() > > and all the code involved with patching FineIBT. I think you'll find it > very broken if you change anything here. > > Can you post an actual function preamble from a kernel with > CONFIG_FINEIBT=y with your changes on? > > Ex. > > $ objdump -wdr build/kernel/futex/core.o > > Disassembly of section .text: > > 0000000000000000 <__cfi_futex_hash>: > 0: b9 93 0c f9 ad mov $0xadf90c93,%ecx > > 0000000000000005 <.Ltmp0>: > 5: 90 nop > 6: 90 nop > 7: 90 nop > 8: 90 nop > 9: 90 nop > a: 90 nop > b: 90 nop > c: 90 nop > d: 90 nop > e: 90 nop > f: 90 nop > > 0000000000000010 <futex_hash>: > 10: f3 0f 1e fa endbr64 > 14: e8 00 00 00 00 call 19 <futex_hash+0x9> 15: R_X86_64_PLT32 __fentry__-0x4 > 19: 8b 47 10 mov 0x10(%rdi),%eax > > > Any change to the layout here *WILL* break the FineIBT code. > > > If you want to test, make sure your build has FINEIBT=y and boot on an > Intel CPU that has CET-IBT (alderlake and later). Oh, wait, not true, tigerlake also has IBT.
On Fri, Feb 28, 2025 at 6:26 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Feb 28, 2025 at 05:53:07PM +0800, Menglong Dong wrote: > > > I tested it a little by enabling CFI_CLANG and the extra 5-bytes > > padding. It works fine, as mostly CFI_CLANG use > > CONFIG_FUNCTION_PADDING_BYTES to find the tags. I'll > > do more testing on CFI_CLANG to make sure everything goes > > well. > > I don't think you understand; please read: > > arch/x86/kernel/alternative.c:__apply_fineibt() > > and all the code involved with patching FineIBT. I think you'll find it > very broken if you change anything here. > > Can you post an actual function preamble from a kernel with > CONFIG_FINEIBT=y with your changes on? > > Ex. > > $ objdump -wdr build/kernel/futex/core.o > > Disassembly of section .text: > > 0000000000000000 <__cfi_futex_hash>: > 0: b9 93 0c f9 ad mov $0xadf90c93,%ecx > > 0000000000000005 <.Ltmp0>: > 5: 90 nop > 6: 90 nop > 7: 90 nop > 8: 90 nop > 9: 90 nop > a: 90 nop > b: 90 nop > c: 90 nop > d: 90 nop > e: 90 nop > f: 90 nop > > 0000000000000010 <futex_hash>: > 10: f3 0f 1e fa endbr64 > 14: e8 00 00 00 00 call 19 <futex_hash+0x9> 15: R_X86_64_PLT32 __fentry__-0x4 > 19: 8b 47 10 mov 0x10(%rdi),%eax > > > Any change to the layout here *WILL* break the FineIBT code. > > > If you want to test, make sure your build has FINEIBT=y and boot on an > Intel CPU that has CET-IBT (alderlake and later). Yeah, I understand now. As I see the definition of CFI_PRE_PADDING, I know that things is not as simple as I thought, as the layout can be different with different config. I'll dig it deeper on this part. Thanks a lot for clearing that up for me. I'll test the CFI_CLANG and FINEIBT together with this function later. Thanks! Menglong Dong
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index c997b27b7da1..b4c2b5566a58 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1536,6 +1536,21 @@ config NODES_SHIFT Specify the maximum number of NUMA Nodes available on the target system. Increases memory reserved to accommodate various tables. +config FUNCTION_METADATA + bool "Per-function metadata storage support" + default y + select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE if !FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY + depends on !CFI_CLANG + help + Support per-function metadata storage for kernel functions, and + get the metadata of the function by its address with almost no + overhead. + + The index of the metadata will be stored in the function padding, + which will consume 4-bytes. If FUNCTION_ALIGNMENT_8B is enabled, + extra 8-bytes function padding will be reserved during compiling. + Otherwise, only extra 4-bytes function padding is needed. + source "kernel/Kconfig.hz" config ARCH_SPARSEMEM_ENABLE diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile index 358c68565bfd..d5a124c8ded2 100644 --- a/arch/arm64/Makefile +++ b/arch/arm64/Makefile @@ -140,12 +140,31 @@ endif CHECKFLAGS += -D__aarch64__ +ifeq ($(CONFIG_FUNCTION_METADATA),y) + ifeq ($(CONFIG_FUNCTION_ALIGNMENT_8B),y) + __padding_nops := 2 + else + __padding_nops := 1 + endif +else + __padding_nops := 0 +endif + ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS),y) + __padding_nops := $(shell echo $(__padding_nops) + 2 | bc) KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY - CC_FLAGS_FTRACE := -fpatchable-function-entry=4,2 + CC_FLAGS_FTRACE := -fpatchable-function-entry=$(shell echo $(__padding_nops) + 2 | bc),$(__padding_nops) else ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_ARGS),y) + CC_FLAGS_FTRACE := -fpatchable-function-entry=$(shell echo $(__padding_nops) + 2 | bc),$(__padding_nops) KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY - CC_FLAGS_FTRACE := -fpatchable-function-entry=2 +else ifeq ($(CONFIG_FUNCTION_METADATA),y) + CC_FLAGS_FTRACE += -fpatchable-function-entry=$(__padding_nops),$(__padding_nops) + ifneq ($(CONFIG_FUNCTION_TRACER),y) + KBUILD_CFLAGS += $(CC_FLAGS_FTRACE) + # some file need to remove this cflag even when CONFIG_FUNCTION_TRACER + # is not enabled, so we need to export it here + export CC_FLAGS_FTRACE + endif endif ifeq ($(CONFIG_KASAN_SW_TAGS), y) diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h index bfe3ce9df197..aa3eaa91bf82 100644 --- a/arch/arm64/include/asm/ftrace.h +++ b/arch/arm64/include/asm/ftrace.h @@ -24,6 +24,16 @@ #define FTRACE_PLT_IDX 0 #define NR_FTRACE_PLTS 1 +#ifdef CONFIG_FUNCTION_METADATA +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS +#define KFUNC_MD_DATA_OFFSET (AARCH64_INSN_SIZE * 3) +#else +#define KFUNC_MD_DATA_OFFSET AARCH64_INSN_SIZE +#endif +#define KFUNC_MD_INSN_SIZE AARCH64_INSN_SIZE +#define KFUNC_MD_INSN_OFFSET KFUNC_MD_DATA_OFFSET +#endif + /* * Currently, gcc tends to save the link register after the local variables * on the stack. This causes the max stack tracer to report the function @@ -216,6 +226,30 @@ static inline bool arch_syscall_match_sym_name(const char *sym, */ return !strcmp(sym + 8, name); } + +#ifdef CONFIG_FUNCTION_METADATA +#include <asm/text-patching.h> + +static inline bool kfunc_md_arch_exist(void *ip) +{ + return !aarch64_insn_is_nop(*(u32 *)(ip - KFUNC_MD_INSN_OFFSET)); +} + +static inline void kfunc_md_arch_pretend(u8 *insn, u32 index) +{ + *(u32 *)insn = index; +} + +static inline void kfunc_md_arch_nops(u8 *insn) +{ + *(u32 *)insn = aarch64_insn_gen_nop(); +} + +static inline int kfunc_md_arch_poke(void *ip, u8 *insn) +{ + return aarch64_insn_patch_text_nosync(ip, *(u32 *)insn); +} +#endif #endif /* ifndef __ASSEMBLY__ */ #ifndef __ASSEMBLY__ diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c index 5a890714ee2e..d829651d895b 100644 --- a/arch/arm64/kernel/ftrace.c +++ b/arch/arm64/kernel/ftrace.c @@ -88,8 +88,10 @@ unsigned long ftrace_call_adjust(unsigned long addr) * to `BL <caller>`, which is at `addr + 4` bytes in either case. * */ - if (!IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS)) - return addr + AARCH64_INSN_SIZE; + if (!IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS)) { + addr += AARCH64_INSN_SIZE; + goto out; + } /* * When using patchable-function-entry with pre-function NOPs, addr is @@ -139,6 +141,13 @@ unsigned long ftrace_call_adjust(unsigned long addr) /* Skip the first NOP after function entry */ addr += AARCH64_INSN_SIZE; +out: + if (IS_ENABLED(CONFIG_FUNCTION_METADATA)) { + if (IS_ENABLED(CONFIG_FUNCTION_ALIGNMENT_8B)) + addr += 2 * AARCH64_INSN_SIZE; + else + addr += AARCH64_INSN_SIZE; + } return addr; } diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index fe2fa3a9a0fd..fa6e9c6f5cd5 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2514,6 +2514,21 @@ config PREFIX_SYMBOLS def_bool y depends on CALL_PADDING && !CFI_CLANG +config FUNCTION_METADATA + bool "Per-function metadata storage support" + default y + select CALL_PADDING + help + Support per-function metadata storage for kernel functions, and + get the metadata of the function by its address with almost no + overhead. + + The index of the metadata will be stored in the function padding + and consumes 5-bytes. The spare space of the padding is enough + with CALL_PADDING and FUNCTION_ALIGNMENT_16B if CALL_THUNKS or + CFI_CLANG not enabled. Otherwise, we need extra 5-bytes in the + function padding, which will increases text ~1%. + menuconfig CPU_MITIGATIONS bool "Mitigations for CPU vulnerabilities" default y diff --git a/arch/x86/Makefile b/arch/x86/Makefile index 5b773b34768d..2766c9d755d7 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -240,13 +240,18 @@ ifdef CONFIG_MITIGATION_SLS endif ifdef CONFIG_CALL_PADDING -PADDING_CFLAGS := -fpatchable-function-entry=$(CONFIG_FUNCTION_PADDING_BYTES),$(CONFIG_FUNCTION_PADDING_BYTES) -KBUILD_CFLAGS += $(PADDING_CFLAGS) -export PADDING_CFLAGS + __padding_nops := $(CONFIG_FUNCTION_PADDING_BYTES) + ifneq ($(and $(CONFIG_FUNCTION_METADATA),$(CONFIG_CALL_THUNKS),$(CONFIG_CFI_CLANG)),) + __padding_nops := $(shell echo $(__padding_nops) + 5 | bc) + endif + + PADDING_CFLAGS := -fpatchable-function-entry=$(__padding_nops),$(__padding_nops) + KBUILD_CFLAGS += $(PADDING_CFLAGS) + export PADDING_CFLAGS -PADDING_RUSTFLAGS := -Zpatchable-function-entry=$(CONFIG_FUNCTION_PADDING_BYTES),$(CONFIG_FUNCTION_PADDING_BYTES) -KBUILD_RUSTFLAGS += $(PADDING_RUSTFLAGS) -export PADDING_RUSTFLAGS + PADDING_RUSTFLAGS := -Zpatchable-function-entry=$(__padding_nops),$(__padding_nops) + KBUILD_RUSTFLAGS += $(PADDING_RUSTFLAGS) + export PADDING_RUSTFLAGS endif KBUILD_LDFLAGS += -m elf_$(UTS_MACHINE) diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h index f9cb4d07df58..cf96c990f0c1 100644 --- a/arch/x86/include/asm/ftrace.h +++ b/arch/x86/include/asm/ftrace.h @@ -4,6 +4,26 @@ #include <asm/ptrace.h> +#ifdef CONFIG_FUNCTION_METADATA +#ifdef CONFIG_CFI_CLANG + #ifdef CONFIG_CALL_THUNKS + /* use the extra 5-bytes that we reserve */ + #define KFUNC_MD_INSN_OFFSET (CONFIG_FUNCTION_PADDING_BYTES + 5) + #define KFUNC_MD_DATA_OFFSET (CONFIG_FUNCTION_PADDING_BYTES + 4) + #else + /* use the space that CALL_THUNKS suppose to use */ + #define KFUNC_MD_INSN_OFFSET (5) + #define KFUNC_MD_DATA_OFFSET (4) + #endif +#else + /* use the space that CFI_CLANG suppose to use */ + #define KFUNC_MD_INSN_OFFSET (CONFIG_FUNCTION_PADDING_BYTES) + #define KFUNC_MD_DATA_OFFSET (CONFIG_FUNCTION_PADDING_BYTES - 1) +#endif + +#define KFUNC_MD_INSN_SIZE (5) +#endif + #ifdef CONFIG_FUNCTION_TRACER #ifndef CC_USING_FENTRY # error Compiler does not support fentry? @@ -168,4 +188,36 @@ static inline bool arch_trace_is_compat_syscall(struct pt_regs *regs) #endif /* !COMPILE_OFFSETS */ #endif /* !__ASSEMBLY__ */ +#if !defined(__ASSEMBLY__) && defined(CONFIG_FUNCTION_METADATA) +#include <asm/text-patching.h> + +static inline bool kfunc_md_arch_exist(void *ip) +{ + return *(u8 *)(ip - KFUNC_MD_INSN_OFFSET) == 0xB8; +} + +static inline void kfunc_md_arch_pretend(u8 *insn, u32 index) +{ + *insn = 0xB8; + *(u32 *)(insn + 1) = index; +} + +static inline void kfunc_md_arch_nops(u8 *insn) +{ + *(insn++) = BYTES_NOP1; + *(insn++) = BYTES_NOP1; + *(insn++) = BYTES_NOP1; + *(insn++) = BYTES_NOP1; + *(insn++) = BYTES_NOP1; +} + +static inline int kfunc_md_arch_poke(void *ip, u8 *insn) +{ + text_poke(ip, insn, KFUNC_MD_INSN_SIZE); + text_poke_sync(); + return 0; +} + +#endif + #endif /* _ASM_X86_FTRACE_H */ diff --git a/include/linux/kfunc_md.h b/include/linux/kfunc_md.h new file mode 100644 index 000000000000..df616f0fcb36 --- /dev/null +++ b/include/linux/kfunc_md.h @@ -0,0 +1,25 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_KFUNC_MD_H +#define _LINUX_KFUNC_MD_H + +#include <linux/kernel.h> + +struct kfunc_md { + int users; + /* we can use this field later, make sure it is 8-bytes aligned + * for now. + */ + int pad0; + void *func; +}; + +extern struct kfunc_md *kfunc_mds; + +struct kfunc_md *kfunc_md_find(void *ip); +struct kfunc_md *kfunc_md_get(void *ip); +void kfunc_md_put(struct kfunc_md *meta); +void kfunc_md_put_by_ip(void *ip); +void kfunc_md_lock(void); +void kfunc_md_unlock(void); + +#endif diff --git a/kernel/Makefile b/kernel/Makefile index cef5377c25cd..79d63f0f2496 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -109,6 +109,7 @@ obj-$(CONFIG_TRACE_CLOCK) += trace/ obj-$(CONFIG_RING_BUFFER) += trace/ obj-$(CONFIG_TRACEPOINTS) += trace/ obj-$(CONFIG_RETHOOK) += trace/ +obj-$(CONFIG_FUNCTION_METADATA) += trace/ obj-$(CONFIG_IRQ_WORK) += irq_work.o obj-$(CONFIG_CPU_PM) += cpu_pm.o obj-$(CONFIG_BPF) += bpf/ diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index 057cd975d014..9780ee3f8d8d 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -106,6 +106,7 @@ obj-$(CONFIG_FTRACE_RECORD_RECURSION) += trace_recursion_record.o obj-$(CONFIG_FPROBE) += fprobe.o obj-$(CONFIG_RETHOOK) += rethook.o obj-$(CONFIG_FPROBE_EVENTS) += trace_fprobe.o +obj-$(CONFIG_FUNCTION_METADATA) += kfunc_md.o obj-$(CONFIG_TRACEPOINT_BENCHMARK) += trace_benchmark.o obj-$(CONFIG_RV) += rv/ diff --git a/kernel/trace/kfunc_md.c b/kernel/trace/kfunc_md.c new file mode 100644 index 000000000000..362fa78f13c8 --- /dev/null +++ b/kernel/trace/kfunc_md.c @@ -0,0 +1,239 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <linux/slab.h> +#include <linux/memory.h> +#include <linux/rcupdate.h> +#include <linux/ftrace.h> +#include <linux/kfunc_md.h> + +#define ENTRIES_PER_PAGE (PAGE_SIZE / sizeof(struct kfunc_md)) + +static u32 kfunc_md_count = ENTRIES_PER_PAGE, kfunc_md_used; +struct kfunc_md __rcu *kfunc_mds; +EXPORT_SYMBOL_GPL(kfunc_mds); + +static DEFINE_MUTEX(kfunc_md_mutex); + + +void kfunc_md_unlock(void) +{ + mutex_unlock(&kfunc_md_mutex); +} +EXPORT_SYMBOL_GPL(kfunc_md_unlock); + +void kfunc_md_lock(void) +{ + mutex_lock(&kfunc_md_mutex); +} +EXPORT_SYMBOL_GPL(kfunc_md_lock); + +static u32 kfunc_md_get_index(void *ip) +{ + return *(u32 *)(ip - KFUNC_MD_DATA_OFFSET); +} + +static void kfunc_md_init(struct kfunc_md *mds, u32 start, u32 end) +{ + u32 i; + + for (i = start; i < end; i++) + mds[i].users = 0; +} + +static int kfunc_md_page_order(void) +{ + return fls(DIV_ROUND_UP(kfunc_md_count, ENTRIES_PER_PAGE)) - 1; +} + +/* Get next usable function metadata. On success, return the usable + * kfunc_md and store the index of it to *index. If no usable kfunc_md is + * found in kfunc_mds, a larger array will be allocated. + */ +static struct kfunc_md *kfunc_md_get_next(u32 *index) +{ + struct kfunc_md *new_mds, *mds; + u32 i, order; + + mds = rcu_dereference(kfunc_mds); + if (mds == NULL) { + order = kfunc_md_page_order(); + new_mds = (void *)__get_free_pages(GFP_KERNEL, order); + if (!new_mds) + return NULL; + kfunc_md_init(new_mds, 0, kfunc_md_count); + /* The first time to initialize kfunc_mds, so it is not + * used anywhere yet, and we can update it directly. + */ + rcu_assign_pointer(kfunc_mds, new_mds); + mds = new_mds; + } + + if (likely(kfunc_md_used < kfunc_md_count)) { + /* maybe we can manage the used function metadata entry + * with a bit map ? + */ + for (i = 0; i < kfunc_md_count; i++) { + if (!mds[i].users) { + kfunc_md_used++; + *index = i; + mds[i].users++; + return mds + i; + } + } + } + + order = kfunc_md_page_order(); + /* no available function metadata, so allocate a bigger function + * metadata array. + */ + new_mds = (void *)__get_free_pages(GFP_KERNEL, order + 1); + if (!new_mds) + return NULL; + + memcpy(new_mds, mds, kfunc_md_count * sizeof(*new_mds)); + kfunc_md_init(new_mds, kfunc_md_count, kfunc_md_count * 2); + + rcu_assign_pointer(kfunc_mds, new_mds); + synchronize_rcu(); + free_pages((u64)mds, order); + + mds = new_mds + kfunc_md_count; + *index = kfunc_md_count; + kfunc_md_count <<= 1; + kfunc_md_used++; + mds->users++; + + return mds; +} + +static int kfunc_md_text_poke(void *ip, void *insn, void *nop) +{ + void *target; + int ret = 0; + u8 *prog; + + target = ip - KFUNC_MD_INSN_OFFSET; + mutex_lock(&text_mutex); + if (insn) { + if (!memcmp(target, insn, KFUNC_MD_INSN_SIZE)) + goto out; + + if (memcmp(target, nop, KFUNC_MD_INSN_SIZE)) { + ret = -EBUSY; + goto out; + } + prog = insn; + } else { + if (!memcmp(target, nop, KFUNC_MD_INSN_SIZE)) + goto out; + prog = nop; + } + + ret = kfunc_md_arch_poke(target, prog); +out: + mutex_unlock(&text_mutex); + return ret; +} + +static bool __kfunc_md_put(struct kfunc_md *md) +{ + u8 nop_insn[KFUNC_MD_INSN_SIZE]; + + if (WARN_ON_ONCE(md->users <= 0)) + return false; + + md->users--; + if (md->users > 0) + return false; + + if (!kfunc_md_arch_exist(md->func)) + return false; + + kfunc_md_arch_nops(nop_insn); + /* release the metadata by recovering the function padding to NOPS */ + kfunc_md_text_poke(md->func, NULL, nop_insn); + /* TODO: we need a way to shrink the array "kfunc_mds" */ + kfunc_md_used--; + + return true; +} + +/* Decrease the reference of the md, release it if "md->users <= 0" */ +void kfunc_md_put(struct kfunc_md *md) +{ + mutex_lock(&kfunc_md_mutex); + __kfunc_md_put(md); + mutex_unlock(&kfunc_md_mutex); +} +EXPORT_SYMBOL_GPL(kfunc_md_put); + +/* Get a exist metadata by the function address, and NULL will be returned + * if not exist. + * + * NOTE: rcu lock should be held during reading the metadata, and + * kfunc_md_lock should be held if writing happens. + */ +struct kfunc_md *kfunc_md_find(void *ip) +{ + struct kfunc_md *md; + u32 index; + + if (kfunc_md_arch_exist(ip)) { + index = kfunc_md_get_index(ip); + if (WARN_ON_ONCE(index >= kfunc_md_count)) + return NULL; + + md = &kfunc_mds[index]; + return md; + } + return NULL; +} +EXPORT_SYMBOL_GPL(kfunc_md_find); + +void kfunc_md_put_by_ip(void *ip) +{ + struct kfunc_md *md; + + mutex_lock(&kfunc_md_mutex); + md = kfunc_md_find(ip); + if (md) + __kfunc_md_put(md); + mutex_unlock(&kfunc_md_mutex); +} +EXPORT_SYMBOL_GPL(kfunc_md_put_by_ip); + +/* Get a exist metadata by the function address, and create one if not + * exist. Reference of the metadata will increase 1. + * + * NOTE: always call this function with kfunc_md_lock held, and all + * updating to metadata should also hold the kfunc_md_lock. + */ +struct kfunc_md *kfunc_md_get(void *ip) +{ + u8 nop_insn[KFUNC_MD_INSN_SIZE], insn[KFUNC_MD_INSN_SIZE]; + struct kfunc_md *md; + u32 index; + + md = kfunc_md_find(ip); + if (md) { + md->users++; + return md; + } + + md = kfunc_md_get_next(&index); + if (!md) + return NULL; + + kfunc_md_arch_pretend(insn, index); + kfunc_md_arch_nops(nop_insn); + + if (kfunc_md_text_poke(ip, insn, nop_insn)) { + kfunc_md_used--; + md->users = 0; + return NULL; + } + md->func = ip; + + return md; +} +EXPORT_SYMBOL_GPL(kfunc_md_get);
For now, there isn't a way to set and get per-function metadata with a low overhead, which is not convenient for some situations. Take BPF trampoline for example, we need to create a trampoline for each kernel function, as we have to store some information of the function to the trampoline, such as BPF progs, function arg count, etc. The performance overhead and memory consumption can be higher to create these trampolines. With the supporting of per-function metadata storage, we can store these information to the metadata, and create a global BPF trampoline for all the kernel functions. In the global trampoline, we get the information that we need from the function metadata through the ip (function address) with almost no overhead. Another beneficiary can be ftrace. For now, all the kernel functions that are enabled by dynamic ftrace will be added to a filter hash if there are more than one callbacks. And hash lookup will happen when the traced functions are called, which has an impact on the performance, see __ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function metadata supporting, we can store the information that if the callback is enabled on the kernel function to the metadata. Support per-function metadata storage in the function padding, and previous discussion can be found in [1]. Generally speaking, we have two way to implement this feature: 1. Create a function metadata array, and prepend a insn which can hold the index of the function metadata in the array. And store the insn to the function padding. 2. Allocate the function metadata with kmalloc(), and prepend a insn which hold the pointer of the metadata. And store the insn to the function padding. Compared with way 2, way 1 consume less space, but we need to do more work on the global function metadata array. And we implement this function in the way 1. We implement this function in the following way for x86: With CONFIG_CALL_PADDING enabled, there will be 16-bytes(or more) padding space before all the kernel functions. And some kernel features can use it, such as MITIGATION_CALL_DEPTH_TRACKING, CFI_CLANG, FINEIBT, etc. In my research, MITIGATION_CALL_DEPTH_TRACKING will consume the tail 9-bytes in the function padding, and FINEIBT + CFI_CLANG will consume the head 7-bytes. So there will be no space for us if MITIGATION_CALL_DEPTH_TRACKING and CFI_CLANG are both enabled. In x86, we need 5-bytes to prepend a "mov %eax xxx" insn, which can hold a 4-bytes index. So we have following logic: 1. use the head 5-bytes if CFI_CLANG is not enabled 2. use the tail 5-bytes if MITIGATION_CALL_DEPTH_TRACKING is not enabled 3. compile the kernel with extra 5-bytes padding if MITIGATION_CALL_DEPTH_TRACKING and CFI_CLANG are both enabled. In the third case, we compile the kernel with a function padding of 21-bytes, which means the real function is not 16-bytes aligned any more. And in [2], I tested the performance of the kernel with different padding, and it seems that extra 5-bytes don't have impact on the performance. However, it's a huge change to make the kernel function unaligned in 16-bytes, and I'm not sure if there are any other influence. So another choice is to compile the kernel with 32-bytes aligned if there is no space available for us in the function padding. But this will increase the text size ~5%. (And I'm not sure with method to use.) We implement this function in the following way for arm64: The per-function metadata storage is already used by ftrace if CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS is enabled, and it store the pointer of the callback directly to the function padding, which consume 8-bytes, in the commit baaf553d3bc3 ("arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS") [3]. So we can directly store the index to the function padding too, without a prepending. With CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS enabled, the function is 8-bytes aligned, and we will compile the kernel with extra 8-bytes (2 NOPS) padding space. Otherwise, the function is 4-bytes aligned, and only extra 4-bytes (1 NOPS) is needed. However, we have the same problem with Mark in [3]: we can't use the function padding together with CFI_CLANG, which can make the clang compiles a wrong offset to the pre-function type hash. He said that he was working with others on this problem 2 years ago. Hi Mark, is there any progress on this problem? I tested this function by setting metadata for all the kernel function, and it consumes 0.7s for 70k+ functions, not bad :/ Maybe we should split this patch into 3 patches :/ Link: https://lore.kernel.org/bpf/CADxym3anLzM6cAkn_z71GDd_VeKiqqk1ts=xuiP7pr4PO6USPA@mail.gmail.com/ [1] Link: https://lore.kernel.org/bpf/CADxym3af+CU5Mx8myB8UowdXSc3wJOqWyH4oyq+eXKahXBTXyg@mail.gmail.com/ [2] Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> --- v2: - add supporting for arm64 - split out arch relevant code - refactor the commit log --- arch/arm64/Kconfig | 15 ++ arch/arm64/Makefile | 23 ++- arch/arm64/include/asm/ftrace.h | 34 +++++ arch/arm64/kernel/ftrace.c | 13 +- arch/x86/Kconfig | 15 ++ arch/x86/Makefile | 17 ++- arch/x86/include/asm/ftrace.h | 52 +++++++ include/linux/kfunc_md.h | 25 ++++ kernel/Makefile | 1 + kernel/trace/Makefile | 1 + kernel/trace/kfunc_md.c | 239 ++++++++++++++++++++++++++++++++ 11 files changed, 425 insertions(+), 10 deletions(-) create mode 100644 include/linux/kfunc_md.h create mode 100644 kernel/trace/kfunc_md.c