diff mbox series

[bpf-next,v2] add function metadata support

Message ID 20250226121537.752241-1-dongml2@chinatelecom.cn (mailing list archive)
State Changes Requested
Headers show
Series [bpf-next,v2] add function metadata support | expand

Commit Message

Menglong Dong Feb. 26, 2025, 12:15 p.m. UTC
For now, there isn't a way to set and get per-function metadata with
a low overhead, which is not convenient for some situations. Take
BPF trampoline for example, we need to create a trampoline for each
kernel function, as we have to store some information of the function
to the trampoline, such as BPF progs, function arg count, etc. The
performance overhead and memory consumption can be higher to create
these trampolines. With the supporting of per-function metadata storage,
we can store these information to the metadata, and create a global BPF
trampoline for all the kernel functions. In the global trampoline, we
get the information that we need from the function metadata through the
ip (function address) with almost no overhead.

Another beneficiary can be ftrace. For now, all the kernel functions that
are enabled by dynamic ftrace will be added to a filter hash if there are
more than one callbacks. And hash lookup will happen when the traced
functions are called, which has an impact on the performance, see
__ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function
metadata supporting, we can store the information that if the callback is
enabled on the kernel function to the metadata.

Support per-function metadata storage in the function padding, and
previous discussion can be found in [1]. Generally speaking, we have two
way to implement this feature:

1. Create a function metadata array, and prepend a insn which can hold
the index of the function metadata in the array. And store the insn to
the function padding.

2. Allocate the function metadata with kmalloc(), and prepend a insn which
hold the pointer of the metadata. And store the insn to the function
padding.

Compared with way 2, way 1 consume less space, but we need to do more work
on the global function metadata array. And we implement this function in
the way 1.

We implement this function in the following way for x86:

With CONFIG_CALL_PADDING enabled, there will be 16-bytes(or more) padding
space before all the kernel functions. And some kernel features can use
it, such as MITIGATION_CALL_DEPTH_TRACKING, CFI_CLANG, FINEIBT, etc.

In my research, MITIGATION_CALL_DEPTH_TRACKING will consume the tail
9-bytes in the function padding, and FINEIBT + CFI_CLANG will consume
the head 7-bytes. So there will be no space for us if
MITIGATION_CALL_DEPTH_TRACKING and CFI_CLANG are both enabled.

In x86, we need 5-bytes to prepend a "mov %eax xxx" insn, which can hold
a 4-bytes index. So we have following logic:

1. use the head 5-bytes if CFI_CLANG is not enabled
2. use the tail 5-bytes if MITIGATION_CALL_DEPTH_TRACKING is not enabled
3. compile the kernel with extra 5-bytes padding if
   MITIGATION_CALL_DEPTH_TRACKING and CFI_CLANG are both enabled.

In the third case, we compile the kernel with a function padding of
21-bytes, which means the real function is not 16-bytes aligned any more.
And in [2], I tested the performance of the kernel with different padding,
and it seems that extra 5-bytes don't have impact on the performance.
However, it's a huge change to make the kernel function unaligned in
16-bytes, and I'm not sure if there are any other influence. So another
choice is to compile the kernel with 32-bytes aligned if there is no space
available for us in the function padding. But this will increase the text
size ~5%. (And I'm not sure with method to use.)

We implement this function in the following way for arm64:

The per-function metadata storage is already used by ftrace if
CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS is enabled, and it store the pointer
of the callback directly to the function padding, which consume 8-bytes,
in the commit
baaf553d3bc3 ("arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS") [3].
So we can directly store the index to the function padding too, without
a prepending. With CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS enabled, the
function is 8-bytes aligned, and we will compile the kernel with extra
8-bytes (2 NOPS) padding space. Otherwise, the function is 4-bytes
aligned, and only extra 4-bytes (1 NOPS) is needed.

However, we have the same problem with Mark in [3]: we can't use the
function padding together with CFI_CLANG, which can make the clang
compiles a wrong offset to the pre-function type hash. He said that he was
working with others on this problem 2 years ago. Hi Mark, is there any
progress on this problem?

I tested this function by setting metadata for all the kernel function,
and it consumes 0.7s for 70k+ functions, not bad :/

Maybe we should split this patch into 3 patches :/

Link: https://lore.kernel.org/bpf/CADxym3anLzM6cAkn_z71GDd_VeKiqqk1ts=xuiP7pr4PO6USPA@mail.gmail.com/ [1]
Link: https://lore.kernel.org/bpf/CADxym3af+CU5Mx8myB8UowdXSc3wJOqWyH4oyq+eXKahXBTXyg@mail.gmail.com/ [2]
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
---
v2:
- add supporting for arm64
- split out arch relevant code
- refactor the commit log
---
 arch/arm64/Kconfig              |  15 ++
 arch/arm64/Makefile             |  23 ++-
 arch/arm64/include/asm/ftrace.h |  34 +++++
 arch/arm64/kernel/ftrace.c      |  13 +-
 arch/x86/Kconfig                |  15 ++
 arch/x86/Makefile               |  17 ++-
 arch/x86/include/asm/ftrace.h   |  52 +++++++
 include/linux/kfunc_md.h        |  25 ++++
 kernel/Makefile                 |   1 +
 kernel/trace/Makefile           |   1 +
 kernel/trace/kfunc_md.c         | 239 ++++++++++++++++++++++++++++++++
 11 files changed, 425 insertions(+), 10 deletions(-)
 create mode 100644 include/linux/kfunc_md.h
 create mode 100644 kernel/trace/kfunc_md.c

Comments

kernel test robot Feb. 27, 2025, 3:18 p.m. UTC | #1
Hi Menglong,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Menglong-Dong/add-function-metadata-support/20250226-202312
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20250226121537.752241-1-dongml2%40chinatelecom.cn
patch subject: [PATCH bpf-next v2] add function metadata support
config: i386-kismet-CONFIG_CALL_PADDING-CONFIG_FUNCTION_METADATA-0-0 (https://download.01.org/0day-ci/archive/20250227/202502272339.DtueAali-lkp@intel.com/config)
reproduce: (https://download.01.org/0day-ci/archive/20250227/202502272339.DtueAali-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502272339.DtueAali-lkp@intel.com/

kismet warnings: (new ones prefixed by >>)
>> kismet: WARNING: unmet direct dependencies detected for CALL_PADDING when selected by FUNCTION_METADATA
   WARNING: unmet direct dependencies detected for CALL_PADDING
     Depends on [n]: CC_HAS_ENTRY_PADDING [=y] && OBJTOOL [=n]
     Selected by [y]:
     - FUNCTION_METADATA [=y]
kernel test robot Feb. 27, 2025, 4:32 p.m. UTC | #2
Hi Menglong,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Menglong-Dong/add-function-metadata-support/20250226-202312
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20250226121537.752241-1-dongml2%40chinatelecom.cn
patch subject: [PATCH bpf-next v2] add function metadata support
config: x86_64-randconfig-r112-20250227 (https://download.01.org/0day-ci/archive/20250228/202502280004.QmU2zIb5-lkp@intel.com/config)
compiler: clang version 19.1.7 (https://github.com/llvm/llvm-project cd708029e0b2869e80abe31ddb175f7c35361f90)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250228/202502280004.QmU2zIb5-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502280004.QmU2zIb5-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   kernel/trace/kfunc_md.c:12:23: sparse: sparse: symbol 'kfunc_mds' redeclared with different type (different address spaces):
   kernel/trace/kfunc_md.c:12:23: sparse:    struct kfunc_md [noderef] __rcu *[addressable] [toplevel] kfunc_mds
   kernel/trace/kfunc_md.c: note: in included file:
   include/linux/kfunc_md.h:16:24: sparse: note: previously declared as:
   include/linux/kfunc_md.h:16:24: sparse:    struct kfunc_md *extern [addressable] [toplevel] kfunc_mds
>> kernel/trace/kfunc_md.c:186:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct kfunc_md *md @@     got struct kfunc_md [noderef] __rcu * @@
   kernel/trace/kfunc_md.c:186:20: sparse:     expected struct kfunc_md *md
   kernel/trace/kfunc_md.c:186:20: sparse:     got struct kfunc_md [noderef] __rcu *

vim +186 kernel/trace/kfunc_md.c

   169	
   170	/* Get a exist metadata by the function address, and NULL will be returned
   171	 * if not exist.
   172	 *
   173	 * NOTE: rcu lock should be held during reading the metadata, and
   174	 * kfunc_md_lock should be held if writing happens.
   175	 */
   176	struct kfunc_md *kfunc_md_find(void *ip)
   177	{
   178		struct kfunc_md *md;
   179		u32 index;
   180	
   181		if (kfunc_md_arch_exist(ip)) {
   182			index = kfunc_md_get_index(ip);
   183			if (WARN_ON_ONCE(index >= kfunc_md_count))
   184				return NULL;
   185	
 > 186			md = &kfunc_mds[index];
   187			return md;
   188		}
   189		return NULL;
   190	}
   191	EXPORT_SYMBOL_GPL(kfunc_md_find);
   192
Peter Zijlstra Feb. 27, 2025, 4:53 p.m. UTC | #3
On Wed, Feb 26, 2025 at 08:15:37PM +0800, Menglong Dong wrote:

> In x86, we need 5-bytes to prepend a "mov %eax xxx" insn, which can hold
> a 4-bytes index. So we have following logic:
> 
> 1. use the head 5-bytes if CFI_CLANG is not enabled
> 2. use the tail 5-bytes if MITIGATION_CALL_DEPTH_TRACKING is not enabled
> 3. compile the kernel with extra 5-bytes padding if
>    MITIGATION_CALL_DEPTH_TRACKING and CFI_CLANG are both enabled.

3) would result in 16+5 bytes padding, what does that do for alignment?

Functions should be 16 byte aligned.

Also, did you make sure all the code in arch/x86/kernel/alternative.c
still works? Because adding extra padding in the CFI_CLANG case moves
where the CFI bytes are emitted and all the CFI rewriting code goes
sideways.
kernel test robot Feb. 27, 2025, 5:48 p.m. UTC | #4
Hi Menglong,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Menglong-Dong/add-function-metadata-support/20250226-202312
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20250226121537.752241-1-dongml2%40chinatelecom.cn
patch subject: [PATCH bpf-next v2] add function metadata support
config: i386-randconfig-062-20250227 (https://download.01.org/0day-ci/archive/20250228/202502280123.BNaCYT2A-lkp@intel.com/config)
compiler: clang version 19.1.7 (https://github.com/llvm/llvm-project cd708029e0b2869e80abe31ddb175f7c35361f90)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250228/202502280123.BNaCYT2A-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502280123.BNaCYT2A-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   kernel/trace/kfunc_md.c:12:23: sparse: sparse: symbol 'kfunc_mds' redeclared with different type (different address spaces):
   kernel/trace/kfunc_md.c:12:23: sparse:    struct kfunc_md [noderef] __rcu *[addressable] [toplevel] kfunc_mds
   kernel/trace/kfunc_md.c: note: in included file:
   include/linux/kfunc_md.h:16:24: sparse: note: previously declared as:
   include/linux/kfunc_md.h:16:24: sparse:    struct kfunc_md *extern [addressable] [toplevel] kfunc_mds
   kernel/trace/kfunc_md.c:186:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct kfunc_md *md @@     got struct kfunc_md [noderef] __rcu * @@
   kernel/trace/kfunc_md.c:186:20: sparse:     expected struct kfunc_md *md
   kernel/trace/kfunc_md.c:186:20: sparse:     got struct kfunc_md [noderef] __rcu *
>> kernel/trace/kfunc_md.c:98:25: sparse: sparse: non size-preserving pointer to integer cast

vim +98 kernel/trace/kfunc_md.c

    10	
    11	static u32 kfunc_md_count = ENTRIES_PER_PAGE, kfunc_md_used;
  > 12	struct kfunc_md __rcu *kfunc_mds;
    13	EXPORT_SYMBOL_GPL(kfunc_mds);
    14	
    15	static DEFINE_MUTEX(kfunc_md_mutex);
    16	
    17	
    18	void kfunc_md_unlock(void)
    19	{
    20		mutex_unlock(&kfunc_md_mutex);
    21	}
    22	EXPORT_SYMBOL_GPL(kfunc_md_unlock);
    23	
    24	void kfunc_md_lock(void)
    25	{
    26		mutex_lock(&kfunc_md_mutex);
    27	}
    28	EXPORT_SYMBOL_GPL(kfunc_md_lock);
    29	
    30	static u32 kfunc_md_get_index(void *ip)
    31	{
    32		return *(u32 *)(ip - KFUNC_MD_DATA_OFFSET);
    33	}
    34	
    35	static void kfunc_md_init(struct kfunc_md *mds, u32 start, u32 end)
    36	{
    37		u32 i;
    38	
    39		for (i = start; i < end; i++)
    40			mds[i].users = 0;
    41	}
    42	
    43	static int kfunc_md_page_order(void)
    44	{
    45		return fls(DIV_ROUND_UP(kfunc_md_count, ENTRIES_PER_PAGE)) - 1;
    46	}
    47	
    48	/* Get next usable function metadata. On success, return the usable
    49	 * kfunc_md and store the index of it to *index. If no usable kfunc_md is
    50	 * found in kfunc_mds, a larger array will be allocated.
    51	 */
    52	static struct kfunc_md *kfunc_md_get_next(u32 *index)
    53	{
    54		struct kfunc_md *new_mds, *mds;
    55		u32 i, order;
    56	
    57		mds = rcu_dereference(kfunc_mds);
    58		if (mds == NULL) {
    59			order = kfunc_md_page_order();
    60			new_mds = (void *)__get_free_pages(GFP_KERNEL, order);
    61			if (!new_mds)
    62				return NULL;
    63			kfunc_md_init(new_mds, 0, kfunc_md_count);
    64			/* The first time to initialize kfunc_mds, so it is not
    65			 * used anywhere yet, and we can update it directly.
    66			 */
    67			rcu_assign_pointer(kfunc_mds, new_mds);
    68			mds = new_mds;
    69		}
    70	
    71		if (likely(kfunc_md_used < kfunc_md_count)) {
    72			/* maybe we can manage the used function metadata entry
    73			 * with a bit map ?
    74			 */
    75			for (i = 0; i < kfunc_md_count; i++) {
    76				if (!mds[i].users) {
    77					kfunc_md_used++;
    78					*index = i;
    79					mds[i].users++;
    80					return mds + i;
    81				}
    82			}
    83		}
    84	
    85		order = kfunc_md_page_order();
    86		/* no available function metadata, so allocate a bigger function
    87		 * metadata array.
    88		 */
    89		new_mds = (void *)__get_free_pages(GFP_KERNEL, order + 1);
    90		if (!new_mds)
    91			return NULL;
    92	
    93		memcpy(new_mds, mds, kfunc_md_count * sizeof(*new_mds));
    94		kfunc_md_init(new_mds, kfunc_md_count, kfunc_md_count * 2);
    95	
    96		rcu_assign_pointer(kfunc_mds, new_mds);
    97		synchronize_rcu();
  > 98		free_pages((u64)mds, order);
    99	
   100		mds = new_mds + kfunc_md_count;
   101		*index = kfunc_md_count;
   102		kfunc_md_count <<= 1;
   103		kfunc_md_used++;
   104		mds->users++;
   105	
   106		return mds;
   107	}
   108
kernel test robot Feb. 28, 2025, 12:43 a.m. UTC | #5
Hi Menglong,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Menglong-Dong/add-function-metadata-support/20250226-202312
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20250226121537.752241-1-dongml2%40chinatelecom.cn
patch subject: [PATCH bpf-next v2] add function metadata support
config: i386-buildonly-randconfig-002-20250227 (https://download.01.org/0day-ci/archive/20250228/202502280842.nI3PwNwz-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250228/202502280842.nI3PwNwz-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502280842.nI3PwNwz-lkp@intel.com/

All warnings (new ones prefixed by >>):

   kernel/trace/kfunc_md.c: In function 'kfunc_md_get_next':
>> kernel/trace/kfunc_md.c:98:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
      98 |         free_pages((u64)mds, order);
         |                    ^

Kconfig warnings: (for reference only)
   WARNING: unmet direct dependencies detected for CALL_PADDING
   Depends on [n]: CC_HAS_ENTRY_PADDING [=y] && OBJTOOL [=n]
   Selected by [y]:
   - FUNCTION_METADATA [=y]


vim +98 kernel/trace/kfunc_md.c

    47	
    48	/* Get next usable function metadata. On success, return the usable
    49	 * kfunc_md and store the index of it to *index. If no usable kfunc_md is
    50	 * found in kfunc_mds, a larger array will be allocated.
    51	 */
    52	static struct kfunc_md *kfunc_md_get_next(u32 *index)
    53	{
    54		struct kfunc_md *new_mds, *mds;
    55		u32 i, order;
    56	
    57		mds = rcu_dereference(kfunc_mds);
    58		if (mds == NULL) {
    59			order = kfunc_md_page_order();
    60			new_mds = (void *)__get_free_pages(GFP_KERNEL, order);
    61			if (!new_mds)
    62				return NULL;
    63			kfunc_md_init(new_mds, 0, kfunc_md_count);
    64			/* The first time to initialize kfunc_mds, so it is not
    65			 * used anywhere yet, and we can update it directly.
    66			 */
    67			rcu_assign_pointer(kfunc_mds, new_mds);
    68			mds = new_mds;
    69		}
    70	
    71		if (likely(kfunc_md_used < kfunc_md_count)) {
    72			/* maybe we can manage the used function metadata entry
    73			 * with a bit map ?
    74			 */
    75			for (i = 0; i < kfunc_md_count; i++) {
    76				if (!mds[i].users) {
    77					kfunc_md_used++;
    78					*index = i;
    79					mds[i].users++;
    80					return mds + i;
    81				}
    82			}
    83		}
    84	
    85		order = kfunc_md_page_order();
    86		/* no available function metadata, so allocate a bigger function
    87		 * metadata array.
    88		 */
    89		new_mds = (void *)__get_free_pages(GFP_KERNEL, order + 1);
    90		if (!new_mds)
    91			return NULL;
    92	
    93		memcpy(new_mds, mds, kfunc_md_count * sizeof(*new_mds));
    94		kfunc_md_init(new_mds, kfunc_md_count, kfunc_md_count * 2);
    95	
    96		rcu_assign_pointer(kfunc_mds, new_mds);
    97		synchronize_rcu();
  > 98		free_pages((u64)mds, order);
    99	
   100		mds = new_mds + kfunc_md_count;
   101		*index = kfunc_md_count;
   102		kfunc_md_count <<= 1;
   103		kfunc_md_used++;
   104		mds->users++;
   105	
   106		return mds;
   107	}
   108
Menglong Dong Feb. 28, 2025, 9:53 a.m. UTC | #6
On Fri, Feb 28, 2025 at 12:53 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Feb 26, 2025 at 08:15:37PM +0800, Menglong Dong wrote:
>
> > In x86, we need 5-bytes to prepend a "mov %eax xxx" insn, which can hold
> > a 4-bytes index. So we have following logic:
> >
> > 1. use the head 5-bytes if CFI_CLANG is not enabled
> > 2. use the tail 5-bytes if MITIGATION_CALL_DEPTH_TRACKING is not enabled
> > 3. compile the kernel with extra 5-bytes padding if
> >    MITIGATION_CALL_DEPTH_TRACKING and CFI_CLANG are both enabled.
>
> 3) would result in 16+5 bytes padding, what does that do for alignment?

Hi Peter, thank you for your reply~

Yeah, it will make the function not 16 byte aligned, and this is
the most pointer that I hesitate in.

In this link, I tested the performance with 16+5 bytes padding,
and it seems that the performance is not impacted:

https://lore.kernel.org/bpf/CADxym3af+CU5Mx8myB8UowdXSc3wJOqWyH4oyq+eXKahXBTXyg@mail.gmail.com/

However, it may have other effects if the function is unaligned.
I don't know either. :/

Do you have any advice here? Such as, we'd better make the
padding 32 bytes instead in case 3 :/

>
> Functions should be 16 byte aligned.
>
> Also, did you make sure all the code in arch/x86/kernel/alternative.c
> still works? Because adding extra padding in the CFI_CLANG case moves
> where the CFI bytes are emitted and all the CFI rewriting code goes
> sideways.

I tested it a little by enabling CFI_CLANG and the extra 5-bytes
padding. It works fine, as mostly CFI_CLANG use
CONFIG_FUNCTION_PADDING_BYTES to find the tags. I'll
do more testing on CFI_CLANG to make sure everything goes
well.

Thanks!
Menglong Dong

>
Peter Zijlstra Feb. 28, 2025, 10:26 a.m. UTC | #7
On Fri, Feb 28, 2025 at 05:53:07PM +0800, Menglong Dong wrote:

> I tested it a little by enabling CFI_CLANG and the extra 5-bytes
> padding. It works fine, as mostly CFI_CLANG use
> CONFIG_FUNCTION_PADDING_BYTES to find the tags. I'll
> do more testing on CFI_CLANG to make sure everything goes
> well.

I don't think you understand; please read:

arch/x86/kernel/alternative.c:__apply_fineibt() 

and all the code involved with patching FineIBT. I think you'll find it
very broken if you change anything here.

Can you post an actual function preamble from a kernel with
CONFIG_FINEIBT=y with your changes on?

Ex.

$ objdump -wdr build/kernel/futex/core.o

Disassembly of section .text:

0000000000000000 <__cfi_futex_hash>:
       0:       b9 93 0c f9 ad          mov    $0xadf90c93,%ecx

0000000000000005 <.Ltmp0>:
       5:       90                      nop
       6:       90                      nop
       7:       90                      nop
       8:       90                      nop
       9:       90                      nop
       a:       90                      nop
       b:       90                      nop
       c:       90                      nop
       d:       90                      nop
       e:       90                      nop
       f:       90                      nop

0000000000000010 <futex_hash>:
      10:       f3 0f 1e fa             endbr64
      14:       e8 00 00 00 00          call   19 <futex_hash+0x9>      15: R_X86_64_PLT32      __fentry__-0x4
      19:       8b 47 10                mov    0x10(%rdi),%eax


Any change to the layout here *WILL* break the FineIBT code.


If you want to test, make sure your build has FINEIBT=y and boot on an
Intel CPU that has CET-IBT (alderlake and later).
Peter Zijlstra Feb. 28, 2025, 10:29 a.m. UTC | #8
On Fri, Feb 28, 2025 at 11:26:46AM +0100, Peter Zijlstra wrote:
> On Fri, Feb 28, 2025 at 05:53:07PM +0800, Menglong Dong wrote:
> 
> > I tested it a little by enabling CFI_CLANG and the extra 5-bytes
> > padding. It works fine, as mostly CFI_CLANG use
> > CONFIG_FUNCTION_PADDING_BYTES to find the tags. I'll
> > do more testing on CFI_CLANG to make sure everything goes
> > well.
> 
> I don't think you understand; please read:
> 
> arch/x86/kernel/alternative.c:__apply_fineibt() 
> 
> and all the code involved with patching FineIBT. I think you'll find it
> very broken if you change anything here.
> 
> Can you post an actual function preamble from a kernel with
> CONFIG_FINEIBT=y with your changes on?
> 
> Ex.
> 
> $ objdump -wdr build/kernel/futex/core.o
> 
> Disassembly of section .text:
> 
> 0000000000000000 <__cfi_futex_hash>:
>        0:       b9 93 0c f9 ad          mov    $0xadf90c93,%ecx
> 
> 0000000000000005 <.Ltmp0>:
>        5:       90                      nop
>        6:       90                      nop
>        7:       90                      nop
>        8:       90                      nop
>        9:       90                      nop
>        a:       90                      nop
>        b:       90                      nop
>        c:       90                      nop
>        d:       90                      nop
>        e:       90                      nop
>        f:       90                      nop
> 
> 0000000000000010 <futex_hash>:
>       10:       f3 0f 1e fa             endbr64
>       14:       e8 00 00 00 00          call   19 <futex_hash+0x9>      15: R_X86_64_PLT32      __fentry__-0x4
>       19:       8b 47 10                mov    0x10(%rdi),%eax
> 
> 
> Any change to the layout here *WILL* break the FineIBT code.
> 
> 
> If you want to test, make sure your build has FINEIBT=y and boot on an
> Intel CPU that has CET-IBT (alderlake and later).

Oh, wait, not true, tigerlake also has IBT.
Menglong Dong Feb. 28, 2025, 1:01 p.m. UTC | #9
On Fri, Feb 28, 2025 at 6:26 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Feb 28, 2025 at 05:53:07PM +0800, Menglong Dong wrote:
>
> > I tested it a little by enabling CFI_CLANG and the extra 5-bytes
> > padding. It works fine, as mostly CFI_CLANG use
> > CONFIG_FUNCTION_PADDING_BYTES to find the tags. I'll
> > do more testing on CFI_CLANG to make sure everything goes
> > well.
>
> I don't think you understand; please read:
>
> arch/x86/kernel/alternative.c:__apply_fineibt()
>
> and all the code involved with patching FineIBT. I think you'll find it
> very broken if you change anything here.
>
> Can you post an actual function preamble from a kernel with
> CONFIG_FINEIBT=y with your changes on?
>
> Ex.
>
> $ objdump -wdr build/kernel/futex/core.o
>
> Disassembly of section .text:
>
> 0000000000000000 <__cfi_futex_hash>:
>        0:       b9 93 0c f9 ad          mov    $0xadf90c93,%ecx
>
> 0000000000000005 <.Ltmp0>:
>        5:       90                      nop
>        6:       90                      nop
>        7:       90                      nop
>        8:       90                      nop
>        9:       90                      nop
>        a:       90                      nop
>        b:       90                      nop
>        c:       90                      nop
>        d:       90                      nop
>        e:       90                      nop
>        f:       90                      nop
>
> 0000000000000010 <futex_hash>:
>       10:       f3 0f 1e fa             endbr64
>       14:       e8 00 00 00 00          call   19 <futex_hash+0x9>      15: R_X86_64_PLT32      __fentry__-0x4
>       19:       8b 47 10                mov    0x10(%rdi),%eax
>
>
> Any change to the layout here *WILL* break the FineIBT code.
>
>
> If you want to test, make sure your build has FINEIBT=y and boot on an
> Intel CPU that has CET-IBT (alderlake and later).

Yeah, I understand now. As I see the definition of CFI_PRE_PADDING, I
know that things is not as simple as I thought, as the layout can be different
with different config. I'll dig it deeper on this part.

Thanks a lot for clearing that up for me. I'll test the CFI_CLANG and
FINEIBT together with this function later.

Thanks!
Menglong Dong
diff mbox series

Patch

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c997b27b7da1..b4c2b5566a58 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1536,6 +1536,21 @@  config NODES_SHIFT
 	  Specify the maximum number of NUMA Nodes available on the target
 	  system.  Increases memory reserved to accommodate various tables.
 
+config FUNCTION_METADATA
+	bool "Per-function metadata storage support"
+	default y
+	select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE if !FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY
+	depends on !CFI_CLANG
+	help
+	  Support per-function metadata storage for kernel functions, and
+	  get the metadata of the function by its address with almost no
+	  overhead.
+
+	  The index of the metadata will be stored in the function padding,
+	  which will consume 4-bytes. If FUNCTION_ALIGNMENT_8B is enabled,
+	  extra 8-bytes function padding will be reserved during compiling.
+	  Otherwise, only extra 4-bytes function padding is needed.
+
 source "kernel/Kconfig.hz"
 
 config ARCH_SPARSEMEM_ENABLE
diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index 358c68565bfd..d5a124c8ded2 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -140,12 +140,31 @@  endif
 
 CHECKFLAGS	+= -D__aarch64__
 
+ifeq ($(CONFIG_FUNCTION_METADATA),y)
+  ifeq ($(CONFIG_FUNCTION_ALIGNMENT_8B),y)
+  __padding_nops := 2
+  else
+  __padding_nops := 1
+  endif
+else
+  __padding_nops := 0
+endif
+
 ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS),y)
+  __padding_nops  := $(shell echo $(__padding_nops) + 2 | bc)
   KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
-  CC_FLAGS_FTRACE := -fpatchable-function-entry=4,2
+  CC_FLAGS_FTRACE := -fpatchable-function-entry=$(shell echo $(__padding_nops) + 2 | bc),$(__padding_nops)
 else ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_ARGS),y)
+  CC_FLAGS_FTRACE := -fpatchable-function-entry=$(shell echo $(__padding_nops) + 2 | bc),$(__padding_nops)
   KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
-  CC_FLAGS_FTRACE := -fpatchable-function-entry=2
+else ifeq ($(CONFIG_FUNCTION_METADATA),y)
+  CC_FLAGS_FTRACE += -fpatchable-function-entry=$(__padding_nops),$(__padding_nops)
+  ifneq ($(CONFIG_FUNCTION_TRACER),y)
+    KBUILD_CFLAGS += $(CC_FLAGS_FTRACE)
+    # some file need to remove this cflag even when CONFIG_FUNCTION_TRACER
+    # is not enabled, so we need to export it here
+    export CC_FLAGS_FTRACE
+  endif
 endif
 
 ifeq ($(CONFIG_KASAN_SW_TAGS), y)
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index bfe3ce9df197..aa3eaa91bf82 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -24,6 +24,16 @@ 
 #define FTRACE_PLT_IDX		0
 #define NR_FTRACE_PLTS		1
 
+#ifdef CONFIG_FUNCTION_METADATA
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
+#define KFUNC_MD_DATA_OFFSET	(AARCH64_INSN_SIZE * 3)
+#else
+#define KFUNC_MD_DATA_OFFSET	AARCH64_INSN_SIZE
+#endif
+#define KFUNC_MD_INSN_SIZE	AARCH64_INSN_SIZE
+#define KFUNC_MD_INSN_OFFSET	KFUNC_MD_DATA_OFFSET
+#endif
+
 /*
  * Currently, gcc tends to save the link register after the local variables
  * on the stack. This causes the max stack tracer to report the function
@@ -216,6 +226,30 @@  static inline bool arch_syscall_match_sym_name(const char *sym,
 	 */
 	return !strcmp(sym + 8, name);
 }
+
+#ifdef CONFIG_FUNCTION_METADATA
+#include <asm/text-patching.h>
+
+static inline bool kfunc_md_arch_exist(void *ip)
+{
+	return !aarch64_insn_is_nop(*(u32 *)(ip - KFUNC_MD_INSN_OFFSET));
+}
+
+static inline void kfunc_md_arch_pretend(u8 *insn, u32 index)
+{
+	*(u32 *)insn = index;
+}
+
+static inline void kfunc_md_arch_nops(u8 *insn)
+{
+	*(u32 *)insn = aarch64_insn_gen_nop();
+}
+
+static inline int kfunc_md_arch_poke(void *ip, u8 *insn)
+{
+	return aarch64_insn_patch_text_nosync(ip, *(u32 *)insn);
+}
+#endif
 #endif /* ifndef __ASSEMBLY__ */
 
 #ifndef __ASSEMBLY__
diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
index 5a890714ee2e..d829651d895b 100644
--- a/arch/arm64/kernel/ftrace.c
+++ b/arch/arm64/kernel/ftrace.c
@@ -88,8 +88,10 @@  unsigned long ftrace_call_adjust(unsigned long addr)
 	 * to `BL <caller>`, which is at `addr + 4` bytes in either case.
 	 *
 	 */
-	if (!IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS))
-		return addr + AARCH64_INSN_SIZE;
+	if (!IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS)) {
+		addr += AARCH64_INSN_SIZE;
+		goto out;
+	}
 
 	/*
 	 * When using patchable-function-entry with pre-function NOPs, addr is
@@ -139,6 +141,13 @@  unsigned long ftrace_call_adjust(unsigned long addr)
 
 	/* Skip the first NOP after function entry */
 	addr += AARCH64_INSN_SIZE;
+out:
+	if (IS_ENABLED(CONFIG_FUNCTION_METADATA)) {
+		if (IS_ENABLED(CONFIG_FUNCTION_ALIGNMENT_8B))
+			addr += 2 * AARCH64_INSN_SIZE;
+		else
+			addr += AARCH64_INSN_SIZE;
+	}
 
 	return addr;
 }
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fe2fa3a9a0fd..fa6e9c6f5cd5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2514,6 +2514,21 @@  config PREFIX_SYMBOLS
 	def_bool y
 	depends on CALL_PADDING && !CFI_CLANG
 
+config FUNCTION_METADATA
+	bool "Per-function metadata storage support"
+	default y
+	select CALL_PADDING
+	help
+	  Support per-function metadata storage for kernel functions, and
+	  get the metadata of the function by its address with almost no
+	  overhead.
+
+	  The index of the metadata will be stored in the function padding
+	  and consumes 5-bytes. The spare space of the padding is enough
+	  with CALL_PADDING and FUNCTION_ALIGNMENT_16B if CALL_THUNKS or
+	  CFI_CLANG not enabled. Otherwise, we need extra 5-bytes in the
+	  function padding, which will increases text ~1%.
+
 menuconfig CPU_MITIGATIONS
 	bool "Mitigations for CPU vulnerabilities"
 	default y
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 5b773b34768d..2766c9d755d7 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -240,13 +240,18 @@  ifdef CONFIG_MITIGATION_SLS
 endif
 
 ifdef CONFIG_CALL_PADDING
-PADDING_CFLAGS := -fpatchable-function-entry=$(CONFIG_FUNCTION_PADDING_BYTES),$(CONFIG_FUNCTION_PADDING_BYTES)
-KBUILD_CFLAGS += $(PADDING_CFLAGS)
-export PADDING_CFLAGS
+  __padding_nops := $(CONFIG_FUNCTION_PADDING_BYTES)
+  ifneq ($(and $(CONFIG_FUNCTION_METADATA),$(CONFIG_CALL_THUNKS),$(CONFIG_CFI_CLANG)),)
+    __padding_nops := $(shell echo $(__padding_nops) + 5 | bc)
+  endif
+
+  PADDING_CFLAGS := -fpatchable-function-entry=$(__padding_nops),$(__padding_nops)
+  KBUILD_CFLAGS += $(PADDING_CFLAGS)
+  export PADDING_CFLAGS
 
-PADDING_RUSTFLAGS := -Zpatchable-function-entry=$(CONFIG_FUNCTION_PADDING_BYTES),$(CONFIG_FUNCTION_PADDING_BYTES)
-KBUILD_RUSTFLAGS += $(PADDING_RUSTFLAGS)
-export PADDING_RUSTFLAGS
+  PADDING_RUSTFLAGS := -Zpatchable-function-entry=$(__padding_nops),$(__padding_nops)
+  KBUILD_RUSTFLAGS += $(PADDING_RUSTFLAGS)
+  export PADDING_RUSTFLAGS
 endif
 
 KBUILD_LDFLAGS += -m elf_$(UTS_MACHINE)
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index f9cb4d07df58..cf96c990f0c1 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -4,6 +4,26 @@ 
 
 #include <asm/ptrace.h>
 
+#ifdef CONFIG_FUNCTION_METADATA
+#ifdef CONFIG_CFI_CLANG
+  #ifdef CONFIG_CALL_THUNKS
+    /* use the extra 5-bytes that we reserve */
+    #define KFUNC_MD_INSN_OFFSET	(CONFIG_FUNCTION_PADDING_BYTES + 5)
+    #define KFUNC_MD_DATA_OFFSET	(CONFIG_FUNCTION_PADDING_BYTES + 4)
+  #else
+    /* use the space that CALL_THUNKS suppose to use */
+    #define KFUNC_MD_INSN_OFFSET	(5)
+    #define KFUNC_MD_DATA_OFFSET	(4)
+  #endif
+#else
+  /* use the space that CFI_CLANG suppose to use */
+  #define KFUNC_MD_INSN_OFFSET		(CONFIG_FUNCTION_PADDING_BYTES)
+  #define KFUNC_MD_DATA_OFFSET		(CONFIG_FUNCTION_PADDING_BYTES - 1)
+#endif
+
+#define KFUNC_MD_INSN_SIZE		(5)
+#endif
+
 #ifdef CONFIG_FUNCTION_TRACER
 #ifndef CC_USING_FENTRY
 # error Compiler does not support fentry?
@@ -168,4 +188,36 @@  static inline bool arch_trace_is_compat_syscall(struct pt_regs *regs)
 #endif /* !COMPILE_OFFSETS */
 #endif /* !__ASSEMBLY__ */
 
+#if !defined(__ASSEMBLY__) && defined(CONFIG_FUNCTION_METADATA)
+#include <asm/text-patching.h>
+
+static inline bool kfunc_md_arch_exist(void *ip)
+{
+	return *(u8 *)(ip - KFUNC_MD_INSN_OFFSET) == 0xB8;
+}
+
+static inline void kfunc_md_arch_pretend(u8 *insn, u32 index)
+{
+	*insn = 0xB8;
+	*(u32 *)(insn + 1) = index;
+}
+
+static inline void kfunc_md_arch_nops(u8 *insn)
+{
+	*(insn++) = BYTES_NOP1;
+	*(insn++) = BYTES_NOP1;
+	*(insn++) = BYTES_NOP1;
+	*(insn++) = BYTES_NOP1;
+	*(insn++) = BYTES_NOP1;
+}
+
+static inline int kfunc_md_arch_poke(void *ip, u8 *insn)
+{
+	text_poke(ip, insn, KFUNC_MD_INSN_SIZE);
+	text_poke_sync();
+	return 0;
+}
+
+#endif
+
 #endif /* _ASM_X86_FTRACE_H */
diff --git a/include/linux/kfunc_md.h b/include/linux/kfunc_md.h
new file mode 100644
index 000000000000..df616f0fcb36
--- /dev/null
+++ b/include/linux/kfunc_md.h
@@ -0,0 +1,25 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_KFUNC_MD_H
+#define _LINUX_KFUNC_MD_H
+
+#include <linux/kernel.h>
+
+struct kfunc_md {
+	int users;
+	/* we can use this field later, make sure it is 8-bytes aligned
+	 * for now.
+	 */
+	int pad0;
+	void *func;
+};
+
+extern struct kfunc_md *kfunc_mds;
+
+struct kfunc_md *kfunc_md_find(void *ip);
+struct kfunc_md *kfunc_md_get(void *ip);
+void kfunc_md_put(struct kfunc_md *meta);
+void kfunc_md_put_by_ip(void *ip);
+void kfunc_md_lock(void);
+void kfunc_md_unlock(void);
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index cef5377c25cd..79d63f0f2496 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -109,6 +109,7 @@  obj-$(CONFIG_TRACE_CLOCK) += trace/
 obj-$(CONFIG_RING_BUFFER) += trace/
 obj-$(CONFIG_TRACEPOINTS) += trace/
 obj-$(CONFIG_RETHOOK) += trace/
+obj-$(CONFIG_FUNCTION_METADATA) += trace/
 obj-$(CONFIG_IRQ_WORK) += irq_work.o
 obj-$(CONFIG_CPU_PM) += cpu_pm.o
 obj-$(CONFIG_BPF) += bpf/
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 057cd975d014..9780ee3f8d8d 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -106,6 +106,7 @@  obj-$(CONFIG_FTRACE_RECORD_RECURSION) += trace_recursion_record.o
 obj-$(CONFIG_FPROBE) += fprobe.o
 obj-$(CONFIG_RETHOOK) += rethook.o
 obj-$(CONFIG_FPROBE_EVENTS) += trace_fprobe.o
+obj-$(CONFIG_FUNCTION_METADATA) += kfunc_md.o
 
 obj-$(CONFIG_TRACEPOINT_BENCHMARK) += trace_benchmark.o
 obj-$(CONFIG_RV) += rv/
diff --git a/kernel/trace/kfunc_md.c b/kernel/trace/kfunc_md.c
new file mode 100644
index 000000000000..362fa78f13c8
--- /dev/null
+++ b/kernel/trace/kfunc_md.c
@@ -0,0 +1,239 @@ 
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/slab.h>
+#include <linux/memory.h>
+#include <linux/rcupdate.h>
+#include <linux/ftrace.h>
+#include <linux/kfunc_md.h>
+
+#define ENTRIES_PER_PAGE (PAGE_SIZE / sizeof(struct kfunc_md))
+
+static u32 kfunc_md_count = ENTRIES_PER_PAGE, kfunc_md_used;
+struct kfunc_md __rcu *kfunc_mds;
+EXPORT_SYMBOL_GPL(kfunc_mds);
+
+static DEFINE_MUTEX(kfunc_md_mutex);
+
+
+void kfunc_md_unlock(void)
+{
+	mutex_unlock(&kfunc_md_mutex);
+}
+EXPORT_SYMBOL_GPL(kfunc_md_unlock);
+
+void kfunc_md_lock(void)
+{
+	mutex_lock(&kfunc_md_mutex);
+}
+EXPORT_SYMBOL_GPL(kfunc_md_lock);
+
+static u32 kfunc_md_get_index(void *ip)
+{
+	return *(u32 *)(ip - KFUNC_MD_DATA_OFFSET);
+}
+
+static void kfunc_md_init(struct kfunc_md *mds, u32 start, u32 end)
+{
+	u32 i;
+
+	for (i = start; i < end; i++)
+		mds[i].users = 0;
+}
+
+static int kfunc_md_page_order(void)
+{
+	return fls(DIV_ROUND_UP(kfunc_md_count, ENTRIES_PER_PAGE)) - 1;
+}
+
+/* Get next usable function metadata. On success, return the usable
+ * kfunc_md and store the index of it to *index. If no usable kfunc_md is
+ * found in kfunc_mds, a larger array will be allocated.
+ */
+static struct kfunc_md *kfunc_md_get_next(u32 *index)
+{
+	struct kfunc_md *new_mds, *mds;
+	u32 i, order;
+
+	mds = rcu_dereference(kfunc_mds);
+	if (mds == NULL) {
+		order = kfunc_md_page_order();
+		new_mds = (void *)__get_free_pages(GFP_KERNEL, order);
+		if (!new_mds)
+			return NULL;
+		kfunc_md_init(new_mds, 0, kfunc_md_count);
+		/* The first time to initialize kfunc_mds, so it is not
+		 * used anywhere yet, and we can update it directly.
+		 */
+		rcu_assign_pointer(kfunc_mds, new_mds);
+		mds = new_mds;
+	}
+
+	if (likely(kfunc_md_used < kfunc_md_count)) {
+		/* maybe we can manage the used function metadata entry
+		 * with a bit map ?
+		 */
+		for (i = 0; i < kfunc_md_count; i++) {
+			if (!mds[i].users) {
+				kfunc_md_used++;
+				*index = i;
+				mds[i].users++;
+				return mds + i;
+			}
+		}
+	}
+
+	order = kfunc_md_page_order();
+	/* no available function metadata, so allocate a bigger function
+	 * metadata array.
+	 */
+	new_mds = (void *)__get_free_pages(GFP_KERNEL, order + 1);
+	if (!new_mds)
+		return NULL;
+
+	memcpy(new_mds, mds, kfunc_md_count * sizeof(*new_mds));
+	kfunc_md_init(new_mds, kfunc_md_count, kfunc_md_count * 2);
+
+	rcu_assign_pointer(kfunc_mds, new_mds);
+	synchronize_rcu();
+	free_pages((u64)mds, order);
+
+	mds = new_mds + kfunc_md_count;
+	*index = kfunc_md_count;
+	kfunc_md_count <<= 1;
+	kfunc_md_used++;
+	mds->users++;
+
+	return mds;
+}
+
+static int kfunc_md_text_poke(void *ip, void *insn, void *nop)
+{
+	void *target;
+	int ret = 0;
+	u8 *prog;
+
+	target = ip - KFUNC_MD_INSN_OFFSET;
+	mutex_lock(&text_mutex);
+	if (insn) {
+		if (!memcmp(target, insn, KFUNC_MD_INSN_SIZE))
+			goto out;
+
+		if (memcmp(target, nop, KFUNC_MD_INSN_SIZE)) {
+			ret = -EBUSY;
+			goto out;
+		}
+		prog = insn;
+	} else {
+		if (!memcmp(target, nop, KFUNC_MD_INSN_SIZE))
+			goto out;
+		prog = nop;
+	}
+
+	ret = kfunc_md_arch_poke(target, prog);
+out:
+	mutex_unlock(&text_mutex);
+	return ret;
+}
+
+static bool __kfunc_md_put(struct kfunc_md *md)
+{
+	u8 nop_insn[KFUNC_MD_INSN_SIZE];
+
+	if (WARN_ON_ONCE(md->users <= 0))
+		return false;
+
+	md->users--;
+	if (md->users > 0)
+		return false;
+
+	if (!kfunc_md_arch_exist(md->func))
+		return false;
+
+	kfunc_md_arch_nops(nop_insn);
+	/* release the metadata by recovering the function padding to NOPS */
+	kfunc_md_text_poke(md->func, NULL, nop_insn);
+	/* TODO: we need a way to shrink the array "kfunc_mds" */
+	kfunc_md_used--;
+
+	return true;
+}
+
+/* Decrease the reference of the md, release it if "md->users <= 0" */
+void kfunc_md_put(struct kfunc_md *md)
+{
+	mutex_lock(&kfunc_md_mutex);
+	__kfunc_md_put(md);
+	mutex_unlock(&kfunc_md_mutex);
+}
+EXPORT_SYMBOL_GPL(kfunc_md_put);
+
+/* Get a exist metadata by the function address, and NULL will be returned
+ * if not exist.
+ *
+ * NOTE: rcu lock should be held during reading the metadata, and
+ * kfunc_md_lock should be held if writing happens.
+ */
+struct kfunc_md *kfunc_md_find(void *ip)
+{
+	struct kfunc_md *md;
+	u32 index;
+
+	if (kfunc_md_arch_exist(ip)) {
+		index = kfunc_md_get_index(ip);
+		if (WARN_ON_ONCE(index >= kfunc_md_count))
+			return NULL;
+
+		md = &kfunc_mds[index];
+		return md;
+	}
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(kfunc_md_find);
+
+void kfunc_md_put_by_ip(void *ip)
+{
+	struct kfunc_md *md;
+
+	mutex_lock(&kfunc_md_mutex);
+	md = kfunc_md_find(ip);
+	if (md)
+		__kfunc_md_put(md);
+	mutex_unlock(&kfunc_md_mutex);
+}
+EXPORT_SYMBOL_GPL(kfunc_md_put_by_ip);
+
+/* Get a exist metadata by the function address, and create one if not
+ * exist. Reference of the metadata will increase 1.
+ *
+ * NOTE: always call this function with kfunc_md_lock held, and all
+ * updating to metadata should also hold the kfunc_md_lock.
+ */
+struct kfunc_md *kfunc_md_get(void *ip)
+{
+	u8 nop_insn[KFUNC_MD_INSN_SIZE], insn[KFUNC_MD_INSN_SIZE];
+	struct kfunc_md *md;
+	u32 index;
+
+	md = kfunc_md_find(ip);
+	if (md) {
+		md->users++;
+		return md;
+	}
+
+	md = kfunc_md_get_next(&index);
+	if (!md)
+		return NULL;
+
+	kfunc_md_arch_pretend(insn, index);
+	kfunc_md_arch_nops(nop_insn);
+
+	if (kfunc_md_text_poke(ip, insn, nop_insn)) {
+		kfunc_md_used--;
+		md->users = 0;
+		return NULL;
+	}
+	md->func = ip;
+
+	return md;
+}
+EXPORT_SYMBOL_GPL(kfunc_md_get);