diff mbox series

[RFC,bpf-next,v2,2/4] bpf, x64: Fix tailcall hierarchy

Message ID 20231011152725.95895-3-hffilwlqm@gmail.com (mailing list archive)
State RFC
Delegated to: BPF
Headers show
Series bpf, x64: Fix tailcall hierarchy | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-0 success Logs for ShellCheck
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 9 this patch: 9
netdev/cc_maintainers warning 17 maintainers not CCed: song@kernel.org mingo@redhat.com netdev@vger.kernel.org yonghong.song@linux.dev jolsa@kernel.org tglx@linutronix.de x86@kernel.org kpsingh@kernel.org dsahern@kernel.org john.fastabend@gmail.com bp@alien8.de davem@davemloft.net dave.hansen@linux.intel.com sdf@google.com hpa@zytor.com haoluo@google.com martin.lau@linux.dev
netdev/build_clang success Errors and warnings before: 1364 this patch: 1364
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1365 this patch: 1365
netdev/checkpatch warning WARNING: line length of 82 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 94 exceeds 80 columns WARNING: line length of 96 exceeds 80 columns WARNING: line length of 98 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Leon Hwang Oct. 11, 2023, 3:27 p.m. UTC
From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
handling in JIT"), the tailcall on x64 works better than before.

From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms
for x64 JIT"), tailcall is able to run in BPF subprograms on x64.

How about:

1. More than 1 subprograms are called in a bpf program.
2. The tailcalls in the subprograms call the bpf program.

Because of missing tail_call_cnt back-propagation, a tailcall hierarchy
comes up. And MAX_TAIL_CALL_CNT limit does not work for this case.

As we know, in tail call context, the tail_call_cnt propagates by stack
and rax register between BPF subprograms. So, propagating tail_call_cnt
pointer by stack and rax register makes tail_call_cnt as like a global
variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall
hierarchy cases.

Before jumping to other bpf prog, load tail_call_cnt from the pointer
and then compare with MAX_TAIL_CALL_CNT. Finally, increment
tail_call_cnt by its pointer.

But, where does tail_call_cnt store?

It stores on the stack of bpf prog's caller, like

    |  STACK  |
    |         |
    |   rip   |
 +->|   tcc   |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP
 |  |         |
 |  |         |
 |  |         |
 +--| tcc_ptr |
    |   rbx   |
    +---------+ RSP

And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf
prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx"
instruction on BPF JIT epilogue").

Why not back-propagate tail_call_cnt?

It's because it's vulnerable to back-propagate it. It's unable to work
well with the following case.

int prog1();
int prog2();

prog1 is tail caller, and prog2 is tail callee. If we do back-propagate
tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the
same time? The answer is NO. Otherwise, there will be a register to be
polluted, which will make kernel crash.

Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
---
 arch/x86/net/bpf_jit_comp.c | 40 ++++++++++++++++++++++---------------
 1 file changed, 24 insertions(+), 16 deletions(-)

Comments

Fijalkowski, Maciej Dec. 5, 2023, 11:03 p.m. UTC | #1
On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote:
> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
> handling in JIT"), the tailcall on x64 works better than before.
> 
> From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms
> for x64 JIT"), tailcall is able to run in BPF subprograms on x64.
> 
> How about:
> 
> 1. More than 1 subprograms are called in a bpf program.
> 2. The tailcalls in the subprograms call the bpf program.
> 
> Because of missing tail_call_cnt back-propagation, a tailcall hierarchy
> comes up. And MAX_TAIL_CALL_CNT limit does not work for this case.
> 
> As we know, in tail call context, the tail_call_cnt propagates by stack
> and rax register between BPF subprograms. So, propagating tail_call_cnt
> pointer by stack and rax register makes tail_call_cnt as like a global
> variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall
> hierarchy cases.
> 
> Before jumping to other bpf prog, load tail_call_cnt from the pointer
> and then compare with MAX_TAIL_CALL_CNT. Finally, increment
> tail_call_cnt by its pointer.
> 
> But, where does tail_call_cnt store?
> 
> It stores on the stack of bpf prog's caller, like
> 
>     |  STACK  |
>     |         |
>     |   rip   |
>  +->|   tcc   |
>  |  |   rip   |
>  |  |   rbp   |
>  |  +---------+ RBP
>  |  |         |
>  |  |         |
>  |  |         |
>  +--| tcc_ptr |
>     |   rbx   |
>     +---------+ RSP
> 
> And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf
> prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx"
> instruction on BPF JIT epilogue").
> 
> Why not back-propagate tail_call_cnt?
> 
> It's because it's vulnerable to back-propagate it. It's unable to work
> well with the following case.
> 
> int prog1();
> int prog2();
> 
> prog1 is tail caller, and prog2 is tail callee. If we do back-propagate
> tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the
> same time? The answer is NO. Otherwise, there will be a register to be
> polluted, which will make kernel crash.

Sorry but I keep on reading this explanation and I'm lost what is being
fixed here.

You want to limit the total amount of tail calls that entry prog can do to
MAX_TAIL_CALL_CNT. Although I was working on that, my knowledge here is
rusty, therefore my view might be distorted :) to me MAX_TAIL_CALL_CNT is
to protect us from overflowing kernel stack and endless loops. As long a
single call chain doesn't go over 8kB program is fine. Verifier has a
limit of 256 subprogs from what I see.

Can you elaborate a bit more about the kernel crash you mention in the
last paragraph?

I also realized that verifier assumes MAX_TAIL_CALL_CNT as 32 which has
changed in the meantime to 33 and we should adjust the max allowed stack
depth of subprogs? I believe this was brought up at LPC?

> 
> Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
> Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
> ---
>  arch/x86/net/bpf_jit_comp.c | 40 ++++++++++++++++++++++---------------
>  1 file changed, 24 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index c2a0465d37da4..36631129cc800 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -256,7 +256,7 @@ struct jit_context {
>  /* Number of bytes emit_patch() needs to generate instructions */
>  #define X86_PATCH_SIZE		5
>  /* Number of bytes that will be skipped on tailcall */
> -#define X86_TAIL_CALL_OFFSET	(11 + ENDBR_INSN_SIZE)
> +#define X86_TAIL_CALL_OFFSET	(22 + ENDBR_INSN_SIZE)
>  
>  static void push_r12(u8 **pprog)
>  {
> @@ -340,14 +340,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>  	EMIT_ENDBR();
>  	emit_nops(&prog, X86_PATCH_SIZE);
>  	if (!ebpf_from_cbpf) {
> -		if (tail_call_reachable && !is_subprog)
> +		if (tail_call_reachable && !is_subprog) {
>  			/* When it's the entry of the whole tailcall context,
>  			 * zeroing rax means initialising tail_call_cnt.
>  			 */
> -			EMIT2(0x31, 0xC0); /* xor eax, eax */
> -		else
> -			/* Keep the same instruction layout. */
> -			EMIT2(0x66, 0x90); /* nop2 */
> +			EMIT2(0x31, 0xC0);       /* xor eax, eax */
> +			EMIT1(0x50);             /* push rax */
> +			/* Make rax as ptr that points to tail_call_cnt. */
> +			EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
> +			EMIT1_off32(0xE8, 2);    /* call main prog */
> +			EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
> +			EMIT1(0xC3);             /* ret */
> +		} else {
> +			/* Keep the same instruction size. */
> +			emit_nops(&prog, 13);
> +		}
>  	}
>  	/* Exception callback receives FP as third parameter */
>  	if (is_exception_cb) {
> @@ -373,6 +380,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>  	if (stack_depth)
>  		EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
>  	if (tail_call_reachable)
> +		/* Here, rax is tail_call_cnt_ptr. */
>  		EMIT1(0x50);         /* push rax */
>  	*pprog = prog;
>  }
> @@ -528,7 +536,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>  					u32 stack_depth, u8 *ip,
>  					struct jit_context *ctx)
>  {
> -	int tcc_off = -4 - round_up(stack_depth, 8);
> +	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
>  	u8 *prog = *pprog, *start = *pprog;
>  	int offset;
>  
> @@ -553,13 +561,12 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>  	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
>  	 *	goto out;
>  	 */
> -	EMIT2_off32(0x8B, 0x85, tcc_off);         /* mov eax, dword ptr [rbp - tcc_off] */
> -	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
> +	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off); /* mov rax, qword ptr [rbp - tcc_ptr_off] */
> +	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);     /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
>  
>  	offset = ctx->tail_call_indirect_label - (prog + 2 - start);
>  	EMIT2(X86_JAE, offset);                   /* jae out */
> -	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
> -	EMIT2_off32(0x89, 0x85, tcc_off);         /* mov dword ptr [rbp - tcc_off], eax */
> +	EMIT3(0x83, 0x00, 0x01);                  /* add dword ptr [rax], 1 */
>  
>  	/* prog = array->ptrs[index]; */
>  	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
> @@ -581,6 +588,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>  		pop_callee_regs(&prog, callee_regs_used);
>  	}
>  
> +	/* pop tail_call_cnt_ptr */
>  	EMIT1(0x58);                              /* pop rax */
>  	if (stack_depth)
>  		EMIT3_off32(0x48, 0x81, 0xC4,     /* add rsp, sd */
> @@ -609,7 +617,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>  				      bool *callee_regs_used, u32 stack_depth,
>  				      struct jit_context *ctx)
>  {
> -	int tcc_off = -4 - round_up(stack_depth, 8);
> +	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
>  	u8 *prog = *pprog, *start = *pprog;
>  	int offset;
>  
> @@ -617,13 +625,12 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>  	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
>  	 *	goto out;
>  	 */
> -	EMIT2_off32(0x8B, 0x85, tcc_off);             /* mov eax, dword ptr [rbp - tcc_off] */
> -	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
> +	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off);   /* mov rax, qword ptr [rbp - tcc_ptr_off] */
> +	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);         /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
>  
>  	offset = ctx->tail_call_direct_label - (prog + 2 - start);
>  	EMIT2(X86_JAE, offset);                       /* jae out */
> -	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
> -	EMIT2_off32(0x89, 0x85, tcc_off);             /* mov dword ptr [rbp - tcc_off], eax */
> +	EMIT3(0x83, 0x00, 0x01);                      /* add dword ptr [rax], 1 */
>  
>  	poke->tailcall_bypass = ip + (prog - start);
>  	poke->adj_off = X86_TAIL_CALL_OFFSET;
> @@ -640,6 +647,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>  		pop_callee_regs(&prog, callee_regs_used);
>  	}
>  
> +	/* pop tail_call_cnt_ptr */
>  	EMIT1(0x58);                                  /* pop rax */
>  	if (stack_depth)
>  		EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
> -- 
> 2.41.0
> 
>
Leon Hwang Dec. 6, 2023, 6:51 a.m. UTC | #2
On 6/12/23 07:03, Maciej Fijalkowski wrote:
> On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote:
>> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
>> handling in JIT"), the tailcall on x64 works better than before.
>>
>> From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms
>> for x64 JIT"), tailcall is able to run in BPF subprograms on x64.
>>
>> How about:
>>
>> 1. More than 1 subprograms are called in a bpf program.
>> 2. The tailcalls in the subprograms call the bpf program.
>>
>> Because of missing tail_call_cnt back-propagation, a tailcall hierarchy
>> comes up. And MAX_TAIL_CALL_CNT limit does not work for this case.
>>
>> As we know, in tail call context, the tail_call_cnt propagates by stack
>> and rax register between BPF subprograms. So, propagating tail_call_cnt
>> pointer by stack and rax register makes tail_call_cnt as like a global
>> variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall
>> hierarchy cases.
>>
>> Before jumping to other bpf prog, load tail_call_cnt from the pointer
>> and then compare with MAX_TAIL_CALL_CNT. Finally, increment
>> tail_call_cnt by its pointer.
>>
>> But, where does tail_call_cnt store?
>>
>> It stores on the stack of bpf prog's caller, like
>>
>>     |  STACK  |
>>     |         |
>>     |   rip   |
>>  +->|   tcc   |
>>  |  |   rip   |
>>  |  |   rbp   |
>>  |  +---------+ RBP
>>  |  |         |
>>  |  |         |
>>  |  |         |
>>  +--| tcc_ptr |
>>     |   rbx   |
>>     +---------+ RSP
>>
>> And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf
>> prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx"
>> instruction on BPF JIT epilogue").
>>
>> Why not back-propagate tail_call_cnt?
>>
>> It's because it's vulnerable to back-propagate it. It's unable to work
>> well with the following case.
>>
>> int prog1();
>> int prog2();
>>
>> prog1 is tail caller, and prog2 is tail callee. If we do back-propagate
>> tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the
>> same time? The answer is NO. Otherwise, there will be a register to be
>> polluted, which will make kernel crash.
> 
> Sorry but I keep on reading this explanation and I'm lost what is being
> fixed here> 
> You want to limit the total amount of tail calls that entry prog can do to
> MAX_TAIL_CALL_CNT. Although I was working on that, my knowledge here is
> rusty, therefore my view might be distorted :) to me MAX_TAIL_CALL_CNT is
> to protect us from overflowing kernel stack and endless loops. As long a
> single call chain doesn't go over 8kB program is fine. Verifier has a
> limit of 256 subprogs from what I see.

I try to resolve the following-like cases here.

          +--------- tailcall --+
          |                     |
          V       --> subprog1 -+
 entry bpf prog <
          A       --> subprog2 -+
          |                     |
          +--------- tailcall --+

Without this fixing, the CPU will be stalled because of too many tailcalls.

> 
> Can you elaborate a bit more about the kernel crash you mention in the
> last paragraph?

We have progs, prog1, prog2, prog3 and prog4, and the scenario:

case1: kprobe1 --> prog1 --> subprog1 --tailcall-> prog2 --> subprog2 --tailcall-> prog3
case2: kprobe2 --> prog2 --> subprog2 --tailcall-> prog3
case3: kprobe3 --> prog4 --> subprog3 --tailcall-> prog3
                         --> subprog4 --tailcall-> prog4

How does prog2 back-propagate tail_call_cnt to prog1?

Possible way 1:
When prog2 and prog3 are added to PROG_ARRAY, poke their epilogues to
back-propagate tail_call_cnt by RCX register. It seems OK because kprobes do
not handle the value in RCX register, like case2.

Possible way 2:
Can back-propagate tail_call_cnt with RCX register by checking tail_call_cnt != 0
at epilogue when current prog has tailcall?
No. As for case1, prog2 handles the value in RCX register, which is not tail_call_cnt,
because prog3 has no tailcall and won't populate RCX register with tail_call_cnt.

However, I don't like the back-propagating way. Then, I "burn" my brain to figure
out pointer propagating ways.

RFC PATCH v1 way:
Propagate tail_call_cnt and its pointer together. Then, the pointer is used to
check MAX_TAIL_CALL_CNT and increment tail_call_cnt.

    |  STACK  |
    +---------+ RBP
    |         |
    |         |
    |         |
 +--| tcc_ptr |
 +->|   tcc   |
    |   rbx   |
    +---------+ RSP

RFC PATCH v2 way (current patchset):
Propagate tail_call_cnt pointer only. Then, the pointer is used to check
MAX_TAIL_CALL_CNT and increment tail_call_cnt.

    |  STACK  |
    |         |
    |   rip   |
 +->|   tcc   |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP
 |  |         |
 |  |         |
 |  |         |
 +--| tcc_ptr |
    |   rbx   |
    +---------+ RSP

> 
> I also realized that verifier assumes MAX_TAIL_CALL_CNT as 32 which has
> changed in the meantime to 33 and we should adjust the max allowed stack
> depth of subprogs? I believe this was brought up at LPC?

There's following code snippet in verifier.c:

	/* protect against potential stack overflow that might happen when
	 * bpf2bpf calls get combined with tailcalls. Limit the caller's stack
	 * depth for such case down to 256 so that the worst case scenario
	 * would result in 8k stack size (32 which is tailcall limit * 256 =
	 * 8k).

But, MAX_TAIL_CALL_CNT is 33.

This was not brought up at LPC 2022&2023. I don't know whether this was
brought up at previous LPCs.

Thanks,
Leon
Fijalkowski, Maciej Dec. 11, 2023, 6:02 p.m. UTC | #3
On Wed, Dec 06, 2023 at 02:51:28PM +0800, Leon Hwang wrote:
> 
> 
> On 6/12/23 07:03, Maciej Fijalkowski wrote:
> > On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote:
> >> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
> >> handling in JIT"), the tailcall on x64 works better than before.
> >>
> >> From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms
> >> for x64 JIT"), tailcall is able to run in BPF subprograms on x64.
> >>
> >> How about:
> >>
> >> 1. More than 1 subprograms are called in a bpf program.
> >> 2. The tailcalls in the subprograms call the bpf program.
> >>
> >> Because of missing tail_call_cnt back-propagation, a tailcall hierarchy
> >> comes up. And MAX_TAIL_CALL_CNT limit does not work for this case.
> >>
> >> As we know, in tail call context, the tail_call_cnt propagates by stack
> >> and rax register between BPF subprograms. So, propagating tail_call_cnt
> >> pointer by stack and rax register makes tail_call_cnt as like a global
> >> variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall
> >> hierarchy cases.
> >>
> >> Before jumping to other bpf prog, load tail_call_cnt from the pointer
> >> and then compare with MAX_TAIL_CALL_CNT. Finally, increment
> >> tail_call_cnt by its pointer.
> >>
> >> But, where does tail_call_cnt store?
> >>
> >> It stores on the stack of bpf prog's caller, like
> >>
> >>     |  STACK  |
> >>     |         |
> >>     |   rip   |
> >>  +->|   tcc   |
> >>  |  |   rip   |
> >>  |  |   rbp   |
> >>  |  +---------+ RBP
> >>  |  |         |
> >>  |  |         |
> >>  |  |         |
> >>  +--| tcc_ptr |
> >>     |   rbx   |
> >>     +---------+ RSP
> >>
> >> And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf
> >> prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx"
> >> instruction on BPF JIT epilogue").
> >>
> >> Why not back-propagate tail_call_cnt?
> >>
> >> It's because it's vulnerable to back-propagate it. It's unable to work
> >> well with the following case.
> >>
> >> int prog1();
> >> int prog2();
> >>
> >> prog1 is tail caller, and prog2 is tail callee. If we do back-propagate
> >> tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the
> >> same time? The answer is NO. Otherwise, there will be a register to be
> >> polluted, which will make kernel crash.
> > 
> > Sorry but I keep on reading this explanation and I'm lost what is being
> > fixed here> 
> > You want to limit the total amount of tail calls that entry prog can do to
> > MAX_TAIL_CALL_CNT. Although I was working on that, my knowledge here is
> > rusty, therefore my view might be distorted :) to me MAX_TAIL_CALL_CNT is
> > to protect us from overflowing kernel stack and endless loops. As long a
> > single call chain doesn't go over 8kB program is fine. Verifier has a
> > limit of 256 subprogs from what I see.
> 
> I try to resolve the following-like cases here.
> 
>           +--------- tailcall --+
>           |                     |
>           V       --> subprog1 -+
>  entry bpf prog <
>           A       --> subprog2 -+
>           |                     |
>           +--------- tailcall --+
> 
> Without this fixing, the CPU will be stalled because of too many tailcalls.

You still didn't explain why CPU might stall :/ if you stumble upon any
kernel splats it's good habit to include them. So, for future readers of
this thread, I reduced one of your examples down to:

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include "bpf_legacy.h"

struct {
        __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
        __uint(max_entries, 1);
        __uint(key_size, sizeof(__u32));
        __uint(value_size, sizeof(__u32));
} jmp_table SEC(".maps");

static __noinline
int subprog_tail(struct __sk_buff *skb)
{
        bpf_tail_call_static(skb, &jmp_table, 0);
        return 0;
}

SEC("tc")
int entry(struct __sk_buff *skb)
{
        subprog_tail(skb);
        return 0;
}

char __license[] SEC("license") = "GPL";

and indeed CPU got stuck:

[10623.892978] RIP: 0010:bpf_prog_6abcef0aca85f2fd_subprog_tail+0x49/0x59
[10623.899606] Code: 39 56 24 76 30 8b 85 fc ff ff ff 83 f8 21 73 25 83 c0 01 89 85 fc ff ff ff 48 8b 8c d6 10 01 00 00 48 85 c9 74 0f 41 5d 5b 58 <48> 8b 49 30 48 83 c1 0b ff e1 cc 41 5d 5b c9 c3 cc cc cc cc cc cc
[10623.918646] RSP: 0018:ffffc900202b78e0 EFLAGS: 00000286
[10623.923952] RAX: 0000002100000000 RBX: ffff8881088aa100 RCX: ffffc9000d319000
[10623.931194] RDX: 0000000000000000 RSI: ffff88811e6d2e00 RDI: ffff8881088aa100
[10623.938439] RBP: ffffc900202b78e0 R08: ffffc900202b7dd4 R09: 0000000000000000
[10623.945686] R10: 736391e000000000 R11: 0000000000000000 R12: 0000000000000001
[10623.952927] R13: ffffc9000d38d048 R14: ffffc900202b7dd4 R15: ffffffff81ae756f
[10623.960174]  ? bpf_test_run+0xef/0x320
[10623.963988]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10623.969826]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10623.975662]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10623.981500]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10623.987334]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10623.993174]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10623.999016]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.004854]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.010688]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.016529]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.022371]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.028209]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.034043]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.043211]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.052344]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.061453]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.070404]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.079200]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.087841]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.096333]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.104680]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.112886]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.121014]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.129048]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.136949]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.144716]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.152479]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.160209]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.167905]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.175565]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
[10624.183173]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.190738]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
[10624.198262]  bpf_test_run+0x186/0x320
[10624.203677]  ? kmem_cache_alloc+0x14d/0x2c0
[10624.209615]  bpf_prog_test_run_skb+0x35c/0x710
[10624.215810]  __sys_bpf+0xf3a/0x2d80
[10624.221068]  ? do_mmap+0x409/0x5e0
[10624.226217]  __x64_sys_bpf+0x1a/0x30
[10624.231693]  do_syscall_64+0x2e/0xa0
[10624.236988]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[10624.243550] RIP: 0033:0x7fd76171e69d
[10624.248576] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 63 a7 0f 00 f7 d8 64 89 01 48
[10624.270613] RSP: 002b:00007ffe694f8d68 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
[10624.279894] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd76171e69d
[10624.288769] RDX: 0000000000000050 RSI: 00007ffe694f8db0 RDI: 000000000000000a
[10624.297634] RBP: 00007ffe694f8d80 R08: 00007ffe694f8d37 R09: 00007ffe694f8db0
[10624.306455] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffe694f9348
[10624.315260] R13: 000055bb8423f6d9 R14: 000055bb8496ac90 R15: 00007fd761925040

Looking at assembly code:

entry:
ffffffffc0012c60 <load3+0x12c60>:
ffffffffc0012c60:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0012c65:       31 c0                   xor    %eax,%eax
ffffffffc0012c67:       55                      push   %rbp
ffffffffc0012c68:       48 89 e5                mov    %rsp,%rbp
ffffffffc0012c6b:       50                      push   %rax
ffffffffc0012c6c:       48 8b 85 f8 ff ff ff    mov    -0x8(%rbp),%rax
ffffffffc0012c73:       e8 30 00 00 00          call   0xffffffffc0012ca8
ffffffffc0012c78:       31 c0                   xor    %eax,%eax
ffffffffc0012c7a:       c9                      leave
ffffffffc0012c7b:       c3                      ret

subprog_tail:
ffffffffc0012ca8 <load3+0x12ca8>:
ffffffffc0012ca8:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0012cad:       66 90                   xchg   %ax,%ax
ffffffffc0012caf:       55                      push   %rbp
ffffffffc0012cb0:       48 89 e5                mov    %rsp,%rbp
ffffffffc0012cb3:       50                      push   %rax
ffffffffc0012cb4:       53                      push   %rbx
ffffffffc0012cb5:       41 55                   push   %r13
ffffffffc0012cb7:       48 89 fb                mov    %rdi,%rbx
ffffffffc0012cba:       49 bd 00 7c 73 0c 81    movabs $0xffff88810c737c00,%r13
ffffffffc0012cc1:       88 ff ff
ffffffffc0012cc4:       48 89 df                mov    %rbx,%rdi
ffffffffc0012cc7:       4c 89 ee                mov    %r13,%rsi
ffffffffc0012cca:       31 d2                   xor    %edx,%edx
ffffffffc0012ccc:       8b 85 fc ff ff ff       mov    -0x4(%rbp),%eax
ffffffffc0012cd2:       83 f8 21                cmp    $0x21,%eax
ffffffffc0012cd5:       73 17                   jae    0xffffffffc0012cee
ffffffffc0012cd7:       83 c0 01                add    $0x1,%eax
ffffffffc0012cda:       89 85 fc ff ff ff       mov    %eax,-0x4(%rbp)
ffffffffc0012ce0:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0012ce5:       41 5d                   pop    %r13
ffffffffc0012ce7:       5b                      pop    %rbx
ffffffffc0012ce8:       58                      pop    %rax
ffffffffc0012ce9:       e9 7d ff ff ff          jmp    0xffffffffc0012c6b
ffffffffc0012cee:       41 5d                   pop    %r13
ffffffffc0012cf0:       5b                      pop    %rbx
ffffffffc0012cf1:       c9                      leave
ffffffffc0012cf2:       c3                      ret

I am trying to understand what are writing to %rax when entry is target of
tailcall. rbp comes from subprog_tail and we popped all of the previously
pushed regs. So what is exactly causing this hang?

(I thought I'd rather respond just to let you know I'm on it)

> 
> > 
> > Can you elaborate a bit more about the kernel crash you mention in the
> > last paragraph?
> 
> We have progs, prog1, prog2, prog3 and prog4, and the scenario:
> 
> case1: kprobe1 --> prog1 --> subprog1 --tailcall-> prog2 --> subprog2 --tailcall-> prog3
> case2: kprobe2 --> prog2 --> subprog2 --tailcall-> prog3
> case3: kprobe3 --> prog4 --> subprog3 --tailcall-> prog3
>                          --> subprog4 --tailcall-> prog4
> 
> How does prog2 back-propagate tail_call_cnt to prog1?
> 
> Possible way 1:
> When prog2 and prog3 are added to PROG_ARRAY, poke their epilogues to
> back-propagate tail_call_cnt by RCX register. It seems OK because kprobes do
> not handle the value in RCX register, like case2.
> 
> Possible way 2:
> Can back-propagate tail_call_cnt with RCX register by checking tail_call_cnt != 0
> at epilogue when current prog has tailcall?
> No. As for case1, prog2 handles the value in RCX register, which is not tail_call_cnt,
> because prog3 has no tailcall and won't populate RCX register with tail_call_cnt.

I don't get it. Let's start with small example and do the baby steps.

> 
> However, I don't like the back-propagating way. Then, I "burn" my brain to figure
> out pointer propagating ways.

I know that code can damage programmer's brain, that's why we need to
strive for elaborative and easy to understand commit messages so that
later on we will be able to pick up what is going on here.

> 
> RFC PATCH v1 way:
> Propagate tail_call_cnt and its pointer together. Then, the pointer is used to
> check MAX_TAIL_CALL_CNT and increment tail_call_cnt.
> 
>     |  STACK  |
>     +---------+ RBP
>     |         |
>     |         |
>     |         |
>  +--| tcc_ptr |
>  +->|   tcc   |
>     |   rbx   |
>     +---------+ RSP
> 
> RFC PATCH v2 way (current patchset):
> Propagate tail_call_cnt pointer only. Then, the pointer is used to check
> MAX_TAIL_CALL_CNT and increment tail_call_cnt.
> 
>     |  STACK  |
>     |         |
>     |   rip   |
>  +->|   tcc   |
>  |  |   rip   |
>  |  |   rbp   |
>  |  +---------+ RBP
>  |  |         |
>  |  |         |
>  |  |         |
>  +--| tcc_ptr |
>     |   rbx   |
>     +---------+ RSP
> 
> > 
> > I also realized that verifier assumes MAX_TAIL_CALL_CNT as 32 which has
> > changed in the meantime to 33 and we should adjust the max allowed stack
> > depth of subprogs? I believe this was brought up at LPC?
> 
> There's following code snippet in verifier.c:
> 
> 	/* protect against potential stack overflow that might happen when
> 	 * bpf2bpf calls get combined with tailcalls. Limit the caller's stack
> 	 * depth for such case down to 256 so that the worst case scenario
> 	 * would result in 8k stack size (32 which is tailcall limit * 256 =
> 	 * 8k).
> 
> But, MAX_TAIL_CALL_CNT is 33.
> 
> This was not brought up at LPC 2022&2023. I don't know whether this was
> brought up at previous LPCs.

Was referring to this:
https://lpc.events/event/17/contributions/1595/attachments/1230/2506/LPC2023_slides.pdf

but let us focus on issue you're bringing up here in this patch. I brought
this up thinking JIT code was fine, now it's clear that things go south
when entry prog is tailcall target, which we didn't test.

> 
> Thanks,
> Leon
>
Leon Hwang Dec. 13, 2023, 2:48 a.m. UTC | #4
On 12/12/23 02:02, Maciej Fijalkowski wrote:
> On Wed, Dec 06, 2023 at 02:51:28PM +0800, Leon Hwang wrote:
>>
>>
>> On 6/12/23 07:03, Maciej Fijalkowski wrote:
>>> On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote:
>>>> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
>>>> handling in JIT"), the tailcall on x64 works better than before.
>>>>
>>>> From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms
>>>> for x64 JIT"), tailcall is able to run in BPF subprograms on x64.
>>>>
>>>> How about:
>>>>
>>>> 1. More than 1 subprograms are called in a bpf program.
>>>> 2. The tailcalls in the subprograms call the bpf program.
>>>>
>>>> Because of missing tail_call_cnt back-propagation, a tailcall hierarchy
>>>> comes up. And MAX_TAIL_CALL_CNT limit does not work for this case.
>>>>
>>>> As we know, in tail call context, the tail_call_cnt propagates by stack
>>>> and rax register between BPF subprograms. So, propagating tail_call_cnt
>>>> pointer by stack and rax register makes tail_call_cnt as like a global
>>>> variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall
>>>> hierarchy cases.
>>>>
>>>> Before jumping to other bpf prog, load tail_call_cnt from the pointer
>>>> and then compare with MAX_TAIL_CALL_CNT. Finally, increment
>>>> tail_call_cnt by its pointer.
>>>>
>>>> But, where does tail_call_cnt store?
>>>>
>>>> It stores on the stack of bpf prog's caller, like
>>>>
>>>>     |  STACK  |
>>>>     |         |
>>>>     |   rip   |
>>>>  +->|   tcc   |
>>>>  |  |   rip   |
>>>>  |  |   rbp   |
>>>>  |  +---------+ RBP
>>>>  |  |         |
>>>>  |  |         |
>>>>  |  |         |
>>>>  +--| tcc_ptr |
>>>>     |   rbx   |
>>>>     +---------+ RSP
>>>>
>>>> And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf
>>>> prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx"
>>>> instruction on BPF JIT epilogue").
>>>>
>>>> Why not back-propagate tail_call_cnt?
>>>>
>>>> It's because it's vulnerable to back-propagate it. It's unable to work
>>>> well with the following case.
>>>>
>>>> int prog1();
>>>> int prog2();
>>>>
>>>> prog1 is tail caller, and prog2 is tail callee. If we do back-propagate
>>>> tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the
>>>> same time? The answer is NO. Otherwise, there will be a register to be
>>>> polluted, which will make kernel crash.
>>>
>>> Sorry but I keep on reading this explanation and I'm lost what is being
>>> fixed here> 
>>> You want to limit the total amount of tail calls that entry prog can do to
>>> MAX_TAIL_CALL_CNT. Although I was working on that, my knowledge here is
>>> rusty, therefore my view might be distorted :) to me MAX_TAIL_CALL_CNT is
>>> to protect us from overflowing kernel stack and endless loops. As long a
>>> single call chain doesn't go over 8kB program is fine. Verifier has a
>>> limit of 256 subprogs from what I see.
>>
>> I try to resolve the following-like cases here.
>>
>>           +--------- tailcall --+
>>           |                     |
>>           V       --> subprog1 -+
>>  entry bpf prog <
>>           A       --> subprog2 -+
>>           |                     |
>>           +--------- tailcall --+
>>
>> Without this fixing, the CPU will be stalled because of too many tailcalls.
> 
> You still didn't explain why CPU might stall :/ if you stumble upon any
> kernel splats it's good habit to include them. So, for future readers of
> this thread, I reduced one of your examples down to:
> 
Thanks for your help.

Here is a GitHub CI history of this patchset:

https://github.com/kernel-patches/bpf/pull/5807/checks

In this CI results, the test_progs failed to run on aarch64 and s390x
because of "rcu: INFO: rcu_sched self-detected stall on CPU".

But why does CPU stall?

It's because there are too many tailcalls to run on the bpf prog. How many
tailcalls? Why MAX_TAIL_CALL_CNT limit does not work for those cases?

Let's have a look at a simple case:

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include "bpf_legacy.h"

struct {
	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
	__uint(max_entries, 1);
	__uint(key_size, sizeof(__u32));
	__uint(value_size, sizeof(__u32));
} jmp_table SEC(".maps");

int count = 0;

static __noinline
int subprog_tail(struct __sk_buff *skb)
{
	bpf_tail_call_static(skb, &jmp_table, 0);
	return 0;
}

SEC("tc")
int entry(struct __sk_buff *skb)
{
	volatile int ret = 1;

	count++;
	subprog_tail(skb); /* subprog call1 */
	subprog_tail(skb); /* subprog call2 */

	return ret;
}

char __license[] SEC("license") = "GPL";

And the entry bpf prog is populated to the 0th slot of jmp_table. Then,
what happens when entry bpf prog runs?

At the very first time when subprog_tail() is called, subprog_tail() does
tailcall the entry bpf prog. Then, subprog_taill() is called at second time
at the position subprog call1, and it tailcalls the entry bpf prog again.

Then, again and again. At the very first time when MAX_TAIL_CALL_CNT limit
works, subprog_tail() has been called for 34 times at the position subprog
call1. And at this time, the tail_call_cnt is 33 in subprog_tail().

Next, the 34th subprog_tail() returns to entry() because of
MAX_TAIL_CALL_CNT limit.

In entry(), the 34th entry(), at the time after the 34th subprog_tail() at
the position subprog call1 finishes and before the 1st subprog_tail() at
the position subprog call2 calls in entry(), what's the value of
tail_call_cnt in entry()? It's 33.

As we know, tail_all_cnt is pushed on the stack of entry(), and propagates
to subprog_tail() by %rax from stack.

Then, at the time when subprog_tail() at the position subprog call2 is
called for its first time, tail_call_cnt 33 propagates to subprog_tail()
by %rax. And the tailcall in subprog_tail() is aborted because of
tail_call_cnt >= MAX_TAIL_CALL_CNT too.

Then, subprog_tail() at the position subprog call2 ends, and the 34th
entry() ends. And it returns to the 33rd subprog_tail() called from the
position subprog call1. But wait, at this time, what's the value of
tail_call_cnt under the stack of subprog_tail()? It's 33.

Then, in the 33rd entry(), at the time after the 33th subprog_tail() at
the position subprog call1 finishes and before the 2nd subprog_tail() at
the position subprog call2 calls, what's the value of tail_call_cnt
in current entry()? It's **32**. Why not 33?

Before stepping into subprog_tail() at the position subprog call2 in 33rd
entry(), like stopping the time, let's have a look at the stack memory:

  |  STACK  |
  +---------+ RBP  <-- current rbp
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  +---------+ RSP  <-- current rsp
  |   rip   | STACK of 34rd entry()
  |   rbp   | reuse the STACK of 33rd subprog_tail() at the position
  |   ret   |                                        subprog call1
  |   tcc   | its value is 33
  +---------+ rsp
  |   rip   | STACK of 1st subprog_tail() at the position subprog call2
  |   rbp   |
  |   tcc   | its value is 33
  +---------+ rsp

Why not 33? It's because tail_call_cnt does not back-propagate from
subprog_tail() to entry().

Then, while stepping into subprog_tail() at the position subprog call2 in 33rd
entry():

  |  STACK  |
  +---------+
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  |   rip   |
  |   rbp   |
  +---------+ RBP  <-- current rbp
  |   tcc   | its value is 32; STACK of subprog_tail() at the position
  +---------+ RSP  <-- current rsp                        subprog call2

Then, while pausing after tailcalling in 2nd subprog_tail() at the position
subprog call2:

  |  STACK  |
  +---------+
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  |   rip   |
  |   rbp   |
  +---------+ RBP  <-- current rbp
  |   tcc   | its value is 33; STACK of subprog_tail() at the position
  +---------+ RSP  <-- current rsp                        subprog call2

Note: what happens to tail_call_cnt:
	/*
	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
	 *	goto out;
	 */
It's to check >= MAX_TAIL_CALL_CNT first and then increment tail_call_cnt.

So, current tailcall is allowed to run.

Then, entry() is tailcalled. And the stack memory status is:

  |  STACK  |
  +---------+
  |   ret   | STACK of 33rd entry()
  |   tcc   | its value is 32
  |   rip   |
  |   rbp   |
  +---------+ RBP  <-- current rbp
  |   ret   | STACK of 35th entry(); reuse STACK of subprog_tail() at the
  |   tcc   | its value is 33                   the position subprog call2
  +---------+ RSP  <-- current rsp

So, the tailcalls in the 35th entry() will be aborted.

And, ..., again and again.  :(

And, I hope you have understood the reason why MAX_TAIL_CALL_CNT limit
does not work for those cases.

And, how many tailcalls are there if CPU does not stall?

Let's count it. 1+1+1+...+1=?

Let's build a math model to calculate the total tailcall count. Sorry, I'm
poor at math.

How about top-down view? Does it look like hierarchy layer and layer?

I think it is a hierarchy layer model with 2+4+8+...+2**33 tailcalls. As a
result, if CPU does not stall, there will be 2**34 - 2 = 17,179,869,182
tailcalls. That's the guy which makes CPU stalled.

What about there are N subprog_tail() in entry()? If CPU does not stall
because of too many tailcalls, there will be almost N**34 tailcalls.

> #include <linux/bpf.h>
> #include <bpf/bpf_helpers.h>
> #include "bpf_legacy.h"
> 
> struct {
>         __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
>         __uint(max_entries, 1);
>         __uint(key_size, sizeof(__u32));
>         __uint(value_size, sizeof(__u32));
> } jmp_table SEC(".maps");
> 
> static __noinline
> int subprog_tail(struct __sk_buff *skb)
> {
>         bpf_tail_call_static(skb, &jmp_table, 0);
>         return 0;
> }
> 
> SEC("tc")
> int entry(struct __sk_buff *skb)
> {
>         subprog_tail(skb);
>         return 0;
> }
> 
> char __license[] SEC("license") = "GPL";
> 
> and indeed CPU got stuck:
> 
> [10623.892978] RIP: 0010:bpf_prog_6abcef0aca85f2fd_subprog_tail+0x49/0x59
> [10623.899606] Code: 39 56 24 76 30 8b 85 fc ff ff ff 83 f8 21 73 25 83 c0 01 89 85 fc ff ff ff 48 8b 8c d6 10 01 00 00 48 85 c9 74 0f 41 5d 5b 58 <48> 8b 49 30 48 83 c1 0b ff e1 cc 41 5d 5b c9 c3 cc cc cc cc cc cc
> [10623.918646] RSP: 0018:ffffc900202b78e0 EFLAGS: 00000286
> [10623.923952] RAX: 0000002100000000 RBX: ffff8881088aa100 RCX: ffffc9000d319000
> [10623.931194] RDX: 0000000000000000 RSI: ffff88811e6d2e00 RDI: ffff8881088aa100
> [10623.938439] RBP: ffffc900202b78e0 R08: ffffc900202b7dd4 R09: 0000000000000000
> [10623.945686] R10: 736391e000000000 R11: 0000000000000000 R12: 0000000000000001
> [10623.952927] R13: ffffc9000d38d048 R14: ffffc900202b7dd4 R15: ffffffff81ae756f
> [10623.960174]  ? bpf_test_run+0xef/0x320
> [10623.963988]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10623.969826]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10623.975662]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10623.981500]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10623.987334]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10623.993174]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10623.999016]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.004854]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.010688]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.016529]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.022371]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.028209]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.034043]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.043211]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.052344]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.061453]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.070404]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.079200]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.087841]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.096333]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.104680]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.112886]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.121014]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.129048]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.136949]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.144716]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.152479]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.160209]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.167905]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.175565]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77
> [10624.183173]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.190738]  bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77
> [10624.198262]  bpf_test_run+0x186/0x320
> [10624.203677]  ? kmem_cache_alloc+0x14d/0x2c0
> [10624.209615]  bpf_prog_test_run_skb+0x35c/0x710
> [10624.215810]  __sys_bpf+0xf3a/0x2d80
> [10624.221068]  ? do_mmap+0x409/0x5e0
> [10624.226217]  __x64_sys_bpf+0x1a/0x30
> [10624.231693]  do_syscall_64+0x2e/0xa0
> [10624.236988]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
> [10624.243550] RIP: 0033:0x7fd76171e69d
> [10624.248576] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 63 a7 0f 00 f7 d8 64 89 01 48
> [10624.270613] RSP: 002b:00007ffe694f8d68 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
> [10624.279894] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd76171e69d
> [10624.288769] RDX: 0000000000000050 RSI: 00007ffe694f8db0 RDI: 000000000000000a
> [10624.297634] RBP: 00007ffe694f8d80 R08: 00007ffe694f8d37 R09: 00007ffe694f8db0
> [10624.306455] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffe694f9348
> [10624.315260] R13: 000055bb8423f6d9 R14: 000055bb8496ac90 R15: 00007fd761925040
> 
> Looking at assembly code:
> 
> entry:
> ffffffffc0012c60 <load3+0x12c60>:
> ffffffffc0012c60:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> ffffffffc0012c65:       31 c0                   xor    %eax,%eax
> ffffffffc0012c67:       55                      push   %rbp
> ffffffffc0012c68:       48 89 e5                mov    %rsp,%rbp
> ffffffffc0012c6b:       50                      push   %rax
> ffffffffc0012c6c:       48 8b 85 f8 ff ff ff    mov    -0x8(%rbp),%rax
> ffffffffc0012c73:       e8 30 00 00 00          call   0xffffffffc0012ca8
> ffffffffc0012c78:       31 c0                   xor    %eax,%eax
> ffffffffc0012c7a:       c9                      leave
> ffffffffc0012c7b:       c3                      ret
> 
> subprog_tail:
> ffffffffc0012ca8 <load3+0x12ca8>:
> ffffffffc0012ca8:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> ffffffffc0012cad:       66 90                   xchg   %ax,%ax
> ffffffffc0012caf:       55                      push   %rbp
> ffffffffc0012cb0:       48 89 e5                mov    %rsp,%rbp
> ffffffffc0012cb3:       50                      push   %rax
> ffffffffc0012cb4:       53                      push   %rbx
> ffffffffc0012cb5:       41 55                   push   %r13
> ffffffffc0012cb7:       48 89 fb                mov    %rdi,%rbx
> ffffffffc0012cba:       49 bd 00 7c 73 0c 81    movabs $0xffff88810c737c00,%r13
> ffffffffc0012cc1:       88 ff ff
> ffffffffc0012cc4:       48 89 df                mov    %rbx,%rdi
> ffffffffc0012cc7:       4c 89 ee                mov    %r13,%rsi
> ffffffffc0012cca:       31 d2                   xor    %edx,%edx
> ffffffffc0012ccc:       8b 85 fc ff ff ff       mov    -0x4(%rbp),%eax
> ffffffffc0012cd2:       83 f8 21                cmp    $0x21,%eax
> ffffffffc0012cd5:       73 17                   jae    0xffffffffc0012cee
> ffffffffc0012cd7:       83 c0 01                add    $0x1,%eax
> ffffffffc0012cda:       89 85 fc ff ff ff       mov    %eax,-0x4(%rbp)
> ffffffffc0012ce0:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> ffffffffc0012ce5:       41 5d                   pop    %r13
> ffffffffc0012ce7:       5b                      pop    %rbx
> ffffffffc0012ce8:       58                      pop    %rax
> ffffffffc0012ce9:       e9 7d ff ff ff          jmp    0xffffffffc0012c6b
> ffffffffc0012cee:       41 5d                   pop    %r13
> ffffffffc0012cf0:       5b                      pop    %rbx
> ffffffffc0012cf1:       c9                      leave
> ffffffffc0012cf2:       c3                      ret
> 
> I am trying to understand what are writing to %rax when entry is target of
> tailcall. rbp comes from subprog_tail and we popped all of the previously
> pushed regs. So what is exactly causing this hang?
> 

Oh, your example is wierd to make CPU stuck. I don't know why, either.

> (I thought I'd rather respond just to let you know I'm on it)
> 

Thanks for your reminding.

>>
>>>
>>> Can you elaborate a bit more about the kernel crash you mention in the
>>> last paragraph?
>>
>> We have progs, prog1, prog2, prog3 and prog4, and the scenario:
>>
>> case1: kprobe1 --> prog1 --> subprog1 --tailcall-> prog2 --> subprog2 --tailcall-> prog3
>> case2: kprobe2 --> prog2 --> subprog2 --tailcall-> prog3
>> case3: kprobe3 --> prog4 --> subprog3 --tailcall-> prog3
>>                          --> subprog4 --tailcall-> prog4
>>
>> How does prog2 back-propagate tail_call_cnt to prog1?
>>
>> Possible way 1:
>> When prog2 and prog3 are added to PROG_ARRAY, poke their epilogues to
>> back-propagate tail_call_cnt by RCX register. It seems OK because kprobes do
>> not handle the value in RCX register, like case2.
>>
>> Possible way 2:
>> Can back-propagate tail_call_cnt with RCX register by checking tail_call_cnt != 0
>> at epilogue when current prog has tailcall?
>> No. As for case1, prog2 handles the value in RCX register, which is not tail_call_cnt,
>> because prog3 has no tailcall and won't populate RCX register with tail_call_cnt.
> 
> I don't get it. Let's start with small example and do the baby steps.
> 

I lost the details of back-propagating solutions which I tried hard to
convince myself the solutions were not good enough as my expectation.

Sorry about that.

>>
>> However, I don't like the back-propagating way. Then, I "burn" my brain to figure
>> out pointer propagating ways.
> 
> I know that code can damage programmer's brain, that's why we need to
> strive for elaborative and easy to understand commit messages so that
> later on we will be able to pick up what is going on here.
> 

Yeah, it's not easy to understand the code of tailcall on arch, like x86.

I'm trying hard to explain this patch.

>>
>> RFC PATCH v1 way:
>> Propagate tail_call_cnt and its pointer together. Then, the pointer is used to
>> check MAX_TAIL_CALL_CNT and increment tail_call_cnt.
>>
>>     |  STACK  |
>>     +---------+ RBP
>>     |         |
>>     |         |
>>     |         |
>>  +--| tcc_ptr |
>>  +->|   tcc   |
>>     |   rbx   |
>>     +---------+ RSP
>>
>> RFC PATCH v2 way (current patchset):
>> Propagate tail_call_cnt pointer only. Then, the pointer is used to check
>> MAX_TAIL_CALL_CNT and increment tail_call_cnt.
>>
>>     |  STACK  |
>>     |         |
>>     |   rip   |
>>  +->|   tcc   |
>>  |  |   rip   |
>>  |  |   rbp   |
>>  |  +---------+ RBP
>>  |  |         |
>>  |  |         |
>>  |  |         |
>>  +--| tcc_ptr |
>>     |   rbx   |
>>     +---------+ RSP
>>

It's time to describe the details of this patch. And take a look at how it
handles the above simple case.

This patch is to keep tail_call_cnt on the stack of entry bpf prog's caller,
and then propagate tail_call_cnt pointer like the way of the previous
tail_call_cnt.

So, at the epilogue of entry bpf prog, before pushing %rbp, it has to
initialise tail_call_cnt and to push it to stack by using %rax. Then, make
%rax as the pointer that points to tail_call_cnt, by "mov rax, rsp". Then,
call the main part of the entry bpf prog by "call 2". **This is the
exceptional point.** With this "call", %rip is pushed to stack. And at the
end of the entry bpf prog runtime, the %rip is popped from stack; then, pop
tail_call_cnt from stack too; and finally "ret" again. The popping
tail_call_cnt and "ret" is the 2 in "call 2".

So, when "call 2", %rax is the tail_call_cnt pointer. And then,
tail_call_cnt pointer will be pushed to stack. This is the way to push %rax
when entry bpf prog is tailcalled, too.

So, when a subprog is called, tail_call_cnt pointer is popped from stack to
%rax.

And when a tailcall happens, load tail_call_cnt pointer from stack to %rax
by "mov rax, qword ptr [rbp - tcc_ptr_off]", and compare tail_call_cnt with
MAX_TAIL_CALL_CNT by "cmp dword ptr [rax], MAX_TAIL_CALL_CNT", and then
increment tail_call_cnt by "add dword ptr [rax], 1". Finally, when pop %rax,
it's to pop tail_call_cnt pointer from stack to %rax.

When the epilogue of entry() runs, the stack of entry() should be like:

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 0
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
    |   rbx   | saved regs
    +---------+ RSP

Then, when subprog_tail() is called for its very first time, its stack
should be like:

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 0
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
 |  |   rbx   | saved regs
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP
 +--| tcc_ptr | STACK of subprog_tail()
    +---------+ RSP

Then, when subprog_tail() tailcalls entry():

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 1
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
 |  |   rbx   | saved regs
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
    +---------+ RSP

Then, when entry() calls subprog_tail():

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 1
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
 |  |   rbx   | saved regs
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP
 +--| tcc_ptr | STACK of subprog_tail()
    +---------+ RSP

Then, when subprog_tail() tailcalls entry():

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 2
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
 |  |   rbx   | saved regs
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ RBP
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
    +---------+ RSP

Then, again and again. At the very first time when MAX_TAIL_CALL_CNT limit
works, subprog_tail() has been called for 34 times at the position subprog
call1. And at this time, the stack should be like:

    |  STACK  | STACK of entry()'s caller
    |         |
    |   rip   |
 +->|   tcc   | its value is 33
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry()
 +--| tcc_ptr |
 |  |   rbx   | saved regs
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |   ret   | STACK of entry(), reuse STACK of subprog_tail()
 +--| tcc_ptr |
 |  |   rip   |
 |  |   rbp   |
 |  +---------+ rbp
 |  |    *    |
 |  |    *    |
 |  |    *    |
 |  +---------+ RBP
 +--| tcc_ptr | STACK of subprog_tail()
    +---------+ RSP

After this time, the tailcalls in the future will be aborted because
tail_call_cnt has been 33, which reaches its MAX_TAIL_CALL_CNT limit.

That's the way how this patch works.

It's really nice if you reach here. I hope you have a clear idea to
understand this patch by reading above replys.

>>>
>>> I also realized that verifier assumes MAX_TAIL_CALL_CNT as 32 which has
>>> changed in the meantime to 33 and we should adjust the max allowed stack
>>> depth of subprogs? I believe this was brought up at LPC?
>>
>> There's following code snippet in verifier.c:
>>
>> 	/* protect against potential stack overflow that might happen when
>> 	 * bpf2bpf calls get combined with tailcalls. Limit the caller's stack
>> 	 * depth for such case down to 256 so that the worst case scenario
>> 	 * would result in 8k stack size (32 which is tailcall limit * 256 =
>> 	 * 8k).
>>
>> But, MAX_TAIL_CALL_CNT is 33.
>>
>> This was not brought up at LPC 2022&2023. I don't know whether this was
>> brought up at previous LPCs.
> 
> Was referring to this:
> https://lpc.events/event/17/contributions/1595/attachments/1230/2506/LPC2023_slides.pdf
> 

That's interesting. I tried to overflow the stack of one bpf prog, whose
max stack size limits to 8k, with fexit. I failed because the stack size
of thread is not so that small. But, exceptionally, it was able to compose
a tailcall infinite loop with fexit, which broke the tail_call_cnt
propagating chain because it did not pass through the fexit's trampoline.

Thanks,
Leon

> but let us focus on issue you're bringing up here in this patch. I brought
> this up thinking JIT code was fine, now it's clear that things go south
> when entry prog is tailcall target, which we didn't test.
> 
>>
>> Thanks,
>> Leon
>>
Fijalkowski, Maciej Dec. 21, 2023, 12:02 p.m. UTC | #5
On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote:
> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
> handling in JIT"), the tailcall on x64 works better than before.
> 
> From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms
> for x64 JIT"), tailcall is able to run in BPF subprograms on x64.
> 
> How about:
> 
> 1. More than 1 subprograms are called in a bpf program.
> 2. The tailcalls in the subprograms call the bpf program.
> 
> Because of missing tail_call_cnt back-propagation, a tailcall hierarchy
> comes up. And MAX_TAIL_CALL_CNT limit does not work for this case.
> 
> As we know, in tail call context, the tail_call_cnt propagates by stack
> and rax register between BPF subprograms. So, propagating tail_call_cnt
> pointer by stack and rax register makes tail_call_cnt as like a global
> variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall
> hierarchy cases.
> 
> Before jumping to other bpf prog, load tail_call_cnt from the pointer
> and then compare with MAX_TAIL_CALL_CNT. Finally, increment
> tail_call_cnt by its pointer.
> 
> But, where does tail_call_cnt store?
> 
> It stores on the stack of bpf prog's caller, like
> 
>     |  STACK  |
>     |         |
>     |   rip   |
>  +->|   tcc   |
>  |  |   rip   |
>  |  |   rbp   |
>  |  +---------+ RBP
>  |  |         |
>  |  |         |
>  |  |         |
>  +--| tcc_ptr |
>     |   rbx   |
>     +---------+ RSP
> 
> And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf
> prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx"
> instruction on BPF JIT epilogue").
> 
> Why not back-propagate tail_call_cnt?
> 
> It's because it's vulnerable to back-propagate it. It's unable to work
> well with the following case.
> 
> int prog1();
> int prog2();
> 
> prog1 is tail caller, and prog2 is tail callee. If we do back-propagate
> tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the
> same time? The answer is NO. Otherwise, there will be a register to be
> polluted, which will make kernel crash.
> 
> Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
> Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
> ---
>  arch/x86/net/bpf_jit_comp.c | 40 ++++++++++++++++++++++---------------
>  1 file changed, 24 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index c2a0465d37da4..36631129cc800 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -256,7 +256,7 @@ struct jit_context {
>  /* Number of bytes emit_patch() needs to generate instructions */
>  #define X86_PATCH_SIZE		5
>  /* Number of bytes that will be skipped on tailcall */
> -#define X86_TAIL_CALL_OFFSET	(11 + ENDBR_INSN_SIZE)
> +#define X86_TAIL_CALL_OFFSET	(22 + ENDBR_INSN_SIZE)
>  
>  static void push_r12(u8 **pprog)
>  {
> @@ -340,14 +340,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>  	EMIT_ENDBR();
>  	emit_nops(&prog, X86_PATCH_SIZE);
>  	if (!ebpf_from_cbpf) {
> -		if (tail_call_reachable && !is_subprog)
> +		if (tail_call_reachable && !is_subprog) {
>  			/* When it's the entry of the whole tailcall context,
>  			 * zeroing rax means initialising tail_call_cnt.
>  			 */
> -			EMIT2(0x31, 0xC0); /* xor eax, eax */
> -		else
> -			/* Keep the same instruction layout. */
> -			EMIT2(0x66, 0x90); /* nop2 */
> +			EMIT2(0x31, 0xC0);       /* xor eax, eax */
> +			EMIT1(0x50);             /* push rax */
> +			/* Make rax as ptr that points to tail_call_cnt. */
> +			EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
> +			EMIT1_off32(0xE8, 2);    /* call main prog */
> +			EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
> +			EMIT1(0xC3);             /* ret */
> +		} else {
> +			/* Keep the same instruction size. */
> +			emit_nops(&prog, 13);
> +		}

At first sight it seemed to me too invasive but after trying out few other
approaches in the end it is elegant.

I wanted to avoid a bit puzzling call insn in the prologue with a following
prologue layout (this will be based on entry prog from tailcall_bpf2bpf3.c that
was the first one to break):

ffffffffc0012cb4:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0012cb9:       55                      push   %rbp
ffffffffc0012cba:       48 89 e5                mov    %rsp,%rbp
ffffffffc0012cbd:       48 83 ec 10             sub    $0x10,%rsp
ffffffffc0012cc1:       48 89 65 f8             mov    %rsp,-0x8(%rbp)
ffffffffc0012cc5:       48 c7 04 24 00 00 00    movq   $0x0,(%rsp)
ffffffffc0012ccc:       00
ffffffffc0012ccd:       48 8b 45 f8             mov    -0x8(%rbp),%rax
ffffffffc0012cd1:       50                      push   %rax
ffffffffc0012cd2:       48 81 ec 80 00 00 00    sub    $0x80,%rsp

So we would have hidden 16 bytes on stack at the *beginning* of entry stack
frame. First thing right after rbp would be tcc pointer so referring to it
wouldn't require us to take into account stack depth. But then if we
follow with rest of insns:

ffffffffc0012cd9:       31 f6                   xor    %esi,%esi
ffffffffc0012cdb:       48 89 75 f8             mov    %rsi,-0x8(%rbp)  // BUG, overwrite of tcc ptr
ffffffffc0012cdf:       48 89 75 f0             mov    %rsi,-0x10(%rbp)
ffffffffc0012ce3:       48 89 75 e8             mov    %rsi,-0x18(%rbp)
ffffffffc0012ce7:       48 89 75 e0             mov    %rsi,-0x20(%rbp)
ffffffffc0012ceb:       48 89 75 d8             mov    %rsi,-0x28(%rbp)
ffffffffc0012cef:       48 89 75 d0             mov    %rsi,-0x30(%rbp)
ffffffffc0012cf3:       48 89 75 c8             mov    %rsi,-0x38(%rbp)
ffffffffc0012cf7:       48 89 75 c0             mov    %rsi,-0x40(%rbp)
ffffffffc0012cfb:       48 89 75 b8             mov    %rsi,-0x48(%rbp)
ffffffffc0012cff:       48 89 75 b0             mov    %rsi,-0x50(%rbp)
ffffffffc0012d03:       48 89 75 a8             mov    %rsi,-0x58(%rbp)
ffffffffc0012d07:       48 89 75 a0             mov    %rsi,-0x60(%rbp)
ffffffffc0012d0b:       48 89 75 98             mov    %rsi,-0x68(%rbp)
ffffffffc0012d0f:       48 89 75 90             mov    %rsi,-0x70(%rbp)
ffffffffc0012d13:       48 89 75 88             mov    %rsi,-0x78(%rbp)
ffffffffc0012d17:       48 89 75 80             mov    %rsi,-0x80(%rbp)
ffffffffc0012d1b:       48 0f b6 75 ff          movzbq -0x1(%rbp),%rsi
ffffffffc0012d20:       40 88 75 ff             mov    %sil,-0x1(%rbp)
ffffffffc0012d24:       48 8b 85 f8 ff ff ff    mov    -0x8(%rbp),%rax
ffffffffc0012d2b:       e8 30 00 00 00          call   0xffffffffc0012d60
ffffffffc0012d30:       c9                      leave
ffffffffc0012d31:       c3                      ret

So even though it would seem more obvious while looking at prologue what
is the intent behind it, this would require us to patch the instructions
that make us of R10/stack, which in the end would be way more invasive.

After all, for x86 JIT code:
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

but it is a must to have a better commit message here.

Thanks!

>  	}
>  	/* Exception callback receives FP as third parameter */
>  	if (is_exception_cb) {
> @@ -373,6 +380,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>  	if (stack_depth)
>  		EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
>  	if (tail_call_reachable)
> +		/* Here, rax is tail_call_cnt_ptr. */
>  		EMIT1(0x50);         /* push rax */
>  	*pprog = prog;
>  }
> @@ -528,7 +536,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>  					u32 stack_depth, u8 *ip,
>  					struct jit_context *ctx)
>  {
> -	int tcc_off = -4 - round_up(stack_depth, 8);
> +	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
>  	u8 *prog = *pprog, *start = *pprog;
>  	int offset;
>  
> @@ -553,13 +561,12 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>  	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
>  	 *	goto out;
>  	 */
> -	EMIT2_off32(0x8B, 0x85, tcc_off);         /* mov eax, dword ptr [rbp - tcc_off] */
> -	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
> +	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off); /* mov rax, qword ptr [rbp - tcc_ptr_off] */
> +	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);     /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
>  
>  	offset = ctx->tail_call_indirect_label - (prog + 2 - start);
>  	EMIT2(X86_JAE, offset);                   /* jae out */
> -	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
> -	EMIT2_off32(0x89, 0x85, tcc_off);         /* mov dword ptr [rbp - tcc_off], eax */
> +	EMIT3(0x83, 0x00, 0x01);                  /* add dword ptr [rax], 1 */
>  
>  	/* prog = array->ptrs[index]; */
>  	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
> @@ -581,6 +588,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>  		pop_callee_regs(&prog, callee_regs_used);
>  	}
>  
> +	/* pop tail_call_cnt_ptr */
>  	EMIT1(0x58);                              /* pop rax */
>  	if (stack_depth)
>  		EMIT3_off32(0x48, 0x81, 0xC4,     /* add rsp, sd */
> @@ -609,7 +617,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>  				      bool *callee_regs_used, u32 stack_depth,
>  				      struct jit_context *ctx)
>  {
> -	int tcc_off = -4 - round_up(stack_depth, 8);
> +	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
>  	u8 *prog = *pprog, *start = *pprog;
>  	int offset;
>  
> @@ -617,13 +625,12 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>  	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
>  	 *	goto out;
>  	 */
> -	EMIT2_off32(0x8B, 0x85, tcc_off);             /* mov eax, dword ptr [rbp - tcc_off] */
> -	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
> +	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off);   /* mov rax, qword ptr [rbp - tcc_ptr_off] */
> +	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);         /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
>  
>  	offset = ctx->tail_call_direct_label - (prog + 2 - start);
>  	EMIT2(X86_JAE, offset);                       /* jae out */
> -	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
> -	EMIT2_off32(0x89, 0x85, tcc_off);             /* mov dword ptr [rbp - tcc_off], eax */
> +	EMIT3(0x83, 0x00, 0x01);                      /* add dword ptr [rax], 1 */
>  
>  	poke->tailcall_bypass = ip + (prog - start);
>  	poke->adj_off = X86_TAIL_CALL_OFFSET;
> @@ -640,6 +647,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>  		pop_callee_regs(&prog, callee_regs_used);
>  	}
>  
> +	/* pop tail_call_cnt_ptr */
>  	EMIT1(0x58);                                  /* pop rax */
>  	if (stack_depth)
>  		EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
> -- 
> 2.41.0
> 
>
Leon Hwang Dec. 21, 2023, 2:56 p.m. UTC | #6
On 2023/12/21 20:02, Maciej Fijalkowski wrote:
> On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote:
>> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall
>> handling in JIT"), the tailcall on x64 works better than before.
>>
>> From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms
>> for x64 JIT"), tailcall is able to run in BPF subprograms on x64.
>>
>> How about:
>>
>> 1. More than 1 subprograms are called in a bpf program.
>> 2. The tailcalls in the subprograms call the bpf program.
>>
>> Because of missing tail_call_cnt back-propagation, a tailcall hierarchy
>> comes up. And MAX_TAIL_CALL_CNT limit does not work for this case.
>>
>> As we know, in tail call context, the tail_call_cnt propagates by stack
>> and rax register between BPF subprograms. So, propagating tail_call_cnt
>> pointer by stack and rax register makes tail_call_cnt as like a global
>> variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall
>> hierarchy cases.
>>
>> Before jumping to other bpf prog, load tail_call_cnt from the pointer
>> and then compare with MAX_TAIL_CALL_CNT. Finally, increment
>> tail_call_cnt by its pointer.
>>
>> But, where does tail_call_cnt store?
>>
>> It stores on the stack of bpf prog's caller, like
>>
>>     |  STACK  |
>>     |         |
>>     |   rip   |
>>  +->|   tcc   |
>>  |  |   rip   |
>>  |  |   rbp   |
>>  |  +---------+ RBP
>>  |  |         |
>>  |  |         |
>>  |  |         |
>>  +--| tcc_ptr |
>>     |   rbx   |
>>     +---------+ RSP
>>
>> And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf
>> prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx"
>> instruction on BPF JIT epilogue").
>>
>> Why not back-propagate tail_call_cnt?
>>
>> It's because it's vulnerable to back-propagate it. It's unable to work
>> well with the following case.
>>
>> int prog1();
>> int prog2();
>>
>> prog1 is tail caller, and prog2 is tail callee. If we do back-propagate
>> tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the
>> same time? The answer is NO. Otherwise, there will be a register to be
>> polluted, which will make kernel crash.
>>
>> Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
>> Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
>> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
>> ---
>>  arch/x86/net/bpf_jit_comp.c | 40 ++++++++++++++++++++++---------------
>>  1 file changed, 24 insertions(+), 16 deletions(-)
>>
>> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
>> index c2a0465d37da4..36631129cc800 100644
>> --- a/arch/x86/net/bpf_jit_comp.c
>> +++ b/arch/x86/net/bpf_jit_comp.c
>> @@ -256,7 +256,7 @@ struct jit_context {
>>  /* Number of bytes emit_patch() needs to generate instructions */
>>  #define X86_PATCH_SIZE		5
>>  /* Number of bytes that will be skipped on tailcall */
>> -#define X86_TAIL_CALL_OFFSET	(11 + ENDBR_INSN_SIZE)
>> +#define X86_TAIL_CALL_OFFSET	(22 + ENDBR_INSN_SIZE)
>>  
>>  static void push_r12(u8 **pprog)
>>  {
>> @@ -340,14 +340,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>>  	EMIT_ENDBR();
>>  	emit_nops(&prog, X86_PATCH_SIZE);
>>  	if (!ebpf_from_cbpf) {
>> -		if (tail_call_reachable && !is_subprog)
>> +		if (tail_call_reachable && !is_subprog) {
>>  			/* When it's the entry of the whole tailcall context,
>>  			 * zeroing rax means initialising tail_call_cnt.
>>  			 */
>> -			EMIT2(0x31, 0xC0); /* xor eax, eax */
>> -		else
>> -			/* Keep the same instruction layout. */
>> -			EMIT2(0x66, 0x90); /* nop2 */
>> +			EMIT2(0x31, 0xC0);       /* xor eax, eax */
>> +			EMIT1(0x50);             /* push rax */
>> +			/* Make rax as ptr that points to tail_call_cnt. */
>> +			EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
>> +			EMIT1_off32(0xE8, 2);    /* call main prog */
>> +			EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
>> +			EMIT1(0xC3);             /* ret */
>> +		} else {
>> +			/* Keep the same instruction size. */
>> +			emit_nops(&prog, 13);
>> +		}
> 
> At first sight it seemed to me too invasive but after trying out few other
> approaches in the end it is elegant.
> 
> I wanted to avoid a bit puzzling call insn in the prologue with a following
> prologue layout (this will be based on entry prog from tailcall_bpf2bpf3.c that
> was the first one to break):
> 
> ffffffffc0012cb4:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> ffffffffc0012cb9:       55                      push   %rbp
> ffffffffc0012cba:       48 89 e5                mov    %rsp,%rbp
> ffffffffc0012cbd:       48 83 ec 10             sub    $0x10,%rsp
> ffffffffc0012cc1:       48 89 65 f8             mov    %rsp,-0x8(%rbp)
> ffffffffc0012cc5:       48 c7 04 24 00 00 00    movq   $0x0,(%rsp)
> ffffffffc0012ccc:       00
> ffffffffc0012ccd:       48 8b 45 f8             mov    -0x8(%rbp),%rax
> ffffffffc0012cd1:       50                      push   %rax
> ffffffffc0012cd2:       48 81 ec 80 00 00 00    sub    $0x80,%rsp
> 
> So we would have hidden 16 bytes on stack at the *beginning* of entry stack
> frame. First thing right after rbp would be tcc pointer so referring to it
> wouldn't require us to take into account stack depth. But then if we
> follow with rest of insns:
> 
> ffffffffc0012cd9:       31 f6                   xor    %esi,%esi
> ffffffffc0012cdb:       48 89 75 f8             mov    %rsi,-0x8(%rbp)  // BUG, overwrite of tcc ptr
> ffffffffc0012cdf:       48 89 75 f0             mov    %rsi,-0x10(%rbp)
> ffffffffc0012ce3:       48 89 75 e8             mov    %rsi,-0x18(%rbp)
> ffffffffc0012ce7:       48 89 75 e0             mov    %rsi,-0x20(%rbp)
> ffffffffc0012ceb:       48 89 75 d8             mov    %rsi,-0x28(%rbp)
> ffffffffc0012cef:       48 89 75 d0             mov    %rsi,-0x30(%rbp)
> ffffffffc0012cf3:       48 89 75 c8             mov    %rsi,-0x38(%rbp)
> ffffffffc0012cf7:       48 89 75 c0             mov    %rsi,-0x40(%rbp)
> ffffffffc0012cfb:       48 89 75 b8             mov    %rsi,-0x48(%rbp)
> ffffffffc0012cff:       48 89 75 b0             mov    %rsi,-0x50(%rbp)
> ffffffffc0012d03:       48 89 75 a8             mov    %rsi,-0x58(%rbp)
> ffffffffc0012d07:       48 89 75 a0             mov    %rsi,-0x60(%rbp)
> ffffffffc0012d0b:       48 89 75 98             mov    %rsi,-0x68(%rbp)
> ffffffffc0012d0f:       48 89 75 90             mov    %rsi,-0x70(%rbp)
> ffffffffc0012d13:       48 89 75 88             mov    %rsi,-0x78(%rbp)
> ffffffffc0012d17:       48 89 75 80             mov    %rsi,-0x80(%rbp)
> ffffffffc0012d1b:       48 0f b6 75 ff          movzbq -0x1(%rbp),%rsi
> ffffffffc0012d20:       40 88 75 ff             mov    %sil,-0x1(%rbp)
> ffffffffc0012d24:       48 8b 85 f8 ff ff ff    mov    -0x8(%rbp),%rax
> ffffffffc0012d2b:       e8 30 00 00 00          call   0xffffffffc0012d60
> ffffffffc0012d30:       c9                      leave
> ffffffffc0012d31:       c3                      ret
> 
> So even though it would seem more obvious while looking at prologue what
> is the intent behind it, this would require us to patch the instructions
> that make us of R10/stack, which in the end would be way more invasive.
> 
> After all, for x86 JIT code:
> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

Thanks for your review.

> 
> but it is a must to have a better commit message here.
> 

I'll write a better commit message here.

Thanks,
Leon

> Thanks!
> 
>>  	}
>>  	/* Exception callback receives FP as third parameter */
>>  	if (is_exception_cb) {
>> @@ -373,6 +380,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>>  	if (stack_depth)
>>  		EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
>>  	if (tail_call_reachable)
>> +		/* Here, rax is tail_call_cnt_ptr. */
>>  		EMIT1(0x50);         /* push rax */
>>  	*pprog = prog;
>>  }
>> @@ -528,7 +536,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>>  					u32 stack_depth, u8 *ip,
>>  					struct jit_context *ctx)
>>  {
>> -	int tcc_off = -4 - round_up(stack_depth, 8);
>> +	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
>>  	u8 *prog = *pprog, *start = *pprog;
>>  	int offset;
>>  
>> @@ -553,13 +561,12 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>>  	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
>>  	 *	goto out;
>>  	 */
>> -	EMIT2_off32(0x8B, 0x85, tcc_off);         /* mov eax, dword ptr [rbp - tcc_off] */
>> -	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
>> +	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off); /* mov rax, qword ptr [rbp - tcc_ptr_off] */
>> +	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);     /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
>>  
>>  	offset = ctx->tail_call_indirect_label - (prog + 2 - start);
>>  	EMIT2(X86_JAE, offset);                   /* jae out */
>> -	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
>> -	EMIT2_off32(0x89, 0x85, tcc_off);         /* mov dword ptr [rbp - tcc_off], eax */
>> +	EMIT3(0x83, 0x00, 0x01);                  /* add dword ptr [rax], 1 */
>>  
>>  	/* prog = array->ptrs[index]; */
>>  	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
>> @@ -581,6 +588,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
>>  		pop_callee_regs(&prog, callee_regs_used);
>>  	}
>>  
>> +	/* pop tail_call_cnt_ptr */
>>  	EMIT1(0x58);                              /* pop rax */
>>  	if (stack_depth)
>>  		EMIT3_off32(0x48, 0x81, 0xC4,     /* add rsp, sd */
>> @@ -609,7 +617,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>>  				      bool *callee_regs_used, u32 stack_depth,
>>  				      struct jit_context *ctx)
>>  {
>> -	int tcc_off = -4 - round_up(stack_depth, 8);
>> +	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
>>  	u8 *prog = *pprog, *start = *pprog;
>>  	int offset;
>>  
>> @@ -617,13 +625,12 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>>  	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
>>  	 *	goto out;
>>  	 */
>> -	EMIT2_off32(0x8B, 0x85, tcc_off);             /* mov eax, dword ptr [rbp - tcc_off] */
>> -	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
>> +	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off);   /* mov rax, qword ptr [rbp - tcc_ptr_off] */
>> +	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);         /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
>>  
>>  	offset = ctx->tail_call_direct_label - (prog + 2 - start);
>>  	EMIT2(X86_JAE, offset);                       /* jae out */
>> -	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
>> -	EMIT2_off32(0x89, 0x85, tcc_off);             /* mov dword ptr [rbp - tcc_off], eax */
>> +	EMIT3(0x83, 0x00, 0x01);                      /* add dword ptr [rax], 1 */
>>  
>>  	poke->tailcall_bypass = ip + (prog - start);
>>  	poke->adj_off = X86_TAIL_CALL_OFFSET;
>> @@ -640,6 +647,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
>>  		pop_callee_regs(&prog, callee_regs_used);
>>  	}
>>  
>> +	/* pop tail_call_cnt_ptr */
>>  	EMIT1(0x58);                                  /* pop rax */
>>  	if (stack_depth)
>>  		EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
>> -- 
>> 2.41.0
>>
>>
Alexei Starovoitov Jan. 4, 2024, 6:23 a.m. UTC | #7
On Thu, Dec 21, 2023 at 6:56 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>
>
> >
> > but it is a must to have a better commit message here.
> >
>
> I'll write a better commit message here.

Please respin with updated commit messages and drop RFC tag.
Keep Reviewed-by that you received already.
diff mbox series

Patch

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index c2a0465d37da4..36631129cc800 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -256,7 +256,7 @@  struct jit_context {
 /* Number of bytes emit_patch() needs to generate instructions */
 #define X86_PATCH_SIZE		5
 /* Number of bytes that will be skipped on tailcall */
-#define X86_TAIL_CALL_OFFSET	(11 + ENDBR_INSN_SIZE)
+#define X86_TAIL_CALL_OFFSET	(22 + ENDBR_INSN_SIZE)
 
 static void push_r12(u8 **pprog)
 {
@@ -340,14 +340,21 @@  static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 	EMIT_ENDBR();
 	emit_nops(&prog, X86_PATCH_SIZE);
 	if (!ebpf_from_cbpf) {
-		if (tail_call_reachable && !is_subprog)
+		if (tail_call_reachable && !is_subprog) {
 			/* When it's the entry of the whole tailcall context,
 			 * zeroing rax means initialising tail_call_cnt.
 			 */
-			EMIT2(0x31, 0xC0); /* xor eax, eax */
-		else
-			/* Keep the same instruction layout. */
-			EMIT2(0x66, 0x90); /* nop2 */
+			EMIT2(0x31, 0xC0);       /* xor eax, eax */
+			EMIT1(0x50);             /* push rax */
+			/* Make rax as ptr that points to tail_call_cnt. */
+			EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */
+			EMIT1_off32(0xE8, 2);    /* call main prog */
+			EMIT1(0x59);             /* pop rcx, get rid of tail_call_cnt */
+			EMIT1(0xC3);             /* ret */
+		} else {
+			/* Keep the same instruction size. */
+			emit_nops(&prog, 13);
+		}
 	}
 	/* Exception callback receives FP as third parameter */
 	if (is_exception_cb) {
@@ -373,6 +380,7 @@  static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
 	if (tail_call_reachable)
+		/* Here, rax is tail_call_cnt_ptr. */
 		EMIT1(0x50);         /* push rax */
 	*pprog = prog;
 }
@@ -528,7 +536,7 @@  static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
 					u32 stack_depth, u8 *ip,
 					struct jit_context *ctx)
 {
-	int tcc_off = -4 - round_up(stack_depth, 8);
+	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
 	u8 *prog = *pprog, *start = *pprog;
 	int offset;
 
@@ -553,13 +561,12 @@  static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
 	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, tcc_off);         /* mov eax, dword ptr [rbp - tcc_off] */
-	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
+	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off); /* mov rax, qword ptr [rbp - tcc_ptr_off] */
+	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);     /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
 
 	offset = ctx->tail_call_indirect_label - (prog + 2 - start);
 	EMIT2(X86_JAE, offset);                   /* jae out */
-	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, tcc_off);         /* mov dword ptr [rbp - tcc_off], eax */
+	EMIT3(0x83, 0x00, 0x01);                  /* add dword ptr [rax], 1 */
 
 	/* prog = array->ptrs[index]; */
 	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
@@ -581,6 +588,7 @@  static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
 		pop_callee_regs(&prog, callee_regs_used);
 	}
 
+	/* pop tail_call_cnt_ptr */
 	EMIT1(0x58);                              /* pop rax */
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xC4,     /* add rsp, sd */
@@ -609,7 +617,7 @@  static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
 				      bool *callee_regs_used, u32 stack_depth,
 				      struct jit_context *ctx)
 {
-	int tcc_off = -4 - round_up(stack_depth, 8);
+	int tcc_ptr_off = -8 - round_up(stack_depth, 8);
 	u8 *prog = *pprog, *start = *pprog;
 	int offset;
 
@@ -617,13 +625,12 @@  static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
 	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, tcc_off);             /* mov eax, dword ptr [rbp - tcc_off] */
-	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
+	EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off);   /* mov rax, qword ptr [rbp - tcc_ptr_off] */
+	EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT);         /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */
 
 	offset = ctx->tail_call_direct_label - (prog + 2 - start);
 	EMIT2(X86_JAE, offset);                       /* jae out */
-	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, tcc_off);             /* mov dword ptr [rbp - tcc_off], eax */
+	EMIT3(0x83, 0x00, 0x01);                      /* add dword ptr [rax], 1 */
 
 	poke->tailcall_bypass = ip + (prog - start);
 	poke->adj_off = X86_TAIL_CALL_OFFSET;
@@ -640,6 +647,7 @@  static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
 		pop_callee_regs(&prog, callee_regs_used);
 	}
 
+	/* pop tail_call_cnt_ptr */
 	EMIT1(0x58);                                  /* pop rax */
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));