Message ID | 20231011152725.95895-3-hffilwlqm@gmail.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | BPF |
Headers | show |
Series | bpf, x64: Fix tailcall hierarchy | expand |
On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote: > From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall > handling in JIT"), the tailcall on x64 works better than before. > > From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms > for x64 JIT"), tailcall is able to run in BPF subprograms on x64. > > How about: > > 1. More than 1 subprograms are called in a bpf program. > 2. The tailcalls in the subprograms call the bpf program. > > Because of missing tail_call_cnt back-propagation, a tailcall hierarchy > comes up. And MAX_TAIL_CALL_CNT limit does not work for this case. > > As we know, in tail call context, the tail_call_cnt propagates by stack > and rax register between BPF subprograms. So, propagating tail_call_cnt > pointer by stack and rax register makes tail_call_cnt as like a global > variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall > hierarchy cases. > > Before jumping to other bpf prog, load tail_call_cnt from the pointer > and then compare with MAX_TAIL_CALL_CNT. Finally, increment > tail_call_cnt by its pointer. > > But, where does tail_call_cnt store? > > It stores on the stack of bpf prog's caller, like > > | STACK | > | | > | rip | > +->| tcc | > | | rip | > | | rbp | > | +---------+ RBP > | | | > | | | > | | | > +--| tcc_ptr | > | rbx | > +---------+ RSP > > And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf > prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx" > instruction on BPF JIT epilogue"). > > Why not back-propagate tail_call_cnt? > > It's because it's vulnerable to back-propagate it. It's unable to work > well with the following case. > > int prog1(); > int prog2(); > > prog1 is tail caller, and prog2 is tail callee. If we do back-propagate > tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the > same time? The answer is NO. Otherwise, there will be a register to be > polluted, which will make kernel crash. Sorry but I keep on reading this explanation and I'm lost what is being fixed here. You want to limit the total amount of tail calls that entry prog can do to MAX_TAIL_CALL_CNT. Although I was working on that, my knowledge here is rusty, therefore my view might be distorted :) to me MAX_TAIL_CALL_CNT is to protect us from overflowing kernel stack and endless loops. As long a single call chain doesn't go over 8kB program is fine. Verifier has a limit of 256 subprogs from what I see. Can you elaborate a bit more about the kernel crash you mention in the last paragraph? I also realized that verifier assumes MAX_TAIL_CALL_CNT as 32 which has changed in the meantime to 33 and we should adjust the max allowed stack depth of subprogs? I believe this was brought up at LPC? > > Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT") > Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT") > Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> > --- > arch/x86/net/bpf_jit_comp.c | 40 ++++++++++++++++++++++--------------- > 1 file changed, 24 insertions(+), 16 deletions(-) > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c > index c2a0465d37da4..36631129cc800 100644 > --- a/arch/x86/net/bpf_jit_comp.c > +++ b/arch/x86/net/bpf_jit_comp.c > @@ -256,7 +256,7 @@ struct jit_context { > /* Number of bytes emit_patch() needs to generate instructions */ > #define X86_PATCH_SIZE 5 > /* Number of bytes that will be skipped on tailcall */ > -#define X86_TAIL_CALL_OFFSET (11 + ENDBR_INSN_SIZE) > +#define X86_TAIL_CALL_OFFSET (22 + ENDBR_INSN_SIZE) > > static void push_r12(u8 **pprog) > { > @@ -340,14 +340,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, > EMIT_ENDBR(); > emit_nops(&prog, X86_PATCH_SIZE); > if (!ebpf_from_cbpf) { > - if (tail_call_reachable && !is_subprog) > + if (tail_call_reachable && !is_subprog) { > /* When it's the entry of the whole tailcall context, > * zeroing rax means initialising tail_call_cnt. > */ > - EMIT2(0x31, 0xC0); /* xor eax, eax */ > - else > - /* Keep the same instruction layout. */ > - EMIT2(0x66, 0x90); /* nop2 */ > + EMIT2(0x31, 0xC0); /* xor eax, eax */ > + EMIT1(0x50); /* push rax */ > + /* Make rax as ptr that points to tail_call_cnt. */ > + EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */ > + EMIT1_off32(0xE8, 2); /* call main prog */ > + EMIT1(0x59); /* pop rcx, get rid of tail_call_cnt */ > + EMIT1(0xC3); /* ret */ > + } else { > + /* Keep the same instruction size. */ > + emit_nops(&prog, 13); > + } > } > /* Exception callback receives FP as third parameter */ > if (is_exception_cb) { > @@ -373,6 +380,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, > if (stack_depth) > EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8)); > if (tail_call_reachable) > + /* Here, rax is tail_call_cnt_ptr. */ > EMIT1(0x50); /* push rax */ > *pprog = prog; > } > @@ -528,7 +536,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog, > u32 stack_depth, u8 *ip, > struct jit_context *ctx) > { > - int tcc_off = -4 - round_up(stack_depth, 8); > + int tcc_ptr_off = -8 - round_up(stack_depth, 8); > u8 *prog = *pprog, *start = *pprog; > int offset; > > @@ -553,13 +561,12 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog, > * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT) > * goto out; > */ > - EMIT2_off32(0x8B, 0x85, tcc_off); /* mov eax, dword ptr [rbp - tcc_off] */ > - EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT); /* cmp eax, MAX_TAIL_CALL_CNT */ > + EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off); /* mov rax, qword ptr [rbp - tcc_ptr_off] */ > + EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT); /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */ > > offset = ctx->tail_call_indirect_label - (prog + 2 - start); > EMIT2(X86_JAE, offset); /* jae out */ > - EMIT3(0x83, 0xC0, 0x01); /* add eax, 1 */ > - EMIT2_off32(0x89, 0x85, tcc_off); /* mov dword ptr [rbp - tcc_off], eax */ > + EMIT3(0x83, 0x00, 0x01); /* add dword ptr [rax], 1 */ > > /* prog = array->ptrs[index]; */ > EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6, /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */ > @@ -581,6 +588,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog, > pop_callee_regs(&prog, callee_regs_used); > } > > + /* pop tail_call_cnt_ptr */ > EMIT1(0x58); /* pop rax */ > if (stack_depth) > EMIT3_off32(0x48, 0x81, 0xC4, /* add rsp, sd */ > @@ -609,7 +617,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog, > bool *callee_regs_used, u32 stack_depth, > struct jit_context *ctx) > { > - int tcc_off = -4 - round_up(stack_depth, 8); > + int tcc_ptr_off = -8 - round_up(stack_depth, 8); > u8 *prog = *pprog, *start = *pprog; > int offset; > > @@ -617,13 +625,12 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog, > * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT) > * goto out; > */ > - EMIT2_off32(0x8B, 0x85, tcc_off); /* mov eax, dword ptr [rbp - tcc_off] */ > - EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT); /* cmp eax, MAX_TAIL_CALL_CNT */ > + EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off); /* mov rax, qword ptr [rbp - tcc_ptr_off] */ > + EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT); /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */ > > offset = ctx->tail_call_direct_label - (prog + 2 - start); > EMIT2(X86_JAE, offset); /* jae out */ > - EMIT3(0x83, 0xC0, 0x01); /* add eax, 1 */ > - EMIT2_off32(0x89, 0x85, tcc_off); /* mov dword ptr [rbp - tcc_off], eax */ > + EMIT3(0x83, 0x00, 0x01); /* add dword ptr [rax], 1 */ > > poke->tailcall_bypass = ip + (prog - start); > poke->adj_off = X86_TAIL_CALL_OFFSET; > @@ -640,6 +647,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog, > pop_callee_regs(&prog, callee_regs_used); > } > > + /* pop tail_call_cnt_ptr */ > EMIT1(0x58); /* pop rax */ > if (stack_depth) > EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8)); > -- > 2.41.0 > >
On 6/12/23 07:03, Maciej Fijalkowski wrote: > On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote: >> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall >> handling in JIT"), the tailcall on x64 works better than before. >> >> From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms >> for x64 JIT"), tailcall is able to run in BPF subprograms on x64. >> >> How about: >> >> 1. More than 1 subprograms are called in a bpf program. >> 2. The tailcalls in the subprograms call the bpf program. >> >> Because of missing tail_call_cnt back-propagation, a tailcall hierarchy >> comes up. And MAX_TAIL_CALL_CNT limit does not work for this case. >> >> As we know, in tail call context, the tail_call_cnt propagates by stack >> and rax register between BPF subprograms. So, propagating tail_call_cnt >> pointer by stack and rax register makes tail_call_cnt as like a global >> variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall >> hierarchy cases. >> >> Before jumping to other bpf prog, load tail_call_cnt from the pointer >> and then compare with MAX_TAIL_CALL_CNT. Finally, increment >> tail_call_cnt by its pointer. >> >> But, where does tail_call_cnt store? >> >> It stores on the stack of bpf prog's caller, like >> >> | STACK | >> | | >> | rip | >> +->| tcc | >> | | rip | >> | | rbp | >> | +---------+ RBP >> | | | >> | | | >> | | | >> +--| tcc_ptr | >> | rbx | >> +---------+ RSP >> >> And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf >> prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx" >> instruction on BPF JIT epilogue"). >> >> Why not back-propagate tail_call_cnt? >> >> It's because it's vulnerable to back-propagate it. It's unable to work >> well with the following case. >> >> int prog1(); >> int prog2(); >> >> prog1 is tail caller, and prog2 is tail callee. If we do back-propagate >> tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the >> same time? The answer is NO. Otherwise, there will be a register to be >> polluted, which will make kernel crash. > > Sorry but I keep on reading this explanation and I'm lost what is being > fixed here> > You want to limit the total amount of tail calls that entry prog can do to > MAX_TAIL_CALL_CNT. Although I was working on that, my knowledge here is > rusty, therefore my view might be distorted :) to me MAX_TAIL_CALL_CNT is > to protect us from overflowing kernel stack and endless loops. As long a > single call chain doesn't go over 8kB program is fine. Verifier has a > limit of 256 subprogs from what I see. I try to resolve the following-like cases here. +--------- tailcall --+ | | V --> subprog1 -+ entry bpf prog < A --> subprog2 -+ | | +--------- tailcall --+ Without this fixing, the CPU will be stalled because of too many tailcalls. > > Can you elaborate a bit more about the kernel crash you mention in the > last paragraph? We have progs, prog1, prog2, prog3 and prog4, and the scenario: case1: kprobe1 --> prog1 --> subprog1 --tailcall-> prog2 --> subprog2 --tailcall-> prog3 case2: kprobe2 --> prog2 --> subprog2 --tailcall-> prog3 case3: kprobe3 --> prog4 --> subprog3 --tailcall-> prog3 --> subprog4 --tailcall-> prog4 How does prog2 back-propagate tail_call_cnt to prog1? Possible way 1: When prog2 and prog3 are added to PROG_ARRAY, poke their epilogues to back-propagate tail_call_cnt by RCX register. It seems OK because kprobes do not handle the value in RCX register, like case2. Possible way 2: Can back-propagate tail_call_cnt with RCX register by checking tail_call_cnt != 0 at epilogue when current prog has tailcall? No. As for case1, prog2 handles the value in RCX register, which is not tail_call_cnt, because prog3 has no tailcall and won't populate RCX register with tail_call_cnt. However, I don't like the back-propagating way. Then, I "burn" my brain to figure out pointer propagating ways. RFC PATCH v1 way: Propagate tail_call_cnt and its pointer together. Then, the pointer is used to check MAX_TAIL_CALL_CNT and increment tail_call_cnt. | STACK | +---------+ RBP | | | | | | +--| tcc_ptr | +->| tcc | | rbx | +---------+ RSP RFC PATCH v2 way (current patchset): Propagate tail_call_cnt pointer only. Then, the pointer is used to check MAX_TAIL_CALL_CNT and increment tail_call_cnt. | STACK | | | | rip | +->| tcc | | | rip | | | rbp | | +---------+ RBP | | | | | | | | | +--| tcc_ptr | | rbx | +---------+ RSP > > I also realized that verifier assumes MAX_TAIL_CALL_CNT as 32 which has > changed in the meantime to 33 and we should adjust the max allowed stack > depth of subprogs? I believe this was brought up at LPC? There's following code snippet in verifier.c: /* protect against potential stack overflow that might happen when * bpf2bpf calls get combined with tailcalls. Limit the caller's stack * depth for such case down to 256 so that the worst case scenario * would result in 8k stack size (32 which is tailcall limit * 256 = * 8k). But, MAX_TAIL_CALL_CNT is 33. This was not brought up at LPC 2022&2023. I don't know whether this was brought up at previous LPCs. Thanks, Leon
On Wed, Dec 06, 2023 at 02:51:28PM +0800, Leon Hwang wrote: > > > On 6/12/23 07:03, Maciej Fijalkowski wrote: > > On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote: > >> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall > >> handling in JIT"), the tailcall on x64 works better than before. > >> > >> From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms > >> for x64 JIT"), tailcall is able to run in BPF subprograms on x64. > >> > >> How about: > >> > >> 1. More than 1 subprograms are called in a bpf program. > >> 2. The tailcalls in the subprograms call the bpf program. > >> > >> Because of missing tail_call_cnt back-propagation, a tailcall hierarchy > >> comes up. And MAX_TAIL_CALL_CNT limit does not work for this case. > >> > >> As we know, in tail call context, the tail_call_cnt propagates by stack > >> and rax register between BPF subprograms. So, propagating tail_call_cnt > >> pointer by stack and rax register makes tail_call_cnt as like a global > >> variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall > >> hierarchy cases. > >> > >> Before jumping to other bpf prog, load tail_call_cnt from the pointer > >> and then compare with MAX_TAIL_CALL_CNT. Finally, increment > >> tail_call_cnt by its pointer. > >> > >> But, where does tail_call_cnt store? > >> > >> It stores on the stack of bpf prog's caller, like > >> > >> | STACK | > >> | | > >> | rip | > >> +->| tcc | > >> | | rip | > >> | | rbp | > >> | +---------+ RBP > >> | | | > >> | | | > >> | | | > >> +--| tcc_ptr | > >> | rbx | > >> +---------+ RSP > >> > >> And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf > >> prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx" > >> instruction on BPF JIT epilogue"). > >> > >> Why not back-propagate tail_call_cnt? > >> > >> It's because it's vulnerable to back-propagate it. It's unable to work > >> well with the following case. > >> > >> int prog1(); > >> int prog2(); > >> > >> prog1 is tail caller, and prog2 is tail callee. If we do back-propagate > >> tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the > >> same time? The answer is NO. Otherwise, there will be a register to be > >> polluted, which will make kernel crash. > > > > Sorry but I keep on reading this explanation and I'm lost what is being > > fixed here> > > You want to limit the total amount of tail calls that entry prog can do to > > MAX_TAIL_CALL_CNT. Although I was working on that, my knowledge here is > > rusty, therefore my view might be distorted :) to me MAX_TAIL_CALL_CNT is > > to protect us from overflowing kernel stack and endless loops. As long a > > single call chain doesn't go over 8kB program is fine. Verifier has a > > limit of 256 subprogs from what I see. > > I try to resolve the following-like cases here. > > +--------- tailcall --+ > | | > V --> subprog1 -+ > entry bpf prog < > A --> subprog2 -+ > | | > +--------- tailcall --+ > > Without this fixing, the CPU will be stalled because of too many tailcalls. You still didn't explain why CPU might stall :/ if you stumble upon any kernel splats it's good habit to include them. So, for future readers of this thread, I reduced one of your examples down to: #include <linux/bpf.h> #include <bpf/bpf_helpers.h> #include "bpf_legacy.h" struct { __uint(type, BPF_MAP_TYPE_PROG_ARRAY); __uint(max_entries, 1); __uint(key_size, sizeof(__u32)); __uint(value_size, sizeof(__u32)); } jmp_table SEC(".maps"); static __noinline int subprog_tail(struct __sk_buff *skb) { bpf_tail_call_static(skb, &jmp_table, 0); return 0; } SEC("tc") int entry(struct __sk_buff *skb) { subprog_tail(skb); return 0; } char __license[] SEC("license") = "GPL"; and indeed CPU got stuck: [10623.892978] RIP: 0010:bpf_prog_6abcef0aca85f2fd_subprog_tail+0x49/0x59 [10623.899606] Code: 39 56 24 76 30 8b 85 fc ff ff ff 83 f8 21 73 25 83 c0 01 89 85 fc ff ff ff 48 8b 8c d6 10 01 00 00 48 85 c9 74 0f 41 5d 5b 58 <48> 8b 49 30 48 83 c1 0b ff e1 cc 41 5d 5b c9 c3 cc cc cc cc cc cc [10623.918646] RSP: 0018:ffffc900202b78e0 EFLAGS: 00000286 [10623.923952] RAX: 0000002100000000 RBX: ffff8881088aa100 RCX: ffffc9000d319000 [10623.931194] RDX: 0000000000000000 RSI: ffff88811e6d2e00 RDI: ffff8881088aa100 [10623.938439] RBP: ffffc900202b78e0 R08: ffffc900202b7dd4 R09: 0000000000000000 [10623.945686] R10: 736391e000000000 R11: 0000000000000000 R12: 0000000000000001 [10623.952927] R13: ffffc9000d38d048 R14: ffffc900202b7dd4 R15: ffffffff81ae756f [10623.960174] ? bpf_test_run+0xef/0x320 [10623.963988] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10623.969826] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10623.975662] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10623.981500] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10623.987334] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10623.993174] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10623.999016] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.004854] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.010688] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.016529] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.022371] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.028209] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.034043] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.043211] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.052344] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.061453] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.070404] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.079200] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.087841] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.096333] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.104680] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.112886] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.121014] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.129048] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.136949] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.144716] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.152479] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.160209] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.167905] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.175565] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 [10624.183173] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.190738] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 [10624.198262] bpf_test_run+0x186/0x320 [10624.203677] ? kmem_cache_alloc+0x14d/0x2c0 [10624.209615] bpf_prog_test_run_skb+0x35c/0x710 [10624.215810] __sys_bpf+0xf3a/0x2d80 [10624.221068] ? do_mmap+0x409/0x5e0 [10624.226217] __x64_sys_bpf+0x1a/0x30 [10624.231693] do_syscall_64+0x2e/0xa0 [10624.236988] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [10624.243550] RIP: 0033:0x7fd76171e69d [10624.248576] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 63 a7 0f 00 f7 d8 64 89 01 48 [10624.270613] RSP: 002b:00007ffe694f8d68 EFLAGS: 00000202 ORIG_RAX: 0000000000000141 [10624.279894] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd76171e69d [10624.288769] RDX: 0000000000000050 RSI: 00007ffe694f8db0 RDI: 000000000000000a [10624.297634] RBP: 00007ffe694f8d80 R08: 00007ffe694f8d37 R09: 00007ffe694f8db0 [10624.306455] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffe694f9348 [10624.315260] R13: 000055bb8423f6d9 R14: 000055bb8496ac90 R15: 00007fd761925040 Looking at assembly code: entry: ffffffffc0012c60 <load3+0x12c60>: ffffffffc0012c60: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) ffffffffc0012c65: 31 c0 xor %eax,%eax ffffffffc0012c67: 55 push %rbp ffffffffc0012c68: 48 89 e5 mov %rsp,%rbp ffffffffc0012c6b: 50 push %rax ffffffffc0012c6c: 48 8b 85 f8 ff ff ff mov -0x8(%rbp),%rax ffffffffc0012c73: e8 30 00 00 00 call 0xffffffffc0012ca8 ffffffffc0012c78: 31 c0 xor %eax,%eax ffffffffc0012c7a: c9 leave ffffffffc0012c7b: c3 ret subprog_tail: ffffffffc0012ca8 <load3+0x12ca8>: ffffffffc0012ca8: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) ffffffffc0012cad: 66 90 xchg %ax,%ax ffffffffc0012caf: 55 push %rbp ffffffffc0012cb0: 48 89 e5 mov %rsp,%rbp ffffffffc0012cb3: 50 push %rax ffffffffc0012cb4: 53 push %rbx ffffffffc0012cb5: 41 55 push %r13 ffffffffc0012cb7: 48 89 fb mov %rdi,%rbx ffffffffc0012cba: 49 bd 00 7c 73 0c 81 movabs $0xffff88810c737c00,%r13 ffffffffc0012cc1: 88 ff ff ffffffffc0012cc4: 48 89 df mov %rbx,%rdi ffffffffc0012cc7: 4c 89 ee mov %r13,%rsi ffffffffc0012cca: 31 d2 xor %edx,%edx ffffffffc0012ccc: 8b 85 fc ff ff ff mov -0x4(%rbp),%eax ffffffffc0012cd2: 83 f8 21 cmp $0x21,%eax ffffffffc0012cd5: 73 17 jae 0xffffffffc0012cee ffffffffc0012cd7: 83 c0 01 add $0x1,%eax ffffffffc0012cda: 89 85 fc ff ff ff mov %eax,-0x4(%rbp) ffffffffc0012ce0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) ffffffffc0012ce5: 41 5d pop %r13 ffffffffc0012ce7: 5b pop %rbx ffffffffc0012ce8: 58 pop %rax ffffffffc0012ce9: e9 7d ff ff ff jmp 0xffffffffc0012c6b ffffffffc0012cee: 41 5d pop %r13 ffffffffc0012cf0: 5b pop %rbx ffffffffc0012cf1: c9 leave ffffffffc0012cf2: c3 ret I am trying to understand what are writing to %rax when entry is target of tailcall. rbp comes from subprog_tail and we popped all of the previously pushed regs. So what is exactly causing this hang? (I thought I'd rather respond just to let you know I'm on it) > > > > > Can you elaborate a bit more about the kernel crash you mention in the > > last paragraph? > > We have progs, prog1, prog2, prog3 and prog4, and the scenario: > > case1: kprobe1 --> prog1 --> subprog1 --tailcall-> prog2 --> subprog2 --tailcall-> prog3 > case2: kprobe2 --> prog2 --> subprog2 --tailcall-> prog3 > case3: kprobe3 --> prog4 --> subprog3 --tailcall-> prog3 > --> subprog4 --tailcall-> prog4 > > How does prog2 back-propagate tail_call_cnt to prog1? > > Possible way 1: > When prog2 and prog3 are added to PROG_ARRAY, poke their epilogues to > back-propagate tail_call_cnt by RCX register. It seems OK because kprobes do > not handle the value in RCX register, like case2. > > Possible way 2: > Can back-propagate tail_call_cnt with RCX register by checking tail_call_cnt != 0 > at epilogue when current prog has tailcall? > No. As for case1, prog2 handles the value in RCX register, which is not tail_call_cnt, > because prog3 has no tailcall and won't populate RCX register with tail_call_cnt. I don't get it. Let's start with small example and do the baby steps. > > However, I don't like the back-propagating way. Then, I "burn" my brain to figure > out pointer propagating ways. I know that code can damage programmer's brain, that's why we need to strive for elaborative and easy to understand commit messages so that later on we will be able to pick up what is going on here. > > RFC PATCH v1 way: > Propagate tail_call_cnt and its pointer together. Then, the pointer is used to > check MAX_TAIL_CALL_CNT and increment tail_call_cnt. > > | STACK | > +---------+ RBP > | | > | | > | | > +--| tcc_ptr | > +->| tcc | > | rbx | > +---------+ RSP > > RFC PATCH v2 way (current patchset): > Propagate tail_call_cnt pointer only. Then, the pointer is used to check > MAX_TAIL_CALL_CNT and increment tail_call_cnt. > > | STACK | > | | > | rip | > +->| tcc | > | | rip | > | | rbp | > | +---------+ RBP > | | | > | | | > | | | > +--| tcc_ptr | > | rbx | > +---------+ RSP > > > > > I also realized that verifier assumes MAX_TAIL_CALL_CNT as 32 which has > > changed in the meantime to 33 and we should adjust the max allowed stack > > depth of subprogs? I believe this was brought up at LPC? > > There's following code snippet in verifier.c: > > /* protect against potential stack overflow that might happen when > * bpf2bpf calls get combined with tailcalls. Limit the caller's stack > * depth for such case down to 256 so that the worst case scenario > * would result in 8k stack size (32 which is tailcall limit * 256 = > * 8k). > > But, MAX_TAIL_CALL_CNT is 33. > > This was not brought up at LPC 2022&2023. I don't know whether this was > brought up at previous LPCs. Was referring to this: https://lpc.events/event/17/contributions/1595/attachments/1230/2506/LPC2023_slides.pdf but let us focus on issue you're bringing up here in this patch. I brought this up thinking JIT code was fine, now it's clear that things go south when entry prog is tailcall target, which we didn't test. > > Thanks, > Leon >
On 12/12/23 02:02, Maciej Fijalkowski wrote: > On Wed, Dec 06, 2023 at 02:51:28PM +0800, Leon Hwang wrote: >> >> >> On 6/12/23 07:03, Maciej Fijalkowski wrote: >>> On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote: >>>> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall >>>> handling in JIT"), the tailcall on x64 works better than before. >>>> >>>> From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms >>>> for x64 JIT"), tailcall is able to run in BPF subprograms on x64. >>>> >>>> How about: >>>> >>>> 1. More than 1 subprograms are called in a bpf program. >>>> 2. The tailcalls in the subprograms call the bpf program. >>>> >>>> Because of missing tail_call_cnt back-propagation, a tailcall hierarchy >>>> comes up. And MAX_TAIL_CALL_CNT limit does not work for this case. >>>> >>>> As we know, in tail call context, the tail_call_cnt propagates by stack >>>> and rax register between BPF subprograms. So, propagating tail_call_cnt >>>> pointer by stack and rax register makes tail_call_cnt as like a global >>>> variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall >>>> hierarchy cases. >>>> >>>> Before jumping to other bpf prog, load tail_call_cnt from the pointer >>>> and then compare with MAX_TAIL_CALL_CNT. Finally, increment >>>> tail_call_cnt by its pointer. >>>> >>>> But, where does tail_call_cnt store? >>>> >>>> It stores on the stack of bpf prog's caller, like >>>> >>>> | STACK | >>>> | | >>>> | rip | >>>> +->| tcc | >>>> | | rip | >>>> | | rbp | >>>> | +---------+ RBP >>>> | | | >>>> | | | >>>> | | | >>>> +--| tcc_ptr | >>>> | rbx | >>>> +---------+ RSP >>>> >>>> And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf >>>> prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx" >>>> instruction on BPF JIT epilogue"). >>>> >>>> Why not back-propagate tail_call_cnt? >>>> >>>> It's because it's vulnerable to back-propagate it. It's unable to work >>>> well with the following case. >>>> >>>> int prog1(); >>>> int prog2(); >>>> >>>> prog1 is tail caller, and prog2 is tail callee. If we do back-propagate >>>> tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the >>>> same time? The answer is NO. Otherwise, there will be a register to be >>>> polluted, which will make kernel crash. >>> >>> Sorry but I keep on reading this explanation and I'm lost what is being >>> fixed here> >>> You want to limit the total amount of tail calls that entry prog can do to >>> MAX_TAIL_CALL_CNT. Although I was working on that, my knowledge here is >>> rusty, therefore my view might be distorted :) to me MAX_TAIL_CALL_CNT is >>> to protect us from overflowing kernel stack and endless loops. As long a >>> single call chain doesn't go over 8kB program is fine. Verifier has a >>> limit of 256 subprogs from what I see. >> >> I try to resolve the following-like cases here. >> >> +--------- tailcall --+ >> | | >> V --> subprog1 -+ >> entry bpf prog < >> A --> subprog2 -+ >> | | >> +--------- tailcall --+ >> >> Without this fixing, the CPU will be stalled because of too many tailcalls. > > You still didn't explain why CPU might stall :/ if you stumble upon any > kernel splats it's good habit to include them. So, for future readers of > this thread, I reduced one of your examples down to: > Thanks for your help. Here is a GitHub CI history of this patchset: https://github.com/kernel-patches/bpf/pull/5807/checks In this CI results, the test_progs failed to run on aarch64 and s390x because of "rcu: INFO: rcu_sched self-detected stall on CPU". But why does CPU stall? It's because there are too many tailcalls to run on the bpf prog. How many tailcalls? Why MAX_TAIL_CALL_CNT limit does not work for those cases? Let's have a look at a simple case: #include <linux/bpf.h> #include <bpf/bpf_helpers.h> #include "bpf_legacy.h" struct { __uint(type, BPF_MAP_TYPE_PROG_ARRAY); __uint(max_entries, 1); __uint(key_size, sizeof(__u32)); __uint(value_size, sizeof(__u32)); } jmp_table SEC(".maps"); int count = 0; static __noinline int subprog_tail(struct __sk_buff *skb) { bpf_tail_call_static(skb, &jmp_table, 0); return 0; } SEC("tc") int entry(struct __sk_buff *skb) { volatile int ret = 1; count++; subprog_tail(skb); /* subprog call1 */ subprog_tail(skb); /* subprog call2 */ return ret; } char __license[] SEC("license") = "GPL"; And the entry bpf prog is populated to the 0th slot of jmp_table. Then, what happens when entry bpf prog runs? At the very first time when subprog_tail() is called, subprog_tail() does tailcall the entry bpf prog. Then, subprog_taill() is called at second time at the position subprog call1, and it tailcalls the entry bpf prog again. Then, again and again. At the very first time when MAX_TAIL_CALL_CNT limit works, subprog_tail() has been called for 34 times at the position subprog call1. And at this time, the tail_call_cnt is 33 in subprog_tail(). Next, the 34th subprog_tail() returns to entry() because of MAX_TAIL_CALL_CNT limit. In entry(), the 34th entry(), at the time after the 34th subprog_tail() at the position subprog call1 finishes and before the 1st subprog_tail() at the position subprog call2 calls in entry(), what's the value of tail_call_cnt in entry()? It's 33. As we know, tail_all_cnt is pushed on the stack of entry(), and propagates to subprog_tail() by %rax from stack. Then, at the time when subprog_tail() at the position subprog call2 is called for its first time, tail_call_cnt 33 propagates to subprog_tail() by %rax. And the tailcall in subprog_tail() is aborted because of tail_call_cnt >= MAX_TAIL_CALL_CNT too. Then, subprog_tail() at the position subprog call2 ends, and the 34th entry() ends. And it returns to the 33rd subprog_tail() called from the position subprog call1. But wait, at this time, what's the value of tail_call_cnt under the stack of subprog_tail()? It's 33. Then, in the 33rd entry(), at the time after the 33th subprog_tail() at the position subprog call1 finishes and before the 2nd subprog_tail() at the position subprog call2 calls, what's the value of tail_call_cnt in current entry()? It's **32**. Why not 33? Before stepping into subprog_tail() at the position subprog call2 in 33rd entry(), like stopping the time, let's have a look at the stack memory: | STACK | +---------+ RBP <-- current rbp | ret | STACK of 33rd entry() | tcc | its value is 32 +---------+ RSP <-- current rsp | rip | STACK of 34rd entry() | rbp | reuse the STACK of 33rd subprog_tail() at the position | ret | subprog call1 | tcc | its value is 33 +---------+ rsp | rip | STACK of 1st subprog_tail() at the position subprog call2 | rbp | | tcc | its value is 33 +---------+ rsp Why not 33? It's because tail_call_cnt does not back-propagate from subprog_tail() to entry(). Then, while stepping into subprog_tail() at the position subprog call2 in 33rd entry(): | STACK | +---------+ | ret | STACK of 33rd entry() | tcc | its value is 32 | rip | | rbp | +---------+ RBP <-- current rbp | tcc | its value is 32; STACK of subprog_tail() at the position +---------+ RSP <-- current rsp subprog call2 Then, while pausing after tailcalling in 2nd subprog_tail() at the position subprog call2: | STACK | +---------+ | ret | STACK of 33rd entry() | tcc | its value is 32 | rip | | rbp | +---------+ RBP <-- current rbp | tcc | its value is 33; STACK of subprog_tail() at the position +---------+ RSP <-- current rsp subprog call2 Note: what happens to tail_call_cnt: /* * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT) * goto out; */ It's to check >= MAX_TAIL_CALL_CNT first and then increment tail_call_cnt. So, current tailcall is allowed to run. Then, entry() is tailcalled. And the stack memory status is: | STACK | +---------+ | ret | STACK of 33rd entry() | tcc | its value is 32 | rip | | rbp | +---------+ RBP <-- current rbp | ret | STACK of 35th entry(); reuse STACK of subprog_tail() at the | tcc | its value is 33 the position subprog call2 +---------+ RSP <-- current rsp So, the tailcalls in the 35th entry() will be aborted. And, ..., again and again. :( And, I hope you have understood the reason why MAX_TAIL_CALL_CNT limit does not work for those cases. And, how many tailcalls are there if CPU does not stall? Let's count it. 1+1+1+...+1=? Let's build a math model to calculate the total tailcall count. Sorry, I'm poor at math. How about top-down view? Does it look like hierarchy layer and layer? I think it is a hierarchy layer model with 2+4+8+...+2**33 tailcalls. As a result, if CPU does not stall, there will be 2**34 - 2 = 17,179,869,182 tailcalls. That's the guy which makes CPU stalled. What about there are N subprog_tail() in entry()? If CPU does not stall because of too many tailcalls, there will be almost N**34 tailcalls. > #include <linux/bpf.h> > #include <bpf/bpf_helpers.h> > #include "bpf_legacy.h" > > struct { > __uint(type, BPF_MAP_TYPE_PROG_ARRAY); > __uint(max_entries, 1); > __uint(key_size, sizeof(__u32)); > __uint(value_size, sizeof(__u32)); > } jmp_table SEC(".maps"); > > static __noinline > int subprog_tail(struct __sk_buff *skb) > { > bpf_tail_call_static(skb, &jmp_table, 0); > return 0; > } > > SEC("tc") > int entry(struct __sk_buff *skb) > { > subprog_tail(skb); > return 0; > } > > char __license[] SEC("license") = "GPL"; > > and indeed CPU got stuck: > > [10623.892978] RIP: 0010:bpf_prog_6abcef0aca85f2fd_subprog_tail+0x49/0x59 > [10623.899606] Code: 39 56 24 76 30 8b 85 fc ff ff ff 83 f8 21 73 25 83 c0 01 89 85 fc ff ff ff 48 8b 8c d6 10 01 00 00 48 85 c9 74 0f 41 5d 5b 58 <48> 8b 49 30 48 83 c1 0b ff e1 cc 41 5d 5b c9 c3 cc cc cc cc cc cc > [10623.918646] RSP: 0018:ffffc900202b78e0 EFLAGS: 00000286 > [10623.923952] RAX: 0000002100000000 RBX: ffff8881088aa100 RCX: ffffc9000d319000 > [10623.931194] RDX: 0000000000000000 RSI: ffff88811e6d2e00 RDI: ffff8881088aa100 > [10623.938439] RBP: ffffc900202b78e0 R08: ffffc900202b7dd4 R09: 0000000000000000 > [10623.945686] R10: 736391e000000000 R11: 0000000000000000 R12: 0000000000000001 > [10623.952927] R13: ffffc9000d38d048 R14: ffffc900202b7dd4 R15: ffffffff81ae756f > [10623.960174] ? bpf_test_run+0xef/0x320 > [10623.963988] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10623.969826] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10623.975662] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10623.981500] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10623.987334] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10623.993174] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10623.999016] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.004854] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.010688] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.016529] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.022371] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.028209] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.034043] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.043211] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.052344] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.061453] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.070404] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.079200] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.087841] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.096333] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.104680] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.112886] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.121014] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.129048] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.136949] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.144716] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.152479] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.160209] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.167905] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.175565] bpf_prog_6d15fffd34d84bbf_classifier_0+0x6f/0x77 > [10624.183173] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.190738] bpf_prog_6d15fffd34d84bbf_classifier_0+0x56/0x77 > [10624.198262] bpf_test_run+0x186/0x320 > [10624.203677] ? kmem_cache_alloc+0x14d/0x2c0 > [10624.209615] bpf_prog_test_run_skb+0x35c/0x710 > [10624.215810] __sys_bpf+0xf3a/0x2d80 > [10624.221068] ? do_mmap+0x409/0x5e0 > [10624.226217] __x64_sys_bpf+0x1a/0x30 > [10624.231693] do_syscall_64+0x2e/0xa0 > [10624.236988] entry_SYSCALL_64_after_hwframe+0x6e/0x76 > [10624.243550] RIP: 0033:0x7fd76171e69d > [10624.248576] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 63 a7 0f 00 f7 d8 64 89 01 48 > [10624.270613] RSP: 002b:00007ffe694f8d68 EFLAGS: 00000202 ORIG_RAX: 0000000000000141 > [10624.279894] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd76171e69d > [10624.288769] RDX: 0000000000000050 RSI: 00007ffe694f8db0 RDI: 000000000000000a > [10624.297634] RBP: 00007ffe694f8d80 R08: 00007ffe694f8d37 R09: 00007ffe694f8db0 > [10624.306455] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffe694f9348 > [10624.315260] R13: 000055bb8423f6d9 R14: 000055bb8496ac90 R15: 00007fd761925040 > > Looking at assembly code: > > entry: > ffffffffc0012c60 <load3+0x12c60>: > ffffffffc0012c60: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > ffffffffc0012c65: 31 c0 xor %eax,%eax > ffffffffc0012c67: 55 push %rbp > ffffffffc0012c68: 48 89 e5 mov %rsp,%rbp > ffffffffc0012c6b: 50 push %rax > ffffffffc0012c6c: 48 8b 85 f8 ff ff ff mov -0x8(%rbp),%rax > ffffffffc0012c73: e8 30 00 00 00 call 0xffffffffc0012ca8 > ffffffffc0012c78: 31 c0 xor %eax,%eax > ffffffffc0012c7a: c9 leave > ffffffffc0012c7b: c3 ret > > subprog_tail: > ffffffffc0012ca8 <load3+0x12ca8>: > ffffffffc0012ca8: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > ffffffffc0012cad: 66 90 xchg %ax,%ax > ffffffffc0012caf: 55 push %rbp > ffffffffc0012cb0: 48 89 e5 mov %rsp,%rbp > ffffffffc0012cb3: 50 push %rax > ffffffffc0012cb4: 53 push %rbx > ffffffffc0012cb5: 41 55 push %r13 > ffffffffc0012cb7: 48 89 fb mov %rdi,%rbx > ffffffffc0012cba: 49 bd 00 7c 73 0c 81 movabs $0xffff88810c737c00,%r13 > ffffffffc0012cc1: 88 ff ff > ffffffffc0012cc4: 48 89 df mov %rbx,%rdi > ffffffffc0012cc7: 4c 89 ee mov %r13,%rsi > ffffffffc0012cca: 31 d2 xor %edx,%edx > ffffffffc0012ccc: 8b 85 fc ff ff ff mov -0x4(%rbp),%eax > ffffffffc0012cd2: 83 f8 21 cmp $0x21,%eax > ffffffffc0012cd5: 73 17 jae 0xffffffffc0012cee > ffffffffc0012cd7: 83 c0 01 add $0x1,%eax > ffffffffc0012cda: 89 85 fc ff ff ff mov %eax,-0x4(%rbp) > ffffffffc0012ce0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > ffffffffc0012ce5: 41 5d pop %r13 > ffffffffc0012ce7: 5b pop %rbx > ffffffffc0012ce8: 58 pop %rax > ffffffffc0012ce9: e9 7d ff ff ff jmp 0xffffffffc0012c6b > ffffffffc0012cee: 41 5d pop %r13 > ffffffffc0012cf0: 5b pop %rbx > ffffffffc0012cf1: c9 leave > ffffffffc0012cf2: c3 ret > > I am trying to understand what are writing to %rax when entry is target of > tailcall. rbp comes from subprog_tail and we popped all of the previously > pushed regs. So what is exactly causing this hang? > Oh, your example is wierd to make CPU stuck. I don't know why, either. > (I thought I'd rather respond just to let you know I'm on it) > Thanks for your reminding. >> >>> >>> Can you elaborate a bit more about the kernel crash you mention in the >>> last paragraph? >> >> We have progs, prog1, prog2, prog3 and prog4, and the scenario: >> >> case1: kprobe1 --> prog1 --> subprog1 --tailcall-> prog2 --> subprog2 --tailcall-> prog3 >> case2: kprobe2 --> prog2 --> subprog2 --tailcall-> prog3 >> case3: kprobe3 --> prog4 --> subprog3 --tailcall-> prog3 >> --> subprog4 --tailcall-> prog4 >> >> How does prog2 back-propagate tail_call_cnt to prog1? >> >> Possible way 1: >> When prog2 and prog3 are added to PROG_ARRAY, poke their epilogues to >> back-propagate tail_call_cnt by RCX register. It seems OK because kprobes do >> not handle the value in RCX register, like case2. >> >> Possible way 2: >> Can back-propagate tail_call_cnt with RCX register by checking tail_call_cnt != 0 >> at epilogue when current prog has tailcall? >> No. As for case1, prog2 handles the value in RCX register, which is not tail_call_cnt, >> because prog3 has no tailcall and won't populate RCX register with tail_call_cnt. > > I don't get it. Let's start with small example and do the baby steps. > I lost the details of back-propagating solutions which I tried hard to convince myself the solutions were not good enough as my expectation. Sorry about that. >> >> However, I don't like the back-propagating way. Then, I "burn" my brain to figure >> out pointer propagating ways. > > I know that code can damage programmer's brain, that's why we need to > strive for elaborative and easy to understand commit messages so that > later on we will be able to pick up what is going on here. > Yeah, it's not easy to understand the code of tailcall on arch, like x86. I'm trying hard to explain this patch. >> >> RFC PATCH v1 way: >> Propagate tail_call_cnt and its pointer together. Then, the pointer is used to >> check MAX_TAIL_CALL_CNT and increment tail_call_cnt. >> >> | STACK | >> +---------+ RBP >> | | >> | | >> | | >> +--| tcc_ptr | >> +->| tcc | >> | rbx | >> +---------+ RSP >> >> RFC PATCH v2 way (current patchset): >> Propagate tail_call_cnt pointer only. Then, the pointer is used to check >> MAX_TAIL_CALL_CNT and increment tail_call_cnt. >> >> | STACK | >> | | >> | rip | >> +->| tcc | >> | | rip | >> | | rbp | >> | +---------+ RBP >> | | | >> | | | >> | | | >> +--| tcc_ptr | >> | rbx | >> +---------+ RSP >> It's time to describe the details of this patch. And take a look at how it handles the above simple case. This patch is to keep tail_call_cnt on the stack of entry bpf prog's caller, and then propagate tail_call_cnt pointer like the way of the previous tail_call_cnt. So, at the epilogue of entry bpf prog, before pushing %rbp, it has to initialise tail_call_cnt and to push it to stack by using %rax. Then, make %rax as the pointer that points to tail_call_cnt, by "mov rax, rsp". Then, call the main part of the entry bpf prog by "call 2". **This is the exceptional point.** With this "call", %rip is pushed to stack. And at the end of the entry bpf prog runtime, the %rip is popped from stack; then, pop tail_call_cnt from stack too; and finally "ret" again. The popping tail_call_cnt and "ret" is the 2 in "call 2". So, when "call 2", %rax is the tail_call_cnt pointer. And then, tail_call_cnt pointer will be pushed to stack. This is the way to push %rax when entry bpf prog is tailcalled, too. So, when a subprog is called, tail_call_cnt pointer is popped from stack to %rax. And when a tailcall happens, load tail_call_cnt pointer from stack to %rax by "mov rax, qword ptr [rbp - tcc_ptr_off]", and compare tail_call_cnt with MAX_TAIL_CALL_CNT by "cmp dword ptr [rax], MAX_TAIL_CALL_CNT", and then increment tail_call_cnt by "add dword ptr [rax], 1". Finally, when pop %rax, it's to pop tail_call_cnt pointer from stack to %rax. When the epilogue of entry() runs, the stack of entry() should be like: | STACK | STACK of entry()'s caller | | | rip | +->| tcc | its value is 0 | | rip | | | rbp | | +---------+ RBP | | ret | STACK of entry() +--| tcc_ptr | | rbx | saved regs +---------+ RSP Then, when subprog_tail() is called for its very first time, its stack should be like: | STACK | STACK of entry()'s caller | | | rip | +->| tcc | its value is 0 | | rip | | | rbp | | +---------+ rbp | | ret | STACK of entry() +--| tcc_ptr | | | rbx | saved regs | | rip | | | rbp | | +---------+ RBP +--| tcc_ptr | STACK of subprog_tail() +---------+ RSP Then, when subprog_tail() tailcalls entry(): | STACK | STACK of entry()'s caller | | | rip | +->| tcc | its value is 1 | | rip | | | rbp | | +---------+ rbp | | ret | STACK of entry() +--| tcc_ptr | | | rbx | saved regs | | rip | | | rbp | | +---------+ RBP | | ret | STACK of entry(), reuse STACK of subprog_tail() +--| tcc_ptr | +---------+ RSP Then, when entry() calls subprog_tail(): | STACK | STACK of entry()'s caller | | | rip | +->| tcc | its value is 1 | | rip | | | rbp | | +---------+ rbp | | ret | STACK of entry() +--| tcc_ptr | | | rbx | saved regs | | rip | | | rbp | | +---------+ rbp | | ret | STACK of entry(), reuse STACK of subprog_tail() +--| tcc_ptr | | | rip | | | rbp | | +---------+ RBP +--| tcc_ptr | STACK of subprog_tail() +---------+ RSP Then, when subprog_tail() tailcalls entry(): | STACK | STACK of entry()'s caller | | | rip | +->| tcc | its value is 2 | | rip | | | rbp | | +---------+ rbp | | ret | STACK of entry() +--| tcc_ptr | | | rbx | saved regs | | rip | | | rbp | | +---------+ rbp | | ret | STACK of entry(), reuse STACK of subprog_tail() +--| tcc_ptr | | | rip | | | rbp | | +---------+ RBP | | ret | STACK of entry(), reuse STACK of subprog_tail() +--| tcc_ptr | +---------+ RSP Then, again and again. At the very first time when MAX_TAIL_CALL_CNT limit works, subprog_tail() has been called for 34 times at the position subprog call1. And at this time, the stack should be like: | STACK | STACK of entry()'s caller | | | rip | +->| tcc | its value is 33 | | rip | | | rbp | | +---------+ rbp | | ret | STACK of entry() +--| tcc_ptr | | | rbx | saved regs | | rip | | | rbp | | +---------+ rbp | | ret | STACK of entry(), reuse STACK of subprog_tail() +--| tcc_ptr | | | rip | | | rbp | | +---------+ rbp | | ret | STACK of entry(), reuse STACK of subprog_tail() +--| tcc_ptr | | | rip | | | rbp | | +---------+ rbp | | * | | | * | | | * | | +---------+ RBP +--| tcc_ptr | STACK of subprog_tail() +---------+ RSP After this time, the tailcalls in the future will be aborted because tail_call_cnt has been 33, which reaches its MAX_TAIL_CALL_CNT limit. That's the way how this patch works. It's really nice if you reach here. I hope you have a clear idea to understand this patch by reading above replys. >>> >>> I also realized that verifier assumes MAX_TAIL_CALL_CNT as 32 which has >>> changed in the meantime to 33 and we should adjust the max allowed stack >>> depth of subprogs? I believe this was brought up at LPC? >> >> There's following code snippet in verifier.c: >> >> /* protect against potential stack overflow that might happen when >> * bpf2bpf calls get combined with tailcalls. Limit the caller's stack >> * depth for such case down to 256 so that the worst case scenario >> * would result in 8k stack size (32 which is tailcall limit * 256 = >> * 8k). >> >> But, MAX_TAIL_CALL_CNT is 33. >> >> This was not brought up at LPC 2022&2023. I don't know whether this was >> brought up at previous LPCs. > > Was referring to this: > https://lpc.events/event/17/contributions/1595/attachments/1230/2506/LPC2023_slides.pdf > That's interesting. I tried to overflow the stack of one bpf prog, whose max stack size limits to 8k, with fexit. I failed because the stack size of thread is not so that small. But, exceptionally, it was able to compose a tailcall infinite loop with fexit, which broke the tail_call_cnt propagating chain because it did not pass through the fexit's trampoline. Thanks, Leon > but let us focus on issue you're bringing up here in this patch. I brought > this up thinking JIT code was fine, now it's clear that things go south > when entry prog is tailcall target, which we didn't test. > >> >> Thanks, >> Leon >>
On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote: > From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall > handling in JIT"), the tailcall on x64 works better than before. > > From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms > for x64 JIT"), tailcall is able to run in BPF subprograms on x64. > > How about: > > 1. More than 1 subprograms are called in a bpf program. > 2. The tailcalls in the subprograms call the bpf program. > > Because of missing tail_call_cnt back-propagation, a tailcall hierarchy > comes up. And MAX_TAIL_CALL_CNT limit does not work for this case. > > As we know, in tail call context, the tail_call_cnt propagates by stack > and rax register between BPF subprograms. So, propagating tail_call_cnt > pointer by stack and rax register makes tail_call_cnt as like a global > variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall > hierarchy cases. > > Before jumping to other bpf prog, load tail_call_cnt from the pointer > and then compare with MAX_TAIL_CALL_CNT. Finally, increment > tail_call_cnt by its pointer. > > But, where does tail_call_cnt store? > > It stores on the stack of bpf prog's caller, like > > | STACK | > | | > | rip | > +->| tcc | > | | rip | > | | rbp | > | +---------+ RBP > | | | > | | | > | | | > +--| tcc_ptr | > | rbx | > +---------+ RSP > > And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf > prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx" > instruction on BPF JIT epilogue"). > > Why not back-propagate tail_call_cnt? > > It's because it's vulnerable to back-propagate it. It's unable to work > well with the following case. > > int prog1(); > int prog2(); > > prog1 is tail caller, and prog2 is tail callee. If we do back-propagate > tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the > same time? The answer is NO. Otherwise, there will be a register to be > polluted, which will make kernel crash. > > Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT") > Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT") > Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> > --- > arch/x86/net/bpf_jit_comp.c | 40 ++++++++++++++++++++++--------------- > 1 file changed, 24 insertions(+), 16 deletions(-) > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c > index c2a0465d37da4..36631129cc800 100644 > --- a/arch/x86/net/bpf_jit_comp.c > +++ b/arch/x86/net/bpf_jit_comp.c > @@ -256,7 +256,7 @@ struct jit_context { > /* Number of bytes emit_patch() needs to generate instructions */ > #define X86_PATCH_SIZE 5 > /* Number of bytes that will be skipped on tailcall */ > -#define X86_TAIL_CALL_OFFSET (11 + ENDBR_INSN_SIZE) > +#define X86_TAIL_CALL_OFFSET (22 + ENDBR_INSN_SIZE) > > static void push_r12(u8 **pprog) > { > @@ -340,14 +340,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, > EMIT_ENDBR(); > emit_nops(&prog, X86_PATCH_SIZE); > if (!ebpf_from_cbpf) { > - if (tail_call_reachable && !is_subprog) > + if (tail_call_reachable && !is_subprog) { > /* When it's the entry of the whole tailcall context, > * zeroing rax means initialising tail_call_cnt. > */ > - EMIT2(0x31, 0xC0); /* xor eax, eax */ > - else > - /* Keep the same instruction layout. */ > - EMIT2(0x66, 0x90); /* nop2 */ > + EMIT2(0x31, 0xC0); /* xor eax, eax */ > + EMIT1(0x50); /* push rax */ > + /* Make rax as ptr that points to tail_call_cnt. */ > + EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */ > + EMIT1_off32(0xE8, 2); /* call main prog */ > + EMIT1(0x59); /* pop rcx, get rid of tail_call_cnt */ > + EMIT1(0xC3); /* ret */ > + } else { > + /* Keep the same instruction size. */ > + emit_nops(&prog, 13); > + } At first sight it seemed to me too invasive but after trying out few other approaches in the end it is elegant. I wanted to avoid a bit puzzling call insn in the prologue with a following prologue layout (this will be based on entry prog from tailcall_bpf2bpf3.c that was the first one to break): ffffffffc0012cb4: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) ffffffffc0012cb9: 55 push %rbp ffffffffc0012cba: 48 89 e5 mov %rsp,%rbp ffffffffc0012cbd: 48 83 ec 10 sub $0x10,%rsp ffffffffc0012cc1: 48 89 65 f8 mov %rsp,-0x8(%rbp) ffffffffc0012cc5: 48 c7 04 24 00 00 00 movq $0x0,(%rsp) ffffffffc0012ccc: 00 ffffffffc0012ccd: 48 8b 45 f8 mov -0x8(%rbp),%rax ffffffffc0012cd1: 50 push %rax ffffffffc0012cd2: 48 81 ec 80 00 00 00 sub $0x80,%rsp So we would have hidden 16 bytes on stack at the *beginning* of entry stack frame. First thing right after rbp would be tcc pointer so referring to it wouldn't require us to take into account stack depth. But then if we follow with rest of insns: ffffffffc0012cd9: 31 f6 xor %esi,%esi ffffffffc0012cdb: 48 89 75 f8 mov %rsi,-0x8(%rbp) // BUG, overwrite of tcc ptr ffffffffc0012cdf: 48 89 75 f0 mov %rsi,-0x10(%rbp) ffffffffc0012ce3: 48 89 75 e8 mov %rsi,-0x18(%rbp) ffffffffc0012ce7: 48 89 75 e0 mov %rsi,-0x20(%rbp) ffffffffc0012ceb: 48 89 75 d8 mov %rsi,-0x28(%rbp) ffffffffc0012cef: 48 89 75 d0 mov %rsi,-0x30(%rbp) ffffffffc0012cf3: 48 89 75 c8 mov %rsi,-0x38(%rbp) ffffffffc0012cf7: 48 89 75 c0 mov %rsi,-0x40(%rbp) ffffffffc0012cfb: 48 89 75 b8 mov %rsi,-0x48(%rbp) ffffffffc0012cff: 48 89 75 b0 mov %rsi,-0x50(%rbp) ffffffffc0012d03: 48 89 75 a8 mov %rsi,-0x58(%rbp) ffffffffc0012d07: 48 89 75 a0 mov %rsi,-0x60(%rbp) ffffffffc0012d0b: 48 89 75 98 mov %rsi,-0x68(%rbp) ffffffffc0012d0f: 48 89 75 90 mov %rsi,-0x70(%rbp) ffffffffc0012d13: 48 89 75 88 mov %rsi,-0x78(%rbp) ffffffffc0012d17: 48 89 75 80 mov %rsi,-0x80(%rbp) ffffffffc0012d1b: 48 0f b6 75 ff movzbq -0x1(%rbp),%rsi ffffffffc0012d20: 40 88 75 ff mov %sil,-0x1(%rbp) ffffffffc0012d24: 48 8b 85 f8 ff ff ff mov -0x8(%rbp),%rax ffffffffc0012d2b: e8 30 00 00 00 call 0xffffffffc0012d60 ffffffffc0012d30: c9 leave ffffffffc0012d31: c3 ret So even though it would seem more obvious while looking at prologue what is the intent behind it, this would require us to patch the instructions that make us of R10/stack, which in the end would be way more invasive. After all, for x86 JIT code: Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> but it is a must to have a better commit message here. Thanks! > } > /* Exception callback receives FP as third parameter */ > if (is_exception_cb) { > @@ -373,6 +380,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, > if (stack_depth) > EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8)); > if (tail_call_reachable) > + /* Here, rax is tail_call_cnt_ptr. */ > EMIT1(0x50); /* push rax */ > *pprog = prog; > } > @@ -528,7 +536,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog, > u32 stack_depth, u8 *ip, > struct jit_context *ctx) > { > - int tcc_off = -4 - round_up(stack_depth, 8); > + int tcc_ptr_off = -8 - round_up(stack_depth, 8); > u8 *prog = *pprog, *start = *pprog; > int offset; > > @@ -553,13 +561,12 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog, > * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT) > * goto out; > */ > - EMIT2_off32(0x8B, 0x85, tcc_off); /* mov eax, dword ptr [rbp - tcc_off] */ > - EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT); /* cmp eax, MAX_TAIL_CALL_CNT */ > + EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off); /* mov rax, qword ptr [rbp - tcc_ptr_off] */ > + EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT); /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */ > > offset = ctx->tail_call_indirect_label - (prog + 2 - start); > EMIT2(X86_JAE, offset); /* jae out */ > - EMIT3(0x83, 0xC0, 0x01); /* add eax, 1 */ > - EMIT2_off32(0x89, 0x85, tcc_off); /* mov dword ptr [rbp - tcc_off], eax */ > + EMIT3(0x83, 0x00, 0x01); /* add dword ptr [rax], 1 */ > > /* prog = array->ptrs[index]; */ > EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6, /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */ > @@ -581,6 +588,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog, > pop_callee_regs(&prog, callee_regs_used); > } > > + /* pop tail_call_cnt_ptr */ > EMIT1(0x58); /* pop rax */ > if (stack_depth) > EMIT3_off32(0x48, 0x81, 0xC4, /* add rsp, sd */ > @@ -609,7 +617,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog, > bool *callee_regs_used, u32 stack_depth, > struct jit_context *ctx) > { > - int tcc_off = -4 - round_up(stack_depth, 8); > + int tcc_ptr_off = -8 - round_up(stack_depth, 8); > u8 *prog = *pprog, *start = *pprog; > int offset; > > @@ -617,13 +625,12 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog, > * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT) > * goto out; > */ > - EMIT2_off32(0x8B, 0x85, tcc_off); /* mov eax, dword ptr [rbp - tcc_off] */ > - EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT); /* cmp eax, MAX_TAIL_CALL_CNT */ > + EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off); /* mov rax, qword ptr [rbp - tcc_ptr_off] */ > + EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT); /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */ > > offset = ctx->tail_call_direct_label - (prog + 2 - start); > EMIT2(X86_JAE, offset); /* jae out */ > - EMIT3(0x83, 0xC0, 0x01); /* add eax, 1 */ > - EMIT2_off32(0x89, 0x85, tcc_off); /* mov dword ptr [rbp - tcc_off], eax */ > + EMIT3(0x83, 0x00, 0x01); /* add dword ptr [rax], 1 */ > > poke->tailcall_bypass = ip + (prog - start); > poke->adj_off = X86_TAIL_CALL_OFFSET; > @@ -640,6 +647,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog, > pop_callee_regs(&prog, callee_regs_used); > } > > + /* pop tail_call_cnt_ptr */ > EMIT1(0x58); /* pop rax */ > if (stack_depth) > EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8)); > -- > 2.41.0 > >
On 2023/12/21 20:02, Maciej Fijalkowski wrote: > On Wed, Oct 11, 2023 at 11:27:23PM +0800, Leon Hwang wrote: >> From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall >> handling in JIT"), the tailcall on x64 works better than before. >> >> From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms >> for x64 JIT"), tailcall is able to run in BPF subprograms on x64. >> >> How about: >> >> 1. More than 1 subprograms are called in a bpf program. >> 2. The tailcalls in the subprograms call the bpf program. >> >> Because of missing tail_call_cnt back-propagation, a tailcall hierarchy >> comes up. And MAX_TAIL_CALL_CNT limit does not work for this case. >> >> As we know, in tail call context, the tail_call_cnt propagates by stack >> and rax register between BPF subprograms. So, propagating tail_call_cnt >> pointer by stack and rax register makes tail_call_cnt as like a global >> variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall >> hierarchy cases. >> >> Before jumping to other bpf prog, load tail_call_cnt from the pointer >> and then compare with MAX_TAIL_CALL_CNT. Finally, increment >> tail_call_cnt by its pointer. >> >> But, where does tail_call_cnt store? >> >> It stores on the stack of bpf prog's caller, like >> >> | STACK | >> | | >> | rip | >> +->| tcc | >> | | rip | >> | | rbp | >> | +---------+ RBP >> | | | >> | | | >> | | | >> +--| tcc_ptr | >> | rbx | >> +---------+ RSP >> >> And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf >> prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx" >> instruction on BPF JIT epilogue"). >> >> Why not back-propagate tail_call_cnt? >> >> It's because it's vulnerable to back-propagate it. It's unable to work >> well with the following case. >> >> int prog1(); >> int prog2(); >> >> prog1 is tail caller, and prog2 is tail callee. If we do back-propagate >> tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the >> same time? The answer is NO. Otherwise, there will be a register to be >> polluted, which will make kernel crash. >> >> Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT") >> Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT") >> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> >> --- >> arch/x86/net/bpf_jit_comp.c | 40 ++++++++++++++++++++++--------------- >> 1 file changed, 24 insertions(+), 16 deletions(-) >> >> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c >> index c2a0465d37da4..36631129cc800 100644 >> --- a/arch/x86/net/bpf_jit_comp.c >> +++ b/arch/x86/net/bpf_jit_comp.c >> @@ -256,7 +256,7 @@ struct jit_context { >> /* Number of bytes emit_patch() needs to generate instructions */ >> #define X86_PATCH_SIZE 5 >> /* Number of bytes that will be skipped on tailcall */ >> -#define X86_TAIL_CALL_OFFSET (11 + ENDBR_INSN_SIZE) >> +#define X86_TAIL_CALL_OFFSET (22 + ENDBR_INSN_SIZE) >> >> static void push_r12(u8 **pprog) >> { >> @@ -340,14 +340,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, >> EMIT_ENDBR(); >> emit_nops(&prog, X86_PATCH_SIZE); >> if (!ebpf_from_cbpf) { >> - if (tail_call_reachable && !is_subprog) >> + if (tail_call_reachable && !is_subprog) { >> /* When it's the entry of the whole tailcall context, >> * zeroing rax means initialising tail_call_cnt. >> */ >> - EMIT2(0x31, 0xC0); /* xor eax, eax */ >> - else >> - /* Keep the same instruction layout. */ >> - EMIT2(0x66, 0x90); /* nop2 */ >> + EMIT2(0x31, 0xC0); /* xor eax, eax */ >> + EMIT1(0x50); /* push rax */ >> + /* Make rax as ptr that points to tail_call_cnt. */ >> + EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */ >> + EMIT1_off32(0xE8, 2); /* call main prog */ >> + EMIT1(0x59); /* pop rcx, get rid of tail_call_cnt */ >> + EMIT1(0xC3); /* ret */ >> + } else { >> + /* Keep the same instruction size. */ >> + emit_nops(&prog, 13); >> + } > > At first sight it seemed to me too invasive but after trying out few other > approaches in the end it is elegant. > > I wanted to avoid a bit puzzling call insn in the prologue with a following > prologue layout (this will be based on entry prog from tailcall_bpf2bpf3.c that > was the first one to break): > > ffffffffc0012cb4: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > ffffffffc0012cb9: 55 push %rbp > ffffffffc0012cba: 48 89 e5 mov %rsp,%rbp > ffffffffc0012cbd: 48 83 ec 10 sub $0x10,%rsp > ffffffffc0012cc1: 48 89 65 f8 mov %rsp,-0x8(%rbp) > ffffffffc0012cc5: 48 c7 04 24 00 00 00 movq $0x0,(%rsp) > ffffffffc0012ccc: 00 > ffffffffc0012ccd: 48 8b 45 f8 mov -0x8(%rbp),%rax > ffffffffc0012cd1: 50 push %rax > ffffffffc0012cd2: 48 81 ec 80 00 00 00 sub $0x80,%rsp > > So we would have hidden 16 bytes on stack at the *beginning* of entry stack > frame. First thing right after rbp would be tcc pointer so referring to it > wouldn't require us to take into account stack depth. But then if we > follow with rest of insns: > > ffffffffc0012cd9: 31 f6 xor %esi,%esi > ffffffffc0012cdb: 48 89 75 f8 mov %rsi,-0x8(%rbp) // BUG, overwrite of tcc ptr > ffffffffc0012cdf: 48 89 75 f0 mov %rsi,-0x10(%rbp) > ffffffffc0012ce3: 48 89 75 e8 mov %rsi,-0x18(%rbp) > ffffffffc0012ce7: 48 89 75 e0 mov %rsi,-0x20(%rbp) > ffffffffc0012ceb: 48 89 75 d8 mov %rsi,-0x28(%rbp) > ffffffffc0012cef: 48 89 75 d0 mov %rsi,-0x30(%rbp) > ffffffffc0012cf3: 48 89 75 c8 mov %rsi,-0x38(%rbp) > ffffffffc0012cf7: 48 89 75 c0 mov %rsi,-0x40(%rbp) > ffffffffc0012cfb: 48 89 75 b8 mov %rsi,-0x48(%rbp) > ffffffffc0012cff: 48 89 75 b0 mov %rsi,-0x50(%rbp) > ffffffffc0012d03: 48 89 75 a8 mov %rsi,-0x58(%rbp) > ffffffffc0012d07: 48 89 75 a0 mov %rsi,-0x60(%rbp) > ffffffffc0012d0b: 48 89 75 98 mov %rsi,-0x68(%rbp) > ffffffffc0012d0f: 48 89 75 90 mov %rsi,-0x70(%rbp) > ffffffffc0012d13: 48 89 75 88 mov %rsi,-0x78(%rbp) > ffffffffc0012d17: 48 89 75 80 mov %rsi,-0x80(%rbp) > ffffffffc0012d1b: 48 0f b6 75 ff movzbq -0x1(%rbp),%rsi > ffffffffc0012d20: 40 88 75 ff mov %sil,-0x1(%rbp) > ffffffffc0012d24: 48 8b 85 f8 ff ff ff mov -0x8(%rbp),%rax > ffffffffc0012d2b: e8 30 00 00 00 call 0xffffffffc0012d60 > ffffffffc0012d30: c9 leave > ffffffffc0012d31: c3 ret > > So even though it would seem more obvious while looking at prologue what > is the intent behind it, this would require us to patch the instructions > that make us of R10/stack, which in the end would be way more invasive. > > After all, for x86 JIT code: > Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Thanks for your review. > > but it is a must to have a better commit message here. > I'll write a better commit message here. Thanks, Leon > Thanks! > >> } >> /* Exception callback receives FP as third parameter */ >> if (is_exception_cb) { >> @@ -373,6 +380,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, >> if (stack_depth) >> EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8)); >> if (tail_call_reachable) >> + /* Here, rax is tail_call_cnt_ptr. */ >> EMIT1(0x50); /* push rax */ >> *pprog = prog; >> } >> @@ -528,7 +536,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog, >> u32 stack_depth, u8 *ip, >> struct jit_context *ctx) >> { >> - int tcc_off = -4 - round_up(stack_depth, 8); >> + int tcc_ptr_off = -8 - round_up(stack_depth, 8); >> u8 *prog = *pprog, *start = *pprog; >> int offset; >> >> @@ -553,13 +561,12 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog, >> * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT) >> * goto out; >> */ >> - EMIT2_off32(0x8B, 0x85, tcc_off); /* mov eax, dword ptr [rbp - tcc_off] */ >> - EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT); /* cmp eax, MAX_TAIL_CALL_CNT */ >> + EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off); /* mov rax, qword ptr [rbp - tcc_ptr_off] */ >> + EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT); /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */ >> >> offset = ctx->tail_call_indirect_label - (prog + 2 - start); >> EMIT2(X86_JAE, offset); /* jae out */ >> - EMIT3(0x83, 0xC0, 0x01); /* add eax, 1 */ >> - EMIT2_off32(0x89, 0x85, tcc_off); /* mov dword ptr [rbp - tcc_off], eax */ >> + EMIT3(0x83, 0x00, 0x01); /* add dword ptr [rax], 1 */ >> >> /* prog = array->ptrs[index]; */ >> EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6, /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */ >> @@ -581,6 +588,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog, >> pop_callee_regs(&prog, callee_regs_used); >> } >> >> + /* pop tail_call_cnt_ptr */ >> EMIT1(0x58); /* pop rax */ >> if (stack_depth) >> EMIT3_off32(0x48, 0x81, 0xC4, /* add rsp, sd */ >> @@ -609,7 +617,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog, >> bool *callee_regs_used, u32 stack_depth, >> struct jit_context *ctx) >> { >> - int tcc_off = -4 - round_up(stack_depth, 8); >> + int tcc_ptr_off = -8 - round_up(stack_depth, 8); >> u8 *prog = *pprog, *start = *pprog; >> int offset; >> >> @@ -617,13 +625,12 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog, >> * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT) >> * goto out; >> */ >> - EMIT2_off32(0x8B, 0x85, tcc_off); /* mov eax, dword ptr [rbp - tcc_off] */ >> - EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT); /* cmp eax, MAX_TAIL_CALL_CNT */ >> + EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off); /* mov rax, qword ptr [rbp - tcc_ptr_off] */ >> + EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT); /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */ >> >> offset = ctx->tail_call_direct_label - (prog + 2 - start); >> EMIT2(X86_JAE, offset); /* jae out */ >> - EMIT3(0x83, 0xC0, 0x01); /* add eax, 1 */ >> - EMIT2_off32(0x89, 0x85, tcc_off); /* mov dword ptr [rbp - tcc_off], eax */ >> + EMIT3(0x83, 0x00, 0x01); /* add dword ptr [rax], 1 */ >> >> poke->tailcall_bypass = ip + (prog - start); >> poke->adj_off = X86_TAIL_CALL_OFFSET; >> @@ -640,6 +647,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog, >> pop_callee_regs(&prog, callee_regs_used); >> } >> >> + /* pop tail_call_cnt_ptr */ >> EMIT1(0x58); /* pop rax */ >> if (stack_depth) >> EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8)); >> -- >> 2.41.0 >> >>
On Thu, Dec 21, 2023 at 6:56 AM Leon Hwang <hffilwlqm@gmail.com> wrote: > > > > > > but it is a must to have a better commit message here. > > > > I'll write a better commit message here. Please respin with updated commit messages and drop RFC tag. Keep Reviewed-by that you received already.
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index c2a0465d37da4..36631129cc800 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -256,7 +256,7 @@ struct jit_context { /* Number of bytes emit_patch() needs to generate instructions */ #define X86_PATCH_SIZE 5 /* Number of bytes that will be skipped on tailcall */ -#define X86_TAIL_CALL_OFFSET (11 + ENDBR_INSN_SIZE) +#define X86_TAIL_CALL_OFFSET (22 + ENDBR_INSN_SIZE) static void push_r12(u8 **pprog) { @@ -340,14 +340,21 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, EMIT_ENDBR(); emit_nops(&prog, X86_PATCH_SIZE); if (!ebpf_from_cbpf) { - if (tail_call_reachable && !is_subprog) + if (tail_call_reachable && !is_subprog) { /* When it's the entry of the whole tailcall context, * zeroing rax means initialising tail_call_cnt. */ - EMIT2(0x31, 0xC0); /* xor eax, eax */ - else - /* Keep the same instruction layout. */ - EMIT2(0x66, 0x90); /* nop2 */ + EMIT2(0x31, 0xC0); /* xor eax, eax */ + EMIT1(0x50); /* push rax */ + /* Make rax as ptr that points to tail_call_cnt. */ + EMIT3(0x48, 0x89, 0xE0); /* mov rax, rsp */ + EMIT1_off32(0xE8, 2); /* call main prog */ + EMIT1(0x59); /* pop rcx, get rid of tail_call_cnt */ + EMIT1(0xC3); /* ret */ + } else { + /* Keep the same instruction size. */ + emit_nops(&prog, 13); + } } /* Exception callback receives FP as third parameter */ if (is_exception_cb) { @@ -373,6 +380,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, if (stack_depth) EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8)); if (tail_call_reachable) + /* Here, rax is tail_call_cnt_ptr. */ EMIT1(0x50); /* push rax */ *pprog = prog; } @@ -528,7 +536,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog, u32 stack_depth, u8 *ip, struct jit_context *ctx) { - int tcc_off = -4 - round_up(stack_depth, 8); + int tcc_ptr_off = -8 - round_up(stack_depth, 8); u8 *prog = *pprog, *start = *pprog; int offset; @@ -553,13 +561,12 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog, * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT) * goto out; */ - EMIT2_off32(0x8B, 0x85, tcc_off); /* mov eax, dword ptr [rbp - tcc_off] */ - EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT); /* cmp eax, MAX_TAIL_CALL_CNT */ + EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off); /* mov rax, qword ptr [rbp - tcc_ptr_off] */ + EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT); /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */ offset = ctx->tail_call_indirect_label - (prog + 2 - start); EMIT2(X86_JAE, offset); /* jae out */ - EMIT3(0x83, 0xC0, 0x01); /* add eax, 1 */ - EMIT2_off32(0x89, 0x85, tcc_off); /* mov dword ptr [rbp - tcc_off], eax */ + EMIT3(0x83, 0x00, 0x01); /* add dword ptr [rax], 1 */ /* prog = array->ptrs[index]; */ EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6, /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */ @@ -581,6 +588,7 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog, pop_callee_regs(&prog, callee_regs_used); } + /* pop tail_call_cnt_ptr */ EMIT1(0x58); /* pop rax */ if (stack_depth) EMIT3_off32(0x48, 0x81, 0xC4, /* add rsp, sd */ @@ -609,7 +617,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog, bool *callee_regs_used, u32 stack_depth, struct jit_context *ctx) { - int tcc_off = -4 - round_up(stack_depth, 8); + int tcc_ptr_off = -8 - round_up(stack_depth, 8); u8 *prog = *pprog, *start = *pprog; int offset; @@ -617,13 +625,12 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog, * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT) * goto out; */ - EMIT2_off32(0x8B, 0x85, tcc_off); /* mov eax, dword ptr [rbp - tcc_off] */ - EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT); /* cmp eax, MAX_TAIL_CALL_CNT */ + EMIT3_off32(0x48, 0x8B, 0x85, tcc_ptr_off); /* mov rax, qword ptr [rbp - tcc_ptr_off] */ + EMIT3(0x83, 0x38, MAX_TAIL_CALL_CNT); /* cmp dword ptr [rax], MAX_TAIL_CALL_CNT */ offset = ctx->tail_call_direct_label - (prog + 2 - start); EMIT2(X86_JAE, offset); /* jae out */ - EMIT3(0x83, 0xC0, 0x01); /* add eax, 1 */ - EMIT2_off32(0x89, 0x85, tcc_off); /* mov dword ptr [rbp - tcc_off], eax */ + EMIT3(0x83, 0x00, 0x01); /* add dword ptr [rax], 1 */ poke->tailcall_bypass = ip + (prog - start); poke->adj_off = X86_TAIL_CALL_OFFSET; @@ -640,6 +647,7 @@ static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog, pop_callee_regs(&prog, callee_regs_used); } + /* pop tail_call_cnt_ptr */ EMIT1(0x58); /* pop rax */ if (stack_depth) EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT"), the tailcall on x64 works better than before. From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms for x64 JIT"), tailcall is able to run in BPF subprograms on x64. How about: 1. More than 1 subprograms are called in a bpf program. 2. The tailcalls in the subprograms call the bpf program. Because of missing tail_call_cnt back-propagation, a tailcall hierarchy comes up. And MAX_TAIL_CALL_CNT limit does not work for this case. As we know, in tail call context, the tail_call_cnt propagates by stack and rax register between BPF subprograms. So, propagating tail_call_cnt pointer by stack and rax register makes tail_call_cnt as like a global variable, in order to make MAX_TAIL_CALL_CNT limit works for tailcall hierarchy cases. Before jumping to other bpf prog, load tail_call_cnt from the pointer and then compare with MAX_TAIL_CALL_CNT. Finally, increment tail_call_cnt by its pointer. But, where does tail_call_cnt store? It stores on the stack of bpf prog's caller, like | STACK | | | | rip | +->| tcc | | | rip | | | rbp | | +---------+ RBP | | | | | | | | | +--| tcc_ptr | | rbx | +---------+ RSP And tcc_ptr is unnecessary to be popped from stack at the epilogue of bpf prog, like the way of commit d207929d97ea028f ("bpf, x64: Drop "pop %rcx" instruction on BPF JIT epilogue"). Why not back-propagate tail_call_cnt? It's because it's vulnerable to back-propagate it. It's unable to work well with the following case. int prog1(); int prog2(); prog1 is tail caller, and prog2 is tail callee. If we do back-propagate tail_call_cnt at the epilogue of prog2, can prog2 run standalone at the same time? The answer is NO. Otherwise, there will be a register to be polluted, which will make kernel crash. Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT") Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT") Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> --- arch/x86/net/bpf_jit_comp.c | 40 ++++++++++++++++++++++--------------- 1 file changed, 24 insertions(+), 16 deletions(-)