diff mbox series

bpf,x64: pad NOPs to make images converge more easily

Message ID 20201211081903.17857-1-glin@suse.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series bpf,x64: pad NOPs to make images converge more easily | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Gary Lin Dec. 11, 2020, 8:19 a.m. UTC
The x64 bpf jit expects bpf images converge within the given passes, but
it could fail to do so with some corner cases. For example:

      l0:     ldh [4]
      l1:     jeq #0x537d, l2, l40
      l2:     ld [0]
      l3:     jeq #0xfa163e0d, l4, l40
      l4:     ldh [12]
      l5:     ldx #0xe
      l6:     jeq #0x86dd, l41, l7
      l8:     ld [x+16]
      l9:     ja 41

        [... repeated ja 41 ]

      l40:    ja 41
      l41:    ret #0
      l42:    ld #len
      l43:    ret a

This bpf program contains 32 "ja 41" instructions which are effectively
NOPs and designed to be replaced with valid code dynamically. Ideally,
bpf jit should optimize those "ja 41" instructions out when translating
the bpf instructions into x86_64 machine code. However, do_jit() can
only remove one "ja 41" for offset==0 on each pass, so it requires at
least 32 runs to eliminate those JMPs and exceeds the current limit of
passes (20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON
is set even though it's legit as a classic socket filter.

To make the image more likely converge within 20 passes, this commit
pads some instructions with NOPs in the last 5 passes:

1. conditional jumps
  A possible size variance comes from the adoption of imm8 JMP. If the
  offset is imm8, we calculate the size difference of this BPF instruction
  between the previous pass and the current pass and fill the gap with NOPs.
  To avoid the recalculation of jump offset, those NOPs are inserted before
  the JMP code, so we have to subtract the 2 bytes of imm8 JMP when
  calculating the NOP number.

2. BPF_JA
  There are two conditions for BPF_JA.
  a.) nop jumps
    If this instruction is not optimized out in the previous pass,
    instead of removing it, we insert the equivalent size of NOPs.
  b.) label jumps
    Similar to condition jumps, we prepend NOPs right before the JMP
    code.

To make the code concise, emit_nops() is modified to use the signed len and
return the number of inserted NOPs.

To support bpf-to-bpf, a new flag, padded, is introduced to 'struct bpf_prog'
so that bpf_int_jit_compile() could know if the program is padded or not.

Signed-off-by: Gary Lin <glin@suse.com>
---
 arch/x86/net/bpf_jit_comp.c | 68 ++++++++++++++++++++++++-------------
 include/linux/filter.h      |  1 +
 2 files changed, 45 insertions(+), 24 deletions(-)

Comments

Daniel Borkmann Dec. 11, 2020, 8:05 p.m. UTC | #1
On 12/11/20 9:19 AM, Gary Lin wrote:
> The x64 bpf jit expects bpf images converge within the given passes, but
> it could fail to do so with some corner cases. For example:
> 
>        l0:     ldh [4]
>        l1:     jeq #0x537d, l2, l40
>        l2:     ld [0]
>        l3:     jeq #0xfa163e0d, l4, l40
>        l4:     ldh [12]
>        l5:     ldx #0xe
>        l6:     jeq #0x86dd, l41, l7
>        l8:     ld [x+16]
>        l9:     ja 41
> 
>          [... repeated ja 41 ]
> 
>        l40:    ja 41
>        l41:    ret #0
>        l42:    ld #len
>        l43:    ret a
> 
> This bpf program contains 32 "ja 41" instructions which are effectively
> NOPs and designed to be replaced with valid code dynamically. Ideally,
> bpf jit should optimize those "ja 41" instructions out when translating
> the bpf instructions into x86_64 machine code. However, do_jit() can
> only remove one "ja 41" for offset==0 on each pass, so it requires at
> least 32 runs to eliminate those JMPs and exceeds the current limit of
> passes (20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON
> is set even though it's legit as a classic socket filter.
> 
> To make the image more likely converge within 20 passes, this commit
> pads some instructions with NOPs in the last 5 passes:
> 
> 1. conditional jumps
>    A possible size variance comes from the adoption of imm8 JMP. If the
>    offset is imm8, we calculate the size difference of this BPF instruction
>    between the previous pass and the current pass and fill the gap with NOPs.
>    To avoid the recalculation of jump offset, those NOPs are inserted before
>    the JMP code, so we have to subtract the 2 bytes of imm8 JMP when
>    calculating the NOP number.
> 
> 2. BPF_JA
>    There are two conditions for BPF_JA.
>    a.) nop jumps
>      If this instruction is not optimized out in the previous pass,
>      instead of removing it, we insert the equivalent size of NOPs.
>    b.) label jumps
>      Similar to condition jumps, we prepend NOPs right before the JMP
>      code.
> 
> To make the code concise, emit_nops() is modified to use the signed len and
> return the number of inserted NOPs.
> 
> To support bpf-to-bpf, a new flag, padded, is introduced to 'struct bpf_prog'
> so that bpf_int_jit_compile() could know if the program is padded or not.

Please also add multiple hand-crafted test cases e.g. for bpf-to-bpf calls into
test_verifier (which is part of bpf kselftests) that would exercise this corner
case in x86 jit where we would start to nop pad so that there is proper coverage,
too.
Andrii Nakryiko Dec. 11, 2020, 8:58 p.m. UTC | #2
On Fri, Dec 11, 2020 at 8:51 AM Gary Lin <glin@suse.com> wrote:
>
> The x64 bpf jit expects bpf images converge within the given passes, but
> it could fail to do so with some corner cases. For example:
>
>       l0:     ldh [4]
>       l1:     jeq #0x537d, l2, l40
>       l2:     ld [0]
>       l3:     jeq #0xfa163e0d, l4, l40
>       l4:     ldh [12]
>       l5:     ldx #0xe
>       l6:     jeq #0x86dd, l41, l7
>       l8:     ld [x+16]
>       l9:     ja 41
>
>         [... repeated ja 41 ]
>
>       l40:    ja 41
>       l41:    ret #0
>       l42:    ld #len
>       l43:    ret a
>
> This bpf program contains 32 "ja 41" instructions which are effectively
> NOPs and designed to be replaced with valid code dynamically. Ideally,
> bpf jit should optimize those "ja 41" instructions out when translating
> the bpf instructions into x86_64 machine code. However, do_jit() can
> only remove one "ja 41" for offset==0 on each pass, so it requires at
> least 32 runs to eliminate those JMPs and exceeds the current limit of
> passes (20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON
> is set even though it's legit as a classic socket filter.
>
> To make the image more likely converge within 20 passes, this commit
> pads some instructions with NOPs in the last 5 passes:
>
> 1. conditional jumps
>   A possible size variance comes from the adoption of imm8 JMP. If the
>   offset is imm8, we calculate the size difference of this BPF instruction
>   between the previous pass and the current pass and fill the gap with NOPs.
>   To avoid the recalculation of jump offset, those NOPs are inserted before
>   the JMP code, so we have to subtract the 2 bytes of imm8 JMP when
>   calculating the NOP number.
>
> 2. BPF_JA
>   There are two conditions for BPF_JA.
>   a.) nop jumps
>     If this instruction is not optimized out in the previous pass,
>     instead of removing it, we insert the equivalent size of NOPs.
>   b.) label jumps
>     Similar to condition jumps, we prepend NOPs right before the JMP
>     code.
>
> To make the code concise, emit_nops() is modified to use the signed len and
> return the number of inserted NOPs.
>
> To support bpf-to-bpf, a new flag, padded, is introduced to 'struct bpf_prog'
> so that bpf_int_jit_compile() could know if the program is padded or not.
>
> Signed-off-by: Gary Lin <glin@suse.com>
> ---
>  arch/x86/net/bpf_jit_comp.c | 68 ++++++++++++++++++++++++-------------
>  include/linux/filter.h      |  1 +
>  2 files changed, 45 insertions(+), 24 deletions(-)
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 796506dcfc42..30b81c8539b3 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -789,8 +789,31 @@ static void detect_reg_usage(struct bpf_insn *insn, int insn_cnt,
>         }
>  }
>
> +static int emit_nops(u8 **pprog, int len)
> +{
> +       u8 *prog = *pprog;
> +       int i, noplen, cnt = 0;
> +
> +       while (len > 0) {
> +               noplen = len;
> +
> +               if (noplen > ASM_NOP_MAX)
> +                       noplen = ASM_NOP_MAX;
> +
> +               for (i = 0; i < noplen; i++)
> +                       EMIT1(ideal_nops[noplen][i]);
> +               len -= noplen;
> +       }
> +
> +       *pprog = prog;
> +
> +       return cnt;

Isn't cnt always zero? I guess it was supposed to be `cnt = len` at
the beginning?

But then it begs the question how this patch was actually tested given
emit_nops() is returning wrong answers? Changes like this should
definitely come with tests.

> +}
> +
> +#define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
> +
>  static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
> -                 int oldproglen, struct jit_context *ctx)
> +                 int oldproglen, struct jit_context *ctx, bool jmp_padding)
>  {
>         bool tail_call_reachable = bpf_prog->aux->tail_call_reachable;
>         struct bpf_insn *insn = bpf_prog->insnsi;
> @@ -1409,6 +1432,8 @@ xadd:                     if (is_imm8(insn->off))
>                         }
>                         jmp_offset = addrs[i + insn->off] - addrs[i];
>                         if (is_imm8(jmp_offset)) {
> +                               if (jmp_padding)
> +                                       cnt += emit_nops(&prog, INSN_SZ_DIFF - 2);
>                                 EMIT2(jmp_cond, jmp_offset);
>                         } else if (is_simm32(jmp_offset)) {
>                                 EMIT2_off32(0x0F, jmp_cond + 0x10, jmp_offset);
> @@ -1431,11 +1456,19 @@ xadd:                   if (is_imm8(insn->off))
>                         else
>                                 jmp_offset = addrs[i + insn->off] - addrs[i];
>
> -                       if (!jmp_offset)
> -                               /* Optimize out nop jumps */
> +                       if (!jmp_offset) {
> +                               /*
> +                                * If jmp_padding is enabled, the extra nops will
> +                                * be inserted. Otherwise, optimize out nop jumps.
> +                                */
> +                               if (jmp_padding)
> +                                       cnt += emit_nops(&prog, INSN_SZ_DIFF);
>                                 break;
> +                       }
>  emit_jmp:
>                         if (is_imm8(jmp_offset)) {
> +                               if (jmp_padding)
> +                                       cnt += emit_nops(&prog, INSN_SZ_DIFF - 2);
>                                 EMIT2(0xEB, jmp_offset);
>                         } else if (is_simm32(jmp_offset)) {
>                                 EMIT1_off32(0xE9, jmp_offset);
> @@ -1578,26 +1611,6 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog,
>         return 0;
>  }
>
> -static void emit_nops(u8 **pprog, unsigned int len)
> -{
> -       unsigned int i, noplen;
> -       u8 *prog = *pprog;
> -       int cnt = 0;
> -
> -       while (len > 0) {
> -               noplen = len;
> -
> -               if (noplen > ASM_NOP_MAX)
> -                       noplen = ASM_NOP_MAX;
> -
> -               for (i = 0; i < noplen; i++)
> -                       EMIT1(ideal_nops[noplen][i]);
> -               len -= noplen;
> -       }
> -
> -       *pprog = prog;
> -}
> -
>  static void emit_align(u8 **pprog, u32 align)
>  {
>         u8 *target, *prog = *pprog;
> @@ -1972,6 +1985,9 @@ struct x64_jit_data {
>         struct jit_context ctx;
>  };
>
> +#define MAX_PASSES 20
> +#define PADDING_PASSES (MAX_PASSES - 5)
> +
>  struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>  {
>         struct bpf_binary_header *header = NULL;
> @@ -1981,6 +1997,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>         struct jit_context ctx = {};
>         bool tmp_blinded = false;
>         bool extra_pass = false;
> +       bool padding = prog->padded;

can this ever be true on assignment? I.e., can the program be jitted twice?

>         u8 *image = NULL;
>         int *addrs;
>         int pass;
> @@ -2043,7 +2060,9 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>          * pass to emit the final image.
>          */
>         for (pass = 0; pass < 20 || image; pass++) {
> -               proglen = do_jit(prog, addrs, image, oldproglen, &ctx);
> +               if (!padding && pass >= PADDING_PASSES)
> +                       padding = true;

Just, unconditionally:

padding = pass >= PADDING_PASSES;

> +               proglen = do_jit(prog, addrs, image, oldproglen, &ctx, padding);
>                 if (proglen <= 0) {
>  out_image:
>                         image = NULL;
> @@ -2101,6 +2120,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>                 prog->bpf_func = (void *)image;
>                 prog->jited = 1;
>                 prog->jited_len = proglen;
> +               prog->padded = padding;
>         } else {
>                 prog = orig_prog;
>         }
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 1b62397bd124..cb7ce2b3737a 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -531,6 +531,7 @@ struct bpf_prog {
>                                 dst_needed:1,   /* Do we need dst entry? */
>                                 blinded:1,      /* Was blinded */
>                                 is_func:1,      /* program is a bpf function */
> +                               padded:1,       /* jitted image was padded */
>                                 kprobe_override:1, /* Do we override a kprobe? */
>                                 has_callchain_buf:1, /* callchain buffer allocated? */
>                                 enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */
> --
> 2.29.2
>
Daniel Borkmann Dec. 11, 2020, 9:13 p.m. UTC | #3
On 12/11/20 9:58 PM, Andrii Nakryiko wrote:
> On Fri, Dec 11, 2020 at 8:51 AM Gary Lin <glin@suse.com> wrote:
>>
[...]
>> +static int emit_nops(u8 **pprog, int len)
>> +{
>> +       u8 *prog = *pprog;
>> +       int i, noplen, cnt = 0;
>> +
>> +       while (len > 0) {
>> +               noplen = len;
>> +
>> +               if (noplen > ASM_NOP_MAX)
>> +                       noplen = ASM_NOP_MAX;
>> +
>> +               for (i = 0; i < noplen; i++)
>> +                       EMIT1(ideal_nops[noplen][i]);
>> +               len -= noplen;
>> +       }
>> +
>> +       *pprog = prog;
>> +
>> +       return cnt;
> 
> Isn't cnt always zero? I guess it was supposed to be `cnt = len` at
> the beginning?

EMIT*() macros from the JIT will bump cnt internally.

> But then it begs the question how this patch was actually tested given
> emit_nops() is returning wrong answers? Changes like this should
> definitely come with tests.

Yeah, applying a change like this without tests for the corner cases it is trying
to fix would be no go, so they are definitely needed, ideally also with disasm dump
of the relevant code and detailed analysis to show why it's bullet-proof.

>> +#define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
>> +
>>   static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>> -                 int oldproglen, struct jit_context *ctx)
>> +                 int oldproglen, struct jit_context *ctx, bool jmp_padding)
>>   {
>>          bool tail_call_reachable = bpf_prog->aux->tail_call_reachable;
>>          struct bpf_insn *insn = bpf_prog->insnsi;
>> @@ -1409,6 +1432,8 @@ xadd:                     if (is_imm8(insn->off))
>>                          }
>>                          jmp_offset = addrs[i + insn->off] - addrs[i];
>>                          if (is_imm8(jmp_offset)) {
>> +                               if (jmp_padding)
>> +                                       cnt += emit_nops(&prog, INSN_SZ_DIFF - 2);
>>                                  EMIT2(jmp_cond, jmp_offset);
>>                          } else if (is_simm32(jmp_offset)) {
>>                                  EMIT2_off32(0x0F, jmp_cond + 0x10, jmp_offset);
>> @@ -1431,11 +1456,19 @@ xadd:                   if (is_imm8(insn->off))
>>                          else
>>                                  jmp_offset = addrs[i + insn->off] - addrs[i];
>>
>> -                       if (!jmp_offset)
>> -                               /* Optimize out nop jumps */
>> +                       if (!jmp_offset) {
>> +                               /*
>> +                                * If jmp_padding is enabled, the extra nops will
>> +                                * be inserted. Otherwise, optimize out nop jumps.
>> +                                */
>> +                               if (jmp_padding)
>> +                                       cnt += emit_nops(&prog, INSN_SZ_DIFF);
>>                                  break;
>> +                       }
>>   emit_jmp:
>>                          if (is_imm8(jmp_offset)) {
>> +                               if (jmp_padding)
>> +                                       cnt += emit_nops(&prog, INSN_SZ_DIFF - 2);
>>                                  EMIT2(0xEB, jmp_offset);
>>                          } else if (is_simm32(jmp_offset)) {
>>                                  EMIT1_off32(0xE9, jmp_offset);
>> @@ -1578,26 +1611,6 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog,
>>          return 0;
>>   }
>>
>> -static void emit_nops(u8 **pprog, unsigned int len)
>> -{
>> -       unsigned int i, noplen;
>> -       u8 *prog = *pprog;
>> -       int cnt = 0;
>> -
>> -       while (len > 0) {
>> -               noplen = len;
>> -
>> -               if (noplen > ASM_NOP_MAX)
>> -                       noplen = ASM_NOP_MAX;
>> -
>> -               for (i = 0; i < noplen; i++)
>> -                       EMIT1(ideal_nops[noplen][i]);
>> -               len -= noplen;
>> -       }
>> -
>> -       *pprog = prog;
>> -}
>> -
>>   static void emit_align(u8 **pprog, u32 align)
>>   {
>>          u8 *target, *prog = *pprog;
>> @@ -1972,6 +1985,9 @@ struct x64_jit_data {
>>          struct jit_context ctx;
>>   };
>>
>> +#define MAX_PASSES 20
>> +#define PADDING_PASSES (MAX_PASSES - 5)
>> +
>>   struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>>   {
>>          struct bpf_binary_header *header = NULL;
>> @@ -1981,6 +1997,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>>          struct jit_context ctx = {};
>>          bool tmp_blinded = false;
>>          bool extra_pass = false;
>> +       bool padding = prog->padded;
> 
> can this ever be true on assignment? I.e., can the program be jitted twice?

Yes, progs can be passed into the JIT twice, see also jit_subprogs(). In one of
the earlier patches it would still potentially change the image size a second
time which would break subprogs aka bpf2bpf calls.

>>          u8 *image = NULL;
>>          int *addrs;
>>          int pass;
>> @@ -2043,7 +2060,9 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>>           * pass to emit the final image.
>>           */
>>          for (pass = 0; pass < 20 || image; pass++) {
>> -               proglen = do_jit(prog, addrs, image, oldproglen, &ctx);
>> +               if (!padding && pass >= PADDING_PASSES)
>> +                       padding = true;
> 
[...]
Alexei Starovoitov Dec. 12, 2020, 2:24 a.m. UTC | #4
On Fri, Dec 11, 2020 at 1:13 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> +                       }
> >>   emit_jmp:
> >>                          if (is_imm8(jmp_offset)) {
> >> +                               if (jmp_padding)
> >> +                                       cnt += emit_nops(&prog, INSN_SZ_DIFF - 2);

Could you describe all possible numbers of bytes in padding?
Is it 0, 2, 4 ?
Would be good to add warn_on_once to make sure the number
of nops is expected.

> >>   struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
> >>   {
> >>          struct bpf_binary_header *header = NULL;
> >> @@ -1981,6 +1997,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
> >>          struct jit_context ctx = {};
> >>          bool tmp_blinded = false;
> >>          bool extra_pass = false;
> >> +       bool padding = prog->padded;
> >
> > can this ever be true on assignment? I.e., can the program be jitted twice?
>
> Yes, progs can be passed into the JIT twice, see also jit_subprogs(). In one of
> the earlier patches it would still potentially change the image size a second
> time which would break subprogs aka bpf2bpf calls.

Right. I think memorized padded flag shouldn't be in sticky bits
of the prog structure.
It's only needed between the last pass and extra pass for bpf2bpf calls.
I think it would be cleaner to keep it in struct x64_jit_data *jit_data.

As others have said the selftests are must have.
Especially for bpf2bpf calls where one subprog is padded.

Thanks!
Gary Lin Dec. 14, 2020, 3:51 a.m. UTC | #5
On Fri, Dec 11, 2020 at 12:58:17PM -0800, Andrii Nakryiko wrote:
> On Fri, Dec 11, 2020 at 8:51 AM Gary Lin <glin@suse.com> wrote:
> >
> > The x64 bpf jit expects bpf images converge within the given passes, but
> > it could fail to do so with some corner cases. For example:
> >
> >       l0:     ldh [4]
> >       l1:     jeq #0x537d, l2, l40
> >       l2:     ld [0]
> >       l3:     jeq #0xfa163e0d, l4, l40
> >       l4:     ldh [12]
> >       l5:     ldx #0xe
> >       l6:     jeq #0x86dd, l41, l7
> >       l8:     ld [x+16]
> >       l9:     ja 41
> >
> >         [... repeated ja 41 ]
> >
> >       l40:    ja 41
> >       l41:    ret #0
> >       l42:    ld #len
> >       l43:    ret a
> >
> > This bpf program contains 32 "ja 41" instructions which are effectively
> > NOPs and designed to be replaced with valid code dynamically. Ideally,
> > bpf jit should optimize those "ja 41" instructions out when translating
> > the bpf instructions into x86_64 machine code. However, do_jit() can
> > only remove one "ja 41" for offset==0 on each pass, so it requires at
> > least 32 runs to eliminate those JMPs and exceeds the current limit of
> > passes (20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON
> > is set even though it's legit as a classic socket filter.
> >
> > To make the image more likely converge within 20 passes, this commit
> > pads some instructions with NOPs in the last 5 passes:
> >
> > 1. conditional jumps
> >   A possible size variance comes from the adoption of imm8 JMP. If the
> >   offset is imm8, we calculate the size difference of this BPF instruction
> >   between the previous pass and the current pass and fill the gap with NOPs.
> >   To avoid the recalculation of jump offset, those NOPs are inserted before
> >   the JMP code, so we have to subtract the 2 bytes of imm8 JMP when
> >   calculating the NOP number.
> >
> > 2. BPF_JA
> >   There are two conditions for BPF_JA.
> >   a.) nop jumps
> >     If this instruction is not optimized out in the previous pass,
> >     instead of removing it, we insert the equivalent size of NOPs.
> >   b.) label jumps
> >     Similar to condition jumps, we prepend NOPs right before the JMP
> >     code.
> >
> > To make the code concise, emit_nops() is modified to use the signed len and
> > return the number of inserted NOPs.
> >
> > To support bpf-to-bpf, a new flag, padded, is introduced to 'struct bpf_prog'
> > so that bpf_int_jit_compile() could know if the program is padded or not.
> >
> > Signed-off-by: Gary Lin <glin@suse.com>
> > ---
> >  arch/x86/net/bpf_jit_comp.c | 68 ++++++++++++++++++++++++-------------
> >  include/linux/filter.h      |  1 +
> >  2 files changed, 45 insertions(+), 24 deletions(-)
> >
> > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > index 796506dcfc42..30b81c8539b3 100644
> > --- a/arch/x86/net/bpf_jit_comp.c
> > +++ b/arch/x86/net/bpf_jit_comp.c
> > @@ -789,8 +789,31 @@ static void detect_reg_usage(struct bpf_insn *insn, int insn_cnt,
> >         }
> >  }
> >
> > +static int emit_nops(u8 **pprog, int len)
> > +{
> > +       u8 *prog = *pprog;
> > +       int i, noplen, cnt = 0;
> > +
> > +       while (len > 0) {
> > +               noplen = len;
> > +
> > +               if (noplen > ASM_NOP_MAX)
> > +                       noplen = ASM_NOP_MAX;
> > +
> > +               for (i = 0; i < noplen; i++)
> > +                       EMIT1(ideal_nops[noplen][i]);
> > +               len -= noplen;
> > +       }
> > +
> > +       *pprog = prog;
> > +
> > +       return cnt;
> 
> Isn't cnt always zero? I guess it was supposed to be `cnt = len` at
> the beginning?
> 
[Skip this one since Daniel answered in another mail.]

> But then it begs the question how this patch was actually tested given
> emit_nops() is returning wrong answers? Changes like this should
> definitely come with tests.
> 
Before submitting this patch, I lowered PADDING_PASSES to 2 and ran
tools/testing/selftests/bpf/test_progs. We surely need some official
test cases for this patch. Will do it in v2.

> > +}
> > +
> > +#define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
> > +
> >  static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
> > -                 int oldproglen, struct jit_context *ctx)
> > +                 int oldproglen, struct jit_context *ctx, bool jmp_padding)
> >  {
> >         bool tail_call_reachable = bpf_prog->aux->tail_call_reachable;
> >         struct bpf_insn *insn = bpf_prog->insnsi;
> > @@ -1409,6 +1432,8 @@ xadd:                     if (is_imm8(insn->off))
> >                         }
> >                         jmp_offset = addrs[i + insn->off] - addrs[i];
> >                         if (is_imm8(jmp_offset)) {
> > +                               if (jmp_padding)
> > +                                       cnt += emit_nops(&prog, INSN_SZ_DIFF - 2);
> >                                 EMIT2(jmp_cond, jmp_offset);
> >                         } else if (is_simm32(jmp_offset)) {
> >                                 EMIT2_off32(0x0F, jmp_cond + 0x10, jmp_offset);
> > @@ -1431,11 +1456,19 @@ xadd:                   if (is_imm8(insn->off))
> >                         else
> >                                 jmp_offset = addrs[i + insn->off] - addrs[i];
> >
> > -                       if (!jmp_offset)
> > -                               /* Optimize out nop jumps */
> > +                       if (!jmp_offset) {
> > +                               /*
> > +                                * If jmp_padding is enabled, the extra nops will
> > +                                * be inserted. Otherwise, optimize out nop jumps.
> > +                                */
> > +                               if (jmp_padding)
> > +                                       cnt += emit_nops(&prog, INSN_SZ_DIFF);
> >                                 break;
> > +                       }
> >  emit_jmp:
> >                         if (is_imm8(jmp_offset)) {
> > +                               if (jmp_padding)
> > +                                       cnt += emit_nops(&prog, INSN_SZ_DIFF - 2);
> >                                 EMIT2(0xEB, jmp_offset);
> >                         } else if (is_simm32(jmp_offset)) {
> >                                 EMIT1_off32(0xE9, jmp_offset);
> > @@ -1578,26 +1611,6 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog,
> >         return 0;
> >  }
> >
> > -static void emit_nops(u8 **pprog, unsigned int len)
> > -{
> > -       unsigned int i, noplen;
> > -       u8 *prog = *pprog;
> > -       int cnt = 0;
> > -
> > -       while (len > 0) {
> > -               noplen = len;
> > -
> > -               if (noplen > ASM_NOP_MAX)
> > -                       noplen = ASM_NOP_MAX;
> > -
> > -               for (i = 0; i < noplen; i++)
> > -                       EMIT1(ideal_nops[noplen][i]);
> > -               len -= noplen;
> > -       }
> > -
> > -       *pprog = prog;
> > -}
> > -
> >  static void emit_align(u8 **pprog, u32 align)
> >  {
> >         u8 *target, *prog = *pprog;
> > @@ -1972,6 +1985,9 @@ struct x64_jit_data {
> >         struct jit_context ctx;
> >  };
> >
> > +#define MAX_PASSES 20
> > +#define PADDING_PASSES (MAX_PASSES - 5)
> > +
> >  struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
> >  {
> >         struct bpf_binary_header *header = NULL;
> > @@ -1981,6 +1997,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
> >         struct jit_context ctx = {};
> >         bool tmp_blinded = false;
> >         bool extra_pass = false;
> > +       bool padding = prog->padded;
> 
> can this ever be true on assignment? I.e., can the program be jitted twice?
> 
Yes, bpf-to-bpf runs bpf_int_jit_compile twice and it expects to run
only one pass in the second time.

> >         u8 *image = NULL;
> >         int *addrs;
> >         int pass;
> > @@ -2043,7 +2060,9 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
> >          * pass to emit the final image.
> >          */
> >         for (pass = 0; pass < 20 || image; pass++) {
> > -               proglen = do_jit(prog, addrs, image, oldproglen, &ctx);
> > +               if (!padding && pass >= PADDING_PASSES)
> > +                       padding = true;
> 
> Just, unconditionally:
> 
> padding = pass >= PADDING_PASSES;
> 
But this would turn 'padding' from 'true' to 'false' when invoking
bpf_int_jit_compile() the second time for bpf-to-bpf, so we still need
to do it conditionally.

Gary Lin

> > +               proglen = do_jit(prog, addrs, image, oldproglen, &ctx, padding);
> >                 if (proglen <= 0) {
> >  out_image:
> >                         image = NULL;
> > @@ -2101,6 +2120,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
> >                 prog->bpf_func = (void *)image;
> >                 prog->jited = 1;
> >                 prog->jited_len = proglen;
> > +               prog->padded = padding;
> >         } else {
> >                 prog = orig_prog;
> >         }
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index 1b62397bd124..cb7ce2b3737a 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -531,6 +531,7 @@ struct bpf_prog {
> >                                 dst_needed:1,   /* Do we need dst entry? */
> >                                 blinded:1,      /* Was blinded */
> >                                 is_func:1,      /* program is a bpf function */
> > +                               padded:1,       /* jitted image was padded */
> >                                 kprobe_override:1, /* Do we override a kprobe? */
> >                                 has_callchain_buf:1, /* callchain buffer allocated? */
> >                                 enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */
> > --
> > 2.29.2
> >
>
Gary Lin Dec. 14, 2020, 3:56 a.m. UTC | #6
On Fri, Dec 11, 2020 at 09:05:05PM +0100, Daniel Borkmann wrote:
> On 12/11/20 9:19 AM, Gary Lin wrote:
> > The x64 bpf jit expects bpf images converge within the given passes, but
> > it could fail to do so with some corner cases. For example:
> > 
> >        l0:     ldh [4]
> >        l1:     jeq #0x537d, l2, l40
> >        l2:     ld [0]
> >        l3:     jeq #0xfa163e0d, l4, l40
> >        l4:     ldh [12]
> >        l5:     ldx #0xe
> >        l6:     jeq #0x86dd, l41, l7
> >        l8:     ld [x+16]
> >        l9:     ja 41
> > 
> >          [... repeated ja 41 ]
> > 
> >        l40:    ja 41
> >        l41:    ret #0
> >        l42:    ld #len
> >        l43:    ret a
> > 
> > This bpf program contains 32 "ja 41" instructions which are effectively
> > NOPs and designed to be replaced with valid code dynamically. Ideally,
> > bpf jit should optimize those "ja 41" instructions out when translating
> > the bpf instructions into x86_64 machine code. However, do_jit() can
> > only remove one "ja 41" for offset==0 on each pass, so it requires at
> > least 32 runs to eliminate those JMPs and exceeds the current limit of
> > passes (20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON
> > is set even though it's legit as a classic socket filter.
> > 
> > To make the image more likely converge within 20 passes, this commit
> > pads some instructions with NOPs in the last 5 passes:
> > 
> > 1. conditional jumps
> >    A possible size variance comes from the adoption of imm8 JMP. If the
> >    offset is imm8, we calculate the size difference of this BPF instruction
> >    between the previous pass and the current pass and fill the gap with NOPs.
> >    To avoid the recalculation of jump offset, those NOPs are inserted before
> >    the JMP code, so we have to subtract the 2 bytes of imm8 JMP when
> >    calculating the NOP number.
> > 
> > 2. BPF_JA
> >    There are two conditions for BPF_JA.
> >    a.) nop jumps
> >      If this instruction is not optimized out in the previous pass,
> >      instead of removing it, we insert the equivalent size of NOPs.
> >    b.) label jumps
> >      Similar to condition jumps, we prepend NOPs right before the JMP
> >      code.
> > 
> > To make the code concise, emit_nops() is modified to use the signed len and
> > return the number of inserted NOPs.
> > 
> > To support bpf-to-bpf, a new flag, padded, is introduced to 'struct bpf_prog'
> > so that bpf_int_jit_compile() could know if the program is padded or not.
> 
> Please also add multiple hand-crafted test cases e.g. for bpf-to-bpf calls into
> test_verifier (which is part of bpf kselftests) that would exercise this corner
> case in x86 jit where we would start to nop pad so that there is proper coverage,
> too.
> 
The corner case I had in the commit description is likely being rejected by
the verifier because most of those "ja 41" are unreachable instructions.
Is there any known test case that needs more than 15 passes in x86 jit?

Gary Lin
Gary Lin Dec. 14, 2020, 5:12 a.m. UTC | #7
On Fri, Dec 11, 2020 at 06:24:47PM -0800, Alexei Starovoitov wrote:
> On Fri, Dec 11, 2020 at 1:13 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > >> +                       }
> > >>   emit_jmp:
> > >>                          if (is_imm8(jmp_offset)) {
> > >> +                               if (jmp_padding)
> > >> +                                       cnt += emit_nops(&prog, INSN_SZ_DIFF - 2);
> 
> Could you describe all possible numbers of bytes in padding?
> Is it 0, 2, 4 ?
> Would be good to add warn_on_once to make sure the number
> of nops is expected.
> 
For the conditional jumps, it could be 0 or 4. As for nop jumps, it may be
0, 2, or 5. For the pure jumps, 0 or 3. Will add the warning in the next
version.

> > >>   struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
> > >>   {
> > >>          struct bpf_binary_header *header = NULL;
> > >> @@ -1981,6 +1997,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
> > >>          struct jit_context ctx = {};
> > >>          bool tmp_blinded = false;
> > >>          bool extra_pass = false;
> > >> +       bool padding = prog->padded;
> > >
> > > can this ever be true on assignment? I.e., can the program be jitted twice?
> >
> > Yes, progs can be passed into the JIT twice, see also jit_subprogs(). In one of
> > the earlier patches it would still potentially change the image size a second
> > time which would break subprogs aka bpf2bpf calls.
> 
> Right. I think memorized padded flag shouldn't be in sticky bits
> of the prog structure.
> It's only needed between the last pass and extra pass for bpf2bpf calls.
> I think it would be cleaner to keep it in struct x64_jit_data *jit_data.
> 
Okay, jit_data is surely a better place for the flag.

> As others have said the selftests are must have.
> Especially for bpf2bpf calls where one subprog is padded.
> 
Will try to craft some test cases for this patch in v2.

Gary Lin
Gary Lin Dec. 14, 2020, 8:15 a.m. UTC | #8
On Mon, Dec 14, 2020 at 11:56:22AM +0800, Gary Lin wrote:
> On Fri, Dec 11, 2020 at 09:05:05PM +0100, Daniel Borkmann wrote:
> > On 12/11/20 9:19 AM, Gary Lin wrote:
> > > The x64 bpf jit expects bpf images converge within the given passes, but
> > > it could fail to do so with some corner cases. For example:
> > > 
> > >        l0:     ldh [4]
> > >        l1:     jeq #0x537d, l2, l40
> > >        l2:     ld [0]
> > >        l3:     jeq #0xfa163e0d, l4, l40
> > >        l4:     ldh [12]
> > >        l5:     ldx #0xe
> > >        l6:     jeq #0x86dd, l41, l7
> > >        l8:     ld [x+16]
> > >        l9:     ja 41
> > > 
> > >          [... repeated ja 41 ]
> > > 
> > >        l40:    ja 41
> > >        l41:    ret #0
> > >        l42:    ld #len
> > >        l43:    ret a
> > > 
> > > This bpf program contains 32 "ja 41" instructions which are effectively
> > > NOPs and designed to be replaced with valid code dynamically. Ideally,
> > > bpf jit should optimize those "ja 41" instructions out when translating
> > > the bpf instructions into x86_64 machine code. However, do_jit() can
> > > only remove one "ja 41" for offset==0 on each pass, so it requires at
> > > least 32 runs to eliminate those JMPs and exceeds the current limit of
> > > passes (20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON
> > > is set even though it's legit as a classic socket filter.
> > > 
> > > To make the image more likely converge within 20 passes, this commit
> > > pads some instructions with NOPs in the last 5 passes:
> > > 
> > > 1. conditional jumps
> > >    A possible size variance comes from the adoption of imm8 JMP. If the
> > >    offset is imm8, we calculate the size difference of this BPF instruction
> > >    between the previous pass and the current pass and fill the gap with NOPs.
> > >    To avoid the recalculation of jump offset, those NOPs are inserted before
> > >    the JMP code, so we have to subtract the 2 bytes of imm8 JMP when
> > >    calculating the NOP number.
> > > 
> > > 2. BPF_JA
> > >    There are two conditions for BPF_JA.
> > >    a.) nop jumps
> > >      If this instruction is not optimized out in the previous pass,
> > >      instead of removing it, we insert the equivalent size of NOPs.
> > >    b.) label jumps
> > >      Similar to condition jumps, we prepend NOPs right before the JMP
> > >      code.
> > > 
> > > To make the code concise, emit_nops() is modified to use the signed len and
> > > return the number of inserted NOPs.
> > > 
> > > To support bpf-to-bpf, a new flag, padded, is introduced to 'struct bpf_prog'
> > > so that bpf_int_jit_compile() could know if the program is padded or not.
> > 
> > Please also add multiple hand-crafted test cases e.g. for bpf-to-bpf calls into
> > test_verifier (which is part of bpf kselftests) that would exercise this corner
> > case in x86 jit where we would start to nop pad so that there is proper coverage,
> > too.
> > 
> The corner case I had in the commit description is likely being rejected by
> the verifier because most of those "ja 41" are unreachable instructions.
> Is there any known test case that needs more than 15 passes in x86 jit?
> 
Just an idea. Besides the mentioned corner case, how about making
PADDING_PASSES dynamically configurable (sysfs?) and reusing the existing
test cases? So that we can have a script to set PADDING_PASSES from 1 to 20
and run the bpf selftests separately. This guarantees that the padding
strategy will be applied at least in a certain PADDING_PASSES settings.

Gary Lin
Daniel Borkmann Dec. 14, 2020, 3:31 p.m. UTC | #9
On 12/14/20 9:15 AM, Gary Lin wrote:
> On Mon, Dec 14, 2020 at 11:56:22AM +0800, Gary Lin wrote:
>> On Fri, Dec 11, 2020 at 09:05:05PM +0100, Daniel Borkmann wrote:
>>> On 12/11/20 9:19 AM, Gary Lin wrote:
>>>> The x64 bpf jit expects bpf images converge within the given passes, but
>>>> it could fail to do so with some corner cases. For example:
>>>>
>>>>         l0:     ldh [4]
>>>>         l1:     jeq #0x537d, l2, l40
>>>>         l2:     ld [0]
>>>>         l3:     jeq #0xfa163e0d, l4, l40
>>>>         l4:     ldh [12]
>>>>         l5:     ldx #0xe
>>>>         l6:     jeq #0x86dd, l41, l7
>>>>         l8:     ld [x+16]
>>>>         l9:     ja 41
>>>>
>>>>           [... repeated ja 41 ]
>>>>
>>>>         l40:    ja 41
>>>>         l41:    ret #0
>>>>         l42:    ld #len
>>>>         l43:    ret a
>>>>
>>>> This bpf program contains 32 "ja 41" instructions which are effectively
>>>> NOPs and designed to be replaced with valid code dynamically. Ideally,
>>>> bpf jit should optimize those "ja 41" instructions out when translating
>>>> the bpf instructions into x86_64 machine code. However, do_jit() can
>>>> only remove one "ja 41" for offset==0 on each pass, so it requires at
>>>> least 32 runs to eliminate those JMPs and exceeds the current limit of
>>>> passes (20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON
>>>> is set even though it's legit as a classic socket filter.
>>>>
>>>> To make the image more likely converge within 20 passes, this commit
>>>> pads some instructions with NOPs in the last 5 passes:
>>>>
>>>> 1. conditional jumps
>>>>     A possible size variance comes from the adoption of imm8 JMP. If the
>>>>     offset is imm8, we calculate the size difference of this BPF instruction
>>>>     between the previous pass and the current pass and fill the gap with NOPs.
>>>>     To avoid the recalculation of jump offset, those NOPs are inserted before
>>>>     the JMP code, so we have to subtract the 2 bytes of imm8 JMP when
>>>>     calculating the NOP number.
>>>>
>>>> 2. BPF_JA
>>>>     There are two conditions for BPF_JA.
>>>>     a.) nop jumps
>>>>       If this instruction is not optimized out in the previous pass,
>>>>       instead of removing it, we insert the equivalent size of NOPs.
>>>>     b.) label jumps
>>>>       Similar to condition jumps, we prepend NOPs right before the JMP
>>>>       code.
>>>>
>>>> To make the code concise, emit_nops() is modified to use the signed len and
>>>> return the number of inserted NOPs.
>>>>
>>>> To support bpf-to-bpf, a new flag, padded, is introduced to 'struct bpf_prog'
>>>> so that bpf_int_jit_compile() could know if the program is padded or not.
>>>
>>> Please also add multiple hand-crafted test cases e.g. for bpf-to-bpf calls into
>>> test_verifier (which is part of bpf kselftests) that would exercise this corner
>>> case in x86 jit where we would start to nop pad so that there is proper coverage,
>>> too.
>>>
>> The corner case I had in the commit description is likely being rejected by
>> the verifier because most of those "ja 41" are unreachable instructions.
>> Is there any known test case that needs more than 15 passes in x86 jit?
>>
> Just an idea. Besides the mentioned corner case, how about making
> PADDING_PASSES dynamically configurable (sysfs?) and reusing the existing
> test cases? So that we can have a script to set PADDING_PASSES from 1 to 20
> and run the bpf selftests separately. This guarantees that the padding
> strategy will be applied at least in a certain PADDING_PASSES settings.

I think exposing such implementation detail to users is not that great as they
normally should not need to worry about these things (plus it's also rarely hit
in practice when developing against llvm). On top of all that, such knob would
have no meaning in case of other JITs since most other non-x86 ones have a fixed
number of passes. I think it's probably useful for local testing of the fix, but
less suitable for exposing as sysctl 'uapi' upstream. Re crafting a test case for
bpf-2-bpf calls, you could orientate on bpf_fill_maxinsns10() in lib/test_bpf.c
which is also triggering a high number of passes, port it over to test_verifier
from selftests and experiment from there to integrate calls.

Thanks,
Daniel
Gary Lin Dec. 15, 2020, 1:50 a.m. UTC | #10
On Mon, Dec 14, 2020 at 04:31:44PM +0100, Daniel Borkmann wrote:
> On 12/14/20 9:15 AM, Gary Lin wrote:
> > On Mon, Dec 14, 2020 at 11:56:22AM +0800, Gary Lin wrote:
> > > On Fri, Dec 11, 2020 at 09:05:05PM +0100, Daniel Borkmann wrote:
> > > > On 12/11/20 9:19 AM, Gary Lin wrote:
> > > > > The x64 bpf jit expects bpf images converge within the given passes, but
> > > > > it could fail to do so with some corner cases. For example:
> > > > > 
> > > > >         l0:     ldh [4]
> > > > >         l1:     jeq #0x537d, l2, l40
> > > > >         l2:     ld [0]
> > > > >         l3:     jeq #0xfa163e0d, l4, l40
> > > > >         l4:     ldh [12]
> > > > >         l5:     ldx #0xe
> > > > >         l6:     jeq #0x86dd, l41, l7
> > > > >         l8:     ld [x+16]
> > > > >         l9:     ja 41
> > > > > 
> > > > >           [... repeated ja 41 ]
> > > > > 
> > > > >         l40:    ja 41
> > > > >         l41:    ret #0
> > > > >         l42:    ld #len
> > > > >         l43:    ret a
> > > > > 
> > > > > This bpf program contains 32 "ja 41" instructions which are effectively
> > > > > NOPs and designed to be replaced with valid code dynamically. Ideally,
> > > > > bpf jit should optimize those "ja 41" instructions out when translating
> > > > > the bpf instructions into x86_64 machine code. However, do_jit() can
> > > > > only remove one "ja 41" for offset==0 on each pass, so it requires at
> > > > > least 32 runs to eliminate those JMPs and exceeds the current limit of
> > > > > passes (20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON
> > > > > is set even though it's legit as a classic socket filter.
> > > > > 
> > > > > To make the image more likely converge within 20 passes, this commit
> > > > > pads some instructions with NOPs in the last 5 passes:
> > > > > 
> > > > > 1. conditional jumps
> > > > >     A possible size variance comes from the adoption of imm8 JMP. If the
> > > > >     offset is imm8, we calculate the size difference of this BPF instruction
> > > > >     between the previous pass and the current pass and fill the gap with NOPs.
> > > > >     To avoid the recalculation of jump offset, those NOPs are inserted before
> > > > >     the JMP code, so we have to subtract the 2 bytes of imm8 JMP when
> > > > >     calculating the NOP number.
> > > > > 
> > > > > 2. BPF_JA
> > > > >     There are two conditions for BPF_JA.
> > > > >     a.) nop jumps
> > > > >       If this instruction is not optimized out in the previous pass,
> > > > >       instead of removing it, we insert the equivalent size of NOPs.
> > > > >     b.) label jumps
> > > > >       Similar to condition jumps, we prepend NOPs right before the JMP
> > > > >       code.
> > > > > 
> > > > > To make the code concise, emit_nops() is modified to use the signed len and
> > > > > return the number of inserted NOPs.
> > > > > 
> > > > > To support bpf-to-bpf, a new flag, padded, is introduced to 'struct bpf_prog'
> > > > > so that bpf_int_jit_compile() could know if the program is padded or not.
> > > > 
> > > > Please also add multiple hand-crafted test cases e.g. for bpf-to-bpf calls into
> > > > test_verifier (which is part of bpf kselftests) that would exercise this corner
> > > > case in x86 jit where we would start to nop pad so that there is proper coverage,
> > > > too.
> > > > 
> > > The corner case I had in the commit description is likely being rejected by
> > > the verifier because most of those "ja 41" are unreachable instructions.
> > > Is there any known test case that needs more than 15 passes in x86 jit?
> > > 
> > Just an idea. Besides the mentioned corner case, how about making
> > PADDING_PASSES dynamically configurable (sysfs?) and reusing the existing
> > test cases? So that we can have a script to set PADDING_PASSES from 1 to 20
> > and run the bpf selftests separately. This guarantees that the padding
> > strategy will be applied at least in a certain PADDING_PASSES settings.
> 
> I think exposing such implementation detail to users is not that great as they
> normally should not need to worry about these things (plus it's also rarely hit
> in practice when developing against llvm). On top of all that, such knob would
> have no meaning in case of other JITs since most other non-x86 ones have a fixed
> number of passes. I think it's probably useful for local testing of the fix, but
> less suitable for exposing as sysctl 'uapi' upstream. Re crafting a test case for
> bpf-2-bpf calls, you could orientate on bpf_fill_maxinsns10() in lib/test_bpf.c
> which is also triggering a high number of passes, port it over to test_verifier
> from selftests and experiment from there to integrate calls.
> 
Thanks for the hint. Will try bpf_fill_maxinsns10().

Gary Lin
diff mbox series

Patch

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 796506dcfc42..30b81c8539b3 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -789,8 +789,31 @@  static void detect_reg_usage(struct bpf_insn *insn, int insn_cnt,
 	}
 }
 
+static int emit_nops(u8 **pprog, int len)
+{
+	u8 *prog = *pprog;
+	int i, noplen, cnt = 0;
+
+	while (len > 0) {
+		noplen = len;
+
+		if (noplen > ASM_NOP_MAX)
+			noplen = ASM_NOP_MAX;
+
+		for (i = 0; i < noplen; i++)
+			EMIT1(ideal_nops[noplen][i]);
+		len -= noplen;
+	}
+
+	*pprog = prog;
+
+	return cnt;
+}
+
+#define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
+
 static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
-		  int oldproglen, struct jit_context *ctx)
+		  int oldproglen, struct jit_context *ctx, bool jmp_padding)
 {
 	bool tail_call_reachable = bpf_prog->aux->tail_call_reachable;
 	struct bpf_insn *insn = bpf_prog->insnsi;
@@ -1409,6 +1432,8 @@  xadd:			if (is_imm8(insn->off))
 			}
 			jmp_offset = addrs[i + insn->off] - addrs[i];
 			if (is_imm8(jmp_offset)) {
+				if (jmp_padding)
+					cnt += emit_nops(&prog, INSN_SZ_DIFF - 2);
 				EMIT2(jmp_cond, jmp_offset);
 			} else if (is_simm32(jmp_offset)) {
 				EMIT2_off32(0x0F, jmp_cond + 0x10, jmp_offset);
@@ -1431,11 +1456,19 @@  xadd:			if (is_imm8(insn->off))
 			else
 				jmp_offset = addrs[i + insn->off] - addrs[i];
 
-			if (!jmp_offset)
-				/* Optimize out nop jumps */
+			if (!jmp_offset) {
+				/*
+				 * If jmp_padding is enabled, the extra nops will
+				 * be inserted. Otherwise, optimize out nop jumps.
+				 */
+				if (jmp_padding)
+					cnt += emit_nops(&prog, INSN_SZ_DIFF);
 				break;
+			}
 emit_jmp:
 			if (is_imm8(jmp_offset)) {
+				if (jmp_padding)
+					cnt += emit_nops(&prog, INSN_SZ_DIFF - 2);
 				EMIT2(0xEB, jmp_offset);
 			} else if (is_simm32(jmp_offset)) {
 				EMIT1_off32(0xE9, jmp_offset);
@@ -1578,26 +1611,6 @@  static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog,
 	return 0;
 }
 
-static void emit_nops(u8 **pprog, unsigned int len)
-{
-	unsigned int i, noplen;
-	u8 *prog = *pprog;
-	int cnt = 0;
-
-	while (len > 0) {
-		noplen = len;
-
-		if (noplen > ASM_NOP_MAX)
-			noplen = ASM_NOP_MAX;
-
-		for (i = 0; i < noplen; i++)
-			EMIT1(ideal_nops[noplen][i]);
-		len -= noplen;
-	}
-
-	*pprog = prog;
-}
-
 static void emit_align(u8 **pprog, u32 align)
 {
 	u8 *target, *prog = *pprog;
@@ -1972,6 +1985,9 @@  struct x64_jit_data {
 	struct jit_context ctx;
 };
 
+#define MAX_PASSES 20
+#define PADDING_PASSES (MAX_PASSES - 5)
+
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
 	struct bpf_binary_header *header = NULL;
@@ -1981,6 +1997,7 @@  struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 	struct jit_context ctx = {};
 	bool tmp_blinded = false;
 	bool extra_pass = false;
+	bool padding = prog->padded;
 	u8 *image = NULL;
 	int *addrs;
 	int pass;
@@ -2043,7 +2060,9 @@  struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 	 * pass to emit the final image.
 	 */
 	for (pass = 0; pass < 20 || image; pass++) {
-		proglen = do_jit(prog, addrs, image, oldproglen, &ctx);
+		if (!padding && pass >= PADDING_PASSES)
+			padding = true;
+		proglen = do_jit(prog, addrs, image, oldproglen, &ctx, padding);
 		if (proglen <= 0) {
 out_image:
 			image = NULL;
@@ -2101,6 +2120,7 @@  struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 		prog->bpf_func = (void *)image;
 		prog->jited = 1;
 		prog->jited_len = proglen;
+		prog->padded = padding;
 	} else {
 		prog = orig_prog;
 	}
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1b62397bd124..cb7ce2b3737a 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -531,6 +531,7 @@  struct bpf_prog {
 				dst_needed:1,	/* Do we need dst entry? */
 				blinded:1,	/* Was blinded */
 				is_func:1,	/* program is a bpf function */
+				padded:1,	/* jitted image was padded */
 				kprobe_override:1, /* Do we override a kprobe? */
 				has_callchain_buf:1, /* callchain buffer allocated? */
 				enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */