Message ID | 20220927185958.14995-11-dthaler1968@googlemail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | [01/15] ebpf-docs: Move legacy packet instructions to a separate file | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
bpf/vmtest-bpf-next-VM_Test-4 | success | Logs for llvm-toolchain |
bpf/vmtest-bpf-next-VM_Test-5 | success | Logs for set-matrix |
bpf/vmtest-bpf-next-VM_Test-2 | success | Logs for build for x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-3 | success | Logs for build for x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-1 | success | Logs for build for s390x with gcc |
bpf/vmtest-bpf-next-VM_Test-16 | success | Logs for test_verifier on x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-13 | success | Logs for test_progs_no_alu32 on x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-15 | success | Logs for test_verifier on s390x with gcc |
bpf/vmtest-bpf-next-VM_Test-17 | success | Logs for test_verifier on x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-7 | success | Logs for test_maps on x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-9 | success | Logs for test_progs on s390x with gcc |
bpf/vmtest-bpf-next-VM_Test-10 | success | Logs for test_progs on x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-11 | fail | Logs for test_progs on x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-12 | success | Logs for test_progs_no_alu32 on s390x with gcc |
bpf/vmtest-bpf-next-VM_Test-14 | success | Logs for test_progs_no_alu32 on x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-8 | success | Logs for test_maps on x86_64 with llvm-16 |
bpf/vmtest-bpf-next-PR | fail | PR summary |
bpf/vmtest-bpf-next-VM_Test-6 | success | Logs for test_maps on s390x with gcc |
On Tue, Sep 27, 2022 at 06:59:54PM +0000, dthaler1968@googlemail.com wrote: > From: Dave Thaler <dthaler@microsoft.com> > > Signed-off-by: Dave Thaler <dthaler@microsoft.com> > --- > Documentation/bpf/instruction-set.rst | 137 ++++++++++++++++---------- > 1 file changed, 87 insertions(+), 50 deletions(-) > > diff --git a/Documentation/bpf/instruction-set.rst b/Documentation/bpf/instruction-set.rst > index b6f098104..328207ff6 100644 > --- a/Documentation/bpf/instruction-set.rst > +++ b/Documentation/bpf/instruction-set.rst > @@ -7,6 +7,9 @@ eBPF Instruction Set Specification, v1.0 > > This document specifies version 1.0 of the eBPF instruction set. > > +The eBPF instruction set consists of eleven 64 bit registers, a program counter, > +and 512 bytes of stack space. I would not add 512 to a doc. That's what we have now, but we might relax that in the future. > + > Documentation conventions > ========================= > > @@ -25,12 +28,24 @@ The eBPF calling convention is defined as: > * R6 - R9: callee saved registers that function calls will preserve > * R10: read-only frame pointer to access stack > > -R0 - R5 are scratch registers and eBPF programs needs to spill/fill them if > -necessary across calls. > +Registers R0 - R5 are scratch registers, meaning the BPF program needs to either > +spill them to the BPF stack or move them to callee saved registers if these > +arguments are to be reused across multiple function calls. Spilling means > +that the value in the register is moved to the BPF stack. The reverse operation > +of moving the variable from the BPF stack to the register is called filling. > +The reason for spilling/filling is due to the limited number of registers. More canonical way would be to say that r0-r5 are caller saved. I like above clarification though. > + > +Upon entering execution of an eBPF program, registers R1 - R5 initially can contain > +the input arguments for the program (similar to the argc/argv pair for a typical C program). > +The actual number of registers used, and their meaning, is defined by the program type; > +for example, a networking program might have an argument that includes network packet data > +and/or metadata. > > Instruction encoding > ==================== > > +An eBPF program is a sequence of instructions. > + > eBPF has two instruction encodings: > > * the basic instruction encoding, which uses 64 bits to encode an instruction > @@ -63,7 +78,7 @@ opcode > operation to perform > > Note that most instructions do not use all of the fields. > -Unused fields shall be cleared to zero. > +Unused fields must be set to zero. > > As discussed below in `64-bit immediate instructions`_, some > instructions use a 64-bit immediate value that is constructed as follows. > @@ -90,7 +105,9 @@ and destination registers, respectively, rather than the register number. > Instruction classes > ------------------- > > -The three LSB bits of the 'opcode' field store the instruction class: > +The encoding of the 'opcode' field varies and can be determined from > +the three least significant bits (LSB) of the 'opcode' field which holds > +the "instruction class", as follows: > > ========= ===== =============================== =================================== > class value description reference > @@ -136,9 +153,11 @@ instruction class > Arithmetic instructions > ----------------------- > > -``BPF_ALU`` uses 32-bit wide operands while ``BPF_ALU64`` uses 64-bit wide operands for > +Instruction class ``BPF_ALU`` uses 32-bit wide operands (zeroing the upper 32 bits > +of the destination register) while ``BPF_ALU64`` uses 64-bit wide operands for > otherwise identical operations. > -The 'code' field encodes the operation as below: > + > +The 4-bit 'code' field encodes the operation as follows: > > ======== ===== ========================================================== > code value description > @@ -168,21 +187,23 @@ the destination register is instead set to zero. > If execution would result in modulo by zero, > the destination register is instead left unchanged. > > -``BPF_ADD | BPF_X | BPF_ALU`` means:: > +Examples: > + > +``BPF_ADD | BPF_X | BPF_ALU`` (0x0c) means:: > > dst = (uint32_t) (dst + src) > > where '(uint32_t)' indicates truncation to 32 bits. > > -``BPF_ADD | BPF_X | BPF_ALU64`` means:: > +``BPF_ADD | BPF_X | BPF_ALU64`` (0x0f) means:: I don't think adding hex values help here. > > dst = dst + src > > -``BPF_XOR | BPF_K | BPF_ALU`` means:: > +``BPF_XOR | BPF_K | BPF_ALU`` (0xa4) means:: > > src = (uint32_t) src ^ (uint32_t) imm > > -``BPF_XOR | BPF_K | BPF_ALU64`` means:: > +``BPF_XOR | BPF_K | BPF_ALU64`` (0xa7) means:: > > src = src ^ imm > > @@ -204,8 +225,9 @@ The byte swap instructions use an instruction class of ``BPF_ALU`` and a 4-bit > The byte swap instructions operate on the destination register > only and do not use a separate source register or immediate value. > > -The 1-bit source operand field in the opcode is used to to select what byte > -order the operation convert from or to: > +Byte swap instructions use non-default semantics of the 1-bit 'source' field in I would drop 'non-default'. Many fields have different meanings depending on opcode. BPF_SRC() macro just reads that bit. > +the 'opcode' field. Instead of indicating the source operator, it is instead > +used to select what byte order the operation converts from or to: > > ========= ===== ================================================= > source value description > @@ -215,24 +237,33 @@ BPF_TO_BE 0x08 convert between host byte order and big endian > ========= ===== ================================================= > > The 'imm' field encodes the width of the swap operations. The following widths > -are supported: 16, 32 and 64. > - > -Examples: > - > -``BPF_ALU | BPF_TO_LE | BPF_END`` with imm = 16 means:: > - > - dst = htole16(dst) > - > -``BPF_ALU | BPF_TO_BE | BPF_END`` with imm = 64 means:: > - > - dst = htobe64(dst) > +are supported: 16, 32 and 64. The following table summarizes the resulting > +possibilities: > + > +============================= ========= === ======== ================== > +opcode construction opcode imm mnemonic pseudocode > +============================= ========= === ======== ================== > +BPF_END | BPF_TO_LE | BPF_ALU 0xd4 16 le16 dst dst = htole16(dst) > +BPF_END | BPF_TO_LE | BPF_ALU 0xd4 32 le32 dst dst = htole32(dst) > +BPF_END | BPF_TO_LE | BPF_ALU 0xd4 64 le64 dst dst = htole64(dst) > +BPF_END | BPF_TO_BE | BPF_ALU 0xdc 16 be16 dst dst = htobe16(dst) > +BPF_END | BPF_TO_BE | BPF_ALU 0xdc 32 be32 dst dst = htobe32(dst) > +BPF_END | BPF_TO_BE | BPF_ALU 0xdc 64 be64 dst dst = htobe64(dst) > +============================= ========= === ======== ================== > + > +where > + > +* mnenomic indicates a short form that might be displayed by some tools such as disassemblers > +* 'htoleNN()' indicates converting a NN-bit value from host byte order to little-endian byte order > +* 'htobeNN()' indicates converting a NN-bit value from host byte order to big-endian byte order I think we need to add normal bswap insn. These to_le and to_be are very awkward to use. As soon as we have new insn the compiler will be using it. There is no equivalent of to_be and to_le in C. It wasn't good ISA design. > Jump instructions > ----------------- > > -``BPF_JMP32`` uses 32-bit wide operands while ``BPF_JMP`` uses 64-bit wide operands for > +Instruction class ``BPF_JMP32`` uses 32-bit wide operands while ``BPF_JMP`` uses 64-bit wide operands for > otherwise identical operations. > -The 'code' field encodes the operation as below: > + > +The 4-bit 'code' field encodes the operation as below, where PC is the program counter: > > ======== ===== ========================= ============ > code value description notes > @@ -253,9 +284,6 @@ BPF_JSLT 0xc0 PC += off if dst < src signed > BPF_JSLE 0xd0 PC += off if dst <= src signed > ======== ===== ========================= ============ > > -The eBPF program needs to store the return value into register R0 before doing a > -BPF_EXIT. > - > Helper functions > ~~~~~~~~~~~~~~~~ > Helper functions are a concept whereby BPF programs can call into a > @@ -285,7 +313,8 @@ For load and store instructions (``BPF_LD``, ``BPF_LDX``, ``BPF_ST``, and ``BPF_ > mode size instruction class > ============ ====== ================= > > -The mode modifier is one of: > +mode > + one of: > > ============= ===== ==================================== ============= > mode modifier value description reference > @@ -297,7 +326,8 @@ The mode modifier is one of: > BPF_ATOMIC 0xc0 atomic operations `Atomic operations`_ > ============= ===== ==================================== ============= > > -The size modifier is one of: > +size > + one of: > > ============= ===== ===================== > size modifier value description > @@ -308,25 +338,31 @@ The size modifier is one of: > BPF_DW 0x18 double word (8 bytes) > ============= ===== ===================== > > +instruction class > + the instruction class (see `Instruction classes`_) > + > Regular load and store operations > --------------------------------- > > The ``BPF_MEM`` mode modifier is used to encode regular load and store > instructions that transfer data between a register and memory. > > -``BPF_MEM | <size> | BPF_STX`` means:: > - > - *(size *) (dst + offset) = src_reg > - > -``BPF_MEM | <size> | BPF_ST`` means:: > - > - *(size *) (dst + offset) = imm32 > - > -``BPF_MEM | <size> | BPF_LDX`` means:: > - > - dst = *(size *) (src + offset) > - > -Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW``. > +============================= ========= ==================================== > +opcode construction opcode pseudocode > +============================= ========= ==================================== > +BPF_MEM | BPF_B | BPF_LDX 0x71 dst = \*(uint8_t \*) (src + offset) > +BPF_MEM | BPF_H | BPF_LDX 0x69 dst = \*(uint16_t \*) (src + offset) > +BPF_MEM | BPF_W | BPF_LDX 0x61 dst = \*(uint32_t \*) (src + offset) > +BPF_MEM | BPF_DW | BPF_LDX 0x79 dst = \*(uint64_t \*) (src + offset) > +BPF_MEM | BPF_B | BPF_ST 0x72 \*(uint8_t \*) (dst + offset) = imm > +BPF_MEM | BPF_H | BPF_ST 0x6a \*(uint16_t \*) (dst + offset) = imm > +BPF_MEM | BPF_W | BPF_ST 0x62 \*(uint32_t \*) (dst + offset) = imm > +BPF_MEM | BPF_DW | BPF_ST 0x7a \*(uint64_t \*) (dst + offset) = imm > +BPF_MEM | BPF_B | BPF_STX 0x73 \*(uint8_t \*) (dst + offset) = src > +BPF_MEM | BPF_H | BPF_STX 0x6b \*(uint16_t \*) (dst + offset) = src > +BPF_MEM | BPF_W | BPF_STX 0x63 \*(uint32_t \*) (dst + offset) = src > +BPF_MEM | BPF_DW | BPF_STX 0x7b \*(uint64_t \*) (dst + offset) = src > +============================= ========= ==================================== I think the table is more verbose and less readable than the original text. > > Atomic operations > ----------------- > @@ -338,9 +374,11 @@ by other eBPF programs or means outside of this specification. > All atomic operations supported by eBPF are encoded as store operations > that use the ``BPF_ATOMIC`` mode modifier as follows: > > -* ``BPF_ATOMIC | BPF_W | BPF_STX`` for 32-bit operations > -* ``BPF_ATOMIC | BPF_DW | BPF_STX`` for 64-bit operations > -* 8-bit and 16-bit wide atomic operations are not supported. > +* ``BPF_ATOMIC | BPF_W | BPF_STX`` (0xc3) for 32-bit operations > +* ``BPF_ATOMIC | BPF_DW | BPF_STX`` (0xdb) for 64-bit operations > + > +Note that 8-bit (``BPF_B``) and 16-bit (``BPF_H``) wide atomic operations are not supported, > +nor is ``BPF_ATOMIC | <size> | BPF_ST``. > > The 'imm' field is used to encode the actual atomic operation. > Simple atomic operation use a subset of the values defined to encode > @@ -355,16 +393,15 @@ BPF_AND 0x50 atomic and > BPF_XOR 0xa0 atomic xor > ======== ===== =========== > > - > -``BPF_ATOMIC | BPF_W | BPF_STX`` with 'imm' = BPF_ADD means:: > +``BPF_ATOMIC | BPF_W | BPF_STX`` (0xc3) with 'imm' = BPF_ADD means:: > > *(uint32_t *)(dst + offset) += src > > -``BPF_ATOMIC | BPF_DW | BPF_STX`` with 'imm' = BPF ADD means:: > +``BPF_ATOMIC | BPF_DW | BPF_STX`` (0xdb) with 'imm' = BPF ADD means:: > > *(uint64_t *)(dst + offset) += src > > -In addition to the simple atomic operations, there also is a modifier and > +In addition to the simple atomic operations above, there also is a modifier and > two complex atomic operations: > > =========== ================ =========================== > -- > 2.33.4 >
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > +The eBPF instruction set consists of eleven 64 bit registers, a > > +program counter, and 512 bytes of stack space. > > I would not add 512 to a doc. > That's what we have now, but we might relax that in the future. I think it's important to at least give a minimum. Will change to "at least 512". [...] > > +Registers R0 - R5 are scratch registers, meaning the BPF program > > +needs to either spill them to the BPF stack or move them to callee > > +saved registers if these arguments are to be reused across multiple > > +function calls. Spilling means that the value in the register is > > +moved to the BPF stack. The reverse operation of moving the variable > from the BPF stack to the register is called filling. > > +The reason for spilling/filling is due to the limited number of registers. > > More canonical way would be to say that r0-r5 are caller saved. Will change "scratch" to "caller-saved" [...] > > -``BPF_ADD | BPF_X | BPF_ALU64`` means:: > > +``BPF_ADD | BPF_X | BPF_ALU64`` (0x0f) means:: > > I don't think adding hex values help here. I found it very helpful in verifying that the Appendix table was correct, and providing a correlation to the text here that shows the construction of the value. So I'd like to keep them. [...] > > -The 1-bit source operand field in the opcode is used to to select > > what byte -order the operation convert from or to: > > +Byte swap instructions use non-default semantics of the 1-bit > > +'source' field in > > I would drop 'non-default'. Many fields have different meanings depending > on opcode. > BPF_SRC() macro just reads that bit. Will reword to say: Byte swap instructions use the 1-bit 'source' field in the 'opcode' field as follows. Instead of indicating the source operator, it is instead used to select what byte order the operation converts from or to: [...] > > +* mnenomic indicates a short form that might be displayed by some > > +tools such as disassemblers > > +* 'htoleNN()' indicates converting a NN-bit value from host byte > > +order to little-endian byte order > > +* 'htobeNN()' indicates converting a NN-bit value from host byte > > +order to big-endian byte order > > I think we need to add normal bswap insn. > These to_le and to_be are very awkward to use. > As soon as we have new insn the compiler will be using it. > There is no equivalent of to_be and to_le in C. It wasn't good ISA design. I will interpret this as a request for someone to do code work, rather than any request for immediately changes to the doc :) [...] > > Regular load and store operations > > --------------------------------- > > > > The ``BPF_MEM`` mode modifier is used to encode regular load and > > store instructions that transfer data between a register and memory. > > > > -``BPF_MEM | <size> | BPF_STX`` means:: > > - > > - *(size *) (dst + offset) = src_reg > > - > > -``BPF_MEM | <size> | BPF_ST`` means:: > > - > > - *(size *) (dst + offset) = imm32 > > - > > -``BPF_MEM | <size> | BPF_LDX`` means:: > > - > > - dst = *(size *) (src + offset) > > - > > -Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW``. > > +============================= ========= > ==================================== > > +opcode construction opcode pseudocode > > +============================= ========= > ==================================== > > +BPF_MEM | BPF_B | BPF_LDX 0x71 dst = \*(uint8_t \*) (src + offset) > > +BPF_MEM | BPF_H | BPF_LDX 0x69 dst = \*(uint16_t \*) (src + > offset) > > +BPF_MEM | BPF_W | BPF_LDX 0x61 dst = \*(uint32_t \*) (src + > offset) > > +BPF_MEM | BPF_DW | BPF_LDX 0x79 dst = \*(uint64_t \*) (src + > offset) > > +BPF_MEM | BPF_B | BPF_ST 0x72 \*(uint8_t \*) (dst + offset) = imm > > +BPF_MEM | BPF_H | BPF_ST 0x6a \*(uint16_t \*) (dst + offset) = > imm > > +BPF_MEM | BPF_W | BPF_ST 0x62 \*(uint32_t \*) (dst + offset) = > imm > > +BPF_MEM | BPF_DW | BPF_ST 0x7a \*(uint64_t \*) (dst + offset) = > imm > > +BPF_MEM | BPF_B | BPF_STX 0x73 \*(uint8_t \*) (dst + offset) = src > > +BPF_MEM | BPF_H | BPF_STX 0x6b \*(uint16_t \*) (dst + offset) = > src > > +BPF_MEM | BPF_W | BPF_STX 0x63 \*(uint32_t \*) (dst + offset) = > src > > +BPF_MEM | BPF_DW | BPF_STX 0x7b \*(uint64_t \*) (dst + offset) = > src > > +============================= ========= > > +==================================== > > I think the table is more verbose and less readable than the original text. Will change back to original text. Dave
On Tue, Oct 4, 2022 at 7:32 AM Dave Thaler <dthaler@microsoft.com> wrote: > > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > > +The eBPF instruction set consists of eleven 64 bit registers, a > > > +program counter, and 512 bytes of stack space. > > > > I would not add 512 to a doc. > > That's what we have now, but we might relax that in the future. > > I think it's important to at least give a minimum. Will change to > "at least 512". I'm not sure 512 stands as a minimum either. When we have the gatekeeper prog it will be able to enforce any stack size. > [...] > > > +Registers R0 - R5 are scratch registers, meaning the BPF program > > > +needs to either spill them to the BPF stack or move them to callee > > > +saved registers if these arguments are to be reused across multiple > > > +function calls. Spilling means that the value in the register is > > > +moved to the BPF stack. The reverse operation of moving the variable > > from the BPF stack to the register is called filling. > > > +The reason for spilling/filling is due to the limited number of registers. > > > > More canonical way would be to say that r0-r5 are caller saved. > > Will change "scratch" to "caller-saved" > > [...] > > > -``BPF_ADD | BPF_X | BPF_ALU64`` means:: > > > +``BPF_ADD | BPF_X | BPF_ALU64`` (0x0f) means:: > > > > I don't think adding hex values help here. > > I found it very helpful in verifying that the Appendix table > was correct, and providing a correlation to the text here > that shows the construction of the value. So I'd like to keep them. I think that means that the appendix table shouldn't be there either. I'd like to avoid both. > [...] > > > -The 1-bit source operand field in the opcode is used to to select > > > what byte -order the operation convert from or to: > > > +Byte swap instructions use non-default semantics of the 1-bit > > > +'source' field in > > > > I would drop 'non-default'. Many fields have different meanings depending > > on opcode. > > BPF_SRC() macro just reads that bit. > > Will reword to say: > Byte swap instructions use the 1-bit 'source' field in the 'opcode' field > as follows. Instead of indicating the source operator, it is instead > used to select what byte order the operation converts from or to: makes sense. > [...] > > > +* mnenomic indicates a short form that might be displayed by some > > > +tools such as disassemblers > > > +* 'htoleNN()' indicates converting a NN-bit value from host byte > > > +order to little-endian byte order > > > +* 'htobeNN()' indicates converting a NN-bit value from host byte > > > +order to big-endian byte order > > > > I think we need to add normal bswap insn. > > These to_le and to_be are very awkward to use. > > As soon as we have new insn the compiler will be using it. > > There is no equivalent of to_be and to_le in C. It wasn't good ISA design. > > I will interpret this as a request for someone to do code work, rather > than any request for immediately changes to the doc :) > > [...] > > > Regular load and store operations > > > --------------------------------- > > > > > > The ``BPF_MEM`` mode modifier is used to encode regular load and > > > store instructions that transfer data between a register and memory. > > > > > > -``BPF_MEM | <size> | BPF_STX`` means:: > > > - > > > - *(size *) (dst + offset) = src_reg > > > - > > > -``BPF_MEM | <size> | BPF_ST`` means:: > > > - > > > - *(size *) (dst + offset) = imm32 > > > - > > > -``BPF_MEM | <size> | BPF_LDX`` means:: > > > - > > > - dst = *(size *) (src + offset) > > > - > > > -Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW``. > > > +============================= ========= > > ==================================== > > > +opcode construction opcode pseudocode > > > +============================= ========= > > ==================================== > > > +BPF_MEM | BPF_B | BPF_LDX 0x71 dst = \*(uint8_t \*) (src + offset) > > > +BPF_MEM | BPF_H | BPF_LDX 0x69 dst = \*(uint16_t \*) (src + > > offset) > > > +BPF_MEM | BPF_W | BPF_LDX 0x61 dst = \*(uint32_t \*) (src + > > offset) > > > +BPF_MEM | BPF_DW | BPF_LDX 0x79 dst = \*(uint64_t \*) (src + > > offset) > > > +BPF_MEM | BPF_B | BPF_ST 0x72 \*(uint8_t \*) (dst + offset) = imm > > > +BPF_MEM | BPF_H | BPF_ST 0x6a \*(uint16_t \*) (dst + offset) = > > imm > > > +BPF_MEM | BPF_W | BPF_ST 0x62 \*(uint32_t \*) (dst + offset) = > > imm > > > +BPF_MEM | BPF_DW | BPF_ST 0x7a \*(uint64_t \*) (dst + offset) = > > imm > > > +BPF_MEM | BPF_B | BPF_STX 0x73 \*(uint8_t \*) (dst + offset) = src > > > +BPF_MEM | BPF_H | BPF_STX 0x6b \*(uint16_t \*) (dst + offset) = > > src > > > +BPF_MEM | BPF_W | BPF_STX 0x63 \*(uint32_t \*) (dst + offset) = > > src > > > +BPF_MEM | BPF_DW | BPF_STX 0x7b \*(uint64_t \*) (dst + offset) = > > src > > > +============================= ========= > > > +==================================== > > > > I think the table is more verbose and less readable than the original text. > > Will change back to original text. Please see git. I've removed that table. Please don't add it back. I see no value in such tables other than more things to get wrong.
> > I found it very helpful in verifying that the Appendix table was > > correct, and providing a correlation to the text here that shows the > > construction of the value. So I'd like to keep them. > > I think that means that the appendix table shouldn't be there either. > I'd like to avoid both. I've heard from multiple people with different affiliations that the appendix is the most useful part of the document, and what they wanted to see added. So I added by popular request. Dave
Also worth noting that Quentin has a script that I believe parses the appendix and uses it to generate a validator for ebpf programs. (Which also helps validate the appendix). Dave > -----Original Message----- > From: Dave Thaler > Sent: Tuesday, October 4, 2022 8:55 AM > To: 'Alexei Starovoitov' <alexei.starovoitov@gmail.com> > Cc: bpf@vger.kernel.org > Subject: RE: [PATCH 11/15] ebpf-docs: Improve English readability > > > > I found it very helpful in verifying that the Appendix table was > > > correct, and providing a correlation to the text here that shows the > > > construction of the value. So I'd like to keep them. > > > > I think that means that the appendix table shouldn't be there either. > > I'd like to avoid both. > > I've heard from multiple people with different affiliations that the appendix > is the most useful part of the document, and what they wanted to see > added. So I added by popular request. > > Dave
On Tue, Oct 4, 2022 at 8:56 AM Dave Thaler <dthaler@microsoft.com> wrote: > > Also worth noting that Quentin has a script that I believe parses the appendix > and uses it to generate a validator for ebpf programs. (Which also > helps validate the appendix). The last thing I want to see is a document becoming a description of the code. We've always been doing it the other way around. The documentation can live next to the code and docs automatically generated from .h or .c files. Doing the other way around sooner or later will be a disaster. Imagine a typo in instruction-set.rst. What should we do next? Fix a typo and say, look, the code behaves differently, so we're fixing the doc. If so, there is close to zero reason to add hex to the doc, since it's not an authoritative answer. On the other hand if instruction-set.rst is the source of the truth then the code would have to change, which we obviously cannot do. So let's not get us into the corner with such tables. > Dave > > > -----Original Message----- > > From: Dave Thaler > > Sent: Tuesday, October 4, 2022 8:55 AM > > To: 'Alexei Starovoitov' <alexei.starovoitov@gmail.com> > > Cc: bpf@vger.kernel.org > > Subject: RE: [PATCH 11/15] ebpf-docs: Improve English readability > > > > > > I found it very helpful in verifying that the Appendix table was > > > > correct, and providing a correlation to the text here that shows the > > > > construction of the value. So I'd like to keep them. > > > > > > I think that means that the appendix table shouldn't be there either. > > > I'd like to avoid both. > > > > I've heard from multiple people with different affiliations that the appendix > > is the most useful part of the document, and what they wanted to see > > added. So I added by popular request. These people should speak up then.
> -----Original Message----- > From: Alexei Starovoitov <alexei.starovoitov@gmail.com> > Sent: Tuesday, October 4, 2022 9:20 AM > To: Dave Thaler <dthaler@microsoft.com> > Cc: bpf@vger.kernel.org > Subject: Re: [PATCH 11/15] ebpf-docs: Improve English readability > > On Tue, Oct 4, 2022 at 8:56 AM Dave Thaler <dthaler@microsoft.com> > wrote: > > > > Also worth noting that Quentin has a script that I believe parses the > > appendix and uses it to generate a validator for ebpf programs. > > (Which also helps validate the appendix). > > The last thing I want to see is a document becoming a description of the > code. > We've always been doing it the other way around. > The documentation can live next to the code and docs automatically > generated from .h or .c files. > Doing the other way around sooner or later will be a disaster. > Imagine a typo in instruction-set.rst. > What should we do next? Fix a typo and say, look, the code behaves > differently, so we're fixing the doc. > If so, there is close to zero reason to add hex to the doc, since it's not an > authoritative answer. > On the other hand if instruction-set.rst is the source of the truth then the > code would have to change, which we obviously cannot do. So let's not get us > into the corner with such tables. The point of a standard is to be a source of truth. If another implementation (ubpf, hbpf, etc.) doesn't match the spec, then the code would have to change. Such tables are seen as invaluable for determining correctness of other implementations. So the feedback is that it's important to have such if we want everyone else to do the right thing. > These people should speak up then. I agree. Dave
Regarding the tables: > Such tables are seen as invaluable for determining correctness of other > implementations. So the feedback is that it's important to have such if we > want everyone else to do the right thing. > > > These people should speak up then. > > I agree. Here's two public examples... Christoph Hellwig, said on May 17 at https://lore.kernel.org/bpf/20220517091011.GA18723@lst.de/: > One useful thing for this would be an opcode table with all the > optional field usage in machine readable format. > > Jim who is on CC has already built a nice table off all opcodes based > on existing material that might be a good starting point. Jim Harris responded on that thread with a strawman which was used as the basis for the table in the appendix. Jim then commented in the github version on August 30: > In my opinion, this table is the biggest thing that has been missing, > and will be most essential for a more "formal" specification. I will encourage them and others to comment on this thread. Dave
On Tue, Oct 4, 2022 at 10:09 AM Dave Thaler <dthaler@microsoft.com> wrote: > > Regarding the tables: > > Such tables are seen as invaluable for determining correctness of other > > implementations. So the feedback is that it's important to have such if we > > want everyone else to do the right thing. > > > > > These people should speak up then. > > > > I agree. > > Here's two public examples... > > Christoph Hellwig, said on May 17 at https://lore.kernel.org/bpf/20220517091011.GA18723@lst.de/: > > One useful thing for this would be an opcode table with all the > > optional field usage in machine readable format. > > > > Jim who is on CC has already built a nice table off all opcodes based > > on existing material that might be a good starting point. > > Jim Harris responded on that thread with a strawman which was > used as the basis for the table in the appendix. > > Jim then commented in the github version on August 30: > > In my opinion, this table is the biggest thing that has been missing, > > and will be most essential for a more "formal" specification. I think an opcode table is a huge help for developing alternate eBPF implementations - anything that makes it more explicit which opcodes are valid (and which ones are not). For example, the section for BPF_ALU and BPF_ALU64 classes lists the operation codes. BPF_END is in that list. Description says "see separate section below". The "Byte swap instructions" section then says that byte swap instructions always use BPF_ALU, even for 64-bit widths. So an implementer can synthesize all of this to determine that opcodes 0xD7 and 0xDF (which have BPF_ALU64 | BPF_END) are not valid. It would be more clear if somewhere there was a list that explicitly showed that 0xD4 and 0xDC (BPF_ALU | BPF_END) were valid, but 0xD7 and 0xDF were not. If there's concern that an appendix gets out of sync with the code and the primary sections of the instruction set doc, maybe these opcodes can be added to the existing per-class-code tables in the instruction set doc. A full opcode table could probably be generated from those tables instead of the hand-written one that Dave and I worked on in his patch. -Jim > I will encourage them and others to comment on this thread.
diff --git a/Documentation/bpf/instruction-set.rst b/Documentation/bpf/instruction-set.rst index b6f098104..328207ff6 100644 --- a/Documentation/bpf/instruction-set.rst +++ b/Documentation/bpf/instruction-set.rst @@ -7,6 +7,9 @@ eBPF Instruction Set Specification, v1.0 This document specifies version 1.0 of the eBPF instruction set. +The eBPF instruction set consists of eleven 64 bit registers, a program counter, +and 512 bytes of stack space. + Documentation conventions ========================= @@ -25,12 +28,24 @@ The eBPF calling convention is defined as: * R6 - R9: callee saved registers that function calls will preserve * R10: read-only frame pointer to access stack -R0 - R5 are scratch registers and eBPF programs needs to spill/fill them if -necessary across calls. +Registers R0 - R5 are scratch registers, meaning the BPF program needs to either +spill them to the BPF stack or move them to callee saved registers if these +arguments are to be reused across multiple function calls. Spilling means +that the value in the register is moved to the BPF stack. The reverse operation +of moving the variable from the BPF stack to the register is called filling. +The reason for spilling/filling is due to the limited number of registers. + +Upon entering execution of an eBPF program, registers R1 - R5 initially can contain +the input arguments for the program (similar to the argc/argv pair for a typical C program). +The actual number of registers used, and their meaning, is defined by the program type; +for example, a networking program might have an argument that includes network packet data +and/or metadata. Instruction encoding ==================== +An eBPF program is a sequence of instructions. + eBPF has two instruction encodings: * the basic instruction encoding, which uses 64 bits to encode an instruction @@ -63,7 +78,7 @@ opcode operation to perform Note that most instructions do not use all of the fields. -Unused fields shall be cleared to zero. +Unused fields must be set to zero. As discussed below in `64-bit immediate instructions`_, some instructions use a 64-bit immediate value that is constructed as follows. @@ -90,7 +105,9 @@ and destination registers, respectively, rather than the register number. Instruction classes ------------------- -The three LSB bits of the 'opcode' field store the instruction class: +The encoding of the 'opcode' field varies and can be determined from +the three least significant bits (LSB) of the 'opcode' field which holds +the "instruction class", as follows: ========= ===== =============================== =================================== class value description reference @@ -136,9 +153,11 @@ instruction class Arithmetic instructions ----------------------- -``BPF_ALU`` uses 32-bit wide operands while ``BPF_ALU64`` uses 64-bit wide operands for +Instruction class ``BPF_ALU`` uses 32-bit wide operands (zeroing the upper 32 bits +of the destination register) while ``BPF_ALU64`` uses 64-bit wide operands for otherwise identical operations. -The 'code' field encodes the operation as below: + +The 4-bit 'code' field encodes the operation as follows: ======== ===== ========================================================== code value description @@ -168,21 +187,23 @@ the destination register is instead set to zero. If execution would result in modulo by zero, the destination register is instead left unchanged. -``BPF_ADD | BPF_X | BPF_ALU`` means:: +Examples: + +``BPF_ADD | BPF_X | BPF_ALU`` (0x0c) means:: dst = (uint32_t) (dst + src) where '(uint32_t)' indicates truncation to 32 bits. -``BPF_ADD | BPF_X | BPF_ALU64`` means:: +``BPF_ADD | BPF_X | BPF_ALU64`` (0x0f) means:: dst = dst + src -``BPF_XOR | BPF_K | BPF_ALU`` means:: +``BPF_XOR | BPF_K | BPF_ALU`` (0xa4) means:: src = (uint32_t) src ^ (uint32_t) imm -``BPF_XOR | BPF_K | BPF_ALU64`` means:: +``BPF_XOR | BPF_K | BPF_ALU64`` (0xa7) means:: src = src ^ imm @@ -204,8 +225,9 @@ The byte swap instructions use an instruction class of ``BPF_ALU`` and a 4-bit The byte swap instructions operate on the destination register only and do not use a separate source register or immediate value. -The 1-bit source operand field in the opcode is used to to select what byte -order the operation convert from or to: +Byte swap instructions use non-default semantics of the 1-bit 'source' field in +the 'opcode' field. Instead of indicating the source operator, it is instead +used to select what byte order the operation converts from or to: ========= ===== ================================================= source value description @@ -215,24 +237,33 @@ BPF_TO_BE 0x08 convert between host byte order and big endian ========= ===== ================================================= The 'imm' field encodes the width of the swap operations. The following widths -are supported: 16, 32 and 64. - -Examples: - -``BPF_ALU | BPF_TO_LE | BPF_END`` with imm = 16 means:: - - dst = htole16(dst) - -``BPF_ALU | BPF_TO_BE | BPF_END`` with imm = 64 means:: - - dst = htobe64(dst) +are supported: 16, 32 and 64. The following table summarizes the resulting +possibilities: + +============================= ========= === ======== ================== +opcode construction opcode imm mnemonic pseudocode +============================= ========= === ======== ================== +BPF_END | BPF_TO_LE | BPF_ALU 0xd4 16 le16 dst dst = htole16(dst) +BPF_END | BPF_TO_LE | BPF_ALU 0xd4 32 le32 dst dst = htole32(dst) +BPF_END | BPF_TO_LE | BPF_ALU 0xd4 64 le64 dst dst = htole64(dst) +BPF_END | BPF_TO_BE | BPF_ALU 0xdc 16 be16 dst dst = htobe16(dst) +BPF_END | BPF_TO_BE | BPF_ALU 0xdc 32 be32 dst dst = htobe32(dst) +BPF_END | BPF_TO_BE | BPF_ALU 0xdc 64 be64 dst dst = htobe64(dst) +============================= ========= === ======== ================== + +where + +* mnenomic indicates a short form that might be displayed by some tools such as disassemblers +* 'htoleNN()' indicates converting a NN-bit value from host byte order to little-endian byte order +* 'htobeNN()' indicates converting a NN-bit value from host byte order to big-endian byte order Jump instructions ----------------- -``BPF_JMP32`` uses 32-bit wide operands while ``BPF_JMP`` uses 64-bit wide operands for +Instruction class ``BPF_JMP32`` uses 32-bit wide operands while ``BPF_JMP`` uses 64-bit wide operands for otherwise identical operations. -The 'code' field encodes the operation as below: + +The 4-bit 'code' field encodes the operation as below, where PC is the program counter: ======== ===== ========================= ============ code value description notes @@ -253,9 +284,6 @@ BPF_JSLT 0xc0 PC += off if dst < src signed BPF_JSLE 0xd0 PC += off if dst <= src signed ======== ===== ========================= ============ -The eBPF program needs to store the return value into register R0 before doing a -BPF_EXIT. - Helper functions ~~~~~~~~~~~~~~~~ Helper functions are a concept whereby BPF programs can call into a @@ -285,7 +313,8 @@ For load and store instructions (``BPF_LD``, ``BPF_LDX``, ``BPF_ST``, and ``BPF_ mode size instruction class ============ ====== ================= -The mode modifier is one of: +mode + one of: ============= ===== ==================================== ============= mode modifier value description reference @@ -297,7 +326,8 @@ The mode modifier is one of: BPF_ATOMIC 0xc0 atomic operations `Atomic operations`_ ============= ===== ==================================== ============= -The size modifier is one of: +size + one of: ============= ===== ===================== size modifier value description @@ -308,25 +338,31 @@ The size modifier is one of: BPF_DW 0x18 double word (8 bytes) ============= ===== ===================== +instruction class + the instruction class (see `Instruction classes`_) + Regular load and store operations --------------------------------- The ``BPF_MEM`` mode modifier is used to encode regular load and store instructions that transfer data between a register and memory. -``BPF_MEM | <size> | BPF_STX`` means:: - - *(size *) (dst + offset) = src_reg - -``BPF_MEM | <size> | BPF_ST`` means:: - - *(size *) (dst + offset) = imm32 - -``BPF_MEM | <size> | BPF_LDX`` means:: - - dst = *(size *) (src + offset) - -Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW``. +============================= ========= ==================================== +opcode construction opcode pseudocode +============================= ========= ==================================== +BPF_MEM | BPF_B | BPF_LDX 0x71 dst = \*(uint8_t \*) (src + offset) +BPF_MEM | BPF_H | BPF_LDX 0x69 dst = \*(uint16_t \*) (src + offset) +BPF_MEM | BPF_W | BPF_LDX 0x61 dst = \*(uint32_t \*) (src + offset) +BPF_MEM | BPF_DW | BPF_LDX 0x79 dst = \*(uint64_t \*) (src + offset) +BPF_MEM | BPF_B | BPF_ST 0x72 \*(uint8_t \*) (dst + offset) = imm +BPF_MEM | BPF_H | BPF_ST 0x6a \*(uint16_t \*) (dst + offset) = imm +BPF_MEM | BPF_W | BPF_ST 0x62 \*(uint32_t \*) (dst + offset) = imm +BPF_MEM | BPF_DW | BPF_ST 0x7a \*(uint64_t \*) (dst + offset) = imm +BPF_MEM | BPF_B | BPF_STX 0x73 \*(uint8_t \*) (dst + offset) = src +BPF_MEM | BPF_H | BPF_STX 0x6b \*(uint16_t \*) (dst + offset) = src +BPF_MEM | BPF_W | BPF_STX 0x63 \*(uint32_t \*) (dst + offset) = src +BPF_MEM | BPF_DW | BPF_STX 0x7b \*(uint64_t \*) (dst + offset) = src +============================= ========= ==================================== Atomic operations ----------------- @@ -338,9 +374,11 @@ by other eBPF programs or means outside of this specification. All atomic operations supported by eBPF are encoded as store operations that use the ``BPF_ATOMIC`` mode modifier as follows: -* ``BPF_ATOMIC | BPF_W | BPF_STX`` for 32-bit operations -* ``BPF_ATOMIC | BPF_DW | BPF_STX`` for 64-bit operations -* 8-bit and 16-bit wide atomic operations are not supported. +* ``BPF_ATOMIC | BPF_W | BPF_STX`` (0xc3) for 32-bit operations +* ``BPF_ATOMIC | BPF_DW | BPF_STX`` (0xdb) for 64-bit operations + +Note that 8-bit (``BPF_B``) and 16-bit (``BPF_H``) wide atomic operations are not supported, +nor is ``BPF_ATOMIC | <size> | BPF_ST``. The 'imm' field is used to encode the actual atomic operation. Simple atomic operation use a subset of the values defined to encode @@ -355,16 +393,15 @@ BPF_AND 0x50 atomic and BPF_XOR 0xa0 atomic xor ======== ===== =========== - -``BPF_ATOMIC | BPF_W | BPF_STX`` with 'imm' = BPF_ADD means:: +``BPF_ATOMIC | BPF_W | BPF_STX`` (0xc3) with 'imm' = BPF_ADD means:: *(uint32_t *)(dst + offset) += src -``BPF_ATOMIC | BPF_DW | BPF_STX`` with 'imm' = BPF ADD means:: +``BPF_ATOMIC | BPF_DW | BPF_STX`` (0xdb) with 'imm' = BPF ADD means:: *(uint64_t *)(dst + offset) += src -In addition to the simple atomic operations, there also is a modifier and +In addition to the simple atomic operations above, there also is a modifier and two complex atomic operations: =========== ================ ===========================