diff mbox series

[11/15] ebpf-docs: Improve English readability

Message ID 20220927185958.14995-11-dthaler1968@googlemail.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series [01/15] ebpf-docs: Move legacy packet instructions to a separate file | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch
bpf/vmtest-bpf-next-VM_Test-4 success Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-5 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-1 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-7 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-11 fail Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-8 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-6 success Logs for test_maps on s390x with gcc

Commit Message

Dave Thaler Sept. 27, 2022, 6:59 p.m. UTC
From: Dave Thaler <dthaler@microsoft.com>

Signed-off-by: Dave Thaler <dthaler@microsoft.com>
---
 Documentation/bpf/instruction-set.rst | 137 ++++++++++++++++----------
 1 file changed, 87 insertions(+), 50 deletions(-)

Comments

Alexei Starovoitov Sept. 30, 2022, 10:16 p.m. UTC | #1
On Tue, Sep 27, 2022 at 06:59:54PM +0000, dthaler1968@googlemail.com wrote:
> From: Dave Thaler <dthaler@microsoft.com>
> 
> Signed-off-by: Dave Thaler <dthaler@microsoft.com>
> ---
>  Documentation/bpf/instruction-set.rst | 137 ++++++++++++++++----------
>  1 file changed, 87 insertions(+), 50 deletions(-)
> 
> diff --git a/Documentation/bpf/instruction-set.rst b/Documentation/bpf/instruction-set.rst
> index b6f098104..328207ff6 100644
> --- a/Documentation/bpf/instruction-set.rst
> +++ b/Documentation/bpf/instruction-set.rst
> @@ -7,6 +7,9 @@ eBPF Instruction Set Specification, v1.0
>  
>  This document specifies version 1.0 of the eBPF instruction set.
>  
> +The eBPF instruction set consists of eleven 64 bit registers, a program counter,
> +and 512 bytes of stack space.

I would not add 512 to a doc.
That's what we have now, but we might relax that in the future.

> +
>  Documentation conventions
>  =========================
>  
> @@ -25,12 +28,24 @@ The eBPF calling convention is defined as:
>  * R6 - R9: callee saved registers that function calls will preserve
>  * R10: read-only frame pointer to access stack
>  
> -R0 - R5 are scratch registers and eBPF programs needs to spill/fill them if
> -necessary across calls.
> +Registers R0 - R5 are scratch registers, meaning the BPF program needs to either
> +spill them to the BPF stack or move them to callee saved registers if these
> +arguments are to be reused across multiple function calls. Spilling means
> +that the value in the register is moved to the BPF stack. The reverse operation
> +of moving the variable from the BPF stack to the register is called filling.
> +The reason for spilling/filling is due to the limited number of registers.

More canonical way would be to say that r0-r5 are caller saved.
I like above clarification though.

> +
> +Upon entering execution of an eBPF program, registers R1 - R5 initially can contain
> +the input arguments for the program (similar to the argc/argv pair for a typical C program).
> +The actual number of registers used, and their meaning, is defined by the program type;
> +for example, a networking program might have an argument that includes network packet data
> +and/or metadata.
>  
>  Instruction encoding
>  ====================
>  
> +An eBPF program is a sequence of instructions.
> +
>  eBPF has two instruction encodings:
>  
>  * the basic instruction encoding, which uses 64 bits to encode an instruction
> @@ -63,7 +78,7 @@ opcode
>    operation to perform
>  
>  Note that most instructions do not use all of the fields.
> -Unused fields shall be cleared to zero.
> +Unused fields must be set to zero.
>  
>  As discussed below in `64-bit immediate instructions`_, some
>  instructions use a 64-bit immediate value that is constructed as follows.
> @@ -90,7 +105,9 @@ and destination registers, respectively, rather than the register number.
>  Instruction classes
>  -------------------
>  
> -The three LSB bits of the 'opcode' field store the instruction class:
> +The encoding of the 'opcode' field varies and can be determined from
> +the three least significant bits (LSB) of the 'opcode' field which holds
> +the "instruction class", as follows:
>  
>  =========  =====  ===============================  ===================================
>  class      value  description                      reference
> @@ -136,9 +153,11 @@ instruction class
>  Arithmetic instructions
>  -----------------------
>  
> -``BPF_ALU`` uses 32-bit wide operands while ``BPF_ALU64`` uses 64-bit wide operands for
> +Instruction class ``BPF_ALU`` uses 32-bit wide operands (zeroing the upper 32 bits
> +of the destination register) while ``BPF_ALU64`` uses 64-bit wide operands for
>  otherwise identical operations.
> -The 'code' field encodes the operation as below:
> +
> +The 4-bit 'code' field encodes the operation as follows:
>  
>  ========  =====  ==========================================================
>  code      value  description
> @@ -168,21 +187,23 @@ the destination register is instead set to zero.
>  If execution would result in modulo by zero,
>  the destination register is instead left unchanged.
>  
> -``BPF_ADD | BPF_X | BPF_ALU`` means::
> +Examples:
> +
> +``BPF_ADD | BPF_X | BPF_ALU`` (0x0c) means::
>  
>    dst = (uint32_t) (dst + src)
>  
>  where '(uint32_t)' indicates truncation to 32 bits.
>  
> -``BPF_ADD | BPF_X | BPF_ALU64`` means::
> +``BPF_ADD | BPF_X | BPF_ALU64`` (0x0f) means::

I don't think adding hex values help here.

>  
>    dst = dst + src
>  
> -``BPF_XOR | BPF_K | BPF_ALU`` means::
> +``BPF_XOR | BPF_K | BPF_ALU`` (0xa4) means::
>  
>    src = (uint32_t) src ^ (uint32_t) imm
>  
> -``BPF_XOR | BPF_K | BPF_ALU64`` means::
> +``BPF_XOR | BPF_K | BPF_ALU64`` (0xa7) means::
>  
>    src = src ^ imm
>  
> @@ -204,8 +225,9 @@ The byte swap instructions use an instruction class of ``BPF_ALU`` and a 4-bit
>  The byte swap instructions operate on the destination register
>  only and do not use a separate source register or immediate value.
>  
> -The 1-bit source operand field in the opcode is used to to select what byte
> -order the operation convert from or to:
> +Byte swap instructions use non-default semantics of the 1-bit 'source' field in

I would drop 'non-default'. Many fields have different meanings depending on opcode.
BPF_SRC() macro just reads that bit.

> +the 'opcode' field.  Instead of indicating the source operator, it is instead
> +used to select what byte order the operation converts from or to:
>  
>  =========  =====  =================================================
>  source     value  description
> @@ -215,24 +237,33 @@ BPF_TO_BE  0x08   convert between host byte order and big endian
>  =========  =====  =================================================
>  
>  The 'imm' field encodes the width of the swap operations.  The following widths
> -are supported: 16, 32 and 64.
> -
> -Examples:
> -
> -``BPF_ALU | BPF_TO_LE | BPF_END`` with imm = 16 means::
> -
> -  dst = htole16(dst)
> -
> -``BPF_ALU | BPF_TO_BE | BPF_END`` with imm = 64 means::
> -
> -  dst = htobe64(dst)
> +are supported: 16, 32 and 64. The following table summarizes the resulting
> +possibilities:
> +
> +=============================  =========  ===  ========  ==================
> +opcode construction            opcode     imm  mnemonic  pseudocode
> +=============================  =========  ===  ========  ==================
> +BPF_END | BPF_TO_LE | BPF_ALU  0xd4       16   le16 dst  dst = htole16(dst)
> +BPF_END | BPF_TO_LE | BPF_ALU  0xd4       32   le32 dst  dst = htole32(dst)
> +BPF_END | BPF_TO_LE | BPF_ALU  0xd4       64   le64 dst  dst = htole64(dst)
> +BPF_END | BPF_TO_BE | BPF_ALU  0xdc       16   be16 dst  dst = htobe16(dst)
> +BPF_END | BPF_TO_BE | BPF_ALU  0xdc       32   be32 dst  dst = htobe32(dst)
> +BPF_END | BPF_TO_BE | BPF_ALU  0xdc       64   be64 dst  dst = htobe64(dst)
> +=============================  =========  ===  ========  ==================
> +
> +where
> +
> +* mnenomic indicates a short form that might be displayed by some tools such as disassemblers
> +* 'htoleNN()' indicates converting a NN-bit value from host byte order to little-endian byte order
> +* 'htobeNN()' indicates converting a NN-bit value from host byte order to big-endian byte order

I think we need to add normal bswap insn.
These to_le and to_be are very awkward to use.
As soon as we have new insn the compiler will be using it.
There is no equivalent of to_be and to_le in C. It wasn't good ISA design.

>  Jump instructions
>  -----------------
>  
> -``BPF_JMP32`` uses 32-bit wide operands while ``BPF_JMP`` uses 64-bit wide operands for
> +Instruction class ``BPF_JMP32`` uses 32-bit wide operands while ``BPF_JMP`` uses 64-bit wide operands for
>  otherwise identical operations.
> -The 'code' field encodes the operation as below:
> +
> +The 4-bit 'code' field encodes the operation as below, where PC is the program counter:
>  
>  ========  =====  =========================  ============
>  code      value  description                notes
> @@ -253,9 +284,6 @@ BPF_JSLT  0xc0   PC += off if dst < src     signed
>  BPF_JSLE  0xd0   PC += off if dst <= src    signed
>  ========  =====  =========================  ============
>  
> -The eBPF program needs to store the return value into register R0 before doing a
> -BPF_EXIT.
> -
>  Helper functions
>  ~~~~~~~~~~~~~~~~
>  Helper functions are a concept whereby BPF programs can call into a
> @@ -285,7 +313,8 @@ For load and store instructions (``BPF_LD``, ``BPF_LDX``, ``BPF_ST``, and ``BPF_
>  mode          size    instruction class
>  ============  ======  =================
>  
> -The mode modifier is one of:
> +mode
> +  one of:
>  
>    =============  =====  ====================================  =============
>    mode modifier  value  description                           reference
> @@ -297,7 +326,8 @@ The mode modifier is one of:
>    BPF_ATOMIC     0xc0   atomic operations                     `Atomic operations`_
>    =============  =====  ====================================  =============
>  
> -The size modifier is one of:
> +size
> +  one of:
>  
>    =============  =====  =====================
>    size modifier  value  description
> @@ -308,25 +338,31 @@ The size modifier is one of:
>    BPF_DW         0x18   double word (8 bytes)
>    =============  =====  =====================
>  
> +instruction class
> +  the instruction class (see `Instruction classes`_)
> +
>  Regular load and store operations
>  ---------------------------------
>  
>  The ``BPF_MEM`` mode modifier is used to encode regular load and store
>  instructions that transfer data between a register and memory.
>  
> -``BPF_MEM | <size> | BPF_STX`` means::
> -
> -  *(size *) (dst + offset) = src_reg
> -
> -``BPF_MEM | <size> | BPF_ST`` means::
> -
> -  *(size *) (dst + offset) = imm32
> -
> -``BPF_MEM | <size> | BPF_LDX`` means::
> -
> -  dst = *(size *) (src + offset)
> -
> -Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW``.
> +=============================  =========  ====================================
> +opcode construction            opcode     pseudocode
> +=============================  =========  ====================================
> +BPF_MEM | BPF_B | BPF_LDX      0x71       dst = \*(uint8_t \*) (src + offset)
> +BPF_MEM | BPF_H | BPF_LDX      0x69       dst = \*(uint16_t \*) (src + offset)
> +BPF_MEM | BPF_W | BPF_LDX      0x61       dst = \*(uint32_t \*) (src + offset)
> +BPF_MEM | BPF_DW | BPF_LDX     0x79       dst = \*(uint64_t \*) (src + offset)
> +BPF_MEM | BPF_B | BPF_ST       0x72       \*(uint8_t \*) (dst + offset) = imm
> +BPF_MEM | BPF_H | BPF_ST       0x6a       \*(uint16_t \*) (dst + offset) = imm
> +BPF_MEM | BPF_W | BPF_ST       0x62       \*(uint32_t \*) (dst + offset) = imm
> +BPF_MEM | BPF_DW | BPF_ST      0x7a       \*(uint64_t \*) (dst + offset) = imm
> +BPF_MEM | BPF_B | BPF_STX      0x73       \*(uint8_t \*) (dst + offset) = src
> +BPF_MEM | BPF_H | BPF_STX      0x6b       \*(uint16_t \*) (dst + offset) = src
> +BPF_MEM | BPF_W | BPF_STX      0x63       \*(uint32_t \*) (dst + offset) = src
> +BPF_MEM | BPF_DW | BPF_STX     0x7b       \*(uint64_t \*) (dst + offset) = src
> +=============================  =========  ====================================

I think the table is more verbose and less readable than the original text.

>  
>  Atomic operations
>  -----------------
> @@ -338,9 +374,11 @@ by other eBPF programs or means outside of this specification.
>  All atomic operations supported by eBPF are encoded as store operations
>  that use the ``BPF_ATOMIC`` mode modifier as follows:
>  
> -* ``BPF_ATOMIC | BPF_W | BPF_STX`` for 32-bit operations
> -* ``BPF_ATOMIC | BPF_DW | BPF_STX`` for 64-bit operations
> -* 8-bit and 16-bit wide atomic operations are not supported.
> +* ``BPF_ATOMIC | BPF_W | BPF_STX`` (0xc3) for 32-bit operations
> +* ``BPF_ATOMIC | BPF_DW | BPF_STX`` (0xdb) for 64-bit operations
> +
> +Note that 8-bit (``BPF_B``) and 16-bit (``BPF_H``) wide atomic operations are not supported,
> +nor is ``BPF_ATOMIC | <size> | BPF_ST``.
>  
>  The 'imm' field is used to encode the actual atomic operation.
>  Simple atomic operation use a subset of the values defined to encode
> @@ -355,16 +393,15 @@ BPF_AND   0x50   atomic and
>  BPF_XOR   0xa0   atomic xor
>  ========  =====  ===========
>  
> -
> -``BPF_ATOMIC | BPF_W  | BPF_STX`` with 'imm' = BPF_ADD means::
> +``BPF_ATOMIC | BPF_W  | BPF_STX`` (0xc3) with 'imm' = BPF_ADD means::
>  
>    *(uint32_t *)(dst + offset) += src
>  
> -``BPF_ATOMIC | BPF_DW | BPF_STX`` with 'imm' = BPF ADD means::
> +``BPF_ATOMIC | BPF_DW | BPF_STX`` (0xdb) with 'imm' = BPF ADD means::
>  
>    *(uint64_t *)(dst + offset) += src
>  
> -In addition to the simple atomic operations, there also is a modifier and
> +In addition to the simple atomic operations above, there also is a modifier and
>  two complex atomic operations:
>  
>  ===========  ================  ===========================
> -- 
> 2.33.4
>
Dave Thaler Oct. 4, 2022, 2:32 p.m. UTC | #2
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > +The eBPF instruction set consists of eleven 64 bit registers, a
> > +program counter, and 512 bytes of stack space.
> 
> I would not add 512 to a doc.
> That's what we have now, but we might relax that in the future.

I think it's important to at least give a minimum.  Will change to
"at least 512".

[...]
> > +Registers R0 - R5 are scratch registers, meaning the BPF program
> > +needs to either spill them to the BPF stack or move them to callee
> > +saved registers if these arguments are to be reused across multiple
> > +function calls. Spilling means that the value in the register is
> > +moved to the BPF stack. The reverse operation of moving the variable
> from the BPF stack to the register is called filling.
> > +The reason for spilling/filling is due to the limited number of registers.
> 
> More canonical way would be to say that r0-r5 are caller saved.

Will change "scratch" to "caller-saved"

[...]
> > -``BPF_ADD | BPF_X | BPF_ALU64`` means::
> > +``BPF_ADD | BPF_X | BPF_ALU64`` (0x0f) means::
> 
> I don't think adding hex values help here.

I found it very helpful in verifying that the Appendix table
was correct, and providing a correlation to the text here
that shows the construction of the value.  So I'd like to keep them.

[...]
> > -The 1-bit source operand field in the opcode is used to to select
> > what byte -order the operation convert from or to:
> > +Byte swap instructions use non-default semantics of the 1-bit
> > +'source' field in
> 
> I would drop 'non-default'. Many fields have different meanings depending
> on opcode.
> BPF_SRC() macro just reads that bit.

Will reword to say:
Byte swap instructions use the 1-bit 'source' field in the 'opcode' field
as follows.  Instead of indicating the source operator, it is instead
used to select what byte order the operation converts from or to:

[...]
> > +* mnenomic indicates a short form that might be displayed by some
> > +tools such as disassemblers
> > +* 'htoleNN()' indicates converting a NN-bit value from host byte
> > +order to little-endian byte order
> > +* 'htobeNN()' indicates converting a NN-bit value from host byte
> > +order to big-endian byte order
> 
> I think we need to add normal bswap insn.
> These to_le and to_be are very awkward to use.
> As soon as we have new insn the compiler will be using it.
> There is no equivalent of to_be and to_le in C. It wasn't good ISA design.

I will interpret this as a request for someone to do code work, rather
than any request for immediately changes to the doc :)

[...]
> >  Regular load and store operations
> >  ---------------------------------
> >
> >  The ``BPF_MEM`` mode modifier is used to encode regular load and
> > store  instructions that transfer data between a register and memory.
> >
> > -``BPF_MEM | <size> | BPF_STX`` means::
> > -
> > -  *(size *) (dst + offset) = src_reg
> > -
> > -``BPF_MEM | <size> | BPF_ST`` means::
> > -
> > -  *(size *) (dst + offset) = imm32
> > -
> > -``BPF_MEM | <size> | BPF_LDX`` means::
> > -
> > -  dst = *(size *) (src + offset)
> > -
> > -Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW``.
> > +=============================  =========
> ====================================
> > +opcode construction            opcode     pseudocode
> > +=============================  =========
> ====================================
> > +BPF_MEM | BPF_B | BPF_LDX      0x71       dst = \*(uint8_t \*) (src + offset)
> > +BPF_MEM | BPF_H | BPF_LDX      0x69       dst = \*(uint16_t \*) (src +
> offset)
> > +BPF_MEM | BPF_W | BPF_LDX      0x61       dst = \*(uint32_t \*) (src +
> offset)
> > +BPF_MEM | BPF_DW | BPF_LDX     0x79       dst = \*(uint64_t \*) (src +
> offset)
> > +BPF_MEM | BPF_B | BPF_ST       0x72       \*(uint8_t \*) (dst + offset) = imm
> > +BPF_MEM | BPF_H | BPF_ST       0x6a       \*(uint16_t \*) (dst + offset) =
> imm
> > +BPF_MEM | BPF_W | BPF_ST       0x62       \*(uint32_t \*) (dst + offset) =
> imm
> > +BPF_MEM | BPF_DW | BPF_ST      0x7a       \*(uint64_t \*) (dst + offset) =
> imm
> > +BPF_MEM | BPF_B | BPF_STX      0x73       \*(uint8_t \*) (dst + offset) = src
> > +BPF_MEM | BPF_H | BPF_STX      0x6b       \*(uint16_t \*) (dst + offset) =
> src
> > +BPF_MEM | BPF_W | BPF_STX      0x63       \*(uint32_t \*) (dst + offset) =
> src
> > +BPF_MEM | BPF_DW | BPF_STX     0x7b       \*(uint64_t \*) (dst + offset) =
> src
> > +=============================  =========
> > +====================================
> 
> I think the table is more verbose and less readable than the original text.

Will change back to original text.

Dave
Alexei Starovoitov Oct. 4, 2022, 3:38 p.m. UTC | #3
On Tue, Oct 4, 2022 at 7:32 AM Dave Thaler <dthaler@microsoft.com> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > > +The eBPF instruction set consists of eleven 64 bit registers, a
> > > +program counter, and 512 bytes of stack space.
> >
> > I would not add 512 to a doc.
> > That's what we have now, but we might relax that in the future.
>
> I think it's important to at least give a minimum.  Will change to
> "at least 512".

I'm not sure 512 stands as a minimum either.
When we have the gatekeeper prog it will be able to enforce
any stack size.

> [...]
> > > +Registers R0 - R5 are scratch registers, meaning the BPF program
> > > +needs to either spill them to the BPF stack or move them to callee
> > > +saved registers if these arguments are to be reused across multiple
> > > +function calls. Spilling means that the value in the register is
> > > +moved to the BPF stack. The reverse operation of moving the variable
> > from the BPF stack to the register is called filling.
> > > +The reason for spilling/filling is due to the limited number of registers.
> >
> > More canonical way would be to say that r0-r5 are caller saved.
>
> Will change "scratch" to "caller-saved"
>
> [...]
> > > -``BPF_ADD | BPF_X | BPF_ALU64`` means::
> > > +``BPF_ADD | BPF_X | BPF_ALU64`` (0x0f) means::
> >
> > I don't think adding hex values help here.
>
> I found it very helpful in verifying that the Appendix table
> was correct, and providing a correlation to the text here
> that shows the construction of the value.  So I'd like to keep them.

I think that means that the appendix table shouldn't be there either.
I'd like to avoid both.

> [...]
> > > -The 1-bit source operand field in the opcode is used to to select
> > > what byte -order the operation convert from or to:
> > > +Byte swap instructions use non-default semantics of the 1-bit
> > > +'source' field in
> >
> > I would drop 'non-default'. Many fields have different meanings depending
> > on opcode.
> > BPF_SRC() macro just reads that bit.
>
> Will reword to say:
> Byte swap instructions use the 1-bit 'source' field in the 'opcode' field
> as follows.  Instead of indicating the source operator, it is instead
> used to select what byte order the operation converts from or to:

makes sense.

> [...]
> > > +* mnenomic indicates a short form that might be displayed by some
> > > +tools such as disassemblers
> > > +* 'htoleNN()' indicates converting a NN-bit value from host byte
> > > +order to little-endian byte order
> > > +* 'htobeNN()' indicates converting a NN-bit value from host byte
> > > +order to big-endian byte order
> >
> > I think we need to add normal bswap insn.
> > These to_le and to_be are very awkward to use.
> > As soon as we have new insn the compiler will be using it.
> > There is no equivalent of to_be and to_le in C. It wasn't good ISA design.
>
> I will interpret this as a request for someone to do code work, rather
> than any request for immediately changes to the doc :)
>
> [...]
> > >  Regular load and store operations
> > >  ---------------------------------
> > >
> > >  The ``BPF_MEM`` mode modifier is used to encode regular load and
> > > store  instructions that transfer data between a register and memory.
> > >
> > > -``BPF_MEM | <size> | BPF_STX`` means::
> > > -
> > > -  *(size *) (dst + offset) = src_reg
> > > -
> > > -``BPF_MEM | <size> | BPF_ST`` means::
> > > -
> > > -  *(size *) (dst + offset) = imm32
> > > -
> > > -``BPF_MEM | <size> | BPF_LDX`` means::
> > > -
> > > -  dst = *(size *) (src + offset)
> > > -
> > > -Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW``.
> > > +=============================  =========
> > ====================================
> > > +opcode construction            opcode     pseudocode
> > > +=============================  =========
> > ====================================
> > > +BPF_MEM | BPF_B | BPF_LDX      0x71       dst = \*(uint8_t \*) (src + offset)
> > > +BPF_MEM | BPF_H | BPF_LDX      0x69       dst = \*(uint16_t \*) (src +
> > offset)
> > > +BPF_MEM | BPF_W | BPF_LDX      0x61       dst = \*(uint32_t \*) (src +
> > offset)
> > > +BPF_MEM | BPF_DW | BPF_LDX     0x79       dst = \*(uint64_t \*) (src +
> > offset)
> > > +BPF_MEM | BPF_B | BPF_ST       0x72       \*(uint8_t \*) (dst + offset) = imm
> > > +BPF_MEM | BPF_H | BPF_ST       0x6a       \*(uint16_t \*) (dst + offset) =
> > imm
> > > +BPF_MEM | BPF_W | BPF_ST       0x62       \*(uint32_t \*) (dst + offset) =
> > imm
> > > +BPF_MEM | BPF_DW | BPF_ST      0x7a       \*(uint64_t \*) (dst + offset) =
> > imm
> > > +BPF_MEM | BPF_B | BPF_STX      0x73       \*(uint8_t \*) (dst + offset) = src
> > > +BPF_MEM | BPF_H | BPF_STX      0x6b       \*(uint16_t \*) (dst + offset) =
> > src
> > > +BPF_MEM | BPF_W | BPF_STX      0x63       \*(uint32_t \*) (dst + offset) =
> > src
> > > +BPF_MEM | BPF_DW | BPF_STX     0x7b       \*(uint64_t \*) (dst + offset) =
> > src
> > > +=============================  =========
> > > +====================================
> >
> > I think the table is more verbose and less readable than the original text.
>
> Will change back to original text.

Please see git. I've removed that table. Please don't add it back.
I see no value in such tables other than more things to get wrong.
Dave Thaler Oct. 4, 2022, 3:55 p.m. UTC | #4
> > I found it very helpful in verifying that the Appendix table was
> > correct, and providing a correlation to the text here that shows the
> > construction of the value.  So I'd like to keep them.
> 
> I think that means that the appendix table shouldn't be there either.
> I'd like to avoid both.

I've heard from multiple people with different affiliations that the appendix
is the most useful part of the document, and what they wanted to see
added.  So I added by popular request.

Dave
Dave Thaler Oct. 4, 2022, 3:56 p.m. UTC | #5
Also worth noting that Quentin has a script that I believe parses the appendix
and uses it to generate a validator for ebpf programs.  (Which also
helps validate the appendix).

Dave

> -----Original Message-----
> From: Dave Thaler
> Sent: Tuesday, October 4, 2022 8:55 AM
> To: 'Alexei Starovoitov' <alexei.starovoitov@gmail.com>
> Cc: bpf@vger.kernel.org
> Subject: RE: [PATCH 11/15] ebpf-docs: Improve English readability
> 
> > > I found it very helpful in verifying that the Appendix table was
> > > correct, and providing a correlation to the text here that shows the
> > > construction of the value.  So I'd like to keep them.
> >
> > I think that means that the appendix table shouldn't be there either.
> > I'd like to avoid both.
> 
> I've heard from multiple people with different affiliations that the appendix
> is the most useful part of the document, and what they wanted to see
> added.  So I added by popular request.
> 
> Dave
Alexei Starovoitov Oct. 4, 2022, 4:19 p.m. UTC | #6
On Tue, Oct 4, 2022 at 8:56 AM Dave Thaler <dthaler@microsoft.com> wrote:
>
> Also worth noting that Quentin has a script that I believe parses the appendix
> and uses it to generate a validator for ebpf programs.  (Which also
> helps validate the appendix).

The last thing I want to see is a document becoming a description
of the code.
We've always been doing it the other way around.
The documentation can live next to the code and docs automatically
generated from .h or .c files.
Doing the other way around sooner or later will be a disaster.
Imagine a typo in instruction-set.rst.
What should we do next? Fix a typo and say, look, the code
behaves differently, so we're fixing the doc.
If so, there is close to zero reason to add hex to the doc,
since it's not an authoritative answer.
On the other hand if instruction-set.rst is the source of
the truth then the code would have to change, which we obviously
cannot do. So let's not get us into the corner with
such tables.

> Dave
>
> > -----Original Message-----
> > From: Dave Thaler
> > Sent: Tuesday, October 4, 2022 8:55 AM
> > To: 'Alexei Starovoitov' <alexei.starovoitov@gmail.com>
> > Cc: bpf@vger.kernel.org
> > Subject: RE: [PATCH 11/15] ebpf-docs: Improve English readability
> >
> > > > I found it very helpful in verifying that the Appendix table was
> > > > correct, and providing a correlation to the text here that shows the
> > > > construction of the value.  So I'd like to keep them.
> > >
> > > I think that means that the appendix table shouldn't be there either.
> > > I'd like to avoid both.
> >
> > I've heard from multiple people with different affiliations that the appendix
> > is the most useful part of the document, and what they wanted to see
> > added.  So I added by popular request.

These people should speak up then.
Dave Thaler Oct. 4, 2022, 4:41 p.m. UTC | #7
> -----Original Message-----
> From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Sent: Tuesday, October 4, 2022 9:20 AM
> To: Dave Thaler <dthaler@microsoft.com>
> Cc: bpf@vger.kernel.org
> Subject: Re: [PATCH 11/15] ebpf-docs: Improve English readability
> 
> On Tue, Oct 4, 2022 at 8:56 AM Dave Thaler <dthaler@microsoft.com>
> wrote:
> >
> > Also worth noting that Quentin has a script that I believe parses the
> > appendix and uses it to generate a validator for ebpf programs.
> > (Which also helps validate the appendix).
> 
> The last thing I want to see is a document becoming a description of the
> code.
> We've always been doing it the other way around.
> The documentation can live next to the code and docs automatically
> generated from .h or .c files.
> Doing the other way around sooner or later will be a disaster.
> Imagine a typo in instruction-set.rst.
> What should we do next? Fix a typo and say, look, the code behaves
> differently, so we're fixing the doc.
> If so, there is close to zero reason to add hex to the doc, since it's not an
> authoritative answer.
> On the other hand if instruction-set.rst is the source of the truth then the
> code would have to change, which we obviously cannot do. So let's not get us
> into the corner with such tables.

The point of a standard is to be a source of truth.  If another implementation
(ubpf, hbpf, etc.) doesn't match the spec, then the code would have to change.
Such tables are seen as invaluable for determining correctness of other
implementations.   So the feedback is that it's important to have such if we
want everyone else to do the right thing.

> These people should speak up then.

I agree.

Dave
Dave Thaler Oct. 4, 2022, 4:54 p.m. UTC | #8
Regarding the tables:
> Such tables are seen as invaluable for determining correctness of other
> implementations.   So the feedback is that it's important to have such if we
> want everyone else to do the right thing.
> 
> > These people should speak up then.
> 
> I agree.

Here's two public examples...

Christoph Hellwig, said on May 17 at https://lore.kernel.org/bpf/20220517091011.GA18723@lst.de/:
> One useful thing for this would be an opcode table with all the 
> optional field usage in machine readable format.
>
> Jim who is on CC has already built a nice table off all opcodes based 
> on existing material that might be a good starting point.

Jim Harris responded on that thread with a strawman which was
used as the basis for the table in the appendix.

Jim then commented in the github version on August 30:
> In my opinion, this table is the biggest thing that has been missing, 
> and will be most essential for a more "formal" specification.

I will encourage them and others to comment on this thread.

Dave
Jim Harris Oct. 6, 2022, 8:44 p.m. UTC | #9
On Tue, Oct 4, 2022 at 10:09 AM Dave Thaler <dthaler@microsoft.com> wrote:
>
> Regarding the tables:
> > Such tables are seen as invaluable for determining correctness of other
> > implementations.   So the feedback is that it's important to have such if we
> > want everyone else to do the right thing.
> >
> > > These people should speak up then.
> >
> > I agree.
>
> Here's two public examples...
>
> Christoph Hellwig, said on May 17 at https://lore.kernel.org/bpf/20220517091011.GA18723@lst.de/:
> > One useful thing for this would be an opcode table with all the
> > optional field usage in machine readable format.
> >
> > Jim who is on CC has already built a nice table off all opcodes based
> > on existing material that might be a good starting point.
>
> Jim Harris responded on that thread with a strawman which was
> used as the basis for the table in the appendix.
>
> Jim then commented in the github version on August 30:
> > In my opinion, this table is the biggest thing that has been missing,
> > and will be most essential for a more "formal" specification.

I think an opcode table is a huge help for developing alternate eBPF
implementations - anything that makes it more explicit which opcodes
are valid (and which ones are not).

For example, the section for BPF_ALU and BPF_ALU64 classes lists the
operation codes.  BPF_END is in that list.  Description says "see
separate section below".  The "Byte swap instructions" section then
says that byte swap instructions always use BPF_ALU, even for 64-bit
widths.  So an implementer can synthesize all of this to determine
that opcodes 0xD7 and 0xDF (which have BPF_ALU64 | BPF_END) are not
valid.  It would be more clear if somewhere there was a list that
explicitly showed that 0xD4 and 0xDC (BPF_ALU | BPF_END) were valid,
but 0xD7 and 0xDF were not.

If there's concern that an appendix gets out of sync with the code and
the primary sections of the instruction set doc, maybe these opcodes
can be added to the existing per-class-code tables in the instruction
set doc.  A full opcode table could probably be generated from those
tables instead of the hand-written one that Dave and I worked on in
his patch.

-Jim


> I will encourage them and others to comment on this thread.
diff mbox series

Patch

diff --git a/Documentation/bpf/instruction-set.rst b/Documentation/bpf/instruction-set.rst
index b6f098104..328207ff6 100644
--- a/Documentation/bpf/instruction-set.rst
+++ b/Documentation/bpf/instruction-set.rst
@@ -7,6 +7,9 @@  eBPF Instruction Set Specification, v1.0
 
 This document specifies version 1.0 of the eBPF instruction set.
 
+The eBPF instruction set consists of eleven 64 bit registers, a program counter,
+and 512 bytes of stack space.
+
 Documentation conventions
 =========================
 
@@ -25,12 +28,24 @@  The eBPF calling convention is defined as:
 * R6 - R9: callee saved registers that function calls will preserve
 * R10: read-only frame pointer to access stack
 
-R0 - R5 are scratch registers and eBPF programs needs to spill/fill them if
-necessary across calls.
+Registers R0 - R5 are scratch registers, meaning the BPF program needs to either
+spill them to the BPF stack or move them to callee saved registers if these
+arguments are to be reused across multiple function calls. Spilling means
+that the value in the register is moved to the BPF stack. The reverse operation
+of moving the variable from the BPF stack to the register is called filling.
+The reason for spilling/filling is due to the limited number of registers.
+
+Upon entering execution of an eBPF program, registers R1 - R5 initially can contain
+the input arguments for the program (similar to the argc/argv pair for a typical C program).
+The actual number of registers used, and their meaning, is defined by the program type;
+for example, a networking program might have an argument that includes network packet data
+and/or metadata.
 
 Instruction encoding
 ====================
 
+An eBPF program is a sequence of instructions.
+
 eBPF has two instruction encodings:
 
 * the basic instruction encoding, which uses 64 bits to encode an instruction
@@ -63,7 +78,7 @@  opcode
   operation to perform
 
 Note that most instructions do not use all of the fields.
-Unused fields shall be cleared to zero.
+Unused fields must be set to zero.
 
 As discussed below in `64-bit immediate instructions`_, some
 instructions use a 64-bit immediate value that is constructed as follows.
@@ -90,7 +105,9 @@  and destination registers, respectively, rather than the register number.
 Instruction classes
 -------------------
 
-The three LSB bits of the 'opcode' field store the instruction class:
+The encoding of the 'opcode' field varies and can be determined from
+the three least significant bits (LSB) of the 'opcode' field which holds
+the "instruction class", as follows:
 
 =========  =====  ===============================  ===================================
 class      value  description                      reference
@@ -136,9 +153,11 @@  instruction class
 Arithmetic instructions
 -----------------------
 
-``BPF_ALU`` uses 32-bit wide operands while ``BPF_ALU64`` uses 64-bit wide operands for
+Instruction class ``BPF_ALU`` uses 32-bit wide operands (zeroing the upper 32 bits
+of the destination register) while ``BPF_ALU64`` uses 64-bit wide operands for
 otherwise identical operations.
-The 'code' field encodes the operation as below:
+
+The 4-bit 'code' field encodes the operation as follows:
 
 ========  =====  ==========================================================
 code      value  description
@@ -168,21 +187,23 @@  the destination register is instead set to zero.
 If execution would result in modulo by zero,
 the destination register is instead left unchanged.
 
-``BPF_ADD | BPF_X | BPF_ALU`` means::
+Examples:
+
+``BPF_ADD | BPF_X | BPF_ALU`` (0x0c) means::
 
   dst = (uint32_t) (dst + src)
 
 where '(uint32_t)' indicates truncation to 32 bits.
 
-``BPF_ADD | BPF_X | BPF_ALU64`` means::
+``BPF_ADD | BPF_X | BPF_ALU64`` (0x0f) means::
 
   dst = dst + src
 
-``BPF_XOR | BPF_K | BPF_ALU`` means::
+``BPF_XOR | BPF_K | BPF_ALU`` (0xa4) means::
 
   src = (uint32_t) src ^ (uint32_t) imm
 
-``BPF_XOR | BPF_K | BPF_ALU64`` means::
+``BPF_XOR | BPF_K | BPF_ALU64`` (0xa7) means::
 
   src = src ^ imm
 
@@ -204,8 +225,9 @@  The byte swap instructions use an instruction class of ``BPF_ALU`` and a 4-bit
 The byte swap instructions operate on the destination register
 only and do not use a separate source register or immediate value.
 
-The 1-bit source operand field in the opcode is used to to select what byte
-order the operation convert from or to:
+Byte swap instructions use non-default semantics of the 1-bit 'source' field in
+the 'opcode' field.  Instead of indicating the source operator, it is instead
+used to select what byte order the operation converts from or to:
 
 =========  =====  =================================================
 source     value  description
@@ -215,24 +237,33 @@  BPF_TO_BE  0x08   convert between host byte order and big endian
 =========  =====  =================================================
 
 The 'imm' field encodes the width of the swap operations.  The following widths
-are supported: 16, 32 and 64.
-
-Examples:
-
-``BPF_ALU | BPF_TO_LE | BPF_END`` with imm = 16 means::
-
-  dst = htole16(dst)
-
-``BPF_ALU | BPF_TO_BE | BPF_END`` with imm = 64 means::
-
-  dst = htobe64(dst)
+are supported: 16, 32 and 64. The following table summarizes the resulting
+possibilities:
+
+=============================  =========  ===  ========  ==================
+opcode construction            opcode     imm  mnemonic  pseudocode
+=============================  =========  ===  ========  ==================
+BPF_END | BPF_TO_LE | BPF_ALU  0xd4       16   le16 dst  dst = htole16(dst)
+BPF_END | BPF_TO_LE | BPF_ALU  0xd4       32   le32 dst  dst = htole32(dst)
+BPF_END | BPF_TO_LE | BPF_ALU  0xd4       64   le64 dst  dst = htole64(dst)
+BPF_END | BPF_TO_BE | BPF_ALU  0xdc       16   be16 dst  dst = htobe16(dst)
+BPF_END | BPF_TO_BE | BPF_ALU  0xdc       32   be32 dst  dst = htobe32(dst)
+BPF_END | BPF_TO_BE | BPF_ALU  0xdc       64   be64 dst  dst = htobe64(dst)
+=============================  =========  ===  ========  ==================
+
+where
+
+* mnenomic indicates a short form that might be displayed by some tools such as disassemblers
+* 'htoleNN()' indicates converting a NN-bit value from host byte order to little-endian byte order
+* 'htobeNN()' indicates converting a NN-bit value from host byte order to big-endian byte order
 
 Jump instructions
 -----------------
 
-``BPF_JMP32`` uses 32-bit wide operands while ``BPF_JMP`` uses 64-bit wide operands for
+Instruction class ``BPF_JMP32`` uses 32-bit wide operands while ``BPF_JMP`` uses 64-bit wide operands for
 otherwise identical operations.
-The 'code' field encodes the operation as below:
+
+The 4-bit 'code' field encodes the operation as below, where PC is the program counter:
 
 ========  =====  =========================  ============
 code      value  description                notes
@@ -253,9 +284,6 @@  BPF_JSLT  0xc0   PC += off if dst < src     signed
 BPF_JSLE  0xd0   PC += off if dst <= src    signed
 ========  =====  =========================  ============
 
-The eBPF program needs to store the return value into register R0 before doing a
-BPF_EXIT.
-
 Helper functions
 ~~~~~~~~~~~~~~~~
 Helper functions are a concept whereby BPF programs can call into a
@@ -285,7 +313,8 @@  For load and store instructions (``BPF_LD``, ``BPF_LDX``, ``BPF_ST``, and ``BPF_
 mode          size    instruction class
 ============  ======  =================
 
-The mode modifier is one of:
+mode
+  one of:
 
   =============  =====  ====================================  =============
   mode modifier  value  description                           reference
@@ -297,7 +326,8 @@  The mode modifier is one of:
   BPF_ATOMIC     0xc0   atomic operations                     `Atomic operations`_
   =============  =====  ====================================  =============
 
-The size modifier is one of:
+size
+  one of:
 
   =============  =====  =====================
   size modifier  value  description
@@ -308,25 +338,31 @@  The size modifier is one of:
   BPF_DW         0x18   double word (8 bytes)
   =============  =====  =====================
 
+instruction class
+  the instruction class (see `Instruction classes`_)
+
 Regular load and store operations
 ---------------------------------
 
 The ``BPF_MEM`` mode modifier is used to encode regular load and store
 instructions that transfer data between a register and memory.
 
-``BPF_MEM | <size> | BPF_STX`` means::
-
-  *(size *) (dst + offset) = src_reg
-
-``BPF_MEM | <size> | BPF_ST`` means::
-
-  *(size *) (dst + offset) = imm32
-
-``BPF_MEM | <size> | BPF_LDX`` means::
-
-  dst = *(size *) (src + offset)
-
-Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW``.
+=============================  =========  ====================================
+opcode construction            opcode     pseudocode
+=============================  =========  ====================================
+BPF_MEM | BPF_B | BPF_LDX      0x71       dst = \*(uint8_t \*) (src + offset)
+BPF_MEM | BPF_H | BPF_LDX      0x69       dst = \*(uint16_t \*) (src + offset)
+BPF_MEM | BPF_W | BPF_LDX      0x61       dst = \*(uint32_t \*) (src + offset)
+BPF_MEM | BPF_DW | BPF_LDX     0x79       dst = \*(uint64_t \*) (src + offset)
+BPF_MEM | BPF_B | BPF_ST       0x72       \*(uint8_t \*) (dst + offset) = imm
+BPF_MEM | BPF_H | BPF_ST       0x6a       \*(uint16_t \*) (dst + offset) = imm
+BPF_MEM | BPF_W | BPF_ST       0x62       \*(uint32_t \*) (dst + offset) = imm
+BPF_MEM | BPF_DW | BPF_ST      0x7a       \*(uint64_t \*) (dst + offset) = imm
+BPF_MEM | BPF_B | BPF_STX      0x73       \*(uint8_t \*) (dst + offset) = src
+BPF_MEM | BPF_H | BPF_STX      0x6b       \*(uint16_t \*) (dst + offset) = src
+BPF_MEM | BPF_W | BPF_STX      0x63       \*(uint32_t \*) (dst + offset) = src
+BPF_MEM | BPF_DW | BPF_STX     0x7b       \*(uint64_t \*) (dst + offset) = src
+=============================  =========  ====================================
 
 Atomic operations
 -----------------
@@ -338,9 +374,11 @@  by other eBPF programs or means outside of this specification.
 All atomic operations supported by eBPF are encoded as store operations
 that use the ``BPF_ATOMIC`` mode modifier as follows:
 
-* ``BPF_ATOMIC | BPF_W | BPF_STX`` for 32-bit operations
-* ``BPF_ATOMIC | BPF_DW | BPF_STX`` for 64-bit operations
-* 8-bit and 16-bit wide atomic operations are not supported.
+* ``BPF_ATOMIC | BPF_W | BPF_STX`` (0xc3) for 32-bit operations
+* ``BPF_ATOMIC | BPF_DW | BPF_STX`` (0xdb) for 64-bit operations
+
+Note that 8-bit (``BPF_B``) and 16-bit (``BPF_H``) wide atomic operations are not supported,
+nor is ``BPF_ATOMIC | <size> | BPF_ST``.
 
 The 'imm' field is used to encode the actual atomic operation.
 Simple atomic operation use a subset of the values defined to encode
@@ -355,16 +393,15 @@  BPF_AND   0x50   atomic and
 BPF_XOR   0xa0   atomic xor
 ========  =====  ===========
 
-
-``BPF_ATOMIC | BPF_W  | BPF_STX`` with 'imm' = BPF_ADD means::
+``BPF_ATOMIC | BPF_W  | BPF_STX`` (0xc3) with 'imm' = BPF_ADD means::
 
   *(uint32_t *)(dst + offset) += src
 
-``BPF_ATOMIC | BPF_DW | BPF_STX`` with 'imm' = BPF ADD means::
+``BPF_ATOMIC | BPF_DW | BPF_STX`` (0xdb) with 'imm' = BPF ADD means::
 
   *(uint64_t *)(dst + offset) += src
 
-In addition to the simple atomic operations, there also is a modifier and
+In addition to the simple atomic operations above, there also is a modifier and
 two complex atomic operations:
 
 ===========  ================  ===========================