diff mbox series

[1/2] pack-format.txt: define "varint" format

Message ID 42c6206b102cd97290fd9ad207bb39b20660064c.1608537234.git.martin.agren@gmail.com (mailing list archive)
State New, archived
Headers show
Series pack-format.txt: document lengths at start of delta data | expand

Commit Message

Martin Ågren Dec. 21, 2020, 7:54 a.m. UTC
We define our varint format pretty much on the fly as we describe a pack
file entry. In preparation for referring to it in more places in this
document, define "varint" and refer to it.

Signed-off-by: Martin Ågren <martin.agren@gmail.com>
---
 Documentation/technical/pack-format.txt | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

Comments

Junio C Hamano Dec. 21, 2020, 9:40 p.m. UTC | #1
Martin Ågren <martin.agren@gmail.com> writes:

> We define our varint format pretty much on the fly as we describe a pack
> file entry. In preparation for referring to it in more places in this
> document, define "varint" and refer to it.
>
> Signed-off-by: Martin Ågren <martin.agren@gmail.com>
> ---
>  Documentation/technical/pack-format.txt | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
> index f96b2e605f..42198de74c 100644
> --- a/Documentation/technical/pack-format.txt
> +++ b/Documentation/technical/pack-format.txt
> @@ -55,6 +55,15 @@ Valid object types are:
>  
>  Type 5 is reserved for future expansion. Type 0 is invalid.
>  
> +=== Variable-length integer encoding
> +
> +This document uses "varint" encoding of non-negative integers: From
> +each byte, the seven least significant bits are used to form the
> +resulting integer. As long as the most significant bit is 1, this
> +process continues; the byte with MSB 0 provides the last seven bits.
> +The seven-bit chunks are concatenated. Later values are more
> +significant.
> +

Unfortunately we have two kinds of varint, and the above describes
the older and less efficient variant that is used to encode

 (1) the size of the base object and the object that would result
     by applying a delta (stored at the beginning of each datum that
     represents a deltified object, and read by get_delta_hdr_size())

 (2) the size of the object that is represented in the base
     representation and read by unpack_object_header())

(and nothing else as far as I know).

There is another varint encoding that is slightly more efficient and
it also is used in the packfile format to encode the OFS_DELTA
offset, i.e. the number of bytes to go back in the same packfile
starting at the beginning of OFS_DELTA dats to find the object the
delta applies to.  This newer variant is what is known as "varint"
and used throughout the other parts of the system (see hits from
"git grep -e ncode_varint").

We need to be careful when using a generic "varint" to mean the
older variant as it would confuse readers of OFS_DELTA section.

	... goes and looks ...

The phrase "offset encoding" is used in the document to talk about
OFS_DELTA offset.  It is actually what the rest of the code thinks
is the canonical varint defined in varint.[ch]).

A way to avoid confusion would be to refrain from using "varint" as
the primary way to describe this size field; instead explain it as
the "size encoding", to match "offset encoding" used for OFS_DELTA.

It may also help if we added to the description of "offset encoding"
that it is what other parts of the system consider the canonical
"varint" encoding.

Thanks.

>  === Deltified representation
>  Conceptually there are only four object types: commit, tree, tag and
> @@ -196,10 +205,10 @@ Pack file entry: <+
>  	1-byte size extension bit (MSB)
>  	       type (next 3 bit)
>  	       size0 (lower 4-bit)
> -        n-byte sizeN (as long as MSB is set, each 7-bit)
> -		size0..sizeN form 4+7+7+..+7 bit integer, size0
> -		is the least significant part, and sizeN is the
> -		most significant part.
> +        n-byte size1 (varint encoding; present if MSB is set)
> +        If the MSB is set, the size is size0 + 16*size1, otherwise
> +        it is size0. (Equivalently, the entire packed object header
> +        is a varint encoding of (size/16)*128 + type*16 + size%16.)
>       packed object data:
>          If it is not DELTA, then deflated bytes (the size above
>  		is the size before compression).
Martin Ågren Dec. 29, 2020, 10:41 p.m. UTC | #2
On Mon, 21 Dec 2020 at 22:40, Junio C Hamano <gitster@pobox.com> wrote:
>
> Martin Ågren <martin.agren@gmail.com> writes:
>
> > We define our varint format pretty much on the fly as we describe a pack
> > file entry. In preparation for referring to it in more places in this
> > document, define "varint" and refer to it.

> We need to be careful when using a generic "varint" to mean the
> older variant as it would confuse readers of OFS_DELTA section.
>
>         ... goes and looks ...
>
> The phrase "offset encoding" is used in the document to talk about
> OFS_DELTA offset.  It is actually what the rest of the code thinks
> is the canonical varint defined in varint.[ch]).
>
> A way to avoid confusion would be to refrain from using "varint" as
> the primary way to describe this size field; instead explain it as
> the "size encoding", to match "offset encoding" used for OFS_DELTA.

Thank you very much for these comments. I will post a v2 soon, which
will do exactly this: avoid "varint" in favor of "size encoding".

> It may also help if we added to the description of "offset encoding"
> that it is what other parts of the system consider the canonical
> "varint" encoding.

I will leave this as #leftoverbits, though. I'll "only" fix the omission
reported by Ross.

Martin
diff mbox series

Patch

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index f96b2e605f..42198de74c 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -55,6 +55,15 @@  Valid object types are:
 
 Type 5 is reserved for future expansion. Type 0 is invalid.
 
+=== Variable-length integer encoding
+
+This document uses "varint" encoding of non-negative integers: From
+each byte, the seven least significant bits are used to form the
+resulting integer. As long as the most significant bit is 1, this
+process continues; the byte with MSB 0 provides the last seven bits.
+The seven-bit chunks are concatenated. Later values are more
+significant.
+
 === Deltified representation
 
 Conceptually there are only four object types: commit, tree, tag and
@@ -196,10 +205,10 @@  Pack file entry: <+
 	1-byte size extension bit (MSB)
 	       type (next 3 bit)
 	       size0 (lower 4-bit)
-        n-byte sizeN (as long as MSB is set, each 7-bit)
-		size0..sizeN form 4+7+7+..+7 bit integer, size0
-		is the least significant part, and sizeN is the
-		most significant part.
+        n-byte size1 (varint encoding; present if MSB is set)
+        If the MSB is set, the size is size0 + 16*size1, otherwise
+        it is size0. (Equivalently, the entire packed object header
+        is a varint encoding of (size/16)*128 + type*16 + size%16.)
      packed object data:
         If it is not DELTA, then deflated bytes (the size above
 		is the size before compression).