diff mbox series

[v2,17/17] chunk-format: add technical docs

Message ID 8f3985ab5df3e4abc6de6db7f71f1adcbc16b4a8.1611759716.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Refactor chunk-format into an API | expand

Commit Message

Derrick Stolee Jan. 27, 2021, 3:01 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

The chunk-based file format is now an API in the code, but we should
also take time to document it as a file format. Specifically, it matches
the CHUNK LOOKUP sections of the commit-graph and multi-pack-index
files, but there are some commonalities that should be grouped in this
document.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/chunk-format.txt      | 54 +++++++++++++++++++
 .../technical/commit-graph-format.txt         |  3 ++
 Documentation/technical/pack-format.txt       |  3 ++
 3 files changed, 60 insertions(+)
 create mode 100644 Documentation/technical/chunk-format.txt

Comments

Junio C Hamano Feb. 5, 2021, 12:15 a.m. UTC | #1
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +Chunk-based file formats
> +========================
> +
> +Some file formats in Git use a common concept of "chunks" to describe
> +sections of the file. This allows structured access to a large file by
> +scanning a small "table of contents" for the remaining data. This common
> +format is used by the `commit-graph` and `multi-pack-index` files. See
> +link:technical/pack-format.html[the `multi-pack-index` format] and
> +link:technical/commit-graph-format.html[the `commit-graph` format] for
> +how they use the chunks to describe structured data.
> +
> +A chunk-based file format begins with some header information custom to
> +that format. That header should include enough information to identify
> +the file type, format version, and number of chunks in the file. From this
> +information, that file can determine the start of the chunk-based region.
> +
> +The chunk-based region starts with a table of contents describing where
> +each chunk starts and ends. This consists of (C+1) rows of 12 bytes each,
> +where C is the number of chunks. Consider the following table:
> +
> +  | Chunk ID (4 bytes) | Chunk Offset (8 bytes) |
> +  |--------------------|------------------------|
> +  | ID[0]              | OFFSET[0]              |
> +  | ...                | ...                    |
> +  | ID[C]              | OFFSET[C]              |
> +  | 0x0000             | OFFSET[C+1]            |
> +
> +Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset.
> +Each integer is stored in network-byte order.
> +
> +The chunk identifier `ID[i]` is a label for the data stored within this
> +fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the
> +size of the `i`th chunk is equal to the difference between `OFFSET[i+1]`
> +and `OFFSET[i]`. This requires that the chunk data appears contiguously
> +in the same order as the table of contents.
> +
> +The final entry in the table of contents must be four zero bytes. This
> +confirms that the table of contents is ending and provides the offset for
> +the end of the chunk-based data.
> +
> +Note: The chunk-based format expects that the file contains _at least_ a
> +trailing hash after `OFFSET[C+1]`.

I think the above describes what I saw in the writing side of the
code quite clearly and very well.  I misread that the OFFSET[C+1]
was pointing elsewhere in my review of [2/17] somehow, but the code
is clear that it points at the end of the last chunk from the code,
and the above documents it well.

My comments on the need to document the reading side API, on what
the read_chunk callback should be able to assume (namely, the whole
thing stays in memory until the caller that decided to use chunkfile
API decides to discard it), still stands, I would think.

Thanks.
diff mbox series

Patch

diff --git a/Documentation/technical/chunk-format.txt b/Documentation/technical/chunk-format.txt
new file mode 100644
index 00000000000..3db3792dea2
--- /dev/null
+++ b/Documentation/technical/chunk-format.txt
@@ -0,0 +1,54 @@ 
+Chunk-based file formats
+========================
+
+Some file formats in Git use a common concept of "chunks" to describe
+sections of the file. This allows structured access to a large file by
+scanning a small "table of contents" for the remaining data. This common
+format is used by the `commit-graph` and `multi-pack-index` files. See
+link:technical/pack-format.html[the `multi-pack-index` format] and
+link:technical/commit-graph-format.html[the `commit-graph` format] for
+how they use the chunks to describe structured data.
+
+A chunk-based file format begins with some header information custom to
+that format. That header should include enough information to identify
+the file type, format version, and number of chunks in the file. From this
+information, that file can determine the start of the chunk-based region.
+
+The chunk-based region starts with a table of contents describing where
+each chunk starts and ends. This consists of (C+1) rows of 12 bytes each,
+where C is the number of chunks. Consider the following table:
+
+  | Chunk ID (4 bytes) | Chunk Offset (8 bytes) |
+  |--------------------|------------------------|
+  | ID[0]              | OFFSET[0]              |
+  | ...                | ...                    |
+  | ID[C]              | OFFSET[C]              |
+  | 0x0000             | OFFSET[C+1]            |
+
+Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset.
+Each integer is stored in network-byte order.
+
+The chunk identifier `ID[i]` is a label for the data stored within this
+fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the
+size of the `i`th chunk is equal to the difference between `OFFSET[i+1]`
+and `OFFSET[i]`. This requires that the chunk data appears contiguously
+in the same order as the table of contents.
+
+The final entry in the table of contents must be four zero bytes. This
+confirms that the table of contents is ending and provides the offset for
+the end of the chunk-based data.
+
+Note: The chunk-based format expects that the file contains _at least_ a
+trailing hash after `OFFSET[C+1]`.
+
+Functions for working with chunk-based file formats are declared in
+`chunk-format.h`. Using these methods provide extra checks that assist
+developers when creating new file formats, including:
+
+ 1. Writing and reading the table of contents.
+
+ 2. Verifying that the data written in a chunk matches the expected size
+    that was recorded in the table of contents.
+
+ 3. Checking that a table of contents describes offsets properly within
+    the file boundaries.
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index b6658eff188..87971c27dd7 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -61,6 +61,9 @@  CHUNK LOOKUP:
       the length using the next chunk position if necessary.) Each chunk
       ID appears at most once.
 
+  The CHUNK LOOKUP matches the table of contents from
+  link:technical/chunk-format.html[the chunk-based file format].
+
   The remaining data in the body is described one chunk at a time, and
   these chunks may be given in any order. Chunks are required unless
   otherwise specified.
diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index f96b2e605f3..2fb1e60d29e 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -301,6 +301,9 @@  CHUNK LOOKUP:
 	    (Chunks are provided in file-order, so you can infer the length
 	    using the next chunk position if necessary.)
 
+	The CHUNK LOOKUP matches the table of contents from
+	link:technical/chunk-format.html[the chunk-based file format].
+
 	The remaining data in the body is described one chunk at a time, and
 	these chunks may be given in any order. Chunks are required unless
 	otherwise specified.