Message ID | 8f3985ab5df3e4abc6de6db7f71f1adcbc16b4a8.1611759716.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Refactor chunk-format into an API | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > +Chunk-based file formats > +======================== > + > +Some file formats in Git use a common concept of "chunks" to describe > +sections of the file. This allows structured access to a large file by > +scanning a small "table of contents" for the remaining data. This common > +format is used by the `commit-graph` and `multi-pack-index` files. See > +link:technical/pack-format.html[the `multi-pack-index` format] and > +link:technical/commit-graph-format.html[the `commit-graph` format] for > +how they use the chunks to describe structured data. > + > +A chunk-based file format begins with some header information custom to > +that format. That header should include enough information to identify > +the file type, format version, and number of chunks in the file. From this > +information, that file can determine the start of the chunk-based region. > + > +The chunk-based region starts with a table of contents describing where > +each chunk starts and ends. This consists of (C+1) rows of 12 bytes each, > +where C is the number of chunks. Consider the following table: > + > + | Chunk ID (4 bytes) | Chunk Offset (8 bytes) | > + |--------------------|------------------------| > + | ID[0] | OFFSET[0] | > + | ... | ... | > + | ID[C] | OFFSET[C] | > + | 0x0000 | OFFSET[C+1] | > + > +Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset. > +Each integer is stored in network-byte order. > + > +The chunk identifier `ID[i]` is a label for the data stored within this > +fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the > +size of the `i`th chunk is equal to the difference between `OFFSET[i+1]` > +and `OFFSET[i]`. This requires that the chunk data appears contiguously > +in the same order as the table of contents. > + > +The final entry in the table of contents must be four zero bytes. This > +confirms that the table of contents is ending and provides the offset for > +the end of the chunk-based data. > + > +Note: The chunk-based format expects that the file contains _at least_ a > +trailing hash after `OFFSET[C+1]`. I think the above describes what I saw in the writing side of the code quite clearly and very well. I misread that the OFFSET[C+1] was pointing elsewhere in my review of [2/17] somehow, but the code is clear that it points at the end of the last chunk from the code, and the above documents it well. My comments on the need to document the reading side API, on what the read_chunk callback should be able to assume (namely, the whole thing stays in memory until the caller that decided to use chunkfile API decides to discard it), still stands, I would think. Thanks.
diff --git a/Documentation/technical/chunk-format.txt b/Documentation/technical/chunk-format.txt new file mode 100644 index 00000000000..3db3792dea2 --- /dev/null +++ b/Documentation/technical/chunk-format.txt @@ -0,0 +1,54 @@ +Chunk-based file formats +======================== + +Some file formats in Git use a common concept of "chunks" to describe +sections of the file. This allows structured access to a large file by +scanning a small "table of contents" for the remaining data. This common +format is used by the `commit-graph` and `multi-pack-index` files. See +link:technical/pack-format.html[the `multi-pack-index` format] and +link:technical/commit-graph-format.html[the `commit-graph` format] for +how they use the chunks to describe structured data. + +A chunk-based file format begins with some header information custom to +that format. That header should include enough information to identify +the file type, format version, and number of chunks in the file. From this +information, that file can determine the start of the chunk-based region. + +The chunk-based region starts with a table of contents describing where +each chunk starts and ends. This consists of (C+1) rows of 12 bytes each, +where C is the number of chunks. Consider the following table: + + | Chunk ID (4 bytes) | Chunk Offset (8 bytes) | + |--------------------|------------------------| + | ID[0] | OFFSET[0] | + | ... | ... | + | ID[C] | OFFSET[C] | + | 0x0000 | OFFSET[C+1] | + +Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset. +Each integer is stored in network-byte order. + +The chunk identifier `ID[i]` is a label for the data stored within this +fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the +size of the `i`th chunk is equal to the difference between `OFFSET[i+1]` +and `OFFSET[i]`. This requires that the chunk data appears contiguously +in the same order as the table of contents. + +The final entry in the table of contents must be four zero bytes. This +confirms that the table of contents is ending and provides the offset for +the end of the chunk-based data. + +Note: The chunk-based format expects that the file contains _at least_ a +trailing hash after `OFFSET[C+1]`. + +Functions for working with chunk-based file formats are declared in +`chunk-format.h`. Using these methods provide extra checks that assist +developers when creating new file formats, including: + + 1. Writing and reading the table of contents. + + 2. Verifying that the data written in a chunk matches the expected size + that was recorded in the table of contents. + + 3. Checking that a table of contents describes offsets properly within + the file boundaries. diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt index b6658eff188..87971c27dd7 100644 --- a/Documentation/technical/commit-graph-format.txt +++ b/Documentation/technical/commit-graph-format.txt @@ -61,6 +61,9 @@ CHUNK LOOKUP: the length using the next chunk position if necessary.) Each chunk ID appears at most once. + The CHUNK LOOKUP matches the table of contents from + link:technical/chunk-format.html[the chunk-based file format]. + The remaining data in the body is described one chunk at a time, and these chunks may be given in any order. Chunks are required unless otherwise specified. diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt index f96b2e605f3..2fb1e60d29e 100644 --- a/Documentation/technical/pack-format.txt +++ b/Documentation/technical/pack-format.txt @@ -301,6 +301,9 @@ CHUNK LOOKUP: (Chunks are provided in file-order, so you can infer the length using the next chunk position if necessary.) + The CHUNK LOOKUP matches the table of contents from + link:technical/chunk-format.html[the chunk-based file format]. + The remaining data in the body is described one chunk at a time, and these chunks may be given in any order. Chunks are required unless otherwise specified.