Message ID | 64a260e0c6a116b7c6fa6fea2b9fd96bf416cb18.1624314293.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | multi-pack reachability bitmaps | expand |
On Mon, Jun 21, 2021 at 06:25:10PM -0400, Taylor Blau wrote: > +An object is uniquely described by its bit position within a bitmap: > + > + - If the bitmap belongs to a packfile, the __n__th bit corresponds to > + the __n__th object in pack order. For a function `offset` which maps > + objects to their byte offset within a pack, pack order is defined as > + follows: > + > + o1 <= o2 <==> offset(o1) <= offset(o2) > + > + - If the bitmap belongs to a MIDX, the __n__th bit corresponds to the > + __n__th object in MIDX order. With an additional function `pack` which > + maps objects to the pack they were selected from by the MIDX, MIDX order > + is defined as follows: > + > + o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2) > + > + The ordering between packs is done lexicographically by the pack name, > + with the exception of the preferred pack, which sorts ahead of all other > + packs. This doesn't render well as asciidoc (the final paragraph is taken as more of the code block). But that is a problem through the whole file. I think we should ignore it for now, and worry about asciidoc-ifying the whole thing later, if we choose to. > + The ordering between packs is done lexicographically by the pack name, > + with the exception of the preferred pack, which sorts ahead of all other > + packs. Hmm, I'm not sure if this "lexicographically" part is true. Really we're building on the midx .rev format here. And that says "defined by the MIDX's pack list" (though I can't offhand remember if that is lexicographic, or if it is in the reverse-mtime order). At any rate, should we just be referencing the rev documentation? > [...] The rest of the changes to the document seemed quite sensible to me. -Peff
On Wed, Jul 21, 2021 at 06:18:46AM -0400, Jeff King wrote: > On Mon, Jun 21, 2021 at 06:25:10PM -0400, Taylor Blau wrote: > > > +An object is uniquely described by its bit position within a bitmap: > > + > > + - If the bitmap belongs to a packfile, the __n__th bit corresponds to > > + the __n__th object in pack order. For a function `offset` which maps > > + objects to their byte offset within a pack, pack order is defined as > > + follows: > > + > > + o1 <= o2 <==> offset(o1) <= offset(o2) > > + > > + - If the bitmap belongs to a MIDX, the __n__th bit corresponds to the > > + __n__th object in MIDX order. With an additional function `pack` which > > + maps objects to the pack they were selected from by the MIDX, MIDX order > > + is defined as follows: > > + > > + o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2) > > + > > + The ordering between packs is done lexicographically by the pack name, > > + with the exception of the preferred pack, which sorts ahead of all other > > + packs. > > This doesn't render well as asciidoc (the final paragraph is taken as > more of the code block). But that is a problem through the whole file. I > think we should ignore it for now, and worry about asciidoc-ifying the > whole thing later, if we choose to. Agreed; let's ignore it for now. > > + The ordering between packs is done lexicographically by the pack name, > > + with the exception of the preferred pack, which sorts ahead of all other > > + packs. > > Hmm, I'm not sure if this "lexicographically" part is true. Really we're > building on the midx .rev format here. And that says "defined by the > MIDX's pack list" (though I can't offhand remember if that is > lexicographic, or if it is in the reverse-mtime order). > > At any rate, should we just be referencing the rev documentation? The packs are listed in lex order in the MIDX, but that is so we can binary search that list to determine whether a pack is included in the MIDX or not. I had to check, but we do use the lex order to resolve duplicate objects, too. See (at the tip of this branch): QSORT(ctx.info, ctx.nr, pack_info_compare); from within midx.c:write_midx_internal(). Here, ctx.info contains the list of packs, and pack_info_compare is a thin wrapper around strcmp()-ing the pack_name values of two packed_git structures. Arguably, you'd get better EWAH compression of the bits between packs if we sorted packs in reverse order according to their mtime. But I suspect that it doesn't matter much in practice, since the number of objects vastly outpaces the number of packs (but I haven't measured to be certain, so take that with a grain of salt). In any case, I think that you're right that adding too much detail hurts us here, so we should really be mentioning the MIDX's .rev-file documentation (unfortunately, we can't linkgit it, so mentioning it by name will have to suffice). I plan to reroll with something like this on top: diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt index 25221c7ec8..04b3ec2178 100644 --- a/Documentation/technical/bitmap-format.txt +++ b/Documentation/technical/bitmap-format.txt @@ -26,9 +26,8 @@ An object is uniquely described by its bit position within a bitmap: o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2) - The ordering between packs is done lexicographically by the pack name, - with the exception of the preferred pack, which sorts ahead of all other - packs. + The ordering between packs is done according to the MIDX's .rev file. + Notably, the preferred pack sorts ahead of all other packs. The on-disk representation (described below) of a bitmap is the same regardless of whether or not that bitmap belongs to a packfile or a MIDX. The only Thanks, Taylor
On Wed, Jul 21, 2021 at 01:53:40PM -0400, Taylor Blau wrote: > > > + The ordering between packs is done lexicographically by the pack name, > > > + with the exception of the preferred pack, which sorts ahead of all other > > > + packs. > > > > Hmm, I'm not sure if this "lexicographically" part is true. Really we're > > building on the midx .rev format here. And that says "defined by the > > MIDX's pack list" (though I can't offhand remember if that is > > lexicographic, or if it is in the reverse-mtime order). > > > > At any rate, should we just be referencing the rev documentation? > > The packs are listed in lex order in the MIDX, but that is so we can > binary search that list to determine whether a pack is included in the > MIDX or not. > > I had to check, but we do use the lex order to resolve duplicate > objects, too. See (at the tip of this branch): > > QSORT(ctx.info, ctx.nr, pack_info_compare); > > from within midx.c:write_midx_internal(). Here, ctx.info contains the > list of packs, and pack_info_compare is a thin wrapper around > strcmp()-ing the pack_name values of two packed_git structures. Ah, OK, thanks for checking. > Arguably, you'd get better EWAH compression of the bits between packs > if we sorted packs in reverse order according to their mtime. But I > suspect that it doesn't matter much in practice, since the number of > objects vastly outpaces the number of packs (but I haven't measured to > be certain, so take that with a grain of salt). Agreed, especially when the intended use is with geometric repacking to keep reasonable-sized packs. Either way, I think heuristics to optimize the pack ordering can easily come on top later. Let's keep this series focused on the fundamentals of having midx bitmaps at all. > In any case, I think that you're right that adding too much detail hurts > us here, so we should really be mentioning the MIDX's .rev-file > documentation (unfortunately, we can't linkgit it, so mentioning it by > name will have to suffice). I plan to reroll with something like this on > top: > > diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt > index 25221c7ec8..04b3ec2178 100644 > --- a/Documentation/technical/bitmap-format.txt > +++ b/Documentation/technical/bitmap-format.txt > @@ -26,9 +26,8 @@ An object is uniquely described by its bit position within a bitmap: > > o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2) > > - The ordering between packs is done lexicographically by the pack name, > - with the exception of the preferred pack, which sorts ahead of all other > - packs. > + The ordering between packs is done according to the MIDX's .rev file. > + Notably, the preferred pack sorts ahead of all other packs. > > The on-disk representation (described below) of a bitmap is the same regardless > of whether or not that bitmap belongs to a packfile or a MIDX. The only Thanks, that looks much better. We can't linkgit, but we only build HTML for these. So just a link to pack-format.html would work, as they'd generally be found side-by-side in the filesystem. But since this doesn't even really render as asciidoc, I'm not sure I care either way. (Obviously we could also mention pack-format.txt by name, but it's probably already obvious-ish to a human that this is where you'd find information on the pack .rev format). -Peff
diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt index f8c18a0f7a..25221c7ec8 100644 --- a/Documentation/technical/bitmap-format.txt +++ b/Documentation/technical/bitmap-format.txt @@ -1,6 +1,45 @@ GIT bitmap v1 format ==================== +== Pack and multi-pack bitmaps + +Bitmaps store reachability information about the set of objects in a packfile, +or a multi-pack index (MIDX). The former is defined obviously, and the latter is +defined as the union of objects in packs contained in the MIDX. + +A bitmap may belong to either one pack, or the repository's multi-pack index (if +it exists). A repository may have at most one bitmap. + +An object is uniquely described by its bit position within a bitmap: + + - If the bitmap belongs to a packfile, the __n__th bit corresponds to + the __n__th object in pack order. For a function `offset` which maps + objects to their byte offset within a pack, pack order is defined as + follows: + + o1 <= o2 <==> offset(o1) <= offset(o2) + + - If the bitmap belongs to a MIDX, the __n__th bit corresponds to the + __n__th object in MIDX order. With an additional function `pack` which + maps objects to the pack they were selected from by the MIDX, MIDX order + is defined as follows: + + o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2) + + The ordering between packs is done lexicographically by the pack name, + with the exception of the preferred pack, which sorts ahead of all other + packs. + +The on-disk representation (described below) of a bitmap is the same regardless +of whether or not that bitmap belongs to a packfile or a MIDX. The only +difference is the interpretation of the bits, which is described above. + +Certain bitmap extensions are supported (see: Appendix B). No extensions are +required for bitmaps corresponding to packfiles. For bitmaps that correspond to +MIDXs, both the bit-cache and rev-cache extensions are required. + +== On-disk format + - A header appears at the beginning: 4-byte signature: {'B', 'I', 'T', 'M'} @@ -14,17 +53,19 @@ GIT bitmap v1 format The following flags are supported: - BITMAP_OPT_FULL_DAG (0x1) REQUIRED - This flag must always be present. It implies that the bitmap - index has been generated for a packfile with full closure - (i.e. where every single object in the packfile can find - its parent links inside the same packfile). This is a - requirement for the bitmap index format, also present in JGit, - that greatly reduces the complexity of the implementation. + This flag must always be present. It implies that the + bitmap index has been generated for a packfile or + multi-pack index (MIDX) with full closure (i.e. where + every single object in the packfile/MIDX can find its + parent links inside the same packfile/MIDX). This is a + requirement for the bitmap index format, also present in + JGit, that greatly reduces the complexity of the + implementation. - BITMAP_OPT_HASH_CACHE (0x4) If present, the end of the bitmap file contains `N` 32-bit name-hash values, one per object in the - pack. The format and meaning of the name-hash is + pack/MIDX. The format and meaning of the name-hash is described below. 4-byte entry count (network byte order) @@ -33,7 +74,8 @@ GIT bitmap v1 format 20-byte checksum - The SHA1 checksum of the pack this bitmap index belongs to. + The SHA1 checksum of the pack/MIDX this bitmap index + belongs to. - 4 EWAH bitmaps that act as type indexes @@ -50,7 +92,7 @@ GIT bitmap v1 format - Tags In each bitmap, the `n`th bit is set to true if the `n`th object - in the packfile is of that type. + in the packfile or multi-pack index is of that type. The obvious consequence is that the OR of all 4 bitmaps will result in a full set (all bits set), and the AND of all 4 bitmaps will @@ -62,8 +104,9 @@ GIT bitmap v1 format Each entry contains the following: - 4-byte object position (network byte order) - The position **in the index for the packfile** where the - bitmap for this commit is found. + The position **in the index for the packfile or + multi-pack index** where the bitmap for this commit is + found. - 1-byte XOR-offset The xor offset used to compress this bitmap. For an entry @@ -146,10 +189,11 @@ Name-hash cache --------------- If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains -a cache of 32-bit values, one per object in the pack. The value at +a cache of 32-bit values, one per object in the pack/MIDX. The value at position `i` is the hash of the pathname at which the `i`th object -(counting in index order) in the pack can be found. This can be fed -into the delta heuristics to compare objects with similar pathnames. +(counting in index or multi-pack index order) in the pack/MIDX can be found. +This can be fed into the delta heuristics to compare objects with similar +pathnames. The hash algorithm used is: diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt index fb688976c4..1a73c3ee20 100644 --- a/Documentation/technical/multi-pack-index.txt +++ b/Documentation/technical/multi-pack-index.txt @@ -71,14 +71,10 @@ Future Work still reducing the number of binary searches required for object lookups. -- The reachability bitmap is currently paired directly with a single - packfile, using the pack-order as the object order to hopefully - compress the bitmaps well using run-length encoding. This could be - extended to pair a reachability bitmap with a multi-pack-index. If - the multi-pack-index is extended to store a "stable object order" +- If the multi-pack-index is extended to store a "stable object order" (a function Order(hash) = integer that is constant for a given hash, - even as the multi-pack-index is updated) then a reachability bitmap - could point to a multi-pack-index and be updated independently. + even as the multi-pack-index is updated) then MIDX bitmaps could be + updated independently of the MIDX. - Packfiles can be marked as "special" using empty files that share the initial name but replace ".pack" with ".keep" or ".promisor".
Update the technical documentation to describe the multi-pack bitmap format. This patch merely introduces the new format, and describes its high-level ideas. Git does not yet know how to read nor write these multi-pack variants, and so the subsequent patches will: - Introduce code to interpret multi-pack bitmaps, according to this document. - Then, introduce code to write multi-pack bitmaps from the 'git multi-pack-index write' sub-command. Finally, the implementation will gain tests in subsequent patches (as opposed to inline with the patch teaching Git how to write multi-pack bitmaps) to avoid a cyclic dependency. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- Documentation/technical/bitmap-format.txt | 72 ++++++++++++++++---- Documentation/technical/multi-pack-index.txt | 10 +-- 2 files changed, 61 insertions(+), 21 deletions(-)