Message ID | d118f1d45e6202925d4efd5435acdd08545bf132.1656249017.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | bitmap: integrate a lookup table extension to the bitmap format | expand |
On 6/26/2022 9:10 AM, Abhradeep Chakraborty via GitGitGadget wrote: > From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com> > > The bitmap lookup table extension was documentated by an earlier s/documentated/documented/ > change, but Git does not yet knowhow to write that extension. s/knowhow/know how/ > +static int table_cmp(const void *_va, const void *_vb, void *commit_positions) > +{ > + int8_t result = 0; > + uint32_t *positions = (uint32_t *) commit_positions; nit: drop the space between the cast and commit_positions. > + uint32_t a = positions[*(uint32_t *)_va]; > + uint32_t b = positions[*(uint32_t *)_vb]; > + > + if (a > b) > + result = 1; > + else if (a < b) > + result = -1; > + else > + result = 0; > + > + return result; > +} Ok, here you are sorting by commit OID (indirectly by the order in the [multi-]pack-index). I suppose that I misunderstood in the previous patch, so that could use some more specific language, maybe. > +static void write_lookup_table(struct hashfile *f, > + uint64_t *offsets, > + uint32_t *commit_positions) > +{ > + uint32_t i; > + uint32_t *table, *table_inv; > + > + ALLOC_ARRAY(table, writer.selected_nr); > + ALLOC_ARRAY(table_inv, writer.selected_nr); > + > + for (i = 0; i < writer.selected_nr; i++) > + table[i] = i; > + > + QSORT_S(table, writer.selected_nr, table_cmp, commit_positions); At the end of this sort, table[j] = i means that the ith bitmap corresponds to the jth bitmapped commit in lex order of OIDs. > + for (i = 0; i < writer.selected_nr; i++) > + table_inv[table[i]] = i; And table_inv helps us discover that relationship (ith bitmap to jth commit by j = table_inv[i]). > + for (i = 0; i < writer.selected_nr; i++) { > + struct bitmapped_commit *selected = &writer.selected[table[i]]; > + uint32_t xor_offset = selected->xor_offset; Here, xor_offset is "number of bitmaps in relationship to the current bitmap" > + hashwrite_be32(f, commit_positions[table[i]]); > + hashwrite_be64(f, offsets[table[i]]); > + hashwrite_be32(f, xor_offset ? > + table_inv[table[i] - xor_offset]: 0xffffffff); Which means that if "k = table[i] - xor_offset" that the xor base is the kth bitmap. table_inv[k] gets us the position in this table of that bitmap's commit. (It's also strange to me that the offset is being _subtracted_, but I guess the bitmap format requires the xor base to appear first so the offset does not need to be a negative number ever.) This last line is a bit complex. uint32_t xor_offset = selected->xor_offset; uint32_t xor_row = 0xffffffff; if (xor_offset) { uint32_t xor_order = table[i] - xor_offset; xor_row = table_inf[xor_order]; } ...then we can "hashwrite_be32(f, xor_row);" when necessary. I'm not sure that we need the "uint32_t xor_order" inside the "if (xor_offset)" block, but splitting it helps add clarity to the multi-step computation. > enum pack_bitmap_opts { > - BITMAP_OPT_FULL_DAG = 1, > - BITMAP_OPT_HASH_CACHE = 4, > + BITMAP_OPT_FULL_DAG = 0x1, > + BITMAP_OPT_HASH_CACHE = 0x4, > + BITMAP_OPT_LOOKUP_TABLE = 0x10, > }; Excellent. Thanks, -Stolee
On Sun, Jun 26, 2022 at 01:10:13PM +0000, Abhradeep Chakraborty via GitGitGadget wrote: > From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com> > > The bitmap lookup table extension was documentated by an earlier > change, but Git does not yet knowhow to write that extension. > > Teach git to write bitmap lookup table extension. The table contains > the list of `N` <commit pos, offset, xor offset>` triplets. These > triplets are sorted according to their commit pos (ascending order). > The meaning of each data in the i'th triplet is given below: > > - Commit pos is the position of the commit in the pack-index > (or midx) to which the i'th bitmap belongs. It is a 4 byte > network byte order integer. > > - offset is the position of the i'th bitmap. > > - xor offset denotes the position of the triplet with whose > bitmap the current triplet's bitmap need to xor with. > > Co-authored-by: Taylor Blau <me@ttaylorr.com> > Mentored-by: Taylor Blau <me@ttaylorr.com> > Co-mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com> > Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com> > --- > pack-bitmap-write.c | 72 +++++++++++++++++++++++++++++++++++++++++++-- > pack-bitmap.h | 5 ++-- > 2 files changed, 73 insertions(+), 4 deletions(-) > > diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c > index c43375bd344..899a4a941e1 100644 > --- a/pack-bitmap-write.c > +++ b/pack-bitmap-write.c > @@ -650,7 +650,9 @@ static const struct object_id *oid_access(size_t pos, const void *table) > > static void write_selected_commits_v1(struct hashfile *f, > struct pack_idx_entry **index, > - uint32_t index_nr) > + uint32_t index_nr, > + uint64_t *offsets, We should probably leave this as a pointer to an off_t, since that is a more appropriate type for keeping track of an offset within a file (and indeed it is the return type of hashfile_total()). But since it's platform-dependent, we should make sure to cast it to a uint64_t before writing it as part of the lookup table. > + uint32_t *commit_positions) > { > int i; > > @@ -663,6 +665,11 @@ static void write_selected_commits_v1(struct hashfile *f, > if (commit_pos < 0) > BUG("trying to write commit not in index"); > > + if (offsets) > + offsets[i] = hashfile_total(f); This makes sense to store here, since we can't easily recover this information later on. > + if (commit_positions) > + commit_positions[i] = commit_pos; This one I'm not as sure about. It would be nice to not have write_selected_commits_v1() be responsible for writing this down, too. And I think it's easy enough to recover later on, since we're just doing a search over "index" (see above the "oid_pos" call). I think that oid_pos() call could be hidden behind a function that takes an object_id pointer, an index (double pointer) of pack_idx_entry structs, and a length. Its implementation would be something like: static int commit_bitmap_writer_pos(struct object_id *oid, struct pack_idx_entry **index, uint32_t index_nr) { return oid_pos(oid, index, index_nr, oid_access); } and then we could replace any calls like commit_positions[i] with one that first takes `i` to the appropriate object_id in selected commit order. That would be strictly less efficient, but not in a way that I think matters, and it would definitely be cleaner to not rely on a side-effect of write_selected_commits_v1(). Something in the middle there would be to have write_lookup_table() assemble that list of commit_positions itself, something like: uint32_t *commit_positions; ALLOC_ARRAY(commit_positions, writer.selected_nr); for (i = 0; i < writer.selected_nr; i++) { int pos = oid_pos(&writer.selected[i].commit->object.oid, index, index_nr); if (pos < 0) BUG("trying to write commit not in index"); commit_positions[i] = pos; } ... free(commit_positions); That at least removes a side-effect from the implementation of write_selected_commits_v1() and brings the creation of the commit_positions array closer to where it's being used, while still maintaining the constant-time lookups. So that may be a good alternative, but I'm curious of your thoughts. > +static int table_cmp(const void *_va, const void *_vb, void *commit_positions) OK, so this is sorting the table in order of the commit positions. I would rename the commit_positions parameter to something like "void *_data", and then have commit_positions be the result of the cast, like "uint32_t *commit_positions = _data"; > +{ > + int8_t result = 0; int8_t isn't an often used type in Git's codebase, but we can get rid of this variable altogether and just return immediately from each case, e.g.: if (a < b) return -1; else if (a > b) return 1; return 0; or similar. > + uint32_t *positions = (uint32_t *) commit_positions; Explicit cast isn't need here since you're going up from void*. > +static void write_lookup_table(struct hashfile *f, > + uint64_t *offsets, > + uint32_t *commit_positions) > +{ > + uint32_t i; > + uint32_t *table, *table_inv; > + > + ALLOC_ARRAY(table, writer.selected_nr); > + ALLOC_ARRAY(table_inv, writer.selected_nr); > + > + for (i = 0; i < writer.selected_nr; i++) > + table[i] = i; > + > + QSORT_S(table, writer.selected_nr, table_cmp, commit_positions); I think the construction of table and table_inv could definitely benefit from a comment here indicating what they're used for and what they contain (e.g., "table maps abc to xyz"). > + for (i = 0; i < writer.selected_nr; i++) > + table_inv[table[i]] = i; > + > + for (i = 0; i < writer.selected_nr; i++) { > + struct bitmapped_commit *selected = &writer.selected[table[i]]; > + uint32_t xor_offset = selected->xor_offset; > + > + hashwrite_be32(f, commit_positions[table[i]]); > + hashwrite_be64(f, offsets[table[i]]); > + hashwrite_be32(f, xor_offset ? > + table_inv[table[i] - xor_offset]: 0xffffffff); Nit: missing space before ':'. Thanks, Taylor
On Mon, Jun 27, 2022 at 10:35:25AM -0400, Derrick Stolee wrote: > On 6/26/2022 9:10 AM, Abhradeep Chakraborty via GitGitGadget wrote: > > > + uint32_t a = positions[*(uint32_t *)_va]; > > + uint32_t b = positions[*(uint32_t *)_vb]; > > + > > + if (a > b) > > + result = 1; > > + else if (a < b) > > + result = -1; > > + else > > + result = 0; > > + > > + return result; > > +} > > Ok, here you are sorting by commit OID (indirectly by the order in the > [multi-]pack-index). I suppose that I misunderstood in the previous > patch, so that could use some more specific language, maybe. Yeah, I agree that some more specific language could be used, with the main idea being there that we make it clearer that the list of tuples is still sorted (and can be binary searched). > > + for (i = 0; i < writer.selected_nr; i++) > > + table[i] = i; > > + > > + QSORT_S(table, writer.selected_nr, table_cmp, commit_positions); > > At the end of this sort, table[j] = i means that the ith bitmap corresponds > to the jth bitmapped commit in lex order of OIDs. > > > + for (i = 0; i < writer.selected_nr; i++) > > + table_inv[table[i]] = i; > > And table_inv helps us discover that relationship (ith bitmap to jth commit > by j = table_inv[i]). These are both great descriptions and should give an idea of what sort of information is worth putting into a comment. > > > + for (i = 0; i < writer.selected_nr; i++) { > > + struct bitmapped_commit *selected = &writer.selected[table[i]]; > > + uint32_t xor_offset = selected->xor_offset; > > Here, xor_offset is "number of bitmaps in relationship to the current bitmap" It's an offset to an earlier commit which must be used to XOR-decompress the current one (if any). > > + hashwrite_be32(f, commit_positions[table[i]]); > > + hashwrite_be64(f, offsets[table[i]]); > > + hashwrite_be32(f, xor_offset ? > > + table_inv[table[i] - xor_offset]: 0xffffffff); > > Which means that if "k = table[i] - xor_offset" that the xor base is the kth > bitmap. table_inv[k] gets us the position in this table of that bitmap's > commit. Yes, exactly. Abhradeep: this is also worth commenting ;-). > (It's also strange to me that the offset is being _subtracted_, but I guess > the bitmap format requires the xor base to appear first so the offset does > not need to be a negative number ever.) You're right, this follows from the fact that the XOR bases must come before the commits who must use them to decompress themselves. From Documentation/technical/bitmap-format.txt: This number is always positive, and hence entries are always xor'ed with **previous** bitmaps, not bitmaps that will come afterwards in the index. > This last line is a bit complex. > > uint32_t xor_offset = selected->xor_offset; > uint32_t xor_row = 0xffffffff; > > if (xor_offset) { > uint32_t xor_order = table[i] - xor_offset; > xor_row = table_inf[xor_order]; > } > > ...then we can "hashwrite_be32(f, xor_row);" when necessary. I'm not sure > that we need the "uint32_t xor_order" inside the "if (xor_offset)" block, > but splitting it helps add clarity to the multi-step computation. I had the same thought, though I would also say that xor_row should be declared, not initialized, and the "else" block of "if (xor_offset)" should set it to 0xffffffff to make the relationship between xor_offset and the value written a little clearer. Thanks, Taylor
Derrick Stolee <derrickstolee@github.com> wrote: > Which means that if "k = table[i] - xor_offset" that the xor base is the kth > bitmap. table_inv[k] gets us the position in this table of that bitmap's > commit. > > (It's also strange to me that the offset is being _subtracted_, but I guess > the bitmap format requires the xor base to appear first so the offset does > not need to be a negative number ever.) > > This last line is a bit complex. > > uint32_t xor_offset = selected->xor_offset; > uint32_t xor_row = 0xffffffff; > > if (xor_offset) { > uint32_t xor_order = table[i] - xor_offset; > xor_row = table_inf[xor_order]; > } > > ...then we can "hashwrite_be32(f, xor_row);" when necessary. I'm not sure > that we need the "uint32_t xor_order" inside the "if (xor_offset)" block, > but splitting it helps add clarity to the multi-step computation. Got it. Will add comments too. Thanks :)
Taylor Blau <me@ttaylorr.com> wrote: > We should probably leave this as a pointer to an off_t, since that is a > more appropriate type for keeping track of an offset within a file (and > indeed it is the return type of hashfile_total()). > > But since it's platform-dependent, we should make sure to cast it to a > uint64_t before writing it as part of the lookup table. Hmm, will make the necessary changes. > That at least removes a side-effect from the implementation of > write_selected_commits_v1() and brings the creation of the > commit_positions array closer to where it's being used, while still > maintaining the constant-time lookups. So that may be a good > alternative, but I'm curious of your thoughts. Sounds good to me :) > I think the construction of table and table_inv could definitely benefit > from a comment here indicating what they're used for and what they > contain (e.g., "table maps abc to xyz"). Yeah, true. Will add comments. Thanks for the other suggestions also :)
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c index c43375bd344..899a4a941e1 100644 --- a/pack-bitmap-write.c +++ b/pack-bitmap-write.c @@ -650,7 +650,9 @@ static const struct object_id *oid_access(size_t pos, const void *table) static void write_selected_commits_v1(struct hashfile *f, struct pack_idx_entry **index, - uint32_t index_nr) + uint32_t index_nr, + uint64_t *offsets, + uint32_t *commit_positions) { int i; @@ -663,6 +665,11 @@ static void write_selected_commits_v1(struct hashfile *f, if (commit_pos < 0) BUG("trying to write commit not in index"); + if (offsets) + offsets[i] = hashfile_total(f); + if (commit_positions) + commit_positions[i] = commit_pos; + hashwrite_be32(f, commit_pos); hashwrite_u8(f, stored->xor_offset); hashwrite_u8(f, stored->flags); @@ -671,6 +678,55 @@ static void write_selected_commits_v1(struct hashfile *f, } } +static int table_cmp(const void *_va, const void *_vb, void *commit_positions) +{ + int8_t result = 0; + uint32_t *positions = (uint32_t *) commit_positions; + uint32_t a = positions[*(uint32_t *)_va]; + uint32_t b = positions[*(uint32_t *)_vb]; + + if (a > b) + result = 1; + else if (a < b) + result = -1; + else + result = 0; + + return result; +} + +static void write_lookup_table(struct hashfile *f, + uint64_t *offsets, + uint32_t *commit_positions) +{ + uint32_t i; + uint32_t *table, *table_inv; + + ALLOC_ARRAY(table, writer.selected_nr); + ALLOC_ARRAY(table_inv, writer.selected_nr); + + for (i = 0; i < writer.selected_nr; i++) + table[i] = i; + + QSORT_S(table, writer.selected_nr, table_cmp, commit_positions); + + for (i = 0; i < writer.selected_nr; i++) + table_inv[table[i]] = i; + + for (i = 0; i < writer.selected_nr; i++) { + struct bitmapped_commit *selected = &writer.selected[table[i]]; + uint32_t xor_offset = selected->xor_offset; + + hashwrite_be32(f, commit_positions[table[i]]); + hashwrite_be64(f, offsets[table[i]]); + hashwrite_be32(f, xor_offset ? + table_inv[table[i] - xor_offset]: 0xffffffff); + } + + free(table); + free(table_inv); +} + static void write_hash_cache(struct hashfile *f, struct pack_idx_entry **index, uint32_t index_nr) @@ -695,6 +751,8 @@ void bitmap_writer_finish(struct pack_idx_entry **index, { static uint16_t default_version = 1; static uint16_t flags = BITMAP_OPT_FULL_DAG; + uint64_t *offsets = NULL; + uint32_t *commit_positions = NULL; struct strbuf tmp_file = STRBUF_INIT; struct hashfile *f; @@ -715,8 +773,16 @@ void bitmap_writer_finish(struct pack_idx_entry **index, dump_bitmap(f, writer.trees); dump_bitmap(f, writer.blobs); dump_bitmap(f, writer.tags); - write_selected_commits_v1(f, index, index_nr); + if (options & BITMAP_OPT_LOOKUP_TABLE) { + CALLOC_ARRAY(offsets, index_nr); + CALLOC_ARRAY(commit_positions, index_nr); + } + + write_selected_commits_v1(f, index, index_nr, offsets, commit_positions); + + if (options & BITMAP_OPT_LOOKUP_TABLE) + write_lookup_table(f, offsets, commit_positions); if (options & BITMAP_OPT_HASH_CACHE) write_hash_cache(f, index, index_nr); @@ -730,4 +796,6 @@ void bitmap_writer_finish(struct pack_idx_entry **index, die_errno("unable to rename temporary bitmap file to '%s'", filename); strbuf_release(&tmp_file); + free(offsets); + free(commit_positions); } diff --git a/pack-bitmap.h b/pack-bitmap.h index 3d3ddd77345..67a9d0fc303 100644 --- a/pack-bitmap.h +++ b/pack-bitmap.h @@ -24,8 +24,9 @@ struct bitmap_disk_header { #define NEEDS_BITMAP (1u<<22) enum pack_bitmap_opts { - BITMAP_OPT_FULL_DAG = 1, - BITMAP_OPT_HASH_CACHE = 4, + BITMAP_OPT_FULL_DAG = 0x1, + BITMAP_OPT_HASH_CACHE = 0x4, + BITMAP_OPT_LOOKUP_TABLE = 0x10, }; enum pack_bitmap_flags {