Message ID | 2e22ca5069af617fe23072d78efb08b26d6130be.1655728395.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | bitmap: integrate a lookup table extension to the bitmap format | expand |
On 6/20/2022 8:33 AM, Abhradeep Chakraborty via GitGitGadget wrote: > From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com> > > When reading bitmap file, git loads each and every bitmap one by one > even if all the bitmaps are not required. A "bitmap lookup table" > extension to the bitmap format can reduce the overhead of loading > bitmaps which stores a list of bitmapped commit oids, along with their > offset and xor offset. This way git can load only the neccesary bitmaps > without loading the previous bitmaps. > > Add some information for the new "bitmap lookup table" extension in the > bitmap-format documentation. > @@ -67,6 +67,14 @@ MIDXs, both the bit-cache and rev-cache extensions are required. > pack/MIDX. The format and meaning of the name-hash is > described below. > > + ** {empty} > + BITMAP_OPT_LOOKUP_TABLE (0xf) : ::: > + If present, the end of the bitmap file contains a table > + containing a list of `N` object ids, a list of pairs of > + offset and xor offset of respective objects, and 4-byte > + integer denoting the flags (currently none). The format > + and meaning of the table is described below. > + Here, you are adding a new flag that indicates that the end of the file contains this extra extension. This works because the size of the extension is predictable. As long as any future extensions are also of a predictable size, then we can continue adding them via flags in this way. This is better than updating the full file format to do something like like use the chunk format API, especially because this format is shared across other tools (JGit being mentioned frequently). It might be worth mentioning in your commit message what happens when an older version of Git (or JGit) notices this flag. Does it refuse to operate on the .bitmap file? Does it give a warning or die? It would be nice if this extension could be ignored (it seems like adding the extra data at the end does not stop the bitmap data from being understood). > + > +Commit lookup table > +------------------- > + > +If the BITMAP_OPT_LOOKUP_TABLE flag is set, the end of the `.bitmap` > +contains a lookup table specifying the positions of commits which have a > +bitmap. Perhaps it would be better to say "the last N * (HASH_LEN + 8) + 4 bytes preceding the trailing hash" or something? This gives us a concrete way to compute the start of the table, while also being clear that the table is included in the trailing hash. > +For a `.bitmap` containing `nr_entries` reachability bitmaps, the format > +is as follows: > + > + - `nr_entries` object names. Could you expand that these objects are commit OIDs, one for each bitmap in the file. Are they sorted in lexicographical order for binary search, or are we expecting to read the entire table into a hashtable in-memory? > + - `nr_entries` pairs of 4-byte integers, each in network order. > + The first holds the offset from which that commit's bitmap can > + be read. The second number holds the position of the commit > + whose bitmap the current bitmap is xor'd with in lexicographic > + order, or 0xffffffff if the current commit is not xor'd with > + anything. Interesting to give the xor chains directions here. You say "position" here for the second commit: do you mean within the list of object names as opposed to the offset? That would make the most sense so we can trace the full list of XORs we need to make all at once. Are .bitmap files already constrained to 4GB, so these 32-bit offsets make sense? Using 64-bit offsets would be a small cost here, I think, without needing to do any fancy "overflow" tables that could introduce a variable-length extension. > + - One 4-byte network byte order integer specifying > + table-specific flags. None exist currently, so this is always > + "0". I'm guessing this is at the end of the extension because a future flag could modify the length of the extension, so we need the flags to be in a predictable location. Could we make that clear somewhere? How does Git react to seeing flags here that it does not recognize? It seems that Git should ignore the lookup table but continue using the rest of the .bitmap file as it did before, yes? Thanks, -Stolee
On Mon, Jun 20, 2022 at 12:56:27PM -0400, Derrick Stolee wrote: > On 6/20/2022 8:33 AM, Abhradeep Chakraborty via GitGitGadget wrote: > > From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com> > > > > When reading bitmap file, git loads each and every bitmap one by one > > even if all the bitmaps are not required. A "bitmap lookup table" > > extension to the bitmap format can reduce the overhead of loading > > bitmaps which stores a list of bitmapped commit oids, along with their > > offset and xor offset. This way git can load only the neccesary bitmaps > > without loading the previous bitmaps. > > > > Add some information for the new "bitmap lookup table" extension in the > > bitmap-format documentation. > > > > @@ -67,6 +67,14 @@ MIDXs, both the bit-cache and rev-cache extensions are required. > > pack/MIDX. The format and meaning of the name-hash is > > described below. > > > > + ** {empty} > > + BITMAP_OPT_LOOKUP_TABLE (0xf) : ::: > > + If present, the end of the bitmap file contains a table > > + containing a list of `N` object ids, a list of pairs of > > + offset and xor offset of respective objects, and 4-byte > > + integer denoting the flags (currently none). The format > > + and meaning of the table is described below. > > + > > Here, you are adding a new flag that indicates that the end of the file > contains this extra extension. This works because the size of the > extension is predictable. As long as any future extensions are also of > a predictable size, then we can continue adding them via flags in this > way. Right; any extensions that are added to the existing .bitmap format must have a size that is predictable in order for readers to locate the next extension, if any. > This is better than updating the full file format to do something like > like use the chunk format API, especially because this format is shared > across other tools (JGit being mentioned frequently). Agreed. Abhradeep and I discussed whether or not it was worth exploring a new .bitmap format, and the consensus we reached was that it may be required in the future (if we explored a compression scheme other than EWAH or made some other backwards-incompatible change), but as of yet it isn't necessary. So we avoided it to eliminate unnecessary churn, especially of on-disk formats. > It might be worth mentioning in your commit message what happens when an > older version of Git (or JGit) notices this flag. Does it refuse to > operate on the .bitmap file? Does it give a warning or die? It would be > nice if this extension could be ignored (it seems like adding the extra > data at the end does not stop the bitmap data from being understood). I agree. The bitmap reader does not warn or die when it sees unrecognized extensions, that way new extensions can be added without rendering all previously-written bitmaps useless. But in order to understand an extension on bit N, the reader must also understand extensions N-1, N-2, and so on (in order to locate the end of extension N). > > + - `nr_entries` pairs of 4-byte integers, each in network order. > > + The first holds the offset from which that commit's bitmap can > > + be read. The second number holds the position of the commit > > + whose bitmap the current bitmap is xor'd with in lexicographic > > + order, or 0xffffffff if the current commit is not xor'd with > > + anything. > > Interesting to give the xor chains directions here. You say "position" > here for the second commit: do you mean within the list of object names > as opposed to the offset? That would make the most sense so we can trace > the full list of XORs we need to make all at once. > > Are .bitmap files already constrained to 4GB, so these 32-bit offsets > make sense? Using 64-bit offsets would be a small cost here, I think, > without needing to do any fancy "overflow" tables that could introduce > a variable-length extension. Yeah, we should support >4GB bitmaps here. An overflow table could work, but I agree with Stolee that in practice it won't matter. Most .bitmap files that I've looked at in the wild have around ~500 entries at most, and are usually small. So the cost of widening this section isn't a big deal. But note that the entry count is only one component of the bitmap size: the individual entry lengths obviously matter too. And in repositories whose bitmaps exceed 500 entries, the entries themselves are often several million bits long (before compression) already. So it is certainly possible to exceed 4GB without having an astronomical entry count. So doubling the width of this extension might add an extra 250 KiB or so, which is negligible. I would much rather see us do that in cases where it makes sense (small number of entries, minimal cost to wider records, etc.) than adding unnecessary complexity via an extra lookup table for >4GB offsets. > > + - One 4-byte network byte order integer specifying > > + table-specific flags. None exist currently, so this is always > > + "0". > > I'm guessing this is at the end of the extension because a future flag > could modify the length of the extension, so we need the flags to be > in a predictable location. Could we make that clear somewhere? I can't remember what I had on my mind when I wrote this ;-). Abhradeep -- do you have any thoughts about what this might be used for? I'll try to remember it myself, but I imagine that we could just as easily remove this altogether and avoid the confusion. > How does Git react to seeing flags here that it does not recognize? > It seems that Git should ignore the lookup table but continue using the > rest of the .bitmap file as it did before, yes? (See above). Thanks, Taylor
On Mon, Jun 20, 2022 at 12:33:09PM +0000, Abhradeep Chakraborty via GitGitGadget wrote: > From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com> > > When reading bitmap file, git loads each and every bitmap one by one > even if all the bitmaps are not required. A "bitmap lookup table" > extension to the bitmap format can reduce the overhead of loading > bitmaps which stores a list of bitmapped commit oids, along with their > offset and xor offset. This way git can load only the neccesary bitmaps > without loading the previous bitmaps. Well put. It might help to have a concrete example of where we expect this to help and not help. I suspect that some of this will show up in your work updating the perf suite to use this new table, but I imagine that we'll find something like: In cases where the result can be read or computed without significant additional traversal (e.g., all commits of interest already have bitmaps computed), we can save some time loading and parsing a majority of the bitmap file that we will never read. But in cases where the bitmaps are out-of-date, or there is significant traversal required to go from the reference tips to what's contained in the .bitmap file, this table provides minimal benefit (or something). Of course, you should verify that that is actually true before we insert it into the commit message as such ;-). But that sort of information may help readers understand what the purpose of this change is towards the beinning of the series. > Add some information for the new "bitmap lookup table" extension in the > bitmap-format documentation. > > Co-Authored-by: Taylor Blau <ttaylorr@github.com> > Mentored-by: Taylor Blau <ttaylorr@github.com> Here and elsewhere: I typically use my <me@ttaylorr.com> address when contributing to Git. So any trailers that mention my email or commits that you send on my behalf should use that address, too. > Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com> > Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com> > --- > Documentation/technical/bitmap-format.txt | 31 +++++++++++++++++++++++ > 1 file changed, 31 insertions(+) > > diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt > index 04b3ec21785..34e98787b78 100644 > --- a/Documentation/technical/bitmap-format.txt > +++ b/Documentation/technical/bitmap-format.txt > @@ -67,6 +67,14 @@ MIDXs, both the bit-cache and rev-cache extensions are required. > pack/MIDX. The format and meaning of the name-hash is > described below. > > + ** {empty} > + BITMAP_OPT_LOOKUP_TABLE (0xf) : ::: It the space between "(0xf)" and the first ":" intentional? Similarly, should there be two or three colons at the end (either "::" or ":::")? > + If present, the end of the bitmap file contains a table > + containing a list of `N` object ids, a list of pairs of > + offset and xor offset of respective objects, and 4-byte > + integer denoting the flags (currently none). The format > + and meaning of the table is described below. > + I remember we had a brief off-list discussion about whether we should store the full object IDs in the offset table, or whether we could store their pack- or index-relative ordering. Is there a reason to prefer one or the other? I don't think we need to explain the choice fully in the documentation in this patch, but it may be worth thinking about separately nonetheless. We can store either order and convert it to an object ID in constant time. To figure out which is best, I would recommend trying a few different choices here and seeing how they do or don't impact your performance testing. > 4-byte entry count (network byte order) > > The total count of entries (bitmapped commits) in this bitmap index. > @@ -205,3 +213,26 @@ Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag. > If implementations want to choose a different hashing scheme, they are > free to do so, but MUST allocate a new header flag (because comparing > hashes made under two different schemes would be pointless). > + > +Commit lookup table > +------------------- > + > +If the BITMAP_OPT_LOOKUP_TABLE flag is set, the end of the `.bitmap` > +contains a lookup table specifying the positions of commits which have a > +bitmap. > + > +For a `.bitmap` containing `nr_entries` reachability bitmaps, the format > +is as follows: > + > + - `nr_entries` object names. > + > + - `nr_entries` pairs of 4-byte integers, each in network order. > + The first holds the offset from which that commit's bitmap can > + be read. The second number holds the position of the commit > + whose bitmap the current bitmap is xor'd with in lexicographic > + order, or 0xffffffff if the current commit is not xor'd with > + anything. A couple of small thoughts here. I wonder if we'd get better locality if we made each record look something like: (object_id, offset, xor_pos) Where object_id is either 20- or 4-bytes long (depending if we store the full object ID, or some 4-byte identifier that allows us to discover it), offset is 8 bytes long, and xor_pos is 4-bytes (since in practice we don't support packs or MIDXs which have more than 2^32-1 objects). In the event that this table doesn't fit into a single cache line, I think we'll get better performance out of reading it by not forcing the cache to evict itself whenever we need to refer back to the object_id. > + - One 4-byte network byte order integer specifying > + table-specific flags. None exist currently, so this is always > + "0". I mentioned in my reply to Stolee earlier, but I think that we should either (a) try to remember what this is for and document it, or (b) remove it. Thanks, Taylor
On 6/20/2022 8:33 AM, Abhradeep Chakraborty via GitGitGadget wrote: > From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com> > + ** {empty} > + BITMAP_OPT_LOOKUP_TABLE (0xf) : ::: I think you mean 0x10 (b_1_0000) instead of 0xf (b_1111). I noticed when looking at the constant in patch 2. Thanks, -Stolee
Derrick Stole <derrickstolee@github.com> wrote: > It might be worth mentioning in your commit message what happens when an > older version of Git (or JGit) notices this flag. Does it refuse to > operate on the .bitmap file? Does it give a warning or die? It would be > nice if this extension could be ignored (it seems like adding the extra > data at the end does not stop the bitmap data from being understood). No, it doesn't refuse to operate on the .bitmap file. It just ignores the extension. Will update the commit message. > Perhaps it would be better to say "the last N * (HASH_LEN + 8) + 4 bytes > preceding the trailing hash" or something? This gives us a concrete way > to compute the start of the table, while also being clear that the table > is included in the trailing hash. Hmm, well said. Will update it. > Could you expand that these objects are commit OIDs, one for each bitmap > in the file. Are they sorted in lexicographical order for binary search, > or are we expecting to read the entire table into a hashtable in-memory? Yeah, of course! They are sorted in lexicographical order for binary search. > Interesting to give the xor chains directions here. You say "position" > here for the second commit: do you mean within the list of object names > as opposed to the offset? That would make the most sense so we can trace > the full list of XORs we need to make all at once. I think I blundered here. I forgot that the xor-offset is relative to the current bitmap. The current proposed code takes it as ABSOLUTE value and tries to find the commit on that position (in the list of commit ids). So, there are two faults in my code - (1) As the xor-offset have an upper limit (which is 10 probably; not sure), any of the first 10 commits is always selected. (2) As xor-offsets are relative to the current bitmap, it depends On the order of the bitmaps. These bitmaps are ordered by the date of their corresponding commit and commit ids in the lookup table are ordered lexicographically. So, we can't use that xor-offset to find the xor'd commit position. Will fix it. > Are .bitmap files already constrained to 4GB, so these 32-bit offsets > make sense? Using 64-bit offsets would be a small cost here, I think, > without needing to do any fancy "overflow" tables that could introduce > a variable-length extension. I think you're right. I should use 64-bit types here. > I'm guessing this is at the end of the extension because a future flag > could modify the length of the extension, so we need the flags to be > in a predictable location. Could we make that clear somewhere? Flags are at the end of this extension. Thanks :)
Taylor Blau <me@ttaylorr.com> wrote: > Abhradeep -- do you have any thoughts about what this might be used for? > I'll try to remember it myself, but I imagine that we could just as > easily remove this altogether and avoid the confusion. Honestly, I never understood the logic behind adding this flag option. I thought you have a reason to do that. Even I was thinking of curving it to 1 byte. I will remove it then. Thanks :)
Taylor Blau <me@ttaylorr.com> wrote: > In cases where the result can be read or computed without > significant additional traversal (e.g., all commits of interest > already have bitmaps computed), we can save some time loading and > parsing a majority of the bitmap file that we will never read. > > But in cases where the bitmaps are out-of-date, or there is > significant traversal required to go from the reference tips to > what's contained in the .bitmap file, this table provides minimal > benefit (or something). > > Of course, you should verify that that is actually true before we insert > it into the commit message as such ;-). But that sort of information may > help readers understand what the purpose of this change is towards the > beinning of the series. The performance tests cover tests for command like "git rev-list --count --objects --all", "simulated clone", "simulated fetch" etc. And I tested it with both the Git and Linux. In both cases, the average cost of "Without lookup table" is bigger than "with lookup table". The margin of difference is bigger for linux. Though, I need to fix the calculation of xor-offset (see my reply to derrick), the fix will not affect the performance too much. So, what you're saying is true. I think I didn't write the bitmap out-of-date test though. > Here and elsewhere: I typically use my <me@ttaylorr.com> address when > contributing to Git. So any trailers that mention my email or commits > that you send on my behalf should use that address, too. Ohh, sorry! Will fix it. > It the space between "(0xf)" and the first ":" intentional? Similarly, > should there be two or three colons at the end (either "::" or ":::")? Yes, it is intentional. My previous patch (formatting the bitmap-format.txt) uses nested description lists. ":::" means it is the level 3 description list. The space is required else asciidoc will assume that it is level 4 description list. > I remember we had a brief off-list discussion about whether we should > store the full object IDs in the offset table, or whether we could store > their pack- or index-relative ordering. Is there a reason to prefer one > or the other? > > I don't think we need to explain the choice fully in the documentation > in this patch, but it may be worth thinking about separately > nonetheless. We can store either order and convert it to an object ID in > constant time. > > To figure out which is best, I would recommend trying a few different > choices here and seeing how they do or don't impact your performance > testing. I think at that time I thought it would add extra cost of computing the actual commit ids from those index position. So, I didn't go further here. I still have a feeling that there is some way to get rid of this list of commit ids. But at the same time, I do not want to add extra computation to the code. > A couple of small thoughts here. I wonder if we'd get better locality if > we made each record look something like: > > (object_id, offset, xor_pos) > > Where object_id is either 20- or 4-bytes long (depending if we store the > full object ID, or some 4-byte identifier that allows us to discover > it), offset is 8 bytes long, and xor_pos is 4-bytes (since in practice > we don't support packs or MIDXs which have more than 2^32-1 objects). > > In the event that this table doesn't fit into a single cache line, I > think we'll get better performance out of reading it by not forcing the > cache to evict itself whenever we need to refer back to the object_id. Ok, will look into it. > I mentioned in my reply to Stolee earlier, but I think that we should > either (a) try to remember what this is for and document it, or (b) > remove it. Let us for now remove it. Thanks :)
Derrick Stolee <derrickstolee@github.com> wrote: > I think you mean 0x10 (b_1_0000) instead of 0xf (b_1111). > > I noticed when looking at the constant in patch 2. Yes, you're right. It's kind of embarrassment for me :) If the flag was Oxf it would enable all the extensions.
On Tue, Jun 21, 2022 at 02:01:14PM +0530, Abhradeep Chakraborty wrote: > Taylor Blau <me@ttaylorr.com> wrote: > > > Abhradeep -- do you have any thoughts about what this might be used for? > > I'll try to remember it myself, but I imagine that we could just as > > easily remove this altogether and avoid the confusion. > > Honestly, I never understood the logic behind adding this flag option. > I thought you have a reason to do that. Even I was thinking of curving > it to 1 byte. I will remove it then. I think removing it makes more sense. Since many of the other fields are 4-bytes wide, it's important for alignment purposes that those fields have addresses which are a multiple of four (relative to the start of the region, hence the 4-byte wide flags field). But I'd just as soon get rid of it, so I think that makes sense to me. Thanks, Taylor
On Tue, Jun 21, 2022 at 02:52:53PM +0530, Abhradeep Chakraborty wrote: > Taylor Blau <me@ttaylorr.com> wrote: > > I remember we had a brief off-list discussion about whether we should > > store the full object IDs in the offset table, or whether we could store > > their pack- or index-relative ordering. Is there a reason to prefer one > > or the other? > > > > I don't think we need to explain the choice fully in the documentation > > in this patch, but it may be worth thinking about separately > > nonetheless. We can store either order and convert it to an object ID in > > constant time. > > > > To figure out which is best, I would recommend trying a few different > > choices here and seeing how they do or don't impact your performance > > testing. > > I think at that time I thought it would add extra cost of computing > the actual commit ids from those index position. So, I didn't go > further here. It should be negligible relative to everything else, I would imagine. The function that converts an index position into an object ID is `nth_packed_object_id()`. > I still have a feeling that there is some way to get rid of this > list of commit ids. But at the same time, I do not want to add > extra computation to the code. I'm hoping that the additional complexity is minor. And if we can save some extra bytes that aren't necessary in the first place without compromising on performance, I think that's worthwhile to do. Thanks, Taylor
On Tue, Jun 21, 2022 at 03:38:00PM +0530, Abhradeep Chakraborty wrote: > Derrick Stolee <derrickstolee@github.com> wrote: > > > I think you mean 0x10 (b_1_0000) instead of 0xf (b_1111). > > > > I noticed when looking at the constant in patch 2. > > Yes, you're right. It's kind of embarrassment for me :) It happens ;). Let's use 0x10 instead. Thanks, Taylor
Taylor Blau <me@ttaylorr.com> wrote: > It should be negligible relative to everything else, I would imagine. > The function that converts an index position into an object ID is > `nth_packed_object_id()`. > > > I still have a feeling that there is some way to get rid of this > > list of commit ids. But at the same time, I do not want to add > > extra computation to the code. > > I'm hoping that the additional complexity is minor. And if we can save > some extra bytes that aren't necessary in the first place without > compromising on performance, I think that's worthwhile to do. Ok. I will look into it then. Most of the reviews has been addressed. Hope I will be able to submit it soon. Thanks :)
diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt index 04b3ec21785..34e98787b78 100644 --- a/Documentation/technical/bitmap-format.txt +++ b/Documentation/technical/bitmap-format.txt @@ -67,6 +67,14 @@ MIDXs, both the bit-cache and rev-cache extensions are required. pack/MIDX. The format and meaning of the name-hash is described below. + ** {empty} + BITMAP_OPT_LOOKUP_TABLE (0xf) : ::: + If present, the end of the bitmap file contains a table + containing a list of `N` object ids, a list of pairs of + offset and xor offset of respective objects, and 4-byte + integer denoting the flags (currently none). The format + and meaning of the table is described below. + 4-byte entry count (network byte order) The total count of entries (bitmapped commits) in this bitmap index. @@ -205,3 +213,26 @@ Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag. If implementations want to choose a different hashing scheme, they are free to do so, but MUST allocate a new header flag (because comparing hashes made under two different schemes would be pointless). + +Commit lookup table +------------------- + +If the BITMAP_OPT_LOOKUP_TABLE flag is set, the end of the `.bitmap` +contains a lookup table specifying the positions of commits which have a +bitmap. + +For a `.bitmap` containing `nr_entries` reachability bitmaps, the format +is as follows: + + - `nr_entries` object names. + + - `nr_entries` pairs of 4-byte integers, each in network order. + The first holds the offset from which that commit's bitmap can + be read. The second number holds the position of the commit + whose bitmap the current bitmap is xor'd with in lexicographic + order, or 0xffffffff if the current commit is not xor'd with + anything. + + - One 4-byte network byte order integer specifying + table-specific flags. None exist currently, so this is always + "0".