Message ID | f72bf11e6efb4690ae808c0b56c3991c2b1ef266.1656924376.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | bitmap: integrate a lookup table extension to the bitmap format | expand |
Hi Abhradeep, On 04/07/2022 09:46, Abhradeep Chakraborty via GitGitGadget wrote: > From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com> > > When reading bitmap file, Git loads each and every bitmap one by one > even if all the bitmaps are not required. A "bitmap lookup table" > extension to the bitmap format can reduce the overhead of loading > bitmaps which stores a list of bitmapped commit id pos (in the midx > or pack, along with their offset and xor offset. This way git can > load only the necessary bitmaps without loading the previous bitmaps. > > Older versions of Git ignore the lookup table extension and don't > throw any kind of warning or error while parsing the bitmap file. > > Add some information for the new "bitmap lookup table" extension in the > bitmap-format documentation. Not sure if this is new in this extension, but should there be a link or two to the basics of XOR compression and some of the bitmap look up techniques? It's not always obvious if these techniques are 'heuristic' and only have partial commit data, or they have all the commits listed, Nor how/why they work. My point is more about giving new readers a hand-up in their understanding, rather than simple implementation details for those who already know what is going on. For example, are there any external articles that you found helpful in getting started that could be referenced somewhere in the docs? Separately I'm preparing a short series on adding 'reachability bitmap' and 'commit graph' (among other stuff) to the glossary as part of giving folks [0] stepping stones to cross the chasm of understanding Philip [0] me included;-) > > Mentored-by: Taylor Blau <me@ttaylorr.com> > Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com> > Co-Authored-by: Taylor Blau <me@ttaylorr.com> > Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com> > --- > Documentation/technical/bitmap-format.txt | 39 +++++++++++++++++++++++ > 1 file changed, 39 insertions(+) > > diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt > index 04b3ec21785..c30dc177643 100644 > --- a/Documentation/technical/bitmap-format.txt > +++ b/Documentation/technical/bitmap-format.txt > @@ -67,6 +67,17 @@ MIDXs, both the bit-cache and rev-cache extensions are required. > pack/MIDX. The format and meaning of the name-hash is > described below. > > + ** {empty} > + BITMAP_OPT_LOOKUP_TABLE (0x10): ::: > + If present, the end of the bitmap file contains a table > + containing a list of `N` <commit_pos, offset, xor_row> > + triplets. The format and meaning of the table is described > + below. > ++ > +NOTE: Unlike the xor_offset used to compress an individual bitmap, > +`xor_row` stores an *absolute* index into the lookup table, not a location > +relative to the current entry. > + > 4-byte entry count (network byte order) > > The total count of entries (bitmapped commits) in this bitmap index. > @@ -205,3 +216,31 @@ Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag. > If implementations want to choose a different hashing scheme, they are > free to do so, but MUST allocate a new header flag (because comparing > hashes made under two different schemes would be pointless). > + > +Commit lookup table > +------------------- > + > +If the BITMAP_OPT_LOOKUP_TABLE flag is set, the last `N * (4 + 8 + 4)` > +bytes (preceding the name-hash cache and trailing hash) of the `.bitmap` > +file contains a lookup table specifying the information needed to get > +the desired bitmap from the entries without parsing previous unnecessary > +bitmaps. > + > +For a `.bitmap` containing `nr_entries` reachability bitmaps, the table > +contains a list of `nr_entries` <commit_pos, offset, xor_row> triplets > +(sorted in the ascending order of `commit_pos`). The content of i'th > +triplet is - > + > + * {empty} > + commit_pos (4 byte integer, network byte order): :: > + It stores the object position of a commit (in the midx or pack > + index). > + > + * {empty} > + offset (8 byte integer, network byte order): :: > + The offset from which that commit's bitmap can be read. > + > + * {empty} > + xor_row (4 byte integer, network byte order): :: > + The position of the triplet whose bitmap is used to compress > + this one, or `0xffffffff` if no such bitmap exists.
Hello Philip, Philip Oakley <philipoakley@iee.email> wrote: > Not sure if this is new in this extension, but should there be a link or > two to the basics of XOR compression and some of the bitmap look up > techniques? > > It's not always obvious if these techniques are 'heuristic' and only > have partial commit data, or they have all the commits listed, Nor > how/why they work. My point is more about giving new readers a hand-up > in their understanding, rather than simple implementation details for > those who already know what is going on. For example, are there any > external articles that you found helpful in getting started that could > be referenced somewhere in the docs? As this series is only about adding a lookup-table extension (and not about bitmap itself), I am not sure whether it's good to include those things in this series. But I agree with your point that it should be able build a logical understanding among the new readers. There are some external articles[1] which talk about bitmap internals. But I think it would be better if we can make a new doc file (may be `Documentation/technical/reachability-bitmaps.txt` or similar) rather than putting those details in the `bitmap-format.txt` (As the name suggests, this file should only contain format details of bitmaps). That file would provide the answers of "Why bitmaps", "how they are stored", "How they are fetched", "how they work with pack-objects, git-fetch, midx etc.", "Detailed explanation of each bitmap extension" , and lastly "what are the future works" (if any). What do you think? > Separately I'm preparing a short series on adding 'reachability bitmap' > and 'commit graph' (among other stuff) to the glossary as part of giving > folks [0] stepping stones to cross the chasm of understanding Great! Thanks :) [1] https://github.blog/2015-09-22-counting-objects/, https://github.blog/2021-04-29-scaling-monorepo-maintenance/
On 09/07/2022 08:53, Abhradeep Chakraborty wrote: > Hello Philip, > > Philip Oakley <philipoakley@iee.email> wrote: > >> Not sure if this is new in this extension, but should there be a link or >> two to the basics of XOR compression and some of the bitmap look up >> techniques? >> >> It's not always obvious if these techniques are 'heuristic' and only >> have partial commit data, or they have all the commits listed, Nor >> how/why they work. My point is more about giving new readers a hand-up >> in their understanding, rather than simple implementation details for >> those who already know what is going on. For example, are there any >> external articles that you found helpful in getting started that could >> be referenced somewhere in the docs? > As this series is only about adding a lookup-table extension (and not > about bitmap itself), I am not sure whether it's good to include those > things in this series. Thanks for the clarification. I must have slight misread some of the discussions and falsely thought it was the XOR compression (which is a technique I wasn't really aware of), that was being provided by the extension - Where would it be best for me to look up the background to your "extension" project? > But I agree with your point that it should be > able build a logical understanding among the new readers. *nod* > > There are some external articles[1] which talk about bitmap internals. > But I think it would be better if we can make a new doc file (may be > `Documentation/technical/reachability-bitmaps.txt` or similar) rather > than putting those details in the `bitmap-format.txt` Thanks for the two links. In general I agree about the format document. > (As the name > suggests, this file should only contain format details of bitmaps). > That file would provide the answers of "Why bitmaps", "how they are > stored", "How they are fetched", "how they work with pack-objects, > git-fetch, midx etc.", "Detailed explanation of each bitmap extension" > , and lastly "what are the future works" (if any). One thing I've realised on reflection is that I'm unclear how the 'reachability bitmaps' and the 'commit-graph file' techniques relate to each other (and to the ODB DAG), and what features they pick out within their heuristic, explained at just enough level to allow folks to appreciate what the options that select them will do for their use case. > > What do you think? I'd be happy to collate contributions, suggestions and thoughts. Trying to create these good introductory descriptions can be really difficult, as you can only step into the same river once (the 'reading for the first time problem' of not being able to un-hear the explanations of others when reading a 2nd draft...) > >> Separately I'm preparing a short series on adding 'reachability bitmap' >> and 'commit graph' (among other stuff) to the glossary as part of giving >> folks [0] stepping stones to cross the chasm of understanding > Great! > > Thanks :) > > [1] https://github.blog/2015-09-22-counting-objects/, https://github.blog/2021-04-29-scaling-monorepo-maintenance/ Thank you.
On Sun, Jul 10, 2022 at 04:01:11PM +0100, Philip Oakley wrote: > >> Not sure if this is new in this extension, but should there be a link or > >> two to the basics of XOR compression and some of the bitmap look up > >> techniques? > >> > >> It's not always obvious if these techniques are 'heuristic' and only > >> have partial commit data, or they have all the commits listed, Nor > >> how/why they work. My point is more about giving new readers a hand-up > >> in their understanding, rather than simple implementation details for > >> those who already know what is going on. For example, are there any > >> external articles that you found helpful in getting started that could > >> be referenced somewhere in the docs? > > As this series is only about adding a lookup-table extension (and not > > about bitmap itself), I am not sure whether it's good to include those > > things in this series. > > Thanks for the clarification. I must have slight misread some of the > discussions and falsely thought it was the XOR compression (which is a > technique I wasn't really aware of), that was being provided by the > extension - Where would it be best for me to look up the background to > your "extension" project? Yeah, Abhradeep is right that the XOR compression isn't new, we already serialize bitmaps with optional XOR offsets. The gist is that we give an offset of some previous bitmap that is used to compress the current one by XORing the bits in the current bitamp with the previous one. These XOR-compressed bitmaps are often sparse, so they compress well and reduce the overall size of the .bitmap. A slightly more detailed overview can be found in Documentation/technical/bitmap-format.txt under the bullet point reading "1-byte XOR-offset". Thanks, Taylor
On 15/07/2022 00:15, Taylor Blau wrote: > On Sun, Jul 10, 2022 at 04:01:11PM +0100, Philip Oakley wrote: >>>> Not sure if this is new in this extension, but should there be a link or >>>> two to the basics of XOR compression and some of the bitmap look up >>>> techniques? >>>> >>>> It's not always obvious if these techniques are 'heuristic' and only >>>> have partial commit data, or they have all the commits listed, Nor >>>> how/why they work. My point is more about giving new readers a hand-up >>>> in their understanding, rather than simple implementation details for >>>> those who already know what is going on. For example, are there any >>>> external articles that you found helpful in getting started that could >>>> be referenced somewhere in the docs? >>> As this series is only about adding a lookup-table extension (and not >>> about bitmap itself), I am not sure whether it's good to include those >>> things in this series. >> Thanks for the clarification. I must have slight misread some of the >> discussions and falsely thought it was the XOR compression (which is a >> technique I wasn't really aware of), that was being provided by the >> extension - Where would it be best for me to look up the background to >> your "extension" project? > Yeah, Abhradeep is right that the XOR compression isn't new, we already > serialize bitmaps with optional XOR offsets. The gist is that we give an > offset of some previous bitmap that is used to compress the current one > by XORing the bits in the current bitamp with the previous one. These > XOR-compressed bitmaps are often sparse, so they compress well and > reduce the overall size of the .bitmap. I was thinking of a short paragraph that covers the broader 'why' aspects, rather than the what/how. For me, XOR is a 'new' compression method that (IIUC) takes advantage of certain features of the way the data is arranged, such that the XOR has lots of leading zeros, leading to the compression mentioned. I think it's that we sort on oid name, so that despite the oid being long, we have (typically) sufficient oids that the leading XOR bits of adjacent pairs allows effective compression. But I could have guessed wildly wrong. I'd been looking at https://www.timescale.com/blog/time-series-compression-algorithms-explained/ which gave an overview for sorted floats. I'd not had time to review the paper "Gorilla: A Fast, Scalable, In-Memory Time Series Database" http://www.vldb.org/pvldb/vol8/p1816-teller.pdf > > A slightly more detailed overview can be found in > Documentation/technical/bitmap-format.txt under the bullet point reading > "1-byte XOR-offset". > A separate point is the linkage (or not) between the older reachability bit maps, and the commit graph, which sound to be independent options and features, yet appear rather interrelated. Thanks Philip
On Sun, Jul 10, 2022 at 8:31 PM Philip Oakley <philipoakley@iee.email> wrote: > > On 09/07/2022 08:53, Abhradeep Chakraborty wrote: > > Hello Philip, > > > > Philip Oakley <philipoakley@iee.email> wrote: > > > >> Not sure if this is new in this extension, but should there be a link or > >> two to the basics of XOR compression and some of the bitmap look up > >> techniques? > >> > >> It's not always obvious if these techniques are 'heuristic' and only > >> have partial commit data, or they have all the commits listed, Nor > >> how/why they work. My point is more about giving new readers a hand-up > >> in their understanding, rather than simple implementation details for > >> those who already know what is going on. For example, are there any > >> external articles that you found helpful in getting started that could > >> be referenced somewhere in the docs? > > As this series is only about adding a lookup-table extension (and not > > about bitmap itself), I am not sure whether it's good to include those > > things in this series. > > Thanks for the clarification. I must have slight misread some of the > discussions and falsely thought it was the XOR compression (which is a > technique I wasn't really aware of), that was being provided by the > extension - Where would it be best for me to look up the background to > your "extension" project? Sorry that I missed this message. I got the information related to this project from the gsoc project ideas[1] page, additionally you can see the comments[2]. [1] https://git.github.io/SoC-2022-Ideas/ [2] https://lore.kernel.org/git/YNovuzAsaEb2uIaa@nand.local/ > > (As the name > > suggests, this file should only contain format details of bitmaps). > > That file would provide the answers of "Why bitmaps", "how they are > > stored", "How they are fetched", "how they work with pack-objects, > > git-fetch, midx etc.", "Detailed explanation of each bitmap extension" > > , and lastly "what are the future works" (if any). > > One thing I've realised on reflection is that I'm unclear how the > 'reachability bitmaps' and the 'commit-graph file' techniques relate to > each other (and to the ODB DAG), and what features they pick out within > their heuristic, explained at just enough level to allow folks to > appreciate what the options that select them will do for their use case. I am not familiar with 'commit-graph file', so I can't tell you about that. But for bitmaps, you can look at the introductory patches[3]. After that, if you wish, you can also inspect the code related to bitmaps. [3] https://github.com/gitster/git/commit/e127310 > > What do you think? > > I'd be happy to collate contributions, suggestions and thoughts. > > Trying to create these good introductory descriptions can be really > difficult, as you can only step into the same river once (the 'reading > for the first time problem' of not being able to un-hear the > explanations of others when reading a 2nd draft...) I agree ;-) Thanks :)
diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt index 04b3ec21785..c30dc177643 100644 --- a/Documentation/technical/bitmap-format.txt +++ b/Documentation/technical/bitmap-format.txt @@ -67,6 +67,17 @@ MIDXs, both the bit-cache and rev-cache extensions are required. pack/MIDX. The format and meaning of the name-hash is described below. + ** {empty} + BITMAP_OPT_LOOKUP_TABLE (0x10): ::: + If present, the end of the bitmap file contains a table + containing a list of `N` <commit_pos, offset, xor_row> + triplets. The format and meaning of the table is described + below. ++ +NOTE: Unlike the xor_offset used to compress an individual bitmap, +`xor_row` stores an *absolute* index into the lookup table, not a location +relative to the current entry. + 4-byte entry count (network byte order) The total count of entries (bitmapped commits) in this bitmap index. @@ -205,3 +216,31 @@ Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag. If implementations want to choose a different hashing scheme, they are free to do so, but MUST allocate a new header flag (because comparing hashes made under two different schemes would be pointless). + +Commit lookup table +------------------- + +If the BITMAP_OPT_LOOKUP_TABLE flag is set, the last `N * (4 + 8 + 4)` +bytes (preceding the name-hash cache and trailing hash) of the `.bitmap` +file contains a lookup table specifying the information needed to get +the desired bitmap from the entries without parsing previous unnecessary +bitmaps. + +For a `.bitmap` containing `nr_entries` reachability bitmaps, the table +contains a list of `nr_entries` <commit_pos, offset, xor_row> triplets +(sorted in the ascending order of `commit_pos`). The content of i'th +triplet is - + + * {empty} + commit_pos (4 byte integer, network byte order): :: + It stores the object position of a commit (in the midx or pack + index). + + * {empty} + offset (8 byte integer, network byte order): :: + The offset from which that commit's bitmap can be read. + + * {empty} + xor_row (4 byte integer, network byte order): :: + The position of the triplet whose bitmap is used to compress + this one, or `0xffffffff` if no such bitmap exists.