diff mbox series

[1/6] Documentation/technical: describe bitmap lookup table extension

Message ID 2e22ca5069af617fe23072d78efb08b26d6130be.1655728395.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series bitmap: integrate a lookup table extension to the bitmap format | expand

Commit Message

Abhradeep Chakraborty June 20, 2022, 12:33 p.m. UTC
From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>

When reading bitmap file, git loads each and every bitmap one by one
even if all the bitmaps are not required. A "bitmap lookup table"
extension to the bitmap format can reduce the overhead of loading
bitmaps which stores a list of bitmapped commit oids, along with their
offset and xor offset. This way git can load only the neccesary bitmaps
without loading the previous bitmaps.

Add some information for the new "bitmap lookup table" extension in the
bitmap-format documentation.

Co-Authored-by: Taylor Blau <ttaylorr@github.com>
Mentored-by: Taylor Blau <ttaylorr@github.com>
Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
---
 Documentation/technical/bitmap-format.txt | 31 +++++++++++++++++++++++
 1 file changed, 31 insertions(+)

Comments

Derrick Stolee June 20, 2022, 4:56 p.m. UTC | #1
On 6/20/2022 8:33 AM, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> 
> When reading bitmap file, git loads each and every bitmap one by one
> even if all the bitmaps are not required. A "bitmap lookup table"
> extension to the bitmap format can reduce the overhead of loading
> bitmaps which stores a list of bitmapped commit oids, along with their
> offset and xor offset. This way git can load only the neccesary bitmaps
> without loading the previous bitmaps.
> 
> Add some information for the new "bitmap lookup table" extension in the
> bitmap-format documentation.


> @@ -67,6 +67,14 @@ MIDXs, both the bit-cache and rev-cache extensions are required.
>  			pack/MIDX. The format and meaning of the name-hash is
>  			described below.
>  
> +			** {empty}
> +			BITMAP_OPT_LOOKUP_TABLE (0xf) : :::
> +			If present, the end of the bitmap file contains a table
> +			containing a list of `N` object ids, a list of pairs of
> +			offset and xor offset of respective objects, and 4-byte
> +			integer denoting the flags (currently none). The format
> +			and meaning of the table is described below.
> +

Here, you are adding a new flag that indicates that the end of the file
contains this extra extension. This works because the size of the
extension is predictable. As long as any future extensions are also of
a predictable size, then we can continue adding them via flags in this
way.

This is better than updating the full file format to do something like
like use the chunk format API, especially because this format is shared
across other tools (JGit being mentioned frequently).

It might be worth mentioning in your commit message what happens when an
older version of Git (or JGit) notices this flag. Does it refuse to
operate on the .bitmap file? Does it give a warning or die? It would be
nice if this extension could be ignored (it seems like adding the extra
data at the end does not stop the bitmap data from being understood).

> +
> +Commit lookup table
> +-------------------
> +
> +If the BITMAP_OPT_LOOKUP_TABLE flag is set, the end of the `.bitmap`
> +contains a lookup table specifying the positions of commits which have a
> +bitmap.

Perhaps it would be better to say "the last N * (HASH_LEN + 8) + 4 bytes
preceding the trailing hash" or something? This gives us a concrete way
to compute the start of the table, while also being clear that the table
is included in the trailing hash.

> +For a `.bitmap` containing `nr_entries` reachability bitmaps, the format
> +is as follows:
> +
> +	- `nr_entries` object names.

Could you expand that these objects are commit OIDs, one for each bitmap
in the file. Are they sorted in lexicographical order for binary search,
or are we expecting to read the entire table into a hashtable in-memory?

> +	- `nr_entries` pairs of 4-byte integers, each in network order.
> +	  The first holds the offset from which that commit's bitmap can
> +	  be read. The second number holds the position of the commit
> +	  whose bitmap the current bitmap is xor'd with in lexicographic
> +	  order, or 0xffffffff if the current commit is not xor'd with
> +	  anything.

Interesting to give the xor chains directions here. You say "position"
here for the second commit: do you mean within the list of object names
as opposed to the offset? That would make the most sense so we can trace
the full list of XORs we need to make all at once.

Are .bitmap files already constrained to 4GB, so these 32-bit offsets
make sense? Using 64-bit offsets would be a small cost here, I think,
without needing to do any fancy "overflow" tables that could introduce
a variable-length extension.

> +	- One 4-byte network byte order integer specifying
> +	  table-specific flags. None exist currently, so this is always
> +	  "0".

I'm guessing this is at the end of the extension because a future flag
could modify the length of the extension, so we need the flags to be
in a predictable location. Could we make that clear somewhere?

How does Git react to seeing flags here that it does not recognize?
It seems that Git should ignore the lookup table but continue using the
rest of the .bitmap file as it did before, yes?

Thanks,
-Stolee
Taylor Blau June 20, 2022, 5:09 p.m. UTC | #2
On Mon, Jun 20, 2022 at 12:56:27PM -0400, Derrick Stolee wrote:
> On 6/20/2022 8:33 AM, Abhradeep Chakraborty via GitGitGadget wrote:
> > From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> >
> > When reading bitmap file, git loads each and every bitmap one by one
> > even if all the bitmaps are not required. A "bitmap lookup table"
> > extension to the bitmap format can reduce the overhead of loading
> > bitmaps which stores a list of bitmapped commit oids, along with their
> > offset and xor offset. This way git can load only the neccesary bitmaps
> > without loading the previous bitmaps.
> >
> > Add some information for the new "bitmap lookup table" extension in the
> > bitmap-format documentation.
>
>
> > @@ -67,6 +67,14 @@ MIDXs, both the bit-cache and rev-cache extensions are required.
> >  			pack/MIDX. The format and meaning of the name-hash is
> >  			described below.
> >
> > +			** {empty}
> > +			BITMAP_OPT_LOOKUP_TABLE (0xf) : :::
> > +			If present, the end of the bitmap file contains a table
> > +			containing a list of `N` object ids, a list of pairs of
> > +			offset and xor offset of respective objects, and 4-byte
> > +			integer denoting the flags (currently none). The format
> > +			and meaning of the table is described below.
> > +
>
> Here, you are adding a new flag that indicates that the end of the file
> contains this extra extension. This works because the size of the
> extension is predictable. As long as any future extensions are also of
> a predictable size, then we can continue adding them via flags in this
> way.

Right; any extensions that are added to the existing .bitmap format must
have a size that is predictable in order for readers to locate the next
extension, if any.

> This is better than updating the full file format to do something like
> like use the chunk format API, especially because this format is shared
> across other tools (JGit being mentioned frequently).

Agreed. Abhradeep and I discussed whether or not it was worth exploring
a new .bitmap format, and the consensus we reached was that it may be
required in the future (if we explored a compression scheme other than
EWAH or made some other backwards-incompatible change), but as of yet it
isn't necessary. So we avoided it to eliminate unnecessary churn,
especially of on-disk formats.

> It might be worth mentioning in your commit message what happens when an
> older version of Git (or JGit) notices this flag. Does it refuse to
> operate on the .bitmap file? Does it give a warning or die? It would be
> nice if this extension could be ignored (it seems like adding the extra
> data at the end does not stop the bitmap data from being understood).

I agree. The bitmap reader does not warn or die when it sees
unrecognized extensions, that way new extensions can be added without
rendering all previously-written bitmaps useless. But in order to
understand an extension on bit N, the reader must also understand
extensions N-1, N-2, and so on (in order to locate the end of
extension N).

> > +	- `nr_entries` pairs of 4-byte integers, each in network order.
> > +	  The first holds the offset from which that commit's bitmap can
> > +	  be read. The second number holds the position of the commit
> > +	  whose bitmap the current bitmap is xor'd with in lexicographic
> > +	  order, or 0xffffffff if the current commit is not xor'd with
> > +	  anything.
>
> Interesting to give the xor chains directions here. You say "position"
> here for the second commit: do you mean within the list of object names
> as opposed to the offset? That would make the most sense so we can trace
> the full list of XORs we need to make all at once.
>
> Are .bitmap files already constrained to 4GB, so these 32-bit offsets
> make sense? Using 64-bit offsets would be a small cost here, I think,
> without needing to do any fancy "overflow" tables that could introduce
> a variable-length extension.

Yeah, we should support >4GB bitmaps here. An overflow table could work,
but I agree with Stolee that in practice it won't matter. Most .bitmap
files that I've looked at in the wild have around ~500 entries at most,
and are usually small. So the cost of widening this section isn't a big
deal.

But note that the entry count is only one component of the bitmap size:
the individual entry lengths obviously matter too. And in repositories
whose bitmaps exceed 500 entries, the entries themselves are often
several million bits long (before compression) already. So it is
certainly possible to exceed 4GB without having an astronomical entry
count.

So doubling the width of this extension might add an extra 250 KiB or
so, which is negligible.

I would much rather see us do that in cases where it makes sense (small
number of entries, minimal cost to wider records, etc.) than adding
unnecessary complexity via an extra lookup table for >4GB offsets.

> > +	- One 4-byte network byte order integer specifying
> > +	  table-specific flags. None exist currently, so this is always
> > +	  "0".
>
> I'm guessing this is at the end of the extension because a future flag
> could modify the length of the extension, so we need the flags to be
> in a predictable location. Could we make that clear somewhere?

I can't remember what I had on my mind when I wrote this ;-).

Abhradeep -- do you have any thoughts about what this might be used for?
I'll try to remember it myself, but I imagine that we could just as
easily remove this altogether and avoid the confusion.

> How does Git react to seeing flags here that it does not recognize?
> It seems that Git should ignore the lookup table but continue using the
> rest of the .bitmap file as it did before, yes?

(See above).

Thanks,
Taylor
Taylor Blau June 20, 2022, 5:21 p.m. UTC | #3
On Mon, Jun 20, 2022 at 12:33:09PM +0000, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
>
> When reading bitmap file, git loads each and every bitmap one by one
> even if all the bitmaps are not required. A "bitmap lookup table"
> extension to the bitmap format can reduce the overhead of loading
> bitmaps which stores a list of bitmapped commit oids, along with their
> offset and xor offset. This way git can load only the neccesary bitmaps
> without loading the previous bitmaps.

Well put. It might help to have a concrete example of where we expect
this to help and not help. I suspect that some of this will show up in
your work updating the perf suite to use this new table, but I imagine
that we'll find something like:

    In cases where the result can be read or computed without
    significant additional traversal (e.g., all commits of interest
    already have bitmaps computed), we can save some time loading and
    parsing a majority of the bitmap file that we will never read.

    But in cases where the bitmaps are out-of-date, or there is
    significant traversal required to go from the reference tips to
    what's contained in the .bitmap file, this table provides minimal
    benefit (or something).

Of course, you should verify that that is actually true before we insert
it into the commit message as such ;-). But that sort of information may
help readers understand what the purpose of this change is towards the
beinning of the series.

> Add some information for the new "bitmap lookup table" extension in the
> bitmap-format documentation.
>
> Co-Authored-by: Taylor Blau <ttaylorr@github.com>
> Mentored-by: Taylor Blau <ttaylorr@github.com>

Here and elsewhere: I typically use my <me@ttaylorr.com> address when
contributing to Git. So any trailers that mention my email or commits
that you send on my behalf should use that address, too.

> Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> ---
>  Documentation/technical/bitmap-format.txt | 31 +++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
>
> diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
> index 04b3ec21785..34e98787b78 100644
> --- a/Documentation/technical/bitmap-format.txt
> +++ b/Documentation/technical/bitmap-format.txt
> @@ -67,6 +67,14 @@ MIDXs, both the bit-cache and rev-cache extensions are required.
>  			pack/MIDX. The format and meaning of the name-hash is
>  			described below.
>
> +			** {empty}
> +			BITMAP_OPT_LOOKUP_TABLE (0xf) : :::

It the space between "(0xf)" and the first ":" intentional? Similarly,
should there be two or three colons at the end (either "::" or ":::")?

> +			If present, the end of the bitmap file contains a table
> +			containing a list of `N` object ids, a list of pairs of
> +			offset and xor offset of respective objects, and 4-byte
> +			integer denoting the flags (currently none). The format
> +			and meaning of the table is described below.
> +

I remember we had a brief off-list discussion about whether we should
store the full object IDs in the offset table, or whether we could store
their pack- or index-relative ordering. Is there a reason to prefer one
or the other?

I don't think we need to explain the choice fully in the documentation
in this patch, but it may be worth thinking about separately
nonetheless. We can store either order and convert it to an object ID in
constant time.

To figure out which is best, I would recommend trying a few different
choices here and seeing how they do or don't impact your performance
testing.

>  		4-byte entry count (network byte order)
>
>  			The total count of entries (bitmapped commits) in this bitmap index.
> @@ -205,3 +213,26 @@ Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag.
>  If implementations want to choose a different hashing scheme, they are
>  free to do so, but MUST allocate a new header flag (because comparing
>  hashes made under two different schemes would be pointless).
> +
> +Commit lookup table
> +-------------------
> +
> +If the BITMAP_OPT_LOOKUP_TABLE flag is set, the end of the `.bitmap`
> +contains a lookup table specifying the positions of commits which have a
> +bitmap.
> +
> +For a `.bitmap` containing `nr_entries` reachability bitmaps, the format
> +is as follows:
> +
> +	- `nr_entries` object names.
> +
> +	- `nr_entries` pairs of 4-byte integers, each in network order.
> +	  The first holds the offset from which that commit's bitmap can
> +	  be read. The second number holds the position of the commit
> +	  whose bitmap the current bitmap is xor'd with in lexicographic
> +	  order, or 0xffffffff if the current commit is not xor'd with
> +	  anything.

A couple of small thoughts here. I wonder if we'd get better locality if
we made each record look something like:

    (object_id, offset, xor_pos)

Where object_id is either 20- or 4-bytes long (depending if we store the
full object ID, or some 4-byte identifier that allows us to discover
it), offset is 8 bytes long, and xor_pos is 4-bytes (since in practice
we don't support packs or MIDXs which have more than 2^32-1 objects).

In the event that this table doesn't fit into a single cache line, I
think we'll get better performance out of reading it by not forcing the
cache to evict itself whenever we need to refer back to the object_id.

> +	- One 4-byte network byte order integer specifying
> +	  table-specific flags. None exist currently, so this is always
> +	  "0".

I mentioned in my reply to Stolee earlier, but I think that we should
either (a) try to remember what this is for and document it, or (b)
remove it.

Thanks,
Taylor
Derrick Stolee June 20, 2022, 8:21 p.m. UTC | #4
On 6/20/2022 8:33 AM, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>

> +			** {empty}
> +			BITMAP_OPT_LOOKUP_TABLE (0xf) : :::

I think you mean 0x10 (b_1_0000) instead of 0xf (b_1111).

I noticed when looking at the constant in patch 2.

Thanks,
-Stolee
Abhradeep Chakraborty June 21, 2022, 8:23 a.m. UTC | #5
Derrick Stole <derrickstolee@github.com> wrote:

> It might be worth mentioning in your commit message what happens when an
> older version of Git (or JGit) notices this flag. Does it refuse to
> operate on the .bitmap file? Does it give a warning or die? It would be
> nice if this extension could be ignored (it seems like adding the extra
> data at the end does not stop the bitmap data from being understood).

No, it doesn't refuse to operate on the .bitmap file. It just ignores the
extension. Will update the commit message.

> Perhaps it would be better to say "the last N * (HASH_LEN + 8) + 4 bytes
> preceding the trailing hash" or something? This gives us a concrete way
> to compute the start of the table, while also being clear that the table
> is included in the trailing hash.

Hmm, well said. Will update it.

> Could you expand that these objects are commit OIDs, one for each bitmap
> in the file. Are they sorted in lexicographical order for binary search,
> or are we expecting to read the entire table into a hashtable in-memory?

Yeah, of course! They are sorted in lexicographical order for binary search.


> Interesting to give the xor chains directions here. You say "position"
> here for the second commit: do you mean within the list of object names
> as opposed to the offset? That would make the most sense so we can trace
> the full list of XORs we need to make all at once.

I think I blundered here. I forgot that the xor-offset is relative to the
current bitmap. The current proposed code takes it as ABSOLUTE value and
tries to find the commit on that position (in the list of commit ids). So,
there are two faults in my code - (1) As the xor-offset have an upper limit
(which is 10 probably; not sure), any of the first 10 commits is always
selected. (2) As xor-offsets are relative to the current bitmap, it depends
On the order of the bitmaps. These bitmaps are ordered by the date of their
corresponding commit and commit ids in the lookup table are ordered
lexicographically. So, we can't use that xor-offset to find the xor'd
commit position.

Will fix it.

> Are .bitmap files already constrained to 4GB, so these 32-bit offsets
> make sense? Using 64-bit offsets would be a small cost here, I think,
> without needing to do any fancy "overflow" tables that could introduce
> a variable-length extension.

I think you're right. I should use 64-bit types here.

> I'm guessing this is at the end of the extension because a future flag
> could modify the length of the extension, so we need the flags to be
> in a predictable location. Could we make that clear somewhere?

Flags are at the end of this extension.

Thanks :)
Abhradeep Chakraborty June 21, 2022, 8:31 a.m. UTC | #6
Taylor Blau <me@ttaylorr.com> wrote:

> Abhradeep -- do you have any thoughts about what this might be used for?
> I'll try to remember it myself, but I imagine that we could just as
> easily remove this altogether and avoid the confusion.

Honestly, I never understood the logic behind adding this flag option.
I thought you have a reason to do that. Even I was thinking of curving
it to 1 byte. I will remove it then.

Thanks :)
Abhradeep Chakraborty June 21, 2022, 9:22 a.m. UTC | #7
Taylor Blau <me@ttaylorr.com> wrote:

>     In cases where the result can be read or computed without
>     significant additional traversal (e.g., all commits of interest
>     already have bitmaps computed), we can save some time loading and
>     parsing a majority of the bitmap file that we will never read.
>
>     But in cases where the bitmaps are out-of-date, or there is
>     significant traversal required to go from the reference tips to
>     what's contained in the .bitmap file, this table provides minimal
>     benefit (or something).
>
> Of course, you should verify that that is actually true before we insert
> it into the commit message as such ;-). But that sort of information may
> help readers understand what the purpose of this change is towards the
> beinning of the series.

The performance tests cover tests for command like "git rev-list --count
--objects --all", "simulated clone", "simulated fetch" etc. And I tested
it with both the Git and Linux. In both cases, the average cost of
"Without lookup table" is bigger than "with lookup table". The margin of
difference is bigger for linux. Though, I need to fix the calculation
of xor-offset (see my reply to derrick), the fix will not affect the
performance too much. So, what you're saying is true. I think I didn't
write the bitmap out-of-date test though.

> Here and elsewhere: I typically use my <me@ttaylorr.com> address when
> contributing to Git. So any trailers that mention my email or commits
> that you send on my behalf should use that address, too.

Ohh, sorry! Will fix it.

> It the space between "(0xf)" and the first ":" intentional? Similarly,
> should there be two or three colons at the end (either "::" or ":::")?

Yes, it is intentional. My previous patch (formatting the bitmap-format.txt)
uses nested description lists. ":::" means it is the level 3 description list.
The space is required else asciidoc will assume that it is level 4 description
list.

> I remember we had a brief off-list discussion about whether we should
> store the full object IDs in the offset table, or whether we could store
> their pack- or index-relative ordering. Is there a reason to prefer one
> or the other?
>
> I don't think we need to explain the choice fully in the documentation
> in this patch, but it may be worth thinking about separately
> nonetheless. We can store either order and convert it to an object ID in
> constant time.
>
> To figure out which is best, I would recommend trying a few different
> choices here and seeing how they do or don't impact your performance
> testing.

I think at that time I thought it would add extra cost of computing
the actual commit ids from those index position. So, I didn't go 
further here.

I still have a feeling that there is some way to get rid of this
list of commit ids. But at the same time, I do not want to add
extra computation to the code.

> A couple of small thoughts here. I wonder if we'd get better locality if
> we made each record look something like:
>
>     (object_id, offset, xor_pos)
>
> Where object_id is either 20- or 4-bytes long (depending if we store the
> full object ID, or some 4-byte identifier that allows us to discover
> it), offset is 8 bytes long, and xor_pos is 4-bytes (since in practice
> we don't support packs or MIDXs which have more than 2^32-1 objects).
>
> In the event that this table doesn't fit into a single cache line, I
> think we'll get better performance out of reading it by not forcing the
> cache to evict itself whenever we need to refer back to the object_id.

Ok, will look into it.

> I mentioned in my reply to Stolee earlier, but I think that we should
> either (a) try to remember what this is for and document it, or (b)
> remove it.

Let us for now remove it.

Thanks :)
Abhradeep Chakraborty June 21, 2022, 10:08 a.m. UTC | #8
Derrick Stolee <derrickstolee@github.com> wrote:

> I think you mean 0x10 (b_1_0000) instead of 0xf (b_1111).
>
> I noticed when looking at the constant in patch 2.

Yes, you're right. It's kind of embarrassment for me :)

If the flag was Oxf it would enable all the extensions.
Taylor Blau June 22, 2022, 4:26 p.m. UTC | #9
On Tue, Jun 21, 2022 at 02:01:14PM +0530, Abhradeep Chakraborty wrote:
> Taylor Blau <me@ttaylorr.com> wrote:
>
> > Abhradeep -- do you have any thoughts about what this might be used for?
> > I'll try to remember it myself, but I imagine that we could just as
> > easily remove this altogether and avoid the confusion.
>
> Honestly, I never understood the logic behind adding this flag option.
> I thought you have a reason to do that. Even I was thinking of curving
> it to 1 byte. I will remove it then.

I think removing it makes more sense. Since many of the other fields are
4-bytes wide, it's important for alignment purposes that those fields
have addresses which are a multiple of four (relative to the start of
the region, hence the 4-byte wide flags field).

But I'd just as soon get rid of it, so I think that makes sense to me.

Thanks,
Taylor
Taylor Blau June 22, 2022, 4:29 p.m. UTC | #10
On Tue, Jun 21, 2022 at 02:52:53PM +0530, Abhradeep Chakraborty wrote:
> Taylor Blau <me@ttaylorr.com> wrote:
> > I remember we had a brief off-list discussion about whether we should
> > store the full object IDs in the offset table, or whether we could store
> > their pack- or index-relative ordering. Is there a reason to prefer one
> > or the other?
> >
> > I don't think we need to explain the choice fully in the documentation
> > in this patch, but it may be worth thinking about separately
> > nonetheless. We can store either order and convert it to an object ID in
> > constant time.
> >
> > To figure out which is best, I would recommend trying a few different
> > choices here and seeing how they do or don't impact your performance
> > testing.
>
> I think at that time I thought it would add extra cost of computing
> the actual commit ids from those index position. So, I didn't go
> further here.

It should be negligible relative to everything else, I would imagine.
The function that converts an index position into an object ID is
`nth_packed_object_id()`.

> I still have a feeling that there is some way to get rid of this
> list of commit ids. But at the same time, I do not want to add
> extra computation to the code.

I'm hoping that the additional complexity is minor. And if we can save
some extra bytes that aren't necessary in the first place without
compromising on performance, I think that's worthwhile to do.

Thanks,
Taylor
Taylor Blau June 22, 2022, 4:30 p.m. UTC | #11
On Tue, Jun 21, 2022 at 03:38:00PM +0530, Abhradeep Chakraborty wrote:
> Derrick Stolee <derrickstolee@github.com> wrote:
>
> > I think you mean 0x10 (b_1_0000) instead of 0xf (b_1111).
> >
> > I noticed when looking at the constant in patch 2.
>
> Yes, you're right. It's kind of embarrassment for me :)

It happens ;). Let's use 0x10 instead.

Thanks,
Taylor
Abhradeep Chakraborty June 22, 2022, 4:45 p.m. UTC | #12
Taylor Blau <me@ttaylorr.com> wrote:

> It should be negligible relative to everything else, I would imagine.
> The function that converts an index position into an object ID is
> `nth_packed_object_id()`.
>
> > I still have a feeling that there is some way to get rid of this
> > list of commit ids. But at the same time, I do not want to add
> > extra computation to the code.
>
> I'm hoping that the additional complexity is minor. And if we can save
> some extra bytes that aren't necessary in the first place without
> compromising on performance, I think that's worthwhile to do.

Ok. I will look into it then.

Most of the reviews has been addressed. Hope I will be able to submit
it soon.

Thanks :)
diff mbox series

Patch

diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
index 04b3ec21785..34e98787b78 100644
--- a/Documentation/technical/bitmap-format.txt
+++ b/Documentation/technical/bitmap-format.txt
@@ -67,6 +67,14 @@  MIDXs, both the bit-cache and rev-cache extensions are required.
 			pack/MIDX. The format and meaning of the name-hash is
 			described below.
 
+			** {empty}
+			BITMAP_OPT_LOOKUP_TABLE (0xf) : :::
+			If present, the end of the bitmap file contains a table
+			containing a list of `N` object ids, a list of pairs of
+			offset and xor offset of respective objects, and 4-byte
+			integer denoting the flags (currently none). The format
+			and meaning of the table is described below.
+
 		4-byte entry count (network byte order)
 
 			The total count of entries (bitmapped commits) in this bitmap index.
@@ -205,3 +213,26 @@  Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag.
 If implementations want to choose a different hashing scheme, they are
 free to do so, but MUST allocate a new header flag (because comparing
 hashes made under two different schemes would be pointless).
+
+Commit lookup table
+-------------------
+
+If the BITMAP_OPT_LOOKUP_TABLE flag is set, the end of the `.bitmap`
+contains a lookup table specifying the positions of commits which have a
+bitmap.
+
+For a `.bitmap` containing `nr_entries` reachability bitmaps, the format
+is as follows:
+
+	- `nr_entries` object names.
+
+	- `nr_entries` pairs of 4-byte integers, each in network order.
+	  The first holds the offset from which that commit's bitmap can
+	  be read. The second number holds the position of the commit
+	  whose bitmap the current bitmap is xor'd with in lexicographic
+	  order, or 0xffffffff if the current commit is not xor'd with
+	  anything.
+
+	- One 4-byte network byte order integer specifying
+	  table-specific flags. None exist currently, so this is always
+	  "0".