diff mbox series

[01/32] doc hash-file-transition: A map file for mapping between sha1 and sha256

Message ID 20230908231049.2035003-1-ebiederm@xmission.com (mailing list archive)
State New, archived
Headers show
Series SHA256 and SHA1 interoperability | expand

Commit Message

Eric W. Biederman Sept. 8, 2023, 11:10 p.m. UTC
The v3 pack index file as documented has a lot of complexity making it
difficult to implement correctly.  I worked with bryan's preliminary
implementation and it took several passes to get the bugs out.

The complexity also requires multiple table look-ups to find all of
the information that is needed to translate from one kind of oid to
another.  Which can't be good for cache locality.

Even worse coming up with a new index file version requires making
changes that have the potentialy to break anything that uses the index
of a pack file.

Instead of continuing to deal with the chance of braking things
besides the oid mapping functionality, the additional complexity in
the file format, and worry if the performance would be reasonable I
stripped down the problem to it's fundamental complexity and came up
with a file format that is exactly about mapping one kind of oid to
another, and only supports two kinds of oids.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 .../technical/hash-function-transition.txt    | 40 +++++++++++++++++++
 1 file changed, 40 insertions(+)

Comments

brian m. carlson Sept. 10, 2023, 2:24 p.m. UTC | #1
On 2023-09-08 at 23:10:18, Eric W. Biederman wrote:
> The v3 pack index file as documented has a lot of complexity making it
> difficult to implement correctly.  I worked with bryan's preliminary
> implementation and it took several passes to get the bugs out.
> 
> The complexity also requires multiple table look-ups to find all of
> the information that is needed to translate from one kind of oid to
> another.  Which can't be good for cache locality.
> 
> Even worse coming up with a new index file version requires making
> changes that have the potentialy to break anything that uses the index
> of a pack file.
> 
> Instead of continuing to deal with the chance of braking things
> besides the oid mapping functionality, the additional complexity in
> the file format, and worry if the performance would be reasonable I
> stripped down the problem to it's fundamental complexity and came up
> with a file format that is exactly about mapping one kind of oid to
> another, and only supports two kinds of oids.

I think this is a fine approach, and as I'm sure you noticed from my
series, it's a lot more robust than trying to implement pack v3.  I'd be
fine with going with this approach instead of pack v3.
Eric W. Biederman Sept. 10, 2023, 6:07 p.m. UTC | #2
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> On 2023-09-08 at 23:10:18, Eric W. Biederman wrote:
>> The v3 pack index file as documented has a lot of complexity making it
>> difficult to implement correctly.  I worked with bryan's preliminary
>> implementation and it took several passes to get the bugs out.
>> 
>> The complexity also requires multiple table look-ups to find all of
>> the information that is needed to translate from one kind of oid to
>> another.  Which can't be good for cache locality.
>> 
>> Even worse coming up with a new index file version requires making
>> changes that have the potentialy to break anything that uses the index
>> of a pack file.
>> 
>> Instead of continuing to deal with the chance of braking things
>> besides the oid mapping functionality, the additional complexity in
>> the file format, and worry if the performance would be reasonable I
>> stripped down the problem to it's fundamental complexity and came up
>> with a file format that is exactly about mapping one kind of oid to
>> another, and only supports two kinds of oids.
>
> I think this is a fine approach, and as I'm sure you noticed from my
> series, it's a lot more robust than trying to implement pack v3.  I'd be
> fine with going with this approach instead of pack v3.

I think I got your pack v3 working but it was at a minimum a serious
distraction.

I worry a little bit that this might leave some performance on the
table, with something like a 256 way jump table like we have in the
index file.

Still I figure we can start simple and when we start optimizing and
profiling we can revisit the format if it shows up as a performance
issue.

Eric
brian m. carlson Sept. 12, 2023, 12:14 a.m. UTC | #3
On 2023-09-08 at 23:10:18, Eric W. Biederman wrote:
> The v3 pack index file as documented has a lot of complexity making it
> difficult to implement correctly.  I worked with bryan's preliminary
> implementation and it took several passes to get the bugs out.
> 
> The complexity also requires multiple table look-ups to find all of
> the information that is needed to translate from one kind of oid to
> another.  Which can't be good for cache locality.
> 
> Even worse coming up with a new index file version requires making
> changes that have the potentialy to break anything that uses the index
> of a pack file.
> 
> Instead of continuing to deal with the chance of braking things
> besides the oid mapping functionality, the additional complexity in
> the file format, and worry if the performance would be reasonable I
> stripped down the problem to it's fundamental complexity and came up
> with a file format that is exactly about mapping one kind of oid to
> another, and only supports two kinds of oids.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  .../technical/hash-function-transition.txt    | 40 +++++++++++++++++++
>  1 file changed, 40 insertions(+)
> 
> diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt
> index ed574810891c..4b937480848a 100644
> --- a/Documentation/technical/hash-function-transition.txt
> +++ b/Documentation/technical/hash-function-transition.txt
> @@ -209,6 +209,46 @@ format described in linkgit:gitformat-pack[5], just like
>  today. The content that is compressed and stored uses SHA-256 content
>  instead of SHA-1 content.
>  
> +Per Pack Mapping Table
> +~~~~~~~~~~~~~~~~~~~~~~
> +A pack compat map file (.compat) files have the following format:
> +
> +HEADER:
> +	4-byte signature:
> +	    The signature is: {'C', 'M', 'A', 'P'}
> +	1-byte version number:
> +	    Git only writes or recognizes version 1.
> +	1-byte First Object Id Version
> +	    We infer the length of object IDs (OIDs) from this value:
> +		1 => SHA-1
> +		2 => SHA-256

One thing I forgot to mention here, is that we have 32-bit format IDs
for these in the structure, so we should use them here and below.  These
are GIT_SHA1_FORMAT_ID and GIT_SHA256_FORMAT_ID.

Not that I would encourage distributing such software, but it makes it
much easier for people to experiment with additional hash algorithms (in
terms of performance, etc.) if we make the space a little sparser.

> +	1-byte Second Object Id Version
> +	    We infer the length of object IDs (OIDs) from this value:
> +		1 => SHA-1
> +		2 => SHA-256

In your new patch for the next part, you consider that there might be
multiple compatibility hash algorithms.  I had anticipated only one at
a time in my series, but I'm not opposed to multiple if you want to
support that.

However, here you're making the assumption that there are only two.  If
you want to support multiple values, we need to explicitly consider that
both here (where we need a count of object ID version and multiple
tables, one for each algorithm), and in the follow-up series.

I had not considered more than two algorithms because it substantially
complicates the code and requires us to develop n*(n-1) tables, but I'm
not the one volunteering to do most of the work here, so I'll defer to
your preference.  (I do intend to send a patch or two, though.)

It's also possible we could be somewhat provident and define the on-disk
formats for multiple algorithms and then punt on the code until later if
you prefer that.
Eric W. Biederman Sept. 12, 2023, 1:36 p.m. UTC | #4
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> On 2023-09-08 at 23:10:18, Eric W. Biederman wrote:
>> The v3 pack index file as documented has a lot of complexity making it
>> difficult to implement correctly.  I worked with bryan's preliminary
>> implementation and it took several passes to get the bugs out.
>> 
>> The complexity also requires multiple table look-ups to find all of
>> the information that is needed to translate from one kind of oid to
>> another.  Which can't be good for cache locality.
>> 
>> Even worse coming up with a new index file version requires making
>> changes that have the potentialy to break anything that uses the index
>> of a pack file.
>> 
>> Instead of continuing to deal with the chance of braking things
>> besides the oid mapping functionality, the additional complexity in
>> the file format, and worry if the performance would be reasonable I
>> stripped down the problem to it's fundamental complexity and came up
>> with a file format that is exactly about mapping one kind of oid to
>> another, and only supports two kinds of oids.
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  .../technical/hash-function-transition.txt    | 40 +++++++++++++++++++
>>  1 file changed, 40 insertions(+)
>> 
>> diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt
>> index ed574810891c..4b937480848a 100644
>> --- a/Documentation/technical/hash-function-transition.txt
>> +++ b/Documentation/technical/hash-function-transition.txt
>> @@ -209,6 +209,46 @@ format described in linkgit:gitformat-pack[5], just like
>>  today. The content that is compressed and stored uses SHA-256 content
>>  instead of SHA-1 content.
>>  
>> +Per Pack Mapping Table
>> +~~~~~~~~~~~~~~~~~~~~~~
>> +A pack compat map file (.compat) files have the following format:
>> +
>> +HEADER:
>> +	4-byte signature:
>> +	    The signature is: {'C', 'M', 'A', 'P'}
>> +	1-byte version number:
>> +	    Git only writes or recognizes version 1.
>> +	1-byte First Object Id Version
>> +	    We infer the length of object IDs (OIDs) from this value:
>> +		1 => SHA-1
>> +		2 => SHA-256
>
> One thing I forgot to mention here, is that we have 32-bit format IDs
> for these in the structure, so we should use them here and below.  These
> are GIT_SHA1_FORMAT_ID and GIT_SHA256_FORMAT_ID.
>
> Not that I would encourage distributing such software, but it makes it
> much easier for people to experiment with additional hash algorithms (in
> terms of performance, etc.) if we make the space a little sparser.

Unfortunately that ship has already sailed. If you look at pack reverse
indices, pack mtime files, multi-pack-index files, they all use an
oid_version field.  So to experiment with a new hash function a new
number has to be picked.

The only use I can find of your 4 byte format_id's is in the reftable
code.

Using a 4 byte magic number in this case also conflicts with basic
simplicity.  With a one byte field I can specify it easily, and read it
back with no special tools, and understand what it means at a glance.

I admit I can only understand what a oid version field means at a glance
because the variation of object id's is low, but that is fundamental.
We require global agreement on names.  Fundamentally git can not
support many object id transitions.  Names are just too expensive.

When I come to how the map file is specified a single byte has real
advantages.  A single byte never needs byte swapping.  So it won't
be misread.  Using a single byte for each format allows me to
keep the header for the file at 16 bytes.  Which guarantees good
alignment of everything in the file without having to be clever.

All of this is for a file that is strictly local and the entire function
of the bytes is a sanity check to make certain that something weird is
not going on, or to assist recover if something bad happens.

So in this case I don't see any additional agility provided by longer
names helping.

>> +	1-byte Second Object Id Version
>> +	    We infer the length of object IDs (OIDs) from this value:
>> +		1 => SHA-1
>> +		2 => SHA-256
>
> In your new patch for the next part, you consider that there might be
> multiple compatibility hash algorithms.  I had anticipated only one at
> a time in my series, but I'm not opposed to multiple if you want to
> support that.
>
> However, here you're making the assumption that there are only two.  If
> you want to support multiple values, we need to explicitly consider that
> both here (where we need a count of object ID version and multiple
> tables, one for each algorithm), and in the follow-up series.
>
> I had not considered more than two algorithms because it substantially
> complicates the code and requires us to develop n*(n-1) tables, but I'm
> not the one volunteering to do most of the work here, so I'll defer to
> your preference.  (I do intend to send a patch or two, though.)
>
> It's also possible we could be somewhat provident and define the on-disk
> formats for multiple algorithms and then punt on the code until later if
> you prefer that.

In the long term I anticipate people disabling compatObjectFormat and
switching to readCompatMap so they still have access to their old
objects by their original names, but they don't have the over head
of computing a compatibility hash.

In a world where there is a transition to futureHash I anticipate
the files associated with an old pack looking something like:
pack-abcdefg.compat12
pack-abcdefg.compat32

For a repository still using hash version sha256 for storage, with a
mapping to some sha1 names, and a mapping of everything to new
names for compatibility with futureHash.

After transitioning to the futureHash those files would look like:
pack-abcdefg.compat13
pack-abcdefg.compat23

I deeply and fundamentally care about having some way to look up
old names because I do that all of the time.

In my work on the linux-kernel I have found myself frequently digging
into old issues.  I have on my hard drive tglx's git import of the
old bitkeeper tree.  I also have an import of all of the old kernel
releases into git from before the code was stored in bitkeeper.  I find
myself actually using all of those trees when digging into issues.

So I think the idea that we will ever be able to get rid of the mapping
for old converted repositories is unlikely.  We have entirely too many
references out there.

Which means that for every hash format conversion a repository goes
through I am going to have another collection of old names.


I don't honestly anticipate ever needing to have multiple
compatObjectFormat entries specified for a single repository.  I do
agree that if we are going to worry about forward and backward
compatibility we should be robust and have a configuration file syntax
that can handle the possibility.

I do very much anticipate needing to have multiple readCompatMap
entries, and pretty much only using them in get_short_oid in
object-name.c.  It will make the loop in find_short_packed_compat_object
a little longer but that is about all that will need to be implemented
and maintained long term.


I view this compat map format a lot like the loose objects.  It is
simple and good enough to get us started.  If it turns out we need
to optimize it's simplicity means all of the interfaces in the code
to use it have already been built, and we can just concentrate on
optimizing.

Eric
diff mbox series

Patch

diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt
index ed574810891c..4b937480848a 100644
--- a/Documentation/technical/hash-function-transition.txt
+++ b/Documentation/technical/hash-function-transition.txt
@@ -209,6 +209,46 @@  format described in linkgit:gitformat-pack[5], just like
 today. The content that is compressed and stored uses SHA-256 content
 instead of SHA-1 content.
 
+Per Pack Mapping Table
+~~~~~~~~~~~~~~~~~~~~~~
+A pack compat map file (.compat) files have the following format:
+
+HEADER:
+	4-byte signature:
+	    The signature is: {'C', 'M', 'A', 'P'}
+	1-byte version number:
+	    Git only writes or recognizes version 1.
+	1-byte First Object Id Version
+	    We infer the length of object IDs (OIDs) from this value:
+		1 => SHA-1
+		2 => SHA-256
+	1-byte Second Object Id Version
+	    We infer the length of object IDs (OIDs) from this value:
+		1 => SHA-1
+		2 => SHA-256
+	1-byte reserved (must be zero)
+	4-byte number of objects names contained in this mapping
+	1-byte length in bytes of shorted object names for the first object id.
+	       This is the shortest possible length needed to make the
+	       first object names unambigious.
+	1-byte reserved (must be zero)
+	1-byte length in bytes of shorted object names for the second object id.
+	       This is the shortest possible length needed to make the
+	       second object names unambigious.
+	1-byte reserved (must be zero)
+
+OBJECT NAME TABLES:
+	[Object name raw length + 4]*Number of object names
+	   This table is sorted by object name
+	   Each entry in the table is formated as:
+		[20 or 32 byte] Object name
+		4-byte index into the other object name table
+
+TRAILER:
+	checksum of the corresponding packfile, and
+
+	checksum of all of the above.
+
 Pack index
 ~~~~~~~~~~
 Pack index (.idx) files use a new v3 format that supports multiple