diff mbox series

[02/32] doc hash-function-transition: Replace compatObjectFormat with compatMap

Message ID 20230908231049.2035003-2-ebiederm@xmission.com (mailing list archive)
State New, archived
Headers show
Series SHA256 and SHA1 interoperability | expand

Commit Message

Eric W. Biederman Sept. 8, 2023, 11:10 p.m. UTC
Ir makes a lot of sense for the hash algorithm that determines how all
of the objects in the repostiory be an extension so that versions of
git that don't know about it won't even try.

For implementing the compatiblity maps that really is not the case.
An version of git that does not recognizes the won't care and continue
to use the repository as is.  The mapping functionality simply won't be
present.

Similarly if all of the objects are not mapped this could cause
some practical difficulties but it will not cause anything to perform
the wrong actions to the repository.  Some commands just won't work.
In the worst case all that needs to happen is for the compatibilty
maps to be rebuilt.

So let's use an option that forces unnecessary breakage of existing
tools.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 .../technical/hash-function-transition.txt       | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

Comments

brian m. carlson Sept. 10, 2023, 2:34 p.m. UTC | #1
On 2023-09-08 at 23:10:19, Eric W. Biederman wrote:
> Ir makes a lot of sense for the hash algorithm that determines how all

Minor nit: "It".

> diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt
> index 4b937480848a..10572c5794f9 100644
> --- a/Documentation/technical/hash-function-transition.txt
> +++ b/Documentation/technical/hash-function-transition.txt
> @@ -148,14 +148,14 @@ Detailed Design
>  Repository format extension
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~
>  A SHA-256 repository uses repository format version `1` (see
> -Documentation/technical/repository-version.txt) with extensions
> -`objectFormat` and `compatObjectFormat`:
> +Documentation/technical/repository-version.txt) with the extension
> +`objectFormat`, and an optional core.compatMap configuration.
>  
>  	[core]
>  		repositoryFormatVersion = 1
> +		compatMap = on
>  	[extensions]
>  		objectFormat = sha256
> -		compatObjectFormat = sha1

While I'm in favour of an approach that uses the compat map, the
situation we've implemented here doesn't specify the extra hash
algorithm.  We want this approach to work just as well for moving from
SHA-1 to SHA-256 as it might for a future transition from SHA-256 to,
say, SHA-3-512, if that becomes necessary.

Making a future transition easier has been a goal of my SHA-256 work
(because who wants to write several hundred patches in such a case?), so
my hope is we can keep that here as well by explicitly naming the
algorithm we're using.

I also wonder if an approach that doesn't use an extension is going to
be helpful.  Say, that I have a repository that is using Git 3.x, which
supports interop, but I also need to use Git 2.x, which does not.  While
it's true that Git 2.x can read my SHA-256 repository, it won't write
the appropriate objects into the map, and thus it will be practically
very difficult to actually use Git 3.x to push data to a repository of a
different hash function.  We might well prefer to have Git 2.x not work
with the repository at all rather than have incomplete data preventing
us from, well, interoperating.
Eric W. Biederman Sept. 10, 2023, 6 p.m. UTC | #2
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> On 2023-09-08 at 23:10:19, Eric W. Biederman wrote:
>> Ir makes a lot of sense for the hash algorithm that determines how all
>
> Minor nit: "It".
>
>> diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt
>> index 4b937480848a..10572c5794f9 100644
>> --- a/Documentation/technical/hash-function-transition.txt
>> +++ b/Documentation/technical/hash-function-transition.txt
>> @@ -148,14 +148,14 @@ Detailed Design
>>  Repository format extension
>>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>  A SHA-256 repository uses repository format version `1` (see
>> -Documentation/technical/repository-version.txt) with extensions
>> -`objectFormat` and `compatObjectFormat`:
>> +Documentation/technical/repository-version.txt) with the extension
>> +`objectFormat`, and an optional core.compatMap configuration.
>>  
>>  	[core]
>>  		repositoryFormatVersion = 1
>> +		compatMap = on
>>  	[extensions]
>>  		objectFormat = sha256
>> -		compatObjectFormat = sha1
>
> While I'm in favour of an approach that uses the compat map, the
> situation we've implemented here doesn't specify the extra hash
> algorithm.  We want this approach to work just as well for moving from
> SHA-1 to SHA-256 as it might for a future transition from SHA-256 to,
> say, SHA-3-512, if that becomes necessary.
>
> Making a future transition easier has been a goal of my SHA-256 work
> (because who wants to write several hundred patches in such a case?), so
> my hope is we can keep that here as well by explicitly naming the
> algorithm we're using.
>
> I also wonder if an approach that doesn't use an extension is going to
> be helpful.  Say, that I have a repository that is using Git 3.x, which
> supports interop, but I also need to use Git 2.x, which does not.  While
> it's true that Git 2.x can read my SHA-256 repository, it won't write
> the appropriate objects into the map, and thus it will be practically
> very difficult to actually use Git 3.x to push data to a repository of a
> different hash function.  We might well prefer to have Git 2.x not work
> with the repository at all rather than have incomplete data preventing
> us from, well, interoperating.

First it is my hope that we can get a command such as "git gc" to scan
the repository and fill in all of the missing compatibility hashes.

Not so much for day to day work, but for people able to enable
compatibility hashes on an existing repository.  Enabling compatibility
hashes on a sha1 repository is going to be necessary to create a sha256
repository from it.  A depth first walk, or a topological sort of the
objects pretty much has to happen as a separate pass.  So it makes sense
just to require all of the objects have their compatibility hash
computed before attempting to generate a pack in the compatibility
format.

I say all of that and I feel silly.

The core and optimized path is what whatever receive pack does to deal
with a pack in the repositories compatibility format.  Once that is
built we can create a sha256 repository from a sha1 repository just
by cloning it, and letting receive-pack figure out the details.

Before we can generate a sha256 pack from a sha1 pack we still need
to compute the sha256 hash of every object, but that can be very
optimized and local to the case of receiving a non-native pack.  So a
repository that generates a compatibility hash for all of it's objects
is not necessary to transition to another hash algorithm.  All we need
is another repository in the other format.


That said there is value in being able to add compatibility hashes
to an existing repository.  The upstream repository can just convert
to the new hash function and all of the downstream repositories
can compute their compatibility hashes and convert when they are ready.

Basically once a git with transition support exists any repository can
convert at any time without creating a problem for other repositories.

In my head it seems cheaper/safer to compute the compatibility hash of
every object in an existing repository than it does to convert a
repository.  Is it?

I think that if the first pull from a repository in another format can
trigger the initial computation of the compatibility hash (like the
first use of a reverse index triggers the creation of the reverse
index), then it will definitely be easier to just enable compatibility
hashes in an existing repository.

The additional hash computation step every pull from upstream (even when
well optimized) should be an incentive for people to fully convert their
repositories after the upstream has converted.


That is when things get tricky and the transition plan has not talked
about.  There are references to existing oid's in email, bug trackers,
and commit comments.  Digging through the history and dealing with those
references is something that developers are going to need to do for the
rest of the life of a project.

Which means eventually we will need to support a mode where we have some
packs with a ``.compat'' index but we no longer compute or generate the
old hash for new objects.

In summary.  I agree that compatMap is likely insufficient. So far I
think it is too cheap/easy to generate the missing mappings to make it a
mandatory requirement that all operations always generate them.

I also agree that making the configuration resilient foreseeable future
demands is a good idea.

So I will push this change farther out in the patch series.

Eric
Junio C Hamano Sept. 11, 2023, 6:11 a.m. UTC | #3
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

>> +Documentation/technical/repository-version.txt) with the extension
>> +`objectFormat`, and an optional core.compatMap configuration.
>>  
>>  	[core]
>>  		repositoryFormatVersion = 1
>> +		compatMap = on
>>  	[extensions]
>>  		objectFormat = sha256
>> -		compatObjectFormat = sha1
>
> While I'm in favour of an approach that uses the compat map, the
> situation we've implemented here doesn't specify the extra hash
> algorithm.  We want this approach to work just as well for moving from
> SHA-1 to SHA-256 as it might for a future transition from SHA-256 to,
> say, SHA-3-512, if that becomes necessary.
>
> Making a future transition easier has been a goal of my SHA-256 work
> (because who wants to write several hundred patches in such a case?), so
> my hope is we can keep that here as well by explicitly naming the
> algorithm we're using.
>
> I also wonder if an approach that doesn't use an extension is going to
> be helpful.  Say, that I have a repository that is using Git 3.x, which
> supports interop, but I also need to use Git 2.x, which does not.  While
> it's true that Git 2.x can read my SHA-256 repository, it won't write
> the appropriate objects into the map, and thus it will be practically
> very difficult to actually use Git 3.x to push data to a repository of a
> different hash function.  We might well prefer to have Git 2.x not work
> with the repository at all rather than have incomplete data preventing
> us from, well, interoperating.

Very sensible line of thought and suggestion to move the topic
forward.  Very much appreciated.
diff mbox series

Patch

diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt
index 4b937480848a..10572c5794f9 100644
--- a/Documentation/technical/hash-function-transition.txt
+++ b/Documentation/technical/hash-function-transition.txt
@@ -148,14 +148,14 @@  Detailed Design
 Repository format extension
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 A SHA-256 repository uses repository format version `1` (see
-Documentation/technical/repository-version.txt) with extensions
-`objectFormat` and `compatObjectFormat`:
+Documentation/technical/repository-version.txt) with the extension
+`objectFormat`, and an optional core.compatMap configuration.
 
 	[core]
 		repositoryFormatVersion = 1
+		compatMap = on
 	[extensions]
 		objectFormat = sha256
-		compatObjectFormat = sha1
 
 The combination of setting `core.repositoryFormatVersion=1` and
 populating `extensions.*` ensures that all versions of Git later than
@@ -682,7 +682,7 @@  Some initial steps can be implemented independently of one another:
 - adding support for the PSRC field and safer object pruning
 
 The first user-visible change is the introduction of the objectFormat
-extension (without compatObjectFormat). This requires:
+extension. This requires:
 
 - teaching fsck about this mode of operation
 - using the hash function API (vtable) when computing object names
@@ -690,7 +690,7 @@  extension (without compatObjectFormat). This requires:
 - rejecting attempts to fetch from or push to an incompatible
   repository
 
-Next comes introduction of compatObjectFormat:
+Next comes introduction of compatMap:
 
 - implementing the loose-object-idx
 - translating object names between object formats
@@ -724,9 +724,9 @@  Over time projects would encourage their users to adopt the "early
 transition" and then "late transition" modes to take advantage of the
 new, more futureproof SHA-256 object names.
 
-When objectFormat and compatObjectFormat are both set, commands
-generating signatures would generate both SHA-1 and SHA-256 signatures
-by default to support both new and old users.
+When objectFormat and compatMap are both set, commands generating
+signatures would generate both SHA-1 and SHA-256 signatures by default
+to support both new and old users.
 
 In projects using SHA-256 heavily, users could be encouraged to adopt
 the "post-transition" mode to avoid accidentally making implicit use