Message ID | 20230908231049.2035003-1-ebiederm@xmission.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | SHA256 and SHA1 interoperability | expand |
On 2023-09-08 at 23:10:18, Eric W. Biederman wrote: > The v3 pack index file as documented has a lot of complexity making it > difficult to implement correctly. I worked with bryan's preliminary > implementation and it took several passes to get the bugs out. > > The complexity also requires multiple table look-ups to find all of > the information that is needed to translate from one kind of oid to > another. Which can't be good for cache locality. > > Even worse coming up with a new index file version requires making > changes that have the potentialy to break anything that uses the index > of a pack file. > > Instead of continuing to deal with the chance of braking things > besides the oid mapping functionality, the additional complexity in > the file format, and worry if the performance would be reasonable I > stripped down the problem to it's fundamental complexity and came up > with a file format that is exactly about mapping one kind of oid to > another, and only supports two kinds of oids. I think this is a fine approach, and as I'm sure you noticed from my series, it's a lot more robust than trying to implement pack v3. I'd be fine with going with this approach instead of pack v3.
"brian m. carlson" <sandals@crustytoothpaste.net> writes: > On 2023-09-08 at 23:10:18, Eric W. Biederman wrote: >> The v3 pack index file as documented has a lot of complexity making it >> difficult to implement correctly. I worked with bryan's preliminary >> implementation and it took several passes to get the bugs out. >> >> The complexity also requires multiple table look-ups to find all of >> the information that is needed to translate from one kind of oid to >> another. Which can't be good for cache locality. >> >> Even worse coming up with a new index file version requires making >> changes that have the potentialy to break anything that uses the index >> of a pack file. >> >> Instead of continuing to deal with the chance of braking things >> besides the oid mapping functionality, the additional complexity in >> the file format, and worry if the performance would be reasonable I >> stripped down the problem to it's fundamental complexity and came up >> with a file format that is exactly about mapping one kind of oid to >> another, and only supports two kinds of oids. > > I think this is a fine approach, and as I'm sure you noticed from my > series, it's a lot more robust than trying to implement pack v3. I'd be > fine with going with this approach instead of pack v3. I think I got your pack v3 working but it was at a minimum a serious distraction. I worry a little bit that this might leave some performance on the table, with something like a 256 way jump table like we have in the index file. Still I figure we can start simple and when we start optimizing and profiling we can revisit the format if it shows up as a performance issue. Eric
On 2023-09-08 at 23:10:18, Eric W. Biederman wrote: > The v3 pack index file as documented has a lot of complexity making it > difficult to implement correctly. I worked with bryan's preliminary > implementation and it took several passes to get the bugs out. > > The complexity also requires multiple table look-ups to find all of > the information that is needed to translate from one kind of oid to > another. Which can't be good for cache locality. > > Even worse coming up with a new index file version requires making > changes that have the potentialy to break anything that uses the index > of a pack file. > > Instead of continuing to deal with the chance of braking things > besides the oid mapping functionality, the additional complexity in > the file format, and worry if the performance would be reasonable I > stripped down the problem to it's fundamental complexity and came up > with a file format that is exactly about mapping one kind of oid to > another, and only supports two kinds of oids. > > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > .../technical/hash-function-transition.txt | 40 +++++++++++++++++++ > 1 file changed, 40 insertions(+) > > diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt > index ed574810891c..4b937480848a 100644 > --- a/Documentation/technical/hash-function-transition.txt > +++ b/Documentation/technical/hash-function-transition.txt > @@ -209,6 +209,46 @@ format described in linkgit:gitformat-pack[5], just like > today. The content that is compressed and stored uses SHA-256 content > instead of SHA-1 content. > > +Per Pack Mapping Table > +~~~~~~~~~~~~~~~~~~~~~~ > +A pack compat map file (.compat) files have the following format: > + > +HEADER: > + 4-byte signature: > + The signature is: {'C', 'M', 'A', 'P'} > + 1-byte version number: > + Git only writes or recognizes version 1. > + 1-byte First Object Id Version > + We infer the length of object IDs (OIDs) from this value: > + 1 => SHA-1 > + 2 => SHA-256 One thing I forgot to mention here, is that we have 32-bit format IDs for these in the structure, so we should use them here and below. These are GIT_SHA1_FORMAT_ID and GIT_SHA256_FORMAT_ID. Not that I would encourage distributing such software, but it makes it much easier for people to experiment with additional hash algorithms (in terms of performance, etc.) if we make the space a little sparser. > + 1-byte Second Object Id Version > + We infer the length of object IDs (OIDs) from this value: > + 1 => SHA-1 > + 2 => SHA-256 In your new patch for the next part, you consider that there might be multiple compatibility hash algorithms. I had anticipated only one at a time in my series, but I'm not opposed to multiple if you want to support that. However, here you're making the assumption that there are only two. If you want to support multiple values, we need to explicitly consider that both here (where we need a count of object ID version and multiple tables, one for each algorithm), and in the follow-up series. I had not considered more than two algorithms because it substantially complicates the code and requires us to develop n*(n-1) tables, but I'm not the one volunteering to do most of the work here, so I'll defer to your preference. (I do intend to send a patch or two, though.) It's also possible we could be somewhat provident and define the on-disk formats for multiple algorithms and then punt on the code until later if you prefer that.
"brian m. carlson" <sandals@crustytoothpaste.net> writes: > On 2023-09-08 at 23:10:18, Eric W. Biederman wrote: >> The v3 pack index file as documented has a lot of complexity making it >> difficult to implement correctly. I worked with bryan's preliminary >> implementation and it took several passes to get the bugs out. >> >> The complexity also requires multiple table look-ups to find all of >> the information that is needed to translate from one kind of oid to >> another. Which can't be good for cache locality. >> >> Even worse coming up with a new index file version requires making >> changes that have the potentialy to break anything that uses the index >> of a pack file. >> >> Instead of continuing to deal with the chance of braking things >> besides the oid mapping functionality, the additional complexity in >> the file format, and worry if the performance would be reasonable I >> stripped down the problem to it's fundamental complexity and came up >> with a file format that is exactly about mapping one kind of oid to >> another, and only supports two kinds of oids. >> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> >> --- >> .../technical/hash-function-transition.txt | 40 +++++++++++++++++++ >> 1 file changed, 40 insertions(+) >> >> diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt >> index ed574810891c..4b937480848a 100644 >> --- a/Documentation/technical/hash-function-transition.txt >> +++ b/Documentation/technical/hash-function-transition.txt >> @@ -209,6 +209,46 @@ format described in linkgit:gitformat-pack[5], just like >> today. The content that is compressed and stored uses SHA-256 content >> instead of SHA-1 content. >> >> +Per Pack Mapping Table >> +~~~~~~~~~~~~~~~~~~~~~~ >> +A pack compat map file (.compat) files have the following format: >> + >> +HEADER: >> + 4-byte signature: >> + The signature is: {'C', 'M', 'A', 'P'} >> + 1-byte version number: >> + Git only writes or recognizes version 1. >> + 1-byte First Object Id Version >> + We infer the length of object IDs (OIDs) from this value: >> + 1 => SHA-1 >> + 2 => SHA-256 > > One thing I forgot to mention here, is that we have 32-bit format IDs > for these in the structure, so we should use them here and below. These > are GIT_SHA1_FORMAT_ID and GIT_SHA256_FORMAT_ID. > > Not that I would encourage distributing such software, but it makes it > much easier for people to experiment with additional hash algorithms (in > terms of performance, etc.) if we make the space a little sparser. Unfortunately that ship has already sailed. If you look at pack reverse indices, pack mtime files, multi-pack-index files, they all use an oid_version field. So to experiment with a new hash function a new number has to be picked. The only use I can find of your 4 byte format_id's is in the reftable code. Using a 4 byte magic number in this case also conflicts with basic simplicity. With a one byte field I can specify it easily, and read it back with no special tools, and understand what it means at a glance. I admit I can only understand what a oid version field means at a glance because the variation of object id's is low, but that is fundamental. We require global agreement on names. Fundamentally git can not support many object id transitions. Names are just too expensive. When I come to how the map file is specified a single byte has real advantages. A single byte never needs byte swapping. So it won't be misread. Using a single byte for each format allows me to keep the header for the file at 16 bytes. Which guarantees good alignment of everything in the file without having to be clever. All of this is for a file that is strictly local and the entire function of the bytes is a sanity check to make certain that something weird is not going on, or to assist recover if something bad happens. So in this case I don't see any additional agility provided by longer names helping. >> + 1-byte Second Object Id Version >> + We infer the length of object IDs (OIDs) from this value: >> + 1 => SHA-1 >> + 2 => SHA-256 > > In your new patch for the next part, you consider that there might be > multiple compatibility hash algorithms. I had anticipated only one at > a time in my series, but I'm not opposed to multiple if you want to > support that. > > However, here you're making the assumption that there are only two. If > you want to support multiple values, we need to explicitly consider that > both here (where we need a count of object ID version and multiple > tables, one for each algorithm), and in the follow-up series. > > I had not considered more than two algorithms because it substantially > complicates the code and requires us to develop n*(n-1) tables, but I'm > not the one volunteering to do most of the work here, so I'll defer to > your preference. (I do intend to send a patch or two, though.) > > It's also possible we could be somewhat provident and define the on-disk > formats for multiple algorithms and then punt on the code until later if > you prefer that. In the long term I anticipate people disabling compatObjectFormat and switching to readCompatMap so they still have access to their old objects by their original names, but they don't have the over head of computing a compatibility hash. In a world where there is a transition to futureHash I anticipate the files associated with an old pack looking something like: pack-abcdefg.compat12 pack-abcdefg.compat32 For a repository still using hash version sha256 for storage, with a mapping to some sha1 names, and a mapping of everything to new names for compatibility with futureHash. After transitioning to the futureHash those files would look like: pack-abcdefg.compat13 pack-abcdefg.compat23 I deeply and fundamentally care about having some way to look up old names because I do that all of the time. In my work on the linux-kernel I have found myself frequently digging into old issues. I have on my hard drive tglx's git import of the old bitkeeper tree. I also have an import of all of the old kernel releases into git from before the code was stored in bitkeeper. I find myself actually using all of those trees when digging into issues. So I think the idea that we will ever be able to get rid of the mapping for old converted repositories is unlikely. We have entirely too many references out there. Which means that for every hash format conversion a repository goes through I am going to have another collection of old names. I don't honestly anticipate ever needing to have multiple compatObjectFormat entries specified for a single repository. I do agree that if we are going to worry about forward and backward compatibility we should be robust and have a configuration file syntax that can handle the possibility. I do very much anticipate needing to have multiple readCompatMap entries, and pretty much only using them in get_short_oid in object-name.c. It will make the loop in find_short_packed_compat_object a little longer but that is about all that will need to be implemented and maintained long term. I view this compat map format a lot like the loose objects. It is simple and good enough to get us started. If it turns out we need to optimize it's simplicity means all of the interfaces in the code to use it have already been built, and we can just concentrate on optimizing. Eric
diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt index ed574810891c..4b937480848a 100644 --- a/Documentation/technical/hash-function-transition.txt +++ b/Documentation/technical/hash-function-transition.txt @@ -209,6 +209,46 @@ format described in linkgit:gitformat-pack[5], just like today. The content that is compressed and stored uses SHA-256 content instead of SHA-1 content. +Per Pack Mapping Table +~~~~~~~~~~~~~~~~~~~~~~ +A pack compat map file (.compat) files have the following format: + +HEADER: + 4-byte signature: + The signature is: {'C', 'M', 'A', 'P'} + 1-byte version number: + Git only writes or recognizes version 1. + 1-byte First Object Id Version + We infer the length of object IDs (OIDs) from this value: + 1 => SHA-1 + 2 => SHA-256 + 1-byte Second Object Id Version + We infer the length of object IDs (OIDs) from this value: + 1 => SHA-1 + 2 => SHA-256 + 1-byte reserved (must be zero) + 4-byte number of objects names contained in this mapping + 1-byte length in bytes of shorted object names for the first object id. + This is the shortest possible length needed to make the + first object names unambigious. + 1-byte reserved (must be zero) + 1-byte length in bytes of shorted object names for the second object id. + This is the shortest possible length needed to make the + second object names unambigious. + 1-byte reserved (must be zero) + +OBJECT NAME TABLES: + [Object name raw length + 4]*Number of object names + This table is sorted by object name + Each entry in the table is formated as: + [20 or 32 byte] Object name + 4-byte index into the other object name table + +TRAILER: + checksum of the corresponding packfile, and + + checksum of all of the above. + Pack index ~~~~~~~~~~ Pack index (.idx) files use a new v3 format that supports multiple
The v3 pack index file as documented has a lot of complexity making it difficult to implement correctly. I worked with bryan's preliminary implementation and it took several passes to get the bugs out. The complexity also requires multiple table look-ups to find all of the information that is needed to translate from one kind of oid to another. Which can't be good for cache locality. Even worse coming up with a new index file version requires making changes that have the potentialy to break anything that uses the index of a pack file. Instead of continuing to deal with the chance of braking things besides the oid mapping functionality, the additional complexity in the file format, and worry if the performance would be reasonable I stripped down the problem to it's fundamental complexity and came up with a file format that is exactly about mapping one kind of oid to another, and only supports two kinds of oids. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- .../technical/hash-function-transition.txt | 40 +++++++++++++++++++ 1 file changed, 40 insertions(+)