Message ID | 20230426205324.326501-3-sandals@crustytoothpaste.net (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Fix empty SHA-256 clones with v0 and v1 | expand |
"brian m. carlson" <sandals@crustytoothpaste.net> writes: > However, up until 8b214c2e9d ("clone: propagate object-format when > cloning from void", 2023-04-05), we honored GIT_DEFAULT_HASH in this > case, so let's continue to do that. Let's not. Once we identified the bug of mistakenly honoring a wrong variable, let's fix it and keep it fixed. The local optimization, if necessary, can be taught to peek the source repository and propagating the object format selection to the destination repository.
"brian m. carlson" <sandals@crustytoothpaste.net> writes: > `GIT_DEFAULT_HASH`:: > If this variable is set, the default hash algorithm for new > repositories will be set to this value. This value is currently > + ignored when cloning if the remote value can be definitively > + determined; the setting of the remote repository is used > + instead. The value is honored if the remote repository's > + algorithm cannot be determined, such as some cases when > + the remote repository is empty. The default is "sha1". > + THIS VARIABLE IS EXPERIMENTAL! See `--object-format` > + in linkgit:git-init[1]. We'd need to evantually cover all the transports (and non-transport like the "--local" optimization) so that the object-format and other choices are communicated from the origin to a new clone anyway, so this extra complexity "until X is fixed, it behaves this way, but otherwise the variable is read in the meantime" may be a disservice to the end users, even though it may make it easier in the shorter term for maintainers of programs that rely on the buggy "git clone" that partially honored this environment variable. In short, I am still not convinced that the above is a good design choice in the longer term. Thanks.
On Wed, Apr 26, 2023 at 02:33:30PM -0700, Junio C Hamano wrote: > "brian m. carlson" <sandals@crustytoothpaste.net> writes: > > > `GIT_DEFAULT_HASH`:: > > If this variable is set, the default hash algorithm for new > > repositories will be set to this value. This value is currently > > + ignored when cloning if the remote value can be definitively > > + determined; the setting of the remote repository is used > > + instead. The value is honored if the remote repository's > > + algorithm cannot be determined, such as some cases when > > + the remote repository is empty. The default is "sha1". > > + THIS VARIABLE IS EXPERIMENTAL! See `--object-format` > > + in linkgit:git-init[1]. > > We'd need to evantually cover all the transports (and non-transport > like the "--local" optimization) so that the object-format and other > choices are communicated from the origin to a new clone anyway, so > this extra complexity "until X is fixed, it behaves this way, but > otherwise the variable is read in the meantime" may be a disservice > to the end users, even though it may make it easier in the shorter > term for maintainers of programs that rely on the buggy "git clone" > that partially honored this environment variable. > > In short, I am still not convinced that the above is a good design > choice in the longer term. I also think it is working against the backwards-compatible design of the hash function transition. If we do not see an object-format line from the remote, then either: 1. They sent us capabilities, but it did not include object-format. So if we are in GIT_DEFAULT_HASH=sha256 mode locally, but the other side is an older version of Git (or even a current version of other implementations, like Dulwich) that do not send object-format at all, then we will not correctly fall back to assuming they are sha1. In a non-empty repo, this means we'll fail to parse their ref advertisement (we'll expect sha256 hashes but get sha1), and cloning will be broken. 2. They did not send us capabilities, because the repo is empty (and the server does not have brian's patch 1). The hash transition doc says we're supposed to assume they're sha1. It's _sort of_ academic, in that they also are not telling us about any refs on their side. But we may end up mis-matched with the server (again, this is the 50/50 thing; we don't know what their format is). Presumably that bites us later when we try to push up new objects (but would eventually work when we support interop). I think handling (2) is iffy as a goal, but the collateral damage of (1) is a complete show-stopper for this patch. If we wanted to do (2) by itself, we'd have to distinguish "did they even send us a capabilities line" as a separate case (but I tend to agree with you that it is not worth doing for now). -Peff
Hi, Changing the subject as this message seems like a different topic. Jeff King wrote: > On Wed, Apr 26, 2023 at 02:33:30PM -0700, Junio C Hamano wrote: > > "brian m. carlson" <sandals@crustytoothpaste.net> writes: > > > > > `GIT_DEFAULT_HASH`:: > > > If this variable is set, the default hash algorithm for new > > > repositories will be set to this value. This value is currently > > > + ignored when cloning if the remote value can be definitively > > > + determined; the setting of the remote repository is used > > > + instead. The value is honored if the remote repository's > > > + algorithm cannot be determined, such as some cases when > > > + the remote repository is empty. The default is "sha1". > > > + THIS VARIABLE IS EXPERIMENTAL! See `--object-format` > > > + in linkgit:git-init[1]. > > > > We'd need to evantually cover all the transports (and non-transport > > like the "--local" optimization) so that the object-format and other > > choices are communicated from the origin to a new clone anyway, so > > this extra complexity "until X is fixed, it behaves this way, but > > otherwise the variable is read in the meantime" may be a disservice > > to the end users, even though it may make it easier in the shorter > > term for maintainers of programs that rely on the buggy "git clone" > > that partially honored this environment variable. > > > > In short, I am still not convinced that the above is a good design > > choice in the longer term. > > I also think it is working against the backwards-compatible design of > the hash function transition. To be honest this whole approach seems to be completely flawed to me and against the whole design of git in the first place. In a recent email Linus Torvalds explained why object ids were calculated based {type, size, data} [1], and he explained very clearly that two objects with exactly the same data are not supposed to have the same id if the type is different. If even the tiniest change such as adding a period to a commit messange changes the object id (and thus semantically makes it a different object), then it makes sense that changing the type of an object also changes the object id (and thus it's also a different object). And because the id of the parent is included in the content of every commit, the top-level id ensures the integrity of the whole graph. But then comes this notion that the hash algorithm is a property of the repository, and not part of the object storage, which means changing the whole hash algorithm of a repository is considered less of a change than adding a period to the commit message, worse: not a change at all. I am reminded of the warning Sam Smith gave to the Git project [2] which seemed to be unheard, but the notion of cryptographic algorithm agility makes complete sense to me. In my view one repository should be able to have part SHA-1 history, part SHA3-256 history, and part BLAKE2b history. Changing the hash algorithm of one commit should change the object id of that commit, and thus make it semantically a different commit. In other words: an object of type "blob" should never be confused with an object of type "blob:sha-256", even if the content is exactly the same. The fact that apparently it's so easy to clone a repository with the wrong hash algorithm should give developers pause, as it means the whole point of using cryptographic hash algorithms to ensure the integrity of the commit history is completely gone. I have not been following the SHA-1 -> OID discussions, but I distinctively recall Linus Torvalds mentioning that the choice of using SHA-1 wasn't even for security purposes, it was to ensure integrity. When I do a `git fetch` as long as the new commits have the same SHA-1 as parent as the SHA-1s I have in my repository I can be relatively certain the repository has not been tampered with. Which means that if I do a `git fetch` that suddenly brings SHA-256 commits, some of them must have SHA-1 parents that match the ones I currently have. Otherwise how do I know it's the same history? Maybe that's one of the reasons people don't seem particularly eager to move away from SHA-1: Better the SHA-1 you know, than the SHA-256 you don't. Cheers. [1] https://lore.kernel.org/git/CAHk-=wjr-CMLX2Jo2++rwcv0VNr+HmZqXEVXNsJGiPRUwNxzBQ@mail.gmail.com/ [2] https://lore.kernel.org/git/D433038A-2643-4F63-8677-CA8AB6904AE1@samuelsmith.org/
On 5/3/23 01:46, Felipe Contreras wrote: > To be honest this whole approach seems to be completely flawed to me and > against the whole design of git in the first place. The discussion above is mostly moot now since this has been fixed in later patches in this thread, AFAIK. It's also moot for other reasons, like the hash function transition plan is not really implemented, yet. Also, this was about corner-case, like it often is. > In a recent email Linus Torvalds explained why object ids were > calculated based {type, size, data} [1], and he explained very clearly > that two objects with exactly the same data are not supposed to have the > same id if the type is different. This is different. But aside, type + size + data are not really much different from just having data in a hash function. There are plenty of hash collisions where HASH(type + size + data) == HASH(type + size + data') by definition of how these functions work. The problem is always in finding these collisions. But anyway... > In my view one repository should be able to have part SHA-1 history, > part SHA3-256 history, and part BLAKE2b history. Yes, that would be great. Please provide patch series for this :-) > I have not been following the SHA-1 -> OID discussions, but I > distinctively recall Linus Torvalds mentioning that the choice of using > SHA-1 wasn't even for security purposes, it was to ensure integrity. These are different sides of the same coin. Hashes are used to provide integrity. Hashes like MD4, MD5, SHA1, SHA256 are there for integrity. Some of these are no longer recommended and some are completely broken. > Better the SHA-1 you know, than the SHA-256 you don't. Wrong conclusion ;) Also, we know SHA-256 The problem in git-core and virtually all clients and other implementations is/was that SHA1 was hardcoded and assumed to be THE ONE and ONLY hash. It will take quite a bit of work outside of git-core to remove this one assumption (remember two digit year and 2000? - yes I'm old). Once this hash assumption is removed, you can start talking about adding other hashes and interop. Keep in mind -- hashes are there for object reference. They are the glue in git. But there is really nothing stopping us from recalculating them "on the fly". If you have SHA1 repo, you can calculate a SHA256 or whatever hash for any type object. That's not the problem, conceptually speaking. Finally, let not have a "bike shed" discussion about this. The GIT_DEFAULT_HASH is meant to be used by `git init` in-lieu of --object-format parameter, so it's not flawed. When used in other applications, it probably indicates a bug. But we can't fix all the bugs at once :-) Cheers, - Adam
On Wed, 3 May 2023 at 02:17, Felipe Contreras <felipe.contreras@gmail.com> wrote: > > Hi, > > Changing the subject as this message seems like a different topic. > > Jeff King wrote: > > On Wed, Apr 26, 2023 at 02:33:30PM -0700, Junio C Hamano wrote: > > > "brian m. carlson" <sandals@crustytoothpaste.net> writes: > > > > > > > `GIT_DEFAULT_HASH`:: > > > > If this variable is set, the default hash algorithm for new > > > > repositories will be set to this value. This value is currently > > > > + ignored when cloning if the remote value can be definitively > > > > + determined; the setting of the remote repository is used > > > > + instead. The value is honored if the remote repository's > > > > + algorithm cannot be determined, such as some cases when > > > > + the remote repository is empty. The default is "sha1". > > > > + THIS VARIABLE IS EXPERIMENTAL! See `--object-format` > > > > + in linkgit:git-init[1]. > > > > > > We'd need to evantually cover all the transports (and non-transport > > > like the "--local" optimization) so that the object-format and other > > > choices are communicated from the origin to a new clone anyway, so > > > this extra complexity "until X is fixed, it behaves this way, but > > > otherwise the variable is read in the meantime" may be a disservice > > > to the end users, even though it may make it easier in the shorter > > > term for maintainers of programs that rely on the buggy "git clone" > > > that partially honored this environment variable. > > > > > > In short, I am still not convinced that the above is a good design > > > choice in the longer term. > > > > I also think it is working against the backwards-compatible design of > > the hash function transition. > > To be honest this whole approach seems to be completely flawed to me and > against the whole design of git in the first place. > > In a recent email Linus Torvalds explained why object ids were > calculated based {type, size, data} [1], and he explained very clearly > that two objects with exactly the same data are not supposed to have the > same id if the type is different. He said: --- quote-begin --- The "no aliasing" means that no two distinct pointers can point to the same data. So a tagged pointer of type "commit" can not point to the same object as a tagged pointer of type "blob". They are distinct pointers, even if (maybe) the commit object encoding ends up then being identical to a blob object. --- quote-end --- As far as I could tell he didn't really explain *why* he wanted this, and IMO it is non-obvious why he would care if a blob and a commit had the same text, and thus the same ID. He just said he didnt want it to happen, not why. I can imagine some aesthetic reasons why you might want to ensure that no blob has the same ID as a commit, and I can imagine it might make debugging easier at certain points, but it seems unnecessary given the data is write once. > If even the tiniest change such as adding a period to a commit messange > changes the object id (and thus semantically makes it a different > object), then it makes sense that changing the type of an object also > changes the object id (and thus it's also a different object). > > And because the id of the parent is included in the content of every > commit, the top-level id ensures the integrity of the whole graph. > > But then comes this notion that the hash algorithm is a property of the > repository, and not part of the object storage, which means changing the > whole hash algorithm of a repository is considered less of a change than > adding a period to the commit message, worse: not a change at all. I really dont understand why you think having two hash functions producing different results for the same data is comparable to a single hash producing different results for different data. In one case you have two different continuum of identifiers, with one ID per continuum, and in the other you have two different identifiers in the same continuum, and if you a continuum you would have 4 different identifiers right? Eg, the two cases are really quite different at a fundamental level. > I am reminded of the warning Sam Smith gave to the Git project [2] which > seemed to be unheard, but the notion of cryptographic algorithm agility > makes complete sense to me. > > In my view one repository should be able to have part SHA-1 history, > part SHA3-256 history, and part BLAKE2b history. Isn't this orthagonal to your other points? > Changing the hash algorithm of one commit should change the object id of > that commit, and thus make it semantically a different commit. > > In other words: an object of type "blob" should never be confused with > an object of type "blob:sha-256", even if the content is exactly the > same. This doesn't make sense to me. As long as we can distinguish the hashes produced by the different hash functions in use we can create a mapping of the data that is hashed such that we have a 1:1 mapping of identifiers of each type at which point it really doesn't matter which hash function is used. > The fact that apparently it's so easy to clone a repository with > the wrong hash algorithm should give developers pause, as it means the > whole point of using cryptographic hash algorithms to ensure the > integrity of the commit history is completely gone. This is a leap too far. The fact that it is "so easy to clone a repo with the wrong hash algorithm" is completely orthogonal to the fundamental principles of hash identifiers from strong hash functions. You seem to be deriving grand conclusions from what sounds to me like a simple bug/design-oversight. > I have not been following the SHA-1 -> OID discussions, but I > distinctively recall Linus Torvalds mentioning that the choice of using > SHA-1 wasn't even for security purposes, it was to ensure integrity. > When I do a `git fetch` as long as the new commits have the same SHA-1 > as parent as the SHA-1s I have in my repository I can be relatively > certain the repository has not been tampered with. Which means that if I > do a `git fetch` that suddenly brings SHA-256 commits, some of them must > have SHA-1 parents that match the ones I currently have. Otherwise how > do I know it's the same history? So consider what /could/ happen here. You fetch a commit which uses SHA-256 into a repo where all of your local commits use SHA-1. The commit you fetched says its parent is some SHA-256 ID you don't know about as all your ID's are SHA-1. So git then could go and construct an index, hashing each item using SHA-256 instead of SHA-1, and using the result to build a bi-directional mapping from SHA-1 to SHA-256 and back. All it has to do then is look into the mapping to find if the SHA-256 parent id is present in your repo. If it is then you know it's the same history. The key point here is that if you ignore SHAttered artifacts (which seems reasonable as you can detect the attack during hashing) you can build a 1:1 map of SHA-1 and SHA-256 ids. Once you have that mapping it doesn't matter which ID is used. > Maybe that's one of the reasons people don't seem particularly eager to > move away from SHA-1: Maybe, but it doesn't make sense to me. You seem to be putting undue weight on an unnecessary aspect of the git design: there doesn't seem to be a reason for Linuses "no aliasing" policy, and it seems like one could build a git-a-like without it without suffering any significant penalties. Regardless, provided that the hash functions allow a 1:1 mapping of ID's (which is assumed by using "collision free hash functions"), it seems like it really doesn't matter which hash is used at any given time. cheers, Yves
Adam Majer wrote: > On 5/3/23 01:46, Felipe Contreras wrote: > > To be honest this whole approach seems to be completely flawed to me and > > against the whole design of git in the first place. > > The discussion above is mostly moot now since this has been fixed in > later patches in this thread, AFAIK. That particular isssue might be fixed, but that issue should never have happened in the first place if the design was correct. A bad design makes certain errors prone to happen, a good design makes the same errors happen rarely, a great design makes those errors impossible. Git was designed to make it *impossible* to confuse two commits with similar data. The symptom might have been fixed, that doesn't mean there's no underlying problem. > It's also moot for other reasons, like the hash function transition plan is > not really implemented, yet. The implemention of the plan isn't the problem, it's the plan itself. > Also, this was about corner-case, like it often is. A corner-case that should be impossible. > > In a recent email Linus Torvalds explained why object ids were > > calculated based {type, size, data} [1], and he explained very clearly > > that two objects with exactly the same data are not supposed to have the > > same id if the type is different. > > This is different. But aside, type + size + data are not really much > different from just having data in a hash function. It's completely different. > There are plenty of hash collisions where > > HASH(type + size + data) == HASH(type + size + data') > > by definition of how these functions work. The problem is always in > finding these collisions. But anyway... I don't think you understand why Linus Torvalds chose to hash objects. > > In my view one repository should be able to have part SHA-1 history, > > part SHA3-256 history, and part BLAKE2b history. > > Yes, that would be great. Please provide patch series for this :-) I have hundreds of patches being ignored, why would I write yet another patch series that will be ignored? > > I have not been following the SHA-1 -> OID discussions, but I > > distinctively recall Linus Torvalds mentioning that the choice of using > > SHA-1 wasn't even for security purposes, it was to ensure integrity. > > These are different sides of the same coin. Hashes are used to provide > integrity. Hashes like MD4, MD5, SHA1, SHA256 are there for integrity. > Some of these are no longer recommended and some are completely broken. There are different philosophical views of what "security" means, and it seems pretty clear to me that your view does not align with the view of Linus Torvalds. > > Better the SHA-1 you know, than the SHA-256 you don't. > > Wrong conclusion ;) Also, we know SHA-256 You don't understand what is being said. Which hash is more trustworthy? a. 69c786637d7a7fe3b2b8f7d989af095f5f49c3a8 b. d891b12414e1d9331f8cbb15acfe690671974f27ba76e2b423294cfb7a055f2f If you answer b just beacuse it's SHA-256 you don't understand security. b is a random commit I generated, a is the current git.git master. A SHA-1 hash from a source you trust is inifinitely more trustworthy than a random SHA-256 hash. Even a known MD5 hash is better in this respect. > Keep in mind -- hashes are there for object reference. No. I don't think you understand why Linus Torvalds used hashes. > If you have SHA1 repo, you can calculate a SHA256 or whatever hash for any > type object. I know it *can* be done, I understand how hash algorithms work, but just because something *can* be done doesn't mean it *should*. You *can* generate a SHA-1 of a blob's data, instead of a SHA-1 of a blob's `type + size + data`, does that mean we should? No. > Finally, let not have a "bike shed" discussion about this. Discussing the original design of git's object storage which has withstood the test of time for 18 years is not "bike sheding". I don't even think you understand what I'm trying to say. --- Why do you think these commands generate different hashes? git hash-object -t blob /dev/null git hash-object -t tree /dev/null
On May 3, 2023 5:44:24 p.m. GMT+02:00, Felipe Contreras <felipe.contreras@gmail.com> wrote: >Git was designed to make it *impossible* to confuse two commits with similar >data. That was never ever the problem here. >> This is different. But aside, type + size + data are not really much >> different from just having data in a hash function. > >It's completely different. How so? Type and size are just about 2 and a dozen bits of data, respectfully. >There are different philosophical views of what "security" means, and it seems >pretty clear to me that your view does not align with the view of Linus >Torvalds. I'm not sure why you are name dropping Linus everywhere or assuming you know more than anyone here about hash functions. Your explanation is quite clear to me (and probably everyone else here). But I'll just leave it at that. Cheers, Adam
demerphq wrote: > On Wed, 3 May 2023 at 02:17, Felipe Contreras > <felipe.contreras@gmail.com> wrote: > > Changing the subject as this message seems like a different topic. > > Jeff King wrote: > > > On Wed, Apr 26, 2023 at 02:33:30PM -0700, Junio C Hamano wrote: > > > > "brian m. carlson" <sandals@crustytoothpaste.net> writes: > > > > > > > > > `GIT_DEFAULT_HASH`:: > > > > > If this variable is set, the default hash algorithm for new > > > > > repositories will be set to this value. This value is currently > > > > > + ignored when cloning if the remote value can be definitively > > > > > + determined; the setting of the remote repository is used > > > > > + instead. The value is honored if the remote repository's > > > > > + algorithm cannot be determined, such as some cases when > > > > > + the remote repository is empty. The default is "sha1". > > > > > + THIS VARIABLE IS EXPERIMENTAL! See `--object-format` > > > > > + in linkgit:git-init[1]. > > > > > > > > We'd need to evantually cover all the transports (and non-transport > > > > like the "--local" optimization) so that the object-format and other > > > > choices are communicated from the origin to a new clone anyway, so > > > > this extra complexity "until X is fixed, it behaves this way, but > > > > otherwise the variable is read in the meantime" may be a disservice > > > > to the end users, even though it may make it easier in the shorter > > > > term for maintainers of programs that rely on the buggy "git clone" > > > > that partially honored this environment variable. > > > > > > > > In short, I am still not convinced that the above is a good design > > > > choice in the longer term. > > > > > > I also think it is working against the backwards-compatible design of > > > the hash function transition. > > > > To be honest this whole approach seems to be completely flawed to me and > > against the whole design of git in the first place. > > > > In a recent email Linus Torvalds explained why object ids were > > calculated based {type, size, data} [1], and he explained very clearly > > that two objects with exactly the same data are not supposed to have the > > same id if the type is different. > > He said: > > --- quote-begin --- > The "no aliasing" means that no two distinct pointers can point to the > same data. So a tagged pointer of type "commit" can not point to the > same object as a tagged pointer of type "blob". They are distinct > pointers, even if (maybe) the commit object encoding ends up then > being identical to a blob object. > --- quote-end --- > > As far as I could tell he didn't really explain *why* he wanted this, > and IMO it is non-obvious why he would care if a blob and a commit had > the same text, and thus the same ID. He just said he didnt want it to > happen, not why. But we don't need to understand why to know it's part of the core design. If something is part of the core design as a rule it's better to not mess with it. > I can imagine some aesthetic reasons why you might want to ensure that > no blob has the same ID as a commit, and I can imagine it might make > debugging easier at certain points, but it seems unnecessary given the > data is write once. I don't know, but to me separating objects makes sense not just conceptually, but in practice there's a whole class of potential errors that could be avoided. For example, I can think of an implementation of `git prune` that would check commits first, then trees, then blobs, and the blobs that not reachable from any trees are removed. But if a commit can have the same id as a blob, you have to think of a different implementation. If that's not possible, then you just forget about those potential issues. > > If even the tiniest change such as adding a period to a commit messange > > changes the object id (and thus semantically makes it a different > > object), then it makes sense that changing the type of an object also > > changes the object id (and thus it's also a different object). > > > > And because the id of the parent is included in the content of every > > commit, the top-level id ensures the integrity of the whole graph. > > > > But then comes this notion that the hash algorithm is a property of the > > repository, and not part of the object storage, which means changing the > > whole hash algorithm of a repository is considered less of a change than > > adding a period to the commit message, worse: not a change at all. > > I really dont understand why you think having two hash functions > producing different results for the same data is comparable to a > single hash producing different results for different data. That depends on what you consider the "data" to be. If you consider the content of a blob to be the data, then you wouldn't have different results if a commit has the same data: it would be the same id. If instead you consider the data to be `type+content`, then you would have different results. > In one case you have two different continuum of identifiers, with one > ID per continuum, and in the other you have two different identifiers > in the same continuum, and if you a continuum you would have 4 > different identifiers right? Eg, the two cases are really quite > different at a fundamental level. That entirely depends on what data you hash. If you hash `algo+type+size+data` there's only one id per object. Period. > > I am reminded of the warning Sam Smith gave to the Git project [2] which > > seemed to be unheard, but the notion of cryptographic algorithm agility > > makes complete sense to me. > > > > In my view one repository should be able to have part SHA-1 history, > > part SHA3-256 history, and part BLAKE2b history. > > Isn't this orthagonal to your other points? Not if you consider changing the hash algorithm of a repository to be an important part of its history (more important than adding a period to a commit). > > Changing the hash algorithm of one commit should change the object id of > > that commit, and thus make it semantically a different commit. > > > > In other words: an object of type "blob" should never be confused with > > an object of type "blob:sha-256", even if the content is exactly the > > same. > > This doesn't make sense to me. As long as we can distinguish the > hashes produced by the different hash functions in use we can create a > mapping of the data that is hashed such that we have a 1:1 mapping of > identifiers of each type at which point it really doesn't matter which > hash function is used. Yes, we *can*, that doesn't mean we *should*. If you do `git commit --ammend --signoff` to add your `Signed-off-by` to a commit, there's a 1:1 mapping from the original commit, to the new one, but conceptually in git they are different objects. I recall Linus Torvalds mentioned he used Monotone as a guideline of what *not* to do. In Monotone you could add the equivalent of `Signed-off-by` without changing the hash of the commit, in fact, you could add any metadata if I recall correctly. But this opens a whole can of worms because now how do you know you have all the metadata relevant to the commit? Making all the metadata of a commit part of the commit solves the integrity problem Monotone had at the cost of making git commits essentially immutable: any change means it's a different commit. If making *any* change in the object, makes it conceptually a different object, including the type of the object, how on Earth is changing the hash algorithm not considered a change? This object: ❯ git hash-object -t blob /dev/null e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 Is considered different from this object: ❯ git hash-object -t tree /dev/null 4b825dc642cb6eb9a060e54bf8d69288fbee4904 That's why they have a different hash. Why would these objects be considered the same? ❯ git hash-object -t blob /dev/null e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 ❯ git hash-object -t blob /dev/null 473a0f4c3be8a93681a267e3b1e9a7dcda1185436fe141f7749120a303721813 It makes *zero* sense that adding a period changes the object, adding a s-o-b changes the object, changing the type changes the object, but changing the hash algorithm does not. > > The fact that apparently it's so easy to clone a repository with > > the wrong hash algorithm should give developers pause, as it means the > > whole point of using cryptographic hash algorithms to ensure the > > integrity of the commit history is completely gone. > > This is a leap too far. The fact that it is "so easy to clone a repo > with the wrong hash algorithm" is completely orthogonal to the > fundamental principles of hash identifiers from strong hash functions. Only if you think changing the hash algorithm is a less important part of an object than adding a period. Do you honestly think these two should be considered the same object? a) tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904 author Felipe Contreras <felipe.contreras@gmail.com> 0 -0600 committer Felipe Contreras <felipe.contreras@gmail.com> 0 -0600 Initial commit b) tree 6ef19b41225c5369f1c104d45d8d85efa9b057b53b14b4b9b939dd74decc5321 author Felipe Contreras <felipe.contreras@gmail.com> 0 -0600 committer Felipe Contreras <felipe.contreras@gmail.com> 0 -0600 Initial commit > You seem to be deriving grand conclusions from what sounds to me like > a simple bug/design-oversight. I think you are dismissing the brilliant idea that made git's object storage model so successful. > > I have not been following the SHA-1 -> OID discussions, but I > > distinctively recall Linus Torvalds mentioning that the choice of using > > SHA-1 wasn't even for security purposes, it was to ensure integrity. > > When I do a `git fetch` as long as the new commits have the same SHA-1 > > as parent as the SHA-1s I have in my repository I can be relatively > > certain the repository has not been tampered with. Which means that if I > > do a `git fetch` that suddenly brings SHA-256 commits, some of them must > > have SHA-1 parents that match the ones I currently have. Otherwise how > > do I know it's the same history? > > So consider what /could/ happen here. You fetch a commit which uses > SHA-256 into a repo where all of your local commits use SHA-1. The > commit you fetched says its parent is some SHA-256 ID you don't know > about as all your ID's are SHA-1. So git then could go and construct > an index, hashing each item using SHA-256 instead of SHA-1, and using > the result to build a bi-directional mapping from SHA-1 to SHA-256 and > back. All it has to do then is look into the mapping to find if the > SHA-256 parent id is present in your repo. If it is then you know it's > the same history. Yeah, that *could* happen. Doesn't mean it *should*. > The key point here is that if you ignore SHAttered artifacts (which > seems reasonable as you can detect the attack during hashing) you can > build a 1:1 map of SHA-1 and SHA-256 ids. Once you have that mapping > it doesn't matter which ID is used. It may not matter to you. It matters to me. > > Maybe that's one of the reasons people don't seem particularly eager to > > move away from SHA-1: > > Maybe, but it doesn't make sense to me. You seem to be putting undue > weight on an unnecessary aspect of the git design: there doesn't seem > to be a reason for Linuses "no aliasing" policy, and it seems like one > could build a git-a-like without it without suffering any significant > penalties. The fact you don't see a reason doesn't mean there isn't one. This is an argument from ignorance fallacy. The appendix was considered a vestigial organ because nobody could see a reason for it. Did that mean it served no puprpose? No. > Regardless, provided that the hash functions allow a 1:1 mapping of > ID's (which is assumed by using "collision free hash functions"), it > seems like it really doesn't matter which hash is used at any given > time. People who designed CVS, Subversion, Monotone, and Mercurial didn't see a reason for many of Git's design choices either. I'd argue they were wrong. I think changing the hash algorithm of a commit matters. Cheers.
On 2023-05-02 at 23:46:02, Felipe Contreras wrote: > In my view one repository should be able to have part SHA-1 history, > part SHA3-256 history, and part BLAKE2b history. That is practically very difficult and it means that it's hard to have confidence in the later history because SHA-1 is weak and you have to rely on it to verify the SHA-256 history later. Since attacks always get better, SHA-1 will eventually be so weak that collisions can be computed in the amount of time we now take for MD4 or MD5 collisions (i.e., seconds), and with your plan, we'd have to retain that history forever with the resulting lack of confidence in part of the history. This also doesn't work with various structures like trees, the index, and pack and index formats, which have no indication of the algorithm used and simply rely on fixed-size, often 4-byte aligned object IDs without any metadata. In addition, the internals of the code often don't pass around enough data to make these values variable and thus this approach would substantially complicate the code in many ways. Also, we've already decided on the current design a long time ago with the transition plan after extensive, thoughtful discussion by many people. Very few people other than me have worked on sending patches to work on the hash function transition, and that work up to now has all been done on my personal time, without compensation of any sort, out of a desire to improve the project. Lots of people have opined on how it should have been different without sending any patches. If you would like to propose patches for the extensive amount of work to implement your solution, then we could consider them, although I will warn you that your approach will likely require at least several hundred patches. However, I refer you to the list archives to determine why your approach is not the one we chose and is not, in my view, the best path forward. I should also be clear that I have no intention of submitting patches to change our approach now or in the future, or redoing the patches I've already sent. > The fact that apparently it's so easy to clone a repository with > the wrong hash algorithm should give developers pause, as it means the > whole point of using cryptographic hash algorithms to ensure the > integrity of the commit history is completely gone. No, it doesn't. It means that our empty repositories until recently lacked any indication of the algorithm or other capabilities, which was a mistake in our original protocol design that has now been corrected. If you interact with the repository later on when it has data, then if you're using the wrong hash algorithm, you'll find that you get a helpful error message that that's not yet supported. If you patched Git to ignore that check, you'd find that your repository would just be very broken in many ways with lots of random crashing and seemingly unrelated error messages instead of subtly using the wrong algorithm.
Adam Majer wrote: > On May 3, 2023 5:44:24 p.m. GMT+02:00, Felipe Contreras <felipe.contreras@gmail.com> wrote: > >Git was designed to make it *impossible* to confuse two commits with similar > >data. > > That was never ever the problem here. But it will be. > >> This is different. But aside, type + size + data are not really much > >> different from just having data in a hash function. > > > >It's completely different. > > How so? Type and size are just about 2 and a dozen bits of data, respectfully. Do you understand how checksums work? Compare these two objects: 1. 0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33 2. 6ba62a7c5e3e9a260c5a30adf2756882c02f12a6 Are they a) "not much different", or b) "completely different"? Answer: doesn't matter, they are *different*. Period. > >There are different philosophical views of what "security" means, and it seems > >pretty clear to me that your view does not align with the view of Linus > >Torvalds. > > > I'm not sure why you are name dropping Linus everywhere I don't know if you are aware, but Linus Torvalds is the author of git. He also happens to be the author of the most successful software project in history: Linux. So generally his design choices are considered to be good. > or assuming you know more than anyone here about hash functions. I don't assume such a thing. But I'm pretty certain not many people are aware of the integrity issues VCSs presented circa 2004, that git hashes solved in 2005, because if they did, they could have created an object model storage similar to git's, and no one did (except Linus Torvalds). > Your explanation is quite clear to me (and probably everyone else > here). But I'll just leave it at that. Is it? Then you would have no trouble steel manning my argument, which you haven't done.
brian m. carlson wrote: > On 2023-05-02 at 23:46:02, Felipe Contreras wrote: > > In my view one repository should be able to have part SHA-1 history, > > part SHA3-256 history, and part BLAKE2b history. > > That is practically very difficult and it means that it's hard to have > confidence in the later history because SHA-1 is weak and you have to > rely on it to verify the SHA-256 history later. Why would I have to rely on SHA-1 to verify the SHA-256 history later on? > Since attacks always get better, SHA-1 will eventually be so weak that > collisions can be computed in the amount of time we now take for MD4 > or MD5 collisions (i.e., seconds), and with your plan, we'd have to > retain that history forever with the resulting lack of confidence in > part of the history. We have to do the same with your plan as well. Your plan relies on SHA-256 being interchangeable with SHA-1, so if the Git project decided to switch *today* to SHA-256, we would have two object ids: 1. 69c786637d7a7fe3b2b8f7d989af095f5f49c3a8 2. 2b4ebdace10518280172449c012af17b51e9d46e023a91a5d3dd3a8ad9e4a116 This object would refer to a tree and a parent object with SHA-1 ids, which would be OK, because they would be interchangeable with some corresponding SHA-256 ids. Isn't that your plan? Therefore the SHA-1 of the parent of the commit, and the tree of the commit would be trusted and retained forever. > This also doesn't work with various structures like trees, the index, > and pack and index formats, which have no indication of the algorithm > used and simply rely on fixed-size, often 4-byte aligned object IDs > without any metadata. So? The index and pack objects can be regenerated, so at any point in time they could be regenerated for SHA-1 or SHA-256. The tree object is a no-brainer. For an object of type "commit:256" you require a tree of type "tree:256". Easy. > In addition, the internals of the code often don't pass around enough > data to make these values variable and thus this approach would > substantially complicate the code in many ways. Really? `enum object_type` is not passed around? > Also, we've already decided on the current design a long time ago with > the transition plan after extensive, thoughtful discussion by many > people. Who is "we"? I've participated in many discussions in the git mailing list where the consensus is that 99% of people decide to do something, and that something never happens. The fact that "we" have decided something doesn't carry as much weight as you seem to think it does. Moreover, haven't "we" decided that this transitioning plan is *tentantive*, and the SHA-256 feature is *experimental*? > Very few people other than me have worked on sending patches to > work on the hash function transition, and that work up to now has all > been done on my personal time, without compensation of any sort, out of > a desire to improve the project. Which seems to suggest if there is a need, it's not very pressing. Doesn't it? > Lots of people have opined on how it should have been different > without sending any patches. As is typical. > If you would like to propose patches for the extensive amount of work > to implement your solution, then we could consider them, although I > will warn you that your approach will likely require at least several > hundred patches. That's not an issue. I've started projects with several hundred patches just to prove that something is possible. > However, I refer you to the list archives to determine why > your approach is not the one we chose and is not, in my view, the best > path forward. Yeah? Provide me with *one* mail proposing my approach. > I should also be clear that I have no intention of submitting patches > to change our approach now or in the future, or redoing the patches > I've already sent. You don't have to. (and it's not really necessary as it's typically the case that people don't provide patches for designs that compete against their own). > > The fact that apparently it's so easy to clone a repository with > > the wrong hash algorithm should give developers pause, as it means the > > whole point of using cryptographic hash algorithms to ensure the > > integrity of the commit history is completely gone. > > No, it doesn't. It means that our empty repositories until recently > lacked any indication of the algorithm or other capabilities, which was > a mistake in our original protocol design that has now been corrected. Yes it does. Can I clone a repository that already transitioned to SHA-256, and then push a SHA-1 commit? Well, of course I can't because that's not currently implemented, but if we followed the current plan that apparently "we" have decided on, it should be. > If you interact with the repository later on when it has data, then if > you're using the wrong hash algorithm, you'll find that you get a > helpful error message that that's not yet supported. That isn't true according to your plan. SHA-1 would be interchangeable with SHA-256, would it not? So according to the current plan, I would be able to push a SHA-1 commit on a SHA-256 repository. --- Is it not the case that the current plan aims to have support for SHA-1 and SHA-256 object ids at the same time? In other words: in your ideal world, the following object ids would *both* refer to the same git object: 1. 69c786637d7a7fe3b2b8f7d989af095f5f49c3a8 2. 2b4ebdace10518280172449c012af17b51e9d46e023a91a5d3dd3a8ad9e4a116 Would they not?
On 2023-05-08 at 02:00:56, Felipe Contreras wrote: > brian m. carlson wrote: > > On 2023-05-02 at 23:46:02, Felipe Contreras wrote: > > > In my view one repository should be able to have part SHA-1 history, > > > part SHA3-256 history, and part BLAKE2b history. > > > > That is practically very difficult and it means that it's hard to have > > confidence in the later history because SHA-1 is weak and you have to > > rely on it to verify the SHA-256 history later. > > Why would I have to rely on SHA-1 to verify the SHA-256 history later > on? If your history contains mixed and matched hash algorithms, you'll need to be able to verify those commits to the root to have any confidence in a signed commit or tag, which means trusting SHA-1 if you have any SHA-1 commits in the repository. > > Since attacks always get better, SHA-1 will eventually be so weak that > > collisions can be computed in the amount of time we now take for MD4 > > or MD5 collisions (i.e., seconds), and with your plan, we'd have to > > retain that history forever with the resulting lack of confidence in > > part of the history. > > We have to do the same with your plan as well. > > Your plan relies on SHA-256 being interchangeable with SHA-1, so if the > Git project decided to switch *today* to SHA-256, we would have two > object ids: > > 1. 69c786637d7a7fe3b2b8f7d989af095f5f49c3a8 > 2. 2b4ebdace10518280172449c012af17b51e9d46e023a91a5d3dd3a8ad9e4a116 > > This object would refer to a tree and a parent object with SHA-1 ids, > which would be OK, because they would be interchangeable with some > corresponding SHA-256 ids. > > Isn't that your plan? For a period of time, yes. At some point, people will abandon SHA-1 and won't use it anymore for a particular repository, and then its security doesn't matter. > Therefore the SHA-1 of the parent of the commit, and the tree of the > commit would be trusted and retained forever. Nope. At some point, we just turn off SHA-1. > > This also doesn't work with various structures like trees, the index, > > and pack and index formats, which have no indication of the algorithm > > used and simply rely on fixed-size, often 4-byte aligned object IDs > > without any metadata. > > So? The index and pack objects can be regenerated, so at any point in > time they could be regenerated for SHA-1 or SHA-256. Right, and that point you've basically converted the repository over. > The tree object is a no-brainer. For an object of type "commit:256" you > require a tree of type "tree:256". Easy. That doesn't work with the pack format because there are only seven valid types of objects, and five of them are used. > > Also, we've already decided on the current design a long time ago with > > the transition plan after extensive, thoughtful discussion by many > > people. > > Who is "we"? The list members. > I've participated in many discussions in the git mailing list where the > consensus is that 99% of people decide to do something, and that > something never happens. > > The fact that "we" have decided something doesn't carry as much weight > as you seem to think it does. Jonathan Nieder decided to propose an approach for how we'd go about this, and it was discussed extensively on the list and the parts that I've implemented have almost completely conformed to that documentation. I think his approach was very thoughtful and addressed many questions about how the project was to proceed, and I appreciate that he sent it and that others contributed in a helpful way. The project wouldn't have been possible unless we had a clear decision on how to implement things. There was an opportunity at the time to comment on the approach and propose alternatives, and we didn't choose to adopt any. > Moreover, haven't "we" decided that this transitioning plan is > *tentantive*, and the SHA-256 feature is *experimental*? We've documented that it's experimental because it's not seeing wide use. I expect that will change, at which point it will no longer be experimental. The design is not tentative and there are no plans to change it. > > Very few people other than me have worked on sending patches to > > work on the hash function transition, and that work up to now has all > > been done on my personal time, without compensation of any sort, out of > > a desire to improve the project. > > Which seems to suggest if there is a need, it's not very pressing. > > Doesn't it? No, I don't agree. An approach to continue to use SHA-1 indefinitely means that Git will not be viable at many major organizations and in many major governments which have restrictions on using insecure cryptography. I think this ends the point at which I'd like to respond to your proposal further. I continue to feel that it's argumentative, unproductive, and not in the best interests of the project, and I don't think continuing here is going to shed any more light on the topic. Ultimately, I feel that my previous statement that we're not going to see eye to eye on most things continues to be correct. I also don't think arguing with you is a productive use of my time or a positive contribution to the community, and in general, I'm going to avoid commenting on your patches or proposals because I can't see a way that we can interact in a useful and helpful way on the list or elsewhere. As a result, I'll kindly ask you to refrain from CCing me on further emails to the list or otherwise emailing me for the indefinite future, and I'll do the same for you after this email.
On Mon, May 08, 2023 at 09:38:56PM +0000, brian m. carlson wrote: >On 2023-05-08 at 02:00:56, Felipe Contreras wrote: >> brian m. carlson wrote: >> > On 2023-05-02 at 23:46:02, Felipe Contreras wrote: >> > > In my view one repository should be able to have part SHA-1 history, >> > > part SHA3-256 history, and part BLAKE2b history. >> > >> > That is practically very difficult and it means that it's hard to have >> > confidence in the later history because SHA-1 is weak and you have to >> > rely on it to verify the SHA-256 history later. >> >> Why would I have to rely on SHA-1 to verify the SHA-256 history later >> on? > >If your history contains mixed and matched hash algorithms, you'll need >to be able to verify those commits to the root to have any confidence in >a signed commit or tag, which means trusting SHA-1 if you have any SHA-1 >commits in the repository. > the history is traversed from the end anyway, so having sha-1 in the history is entirely irrelevant for verifying sha-256 commits, assuming one may only upgrade the algorithm. the transition plan implies the intent to ultimately get rid of old algos, but this is a non-starter, because old histories need to remain accessible indefinitely (you can't rewrite all external references, and even for in-history references this would be unreliable and would falsify historical builds). i won't try making an argument for mixed histories, as i'm assuming i wouldn't add anything that hasn't already been written. -- ossi
Oswald Buddenhagen <oswald.buddenhagen@gmx.de> writes: >>If your history contains mixed and matched hash algorithms, you'll need >>to be able to verify those commits to the root to have any confidence in >>a signed commit or tag, which means trusting SHA-1 if you have any SHA-1 >>commits in the repository. >> > the history is traversed from the end anyway, so having sha-1 in the > history is entirely irrelevant for verifying sha-256 commits, assuming > one may only upgrade the algorithm. That depends on what is meant by "verify a commit". If we are only interested in the tree contents, then the newer commits whose trees are hashed with a more secure hash algorithm recursively down to the blobs would lack any weak link hashed by a less secure algorithm. Most people do not care as deeply how their project tree came to the shape it has today as they care about what is in the recent trees, so this is an acceptable stance to take. If the less secure algorithm becomes so weak that the history, up to the last commit that was signed by it, can be rewritten arbitrarily, however, an attacker can lie about why the code that survives to this day looks the way it does by forging old parts of the history, and mislead today's developers to do wrong things. The only way to prevent that kind of attack is to verify a commit by recursively making sure not just the commit and its tree, but its parents are all authentic, and less secure algorithm may make it impossible. The old history hashed with the old algorithm needs to be kept to help external references, but I think as a mitigation, those who care about that part of history can create a copy of the history hashed with the new algorithm and publish the correspondence between the two parallel histories to assure the integrity of that part of the old history.
diff --git a/Documentation/git.txt b/Documentation/git.txt index 74973d3cc4..48eda9f883 100644 --- a/Documentation/git.txt +++ b/Documentation/git.txt @@ -547,9 +547,13 @@ double-quotes and respecting backslash escapes. E.g., the value `GIT_DEFAULT_HASH`:: If this variable is set, the default hash algorithm for new repositories will be set to this value. This value is currently - ignored when cloning; the setting of the remote repository - is used instead. The default is "sha1". THIS VARIABLE IS - EXPERIMENTAL! See `--object-format` in linkgit:git-init[1]. + ignored when cloning if the remote value can be definitively + determined; the setting of the remote repository is used + instead. The value is honored if the remote repository's + algorithm cannot be determined, such as some cases when + the remote repository is empty. The default is "sha1". + THIS VARIABLE IS EXPERIMENTAL! See `--object-format` + in linkgit:git-init[1]. Git Commits ~~~~~~~~~~~ diff --git a/builtin/clone.c b/builtin/clone.c index 186845ef0b..c207798de9 100644 --- a/builtin/clone.c +++ b/builtin/clone.c @@ -1316,13 +1316,15 @@ int cmd_clone(int argc, const char **argv, const char *prefix) } } + if (transport_get_hash_algo_explicit(transport)) { /* * Now that we know what algorithm the remote side is using, * let's set ours to the same thing. */ - hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport)); - initialize_repository_version(hash_algo, 1); - repo_set_hash_algo(the_repository, hash_algo); + hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport)); + initialize_repository_version(hash_algo, 1); + repo_set_hash_algo(the_repository, hash_algo); + } if (mapped_refs) { /* diff --git a/connect.c b/connect.c index 3a0186280c..40cb9bf261 100644 --- a/connect.c +++ b/connect.c @@ -243,8 +243,10 @@ static void process_capabilities(struct packet_reader *reader, int *linelen) if (feat_val) { char *hash_name = xstrndup(feat_val, feat_len); int hash_algo = hash_algo_by_name(hash_name); - if (hash_algo != GIT_HASH_UNKNOWN) + if (hash_algo != GIT_HASH_UNKNOWN) { reader->hash_algo = &hash_algos[hash_algo]; + reader->hash_algo_explicit = 1; + } free(hash_name); } else { reader->hash_algo = &hash_algos[GIT_HASH_SHA1]; @@ -493,6 +495,7 @@ static void send_capabilities(int fd_out, struct packet_reader *reader) if (hash_algo == GIT_HASH_UNKNOWN) die(_("unknown object format '%s' specified by server"), hash_name); reader->hash_algo = &hash_algos[hash_algo]; + reader->hash_algo_explicit = 1; packet_write_fmt(fd_out, "object-format=%s", reader->hash_algo->name); } else { reader->hash_algo = &hash_algos[GIT_HASH_SHA1]; diff --git a/pkt-line.h b/pkt-line.h index 8e9846f315..10700a9d8c 100644 --- a/pkt-line.h +++ b/pkt-line.h @@ -190,6 +190,8 @@ struct packet_reader { int line_peeked; unsigned use_sideband : 1; + /* indicates if we saw an explicit capability */ + unsigned hash_algo_explicit : 1; const char *me; /* hash algorithm in use */ diff --git a/t/t5700-protocol-v1.sh b/t/t5700-protocol-v1.sh index 3cd9db9012..ad24c7fe64 100755 --- a/t/t5700-protocol-v1.sh +++ b/t/t5700-protocol-v1.sh @@ -244,6 +244,17 @@ test_expect_success 'push with ssh:// using protocol v1' ' grep "push< version 1" log ' +test_expect_success 'clone propagates object-format from empty repo' ' + test_when_finished "rm -fr src256 dst256" && + + echo sha256 >expect && + git init --object-format=sha256 src256 && + GIT_DEFAULT_HASH=sha256 git -c protocol.version=1 clone --no-local src256 dst256 && + git -C dst256 rev-parse --show-object-format >actual && + + test_cmp expect actual +' + # Test protocol v1 with 'http://' transport # . "$TEST_DIRECTORY"/lib-httpd.sh diff --git a/transport-helper.c b/transport-helper.c index 6b816940dc..c65cf7c620 100644 --- a/transport-helper.c +++ b/transport-helper.c @@ -1236,6 +1236,7 @@ static struct ref *get_refs_list_using_list(struct transport *transport, die(_("unsupported object format '%s'"), value); transport->hash_algo = &hash_algos[algo]; + transport->hash_algo_explicit = 1; } continue; } diff --git a/transport.c b/transport.c index 67afdae57c..7774487e8d 100644 --- a/transport.c +++ b/transport.c @@ -147,6 +147,12 @@ static void get_refs_from_bundle_inner(struct transport *transport) die(_("could not read bundle '%s'"), transport->url); transport->hash_algo = data->header.hash_algo; + /* + * This is always set, even if we didn't get an explicit object-format + * capability, since we know that a missing capability or a v2 bundle + * definitively indicates SHA-1. + */ + transport->hash_algo_explicit = 1; } static struct ref *get_refs_from_bundle(struct transport *transport, @@ -190,6 +196,7 @@ static int fetch_refs_from_bundle(struct transport *transport, ret = unbundle(the_repository, &data->header, data->fd, &extra_index_pack_args, 0); transport->hash_algo = data->header.hash_algo; + transport->hash_algo_explicit = 1; return ret; } @@ -360,6 +367,7 @@ static struct ref *handshake(struct transport *transport, int for_push, } data->finished_handshake = 1; transport->hash_algo = reader.hash_algo; + transport->hash_algo_explicit = reader.hash_algo_explicit; if (reader.line_peeked) BUG("buffer must be empty at the end of handshake()"); @@ -1190,6 +1198,7 @@ struct transport *transport_get(struct remote *remote, const char *url) } ret->hash_algo = &hash_algos[GIT_HASH_SHA1]; + ret->hash_algo_explicit = 0; return ret; } @@ -1199,6 +1208,11 @@ const struct git_hash_algo *transport_get_hash_algo(struct transport *transport) return transport->hash_algo; } +int transport_get_hash_algo_explicit(struct transport *transport) +{ + return transport->hash_algo_explicit; +} + int transport_set_option(struct transport *transport, const char *name, const char *value) { diff --git a/transport.h b/transport.h index 6393cd9823..ce67eefc58 100644 --- a/transport.h +++ b/transport.h @@ -128,6 +128,11 @@ struct transport { * in transport_set_verbosity(). **/ unsigned progress : 1; + /* + * Indicates whether the hash algorithm was initialized explicitly as + * opposed to using a fallback. + */ + unsigned hash_algo_explicit : 1; /* * If transport is at least potentially smart, this points to * git_transport_options structure to use in case transport @@ -305,6 +310,15 @@ int transport_get_remote_bundle_uri(struct transport *transport); * This can only be called after fetching the remote refs. */ const struct git_hash_algo *transport_get_hash_algo(struct transport *transport); +/* + * Fetch whether the hash algorithm provided was explicitly set. + * + * If this value is false, "transport_get_hash_algo" will always return a value + * of SHA-1, which is the default algorithm if none is specified. + * + * This can only be called after fetching the remote refs. + */ +int transport_get_hash_algo_explicit(struct transport *transport); int transport_fetch_refs(struct transport *transport, struct ref *refs); /*
From: "brian m. carlson" <bk2204@github.com> The previous commit introduced a change that allows HTTP v0 and v1 operations to determine the hash of an empty remote repository by sending capabilities. However, there are still some cases, such as when cloning locally, where the capabilities are not sent. This is because for local operations, we don't strip out the fake "capabilities^{}" ref, and thus "git ls-remote" would produce incorrect values if we did. However, up until 8b214c2e9d ("clone: propagate object-format when cloning from void", 2023-04-05), we honored GIT_DEFAULT_HASH in this case, so let's continue to do that. Check whether the hash algorithm was explicitly set, and if so, continue to use that value. If not, use the default value for GIT_DEFAULT_HASH to ensure that we can at least properly configure an empty clone whose hash algorithm we know. Note that without this patch, git clone cannot create a SHA-256 repository from an empty remote without protocol v2 (except over HTTP, as in the previous patch). --- Documentation/git.txt | 10 +++++++--- builtin/clone.c | 8 +++++--- connect.c | 5 ++++- pkt-line.h | 2 ++ t/t5700-protocol-v1.sh | 11 +++++++++++ transport-helper.c | 1 + transport.c | 14 ++++++++++++++ transport.h | 14 ++++++++++++++ 8 files changed, 58 insertions(+), 7 deletions(-)