Message ID | CAFjaU5sAVaNHZ0amPXJcbSvsnaijo+3X5Otg_Mntkx2GbikZMA@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | `git diff`/`git apply` can generate/apply ambiguous hunks (ie. in the wrong place) (just like gnu diff/patch) | expand |
On Wednesday, July 3, 2024 11:25 AM, Emanuel Czirai wrote: >This doesn't affect `git rebase` as it's way more robust than simply extracting the >commits as patches and re-applying them. (I haven't looked into `git merge` though, >but I doubt it's affected) > >It seems to me that no matter which algorithm `git diff` uses (of the 4), it generates >the same hunk, because it's really the context length that matters (which by default >is 3), which is the same one that gnu `diff` generates, and its application via `git >apply` is the same as gnu `patch`. > >side note: >`diffy` is a simple rust-written library that behaves (at version 0.4.0) the same as >normal diff and patch apply(with regards to generated diff contents and patch >application; except it doesn't set the original/modified filenames in the patch), and >since my limited experience, I've found it simpler to modify it to make it so that it >generates unambiguous hunks, as a proof of concept that it can be done this way, >here: >https://github.com/bmwill/diffy/issues/31 , whereby increasing the context >length(lines) of the whole patch(though ideally only of the affected hunks) the >initially ambiguous hunk(s) cannot be applied anymore in more than 1 place inside >the original file, thus avoiding both the diff creation and the patch application from >generating and applying ambiguous hunks. > >But forgetting about that for a moment, I'm gonna show you about `git diff` and `git >apply` below: >1. clone cargo's repo: >cd /tmp >git clone https://github.com/rust-lang/cargo.git >cd cargo >2. checkout tag 0.76.0, or branch origin/rust-1.75.0 git checkout 0.76.0 3. manually >edit this file ./src/cargo/core/workspace.rs at line 1118 or manually go to function: >pub fn emit_warnings(&self) -> CargoResult<()> { right before that function's end >which looks like: >Ok(()) >} >so there at line 1118, insert above that Ok(()) something, doesn't matter what, >doesn't have to make sense, like: >if seen_any_warnings { > //comment > bail!("reasons"); >} >save the file >4. try to generate a diff from it: >git diff > /tmp/foo.patch >you get this: >```diff >diff --git a/src/cargo/core/workspace.rs b/src/cargo/core/workspace.rs index >4667c8029..8f0a74473 100644 >--- a/src/cargo/core/workspace.rs >+++ b/src/cargo/core/workspace.rs >@@ -1115,6 +1115,10 @@ impl<'cfg> Workspace<'cfg> { > } > } > } >+ if seen_any_warnings { >+ //comment >+ bail!("reasons"); >+ } > Ok(()) > } > >``` >Now this is an ambiguous patch/hunk because if you try to apply this patch on the >same original file, cumulatively, it applies successfully in 3 different places, but we >won't do that now. >5. now let's discard it(we already have it saved in /tmp/foo.patch) and pretend that >something changed in the original code: >git checkout . >git checkout 0.80.0 >6. reapply that patch on the new changes: >git apply /tmp/foo.patch >(this shows no errors) >7. look at the diff, for no good reason, just to see that it's the same(kind of): >git diff > /tmp/foo2.patch >contents: >```diff >diff --git a/src/cargo/core/workspace.rs b/src/cargo/core/workspace.rs index >55bfb7a10..92709f224 100644 >--- a/src/cargo/core/workspace.rs >+++ b/src/cargo/core/workspace.rs >@@ -1099,6 +1099,10 @@ impl<'gctx> Workspace<'gctx> { > } > } > } >+ if seen_any_warnings { >+ //comment >+ bail!("reasons"); >+ } > Ok(()) > } > >``` >8. now look at the file, where was the patch applied? that's right, it's at the end of >the wrong function: >fn validate_manifest(&mut self) -> CargoResult<()> { vim >src/cargo/core/workspace.rs +1040 jump at the end of it at line 1107, you see it's >our patch there, applied in the wrong spot! > >Hopefully, depending on the change, this kind of patch which applied in the wrong >place, will be caught at (rust)compile time (in my case, it was this) or worse, at >(binary)runtime. > >With the aforementioned `diffy` patch, the generated diff would actually be with a >context of 4, to make it unambiguous, so it would've been this: >```diff >--- original >+++ modified >@@ -1186,8 +1186,12 @@ > self.gctx.shell().warn(msg)? > } > } > } >+ if seen_any_warnings { >+ //use anyhow::bail; >+ bail!("Compilation failed due to cargo warnings! Manually >+ done >this(via cargo patch) so that things like the following (ie. dep key packages= and >using rust pre 1.26.0 which ignores it, downloads squatted >package) will be avoided in the future: >https://github.com/rust-lang/rust/security/advisories/GHSA-phjm-8x66-qw4r"); >+ } > Ok(()) > } > > pub fn emit_lints(&self, pkg: &Package, path: &Path) -> CargoResult<()> { ``` this >hunk is now unambiguous because it cannot be applied in more than 1 place in the >original file, furthermore patching a modified future file will fail(with that `diffy` >patch, and ideally, with `git apply` if any changes are implemented to "fix" this issue) >if any of the hunks can be independently(and during the full patch application too) >applied in more than 1 place. > >I consider that I don't know enough to understand how `git diff`/`git apply` works >internally (and similarly, gnu `diff`/`patch`) to actually change them and make them >generate unambiguous hunks where only the hunks that would've been ambiguous >have increased context size, instead of the whole patch have increased context size >for all hunks(which is what I did for `diffy` too so far, in that proof of concept patch), >therefore if a "fix" is deemed necessary(it may not be, as I might've missed >something and I'm unaware of it, so a fix may be messing other things up, who >knows?!) then I hope someone much more knowledgeable could implement >it(maybe even for gnu diff/patch too), and while I don't think that a "please" would >be enough, I'm still gonna say it: please do so, if so inclined. > >Thank you for your time and consideration. You make good points, but Rust code should not be put into the main git code base as it will break many non-GNU platforms. Perhaps rewriting it is C to be compatible with the git code-base. --Randall
> >I consider that I don't know enough to understand how `git diff`/`git apply` works > >internally (and similarly, gnu `diff`/`patch`) to actually change them and make them > >generate unambiguous hunks where only the hunks that would've been ambiguous > >have increased context size, instead of the whole patch have increased context size > >for all hunks(which is what I did for `diffy` too so far, in that proof of concept patch), > >therefore if a "fix" is deemed necessary(it may not be, as I might've missed > >something and I'm unaware of it, so a fix may be messing other things up, who > >knows?!) then I hope someone much more knowledgeable could implement > >it(maybe even for gnu diff/patch too), and while I don't think that a "please" would > >be enough, I'm still gonna say it: please do so, if so inclined. > > > >Thank you for your time and consideration. > > You make good points, but Rust code should not be put into the main git code base as it will break many non-GNU platforms. Perhaps rewriting it is C to be compatible with the git code-base. > --Randall > Ah, definitely whoever writes the fix would do it in C for the git code base, I didn't mean to imply it would be or should be done in rust, therefore please excuse my failure to communicate that clearly. The `diffy` proof-of-concept patch, is just for `diffy`, in rust, and it's just to show a way this could be done and that "it works" that way. It was easier for me to do it for `diffy` in rust, than in C for git diff/apply or gnu diff/patch. If a fix is to be implemented for `git diff/apply`, it would definitely not be in rust by any means, but C, as you mentioned. Thank you for your reply. Also, I notice that I made a mistake when pasting the patch with the context length of 4, it was a real patch not the one I used in the examples, here's the corrected unambiguous patch: ```diff --- original +++ modified @@ -1114,8 +1114,12 @@ self.config.shell().warn(msg)? } } } + if seen_any_warnings { + //comment + bail!("reasons"); + } Ok(()) } pub fn set_target_dir(&mut self, target_dir: Filesystem) { ``` Cheers, have a great day everyone!
Am 03.07.24 um 17:24 schrieb Emanuel Czirai: > With the aforementioned `diffy` patch, the generated diff would actually be > with a context of 4, to make it unambiguous, so it would've been this: > ```diff > --- original > +++ modified > @@ -1186,8 +1186,12 @@ > self.gctx.shell().warn(msg)? > } > } > } > + if seen_any_warnings { > + //use anyhow::bail; > + bail!("reasons"); > + } > Ok(()) > } > > pub fn emit_lints(&self, pkg: &Package, path: &Path) -> > CargoResult<()> { > ``` > this hunk is now unambiguous because it cannot be applied in more than 1 > place in the original file, This assertion is wrong, assuming that the patch is to be applied to a modified version of 'original'. There is nothing that can be done at the time when a patch is generated to make it unambiguous, not even if the entire file becomes context. The reason is that the modified 'original' could now have the part duplicated that is the context in the patch, and it would be possible to apply the patch any one of the duplicates. Which one? -- Hannes
On Wed, Jul 3, 2024 at 11:01 PM Johannes Sixt <j6t@kdbg.org> wrote: > > Am 03.07.24 um 17:24 schrieb Emanuel Czirai: > > With the aforementioned `diffy` patch, the generated diff would actually be > > with a context of 4, to make it unambiguous, so it would've been this: > > ```diff > > --- original > > +++ modified > > @@ -1186,8 +1186,12 @@ > > self.gctx.shell().warn(msg)? > > } > > } > > } > > + if seen_any_warnings { > > + //use anyhow::bail; > > + bail!("reasons"); > > + } > > Ok(()) > > } > > > > pub fn emit_lints(&self, pkg: &Package, path: &Path) -> > > CargoResult<()> { > > ``` > > this hunk is now unambiguous because it cannot be applied in more than 1 > > place in the original file, > > This assertion is wrong, assuming that the patch is to be applied to a > modified version of 'original'. There is nothing that can be done at the > time when a patch is generated to make it unambiguous, not even if the > entire file becomes context. The reason is that the modified 'original' > could now have the part duplicated that is the context in the patch, and > it would be possible to apply the patch any one of the duplicates. Which > one? > > -- Hannes > tl;dr: you're correct indeed, a hunk is truly unambiguous only when `git apply` determines that, and it thus depends on the target file to apply to; `git diff` can generate unambiguous hunks only with respect to the unchanged original file, but applying that hunk to a changed original file only determines if a hunk is relatively unambiguous, that is, unambiguous with respect to that changed original file. Both diff-ing and patch application are needed to ensure unambiguity. That is correct, which is why it's necessary to have `git apply` also modified so that it tries to apply each hunk independently to the now-modified original to see if it can be applied in more than 1 place, and if so, fail with the proper error message; in addition, after applying it, it would try to apply it again, just in case applying it created a new spot for it to be reapplied, and if it succeeds to re-apply it, it would fail again. I kind of mentioned this I believe, if not, my bad. Both `git diff` and `git apply` would have to each ensure that the hunks are unambiguous for the hunks to be considered unambiguous, therefore you're correct in your statement that my stating that the hunk is unambiguous right after generation via a fixed `git diff` is wrong. It's merely unambiguous only in the context of applying it to the original file, but as soon as a new modified original file is presented, it can be ambiguous as it is, unless `git apply` can determine that indeed it cannot be applied in more than 1 place in this new modified original file, but both: applied independently(just to be sure) and during the application of all previous hunks until then, maybe even more than these 2, because the following hunks can create new spots for it, so a reapply could make that hunk succeed even if all others would fail. Not entirely sure here, how to properly ensure this, that's why someone more knowledgeable might. Ah, I see that I actually tried to say this(the above) after that statement but I used wrong wording "will fail" instead of "should fail" or "would fail" (if fixed, that is): > this hunk is now unambiguous because it cannot be applied in more than 1 > place in the original file, furthermore patching a modified future file > will fail(with that `diffy` patch, and ideally, with `git apply` if any > changes are implemented to "fix" this issue) if any of the hunks can be > independently(and during the full patch application too) applied in more > than 1 place. For what's worth, barring my lack of explaining, the `diffy` patch I've mentioned already does this: makes sure the generated hunks are unambiguous at diff generation, and unambiguous at the time of the application of the patch. But definitely both are required. I guess, I call the hunks unambiguous at diff generation but it implies they're only unambiguous on the original file, and because patching is also "fixed", patching can ensure the hunks cannot be applied in more than in 1 spot and I thus call them unambiguous at patching as well. But both are required (diffing+patching) to ensure that indeed a hunk is truly unambiguous, therefore, you'd be right to say that any static diff file even if it had the whole file as context cannot or shouldn't be called unambiguous, because that's only half of it; because applying that diff as patch(which is the second half of it) is necessary to see if it's unambiguous as it depends on the target that it's applied on. Thus even if the generated hunks(or patch) is unambiguous from the point of view of having generated it, applying it to a modified original file can cause the patching to fail if ambiguity is detected. Seems better to fail(either at diff generation, if context length is too high, or at patch application if ambiguity is detected) than to silently succeed and hope that a compilation will catch it, or worse a runtime behavior is eventually spotted to be wrong. Rust files appear to be far more prone to this ambiguous diff generation due to rustfmt which formats them like seen above with 1 brace per line, and having the possibility of consecutive such braces... Thanks for your reply. Definitely need to think more about calling hunks unambiguous, without also specifying that they're truly only so only when applied(which really makes them only as strong as _relatively_ unambiguous, since they depend on the target file to be applied to). Maybe call them semi-unambiguous, or unambiguous with respect to original file, unclear yet. Perhaps whoever decides to make a patch for this issue can call them differently, or just be aware of this. Cheers! Have a great day everyone!
On Wed, Jul 3, 2024 at 8:25 AM Emanuel Czirai <correabuscar+gitML@gmail.com> wrote: > > Subject: `git diff`/`git apply` can generate/apply ambiguous hunks (ie. in the wrong place) (just like gnu diff/patch) Yes, this is already known. In fact, it was one of the big reasons we changed the default backend in rebase from apply to merge. From the git-rebase manpage: ``` Context The apply backend works by creating a sequence of patches (by calling format-patch internally), and then applying the patches in sequence (calling am internally). Patches are composed of multiple hunks, each with line numbers, a context region, and the actual changes. The line numbers have to be taken with some fuzz, since the other side will likely have inserted or deleted lines earlier in the file. The context region is meant to help find how to adjust the line numbers in order to apply the changes to the right lines. However, if multiple areas of the code have the same surrounding lines of context, the wrong one can be picked. There are real-world cases where this has caused commits to be reapplied incorrectly with no conflicts reported. Setting diff.context to a larger value may prevent such types of problems, but increases the chance of spurious conflicts (since it will require more lines of matching context to apply). The merge backend works with a full copy of each relevant file, insulating it from these types of problems. ``` > This doesn't affect `git rebase` as it's way more robust than simply > extracting the commits as patches and re-applying them. (I haven't looked > into `git merge` though, but I doubt it's affected) This was not always true; and, in fact, rebase is actually still partially affected today -- if you pick the `apply` backend or pick arguments that imply that backend, then you can still run into this problem. The merge backend (the default) is unaffected, and this problem was one of the big reasons for us switching to make the merge backend the default instead of the apply backend. git merge is unaffected.
On Thu, Jul 4, 2024 at 10:08 PM Elijah Newren <newren@gmail.com> wrote: > > On Wed, Jul 3, 2024 at 8:25 AM Emanuel Czirai > <correabuscar+gitML@gmail.com> wrote: > > > > Subject: `git diff`/`git apply` can generate/apply ambiguous hunks (ie. in the wrong place) (just like gnu diff/patch) > > Yes, this is already known. In fact, it was one of the big reasons we > changed the default backend in rebase from apply to merge. From the > git-rebase manpage: > > ``` > Context > The apply backend works by creating a sequence of patches (by calling > format-patch internally), and then applying the patches in sequence > (calling am internally). Patches are composed of multiple hunks, each > with line numbers, a context region, and the actual changes. The line > numbers have to be taken with some fuzz, since the other side will > likely have inserted or deleted lines earlier in the file. The context > region is meant to help find how to adjust the line numbers in order to > apply the changes to the right lines. However, if multiple areas of the > code have the same surrounding lines of context, the wrong one can be > picked. There are real-world cases where this has caused commits to be > reapplied incorrectly with no conflicts reported. Setting diff.context > to a larger value may prevent such types of problems, but increases the > chance of spurious conflicts (since it will require more lines of > matching context to apply). > > The merge backend works with a full copy of each relevant file, > insulating it from these types of problems. > ``` > > > This doesn't affect `git rebase` as it's way more robust than simply > > extracting the commits as patches and re-applying them. (I haven't looked > > into `git merge` though, but I doubt it's affected) > > This was not always true; and, in fact, rebase is actually still > partially affected today -- if you pick the `apply` backend or pick > arguments that imply that backend, then you can still run into this > problem. The merge backend (the default) is unaffected, and this > problem was one of the big reasons for us switching to make the merge > backend the default instead of the apply backend. > > git merge is unaffected. Thank you very much, I really appreciate knowing this very interesting info! Cheers! This also tells me that I shouldn't expect a solution for `git diff`/`git apply` any time soon, else it would've been done at that time, I suppose. It's all good, I'll find some kind of workaround for my use case. (probably something based on `diffy` as it's simpler for me to grasp than the C code) Thanks everyone. All the best!
Elijah Newren <newren@gmail.com> writes: > Yes, this is already known. In fact, it was one of the big reasons we > changed the default backend in rebase from apply to merge. From the > git-rebase manpage: Not exactly on topic of the discussion, but I see a few things problematic in this part, and as I already invested some time reading it, I'd leave #leftoverbits comment here. > ``` > Context > The apply backend works by creating a sequence of patches (by calling > format-patch internally), and then applying the patches in sequence > (calling am internally). Patches are composed of multiple hunks, each `am` should be some marked-up to stand out. It would be even better to spell it out as `git am`. > with line numbers, a context region, and the actual changes. The line > numbers have to be taken with some fuzz, since the other side will "fuzz" -> "offset" In the context of discussing patch application, `fuzz` is a term of art. It is the number of context lines you (the patch applicator) allow the machinery to allow to be different between the patch and the preimage. Git allows *absolutely* no fuzz and there is not even an option to loosen this (this is philosophical design decision originating back in Linus's days). This part is talking about something different. `offset` is another term of art and refers to the difference between the beginning line number recorded in the hunk header, and the actual line in the preimage the patch applies to. Unless you are applying to the same preimage as where the patch was taken from, `offset` being non-zero is perfectly normal, but Git (and other patch applicators) try to minimize the offset.
diff --git a/src/cargo/core/workspace.rs b/src/cargo/core/workspace.rs index 4667c8029..8f0a74473 100644 --- a/src/cargo/core/workspace.rs +++ b/src/cargo/core/workspace.rs @@ -1115,6 +1115,10 @@ impl<'cfg> Workspace<'cfg> { } } } + if seen_any_warnings { + //comment + bail!("reasons"); + } Ok(()) } ``` Now this is an ambiguous patch/hunk because if you try to apply this patch on the same original file, cumulatively, it applies successfully in 3 different places, but we won't do that now. 5. now let's discard it(we already have it saved in /tmp/foo.patch) and pretend that something changed in the original code: git checkout . git checkout 0.80.0 6. reapply that patch on the new changes: git apply /tmp/foo.patch (this shows no errors) 7. look at the diff, for no good reason, just to see that it's the same(kind of): git diff > /tmp/foo2.patch contents: ```diff diff --git a/src/cargo/core/workspace.rs b/src/cargo/core/workspace.rs index 55bfb7a10..92709f224 100644 --- a/src/cargo/core/workspace.rs +++ b/src/cargo/core/workspace.rs @@ -1099,6 +1099,10 @@ impl<'gctx> Workspace<'gctx> { } } } + if seen_any_warnings { + //comment + bail!("reasons"); + } Ok(()) } ``` 8. now look at the file, where was the patch applied? that's right, it's at the end of the wrong function: fn validate_manifest(&mut self) -> CargoResult<()> { vim src/cargo/core/workspace.rs +1040 jump at the end of it at line 1107, you see it's our patch there, applied in the wrong spot! Hopefully, depending on the change, this kind of patch which applied in the wrong place, will be caught at (rust)compile time (in my case, it was this) or worse, at (binary)runtime. With the aforementioned `diffy` patch, the generated diff would actually be with a context of 4, to make it unambiguous, so it would've been this: ```diff --- original +++ modified @@ -1186,8 +1186,12 @@ self.gctx.shell().warn(msg)? } } } + if seen_any_warnings { + //use anyhow::bail; + bail!("Compilation failed due to cargo warnings! Manually done this(via cargo patch) so that things like the following (ie. dep key packages= and using rust pre 1.26.0 which ignores it, downloads squatted package) will be avoided in the future: https://github.com/rust-lang/rust/security/advisories/GHSA-phjm-8x66-qw4r"); + }