Message ID | cover.1596887883.git.congdanhqx@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | extend --abbrev support to diff-patch format | expand |
Đoàn Trần Công Danh <congdanhqx@gmail.com> writes: > Đoàn Trần Công Danh (2): > revision: differentiate if --no-abbrev asked explicitly > diff: extend --abbrev support to diff-patch format It was not clear, at least to me at all, what these patches are trying to achieve (i.e. what end-users appreciate) until I saw the code change X-<. The changes to fill_metainfo() make sense to me. It just needs log messages that explain the intent better. They do not even make it clear that they want to make the abbreviation length of the object names on the "index $from..$to $mode" lines configurable. Thanks.
On Sun, Aug 09, 2020 at 12:01:35PM -0700, Junio C Hamano wrote: > Đoàn Trần Công Danh <congdanhqx@gmail.com> writes: > > > Đoàn Trần Công Danh (2): > > revision: differentiate if --no-abbrev asked explicitly > > diff: extend --abbrev support to diff-patch format > > It was not clear, at least to me at all, what these patches are > trying to achieve (i.e. what end-users appreciate) until I saw the > code change X-<. > > The changes to fill_metainfo() make sense to me. It just needs log > messages that explain the intent better. They do not even make it > clear that they want to make the abbreviation length of the object > names on the "index $from..$to $mode" lines configurable. After reading the original including cover letter, I'm still confused using why --full-index is not the solution for most cases. Perhaps that would be worth touching on, as well. -Peff
On 2020-08-10 06:00:38-0400, Jeff King <peff@peff.net> wrote: > On Sun, Aug 09, 2020 at 12:01:35PM -0700, Junio C Hamano wrote: > > > Đoàn Trần Công Danh <congdanhqx@gmail.com> writes: > > > > > Đoàn Trần Công Danh (2): > > > revision: differentiate if --no-abbrev asked explicitly > > > diff: extend --abbrev support to diff-patch format > > > > It was not clear, at least to me at all, what these patches are > > trying to achieve (i.e. what end-users appreciate) until I saw the > > code change X-<. > > > > The changes to fill_metainfo() make sense to me. It just needs log > > messages that explain the intent better. They do not even make it > > clear that they want to make the abbreviation length of the object > > names on the "index $from..$to $mode" lines configurable. > > After reading the original including cover letter, I'm still confused > using why --full-index is not the solution for most cases. Perhaps that > would be worth touching on, as well. At that time, I'm not really sure what should be written there. The commit message was inspired by --abbrev documentation. Reading both of your two's emails, I think this one could be used: I'll resend this serie if this serie is deemed useful with this explaination. diff: index-line: make object name's abbrev length configurable There are some projects that want to archive and track only released version of other software projects. They also want to backport some changes into those versions unsupported by upstream. Most of git hosting services support some method to download patches without cloning the full (potentially large) repository and/or fiddling with git partial-clone or sparse-checkout. Most of those git hosting services generate those patches with git-format-patch(1) or something alike. Due to its large amount of objects, their object names' abbreviation in the index-line is usually long but not full. A lot of those patches couldn't be applied cleanly to old versions of said software, thus requires some changes from developer and they needs to be regenerated from their trimmed tree. Because the archive tree has significantly fewer objects, the abbreviation in the index line is usually shorter than the original patch. Thus, it generates some noise when said developers try to compare the new patch with the original patch if there's an exact file-hunk match. Make the object name's abbreviation length configurable to lower those noise. <Below is the note in 2/2, I don't know if it's worth put into commit message> To preserve backward compatibility with old script that specify both --full-index and --abbrev, always shows full object id if --full-index is specified.
Jeff King <peff@peff.net> writes: > On Sun, Aug 09, 2020 at 12:01:35PM -0700, Junio C Hamano wrote: > >> Đoàn Trần Công Danh <congdanhqx@gmail.com> writes: >> >> > Đoàn Trần Công Danh (2): >> > revision: differentiate if --no-abbrev asked explicitly >> > diff: extend --abbrev support to diff-patch format >> >> It was not clear, at least to me at all, what these patches are >> trying to achieve (i.e. what end-users appreciate) until I saw the >> code change X-<. >> >> The changes to fill_metainfo() make sense to me. It just needs log >> messages that explain the intent better. They do not even make it >> clear that they want to make the abbreviation length of the object >> names on the "index $from..$to $mode" lines configurable. > > After reading the original including cover letter, I'm still confused > using why --full-index is not the solution for most cases. Perhaps that > would be worth touching on, as well. True. Presumably you could force some stability without sacrificing line length limit by using --abbrev=12 instead of --full-index but I do not think it is such a big deal. But it does look odd that we use a special/single-purpose option --full-index to control the length only for those two object names on the "index" line, when all the other object names we see are controlled with the --abbrev option.
Đoàn Trần Công Danh <congdanhqx@gmail.com> writes: > Reading both of your two's emails, I think this one could be used: > I'll resend this serie if this serie is deemed useful with this > explaination. > > diff: index-line: make object name's abbrev length configurable > > There are some projects that want to archive and track only > released version of other software projects. They also want > to backport some changes into those versions unsupported by > upstream. Most of git hosting services support some method to > download patches without cloning the full (potentially large) > repository and/or fiddling with git partial-clone or > sparse-checkout. > > Most of those git hosting services generate those patches with > git-format-patch(1) or something alike. Due to its large > amount of objects, their object names' abbreviation in the > index-line is usually long but not full. > > A lot of those patches couldn't be applied cleanly to old > versions of said software, thus requires some changes from > developer and they needs to be regenerated from their trimmed > tree. Because the archive tree has significantly fewer > objects, the abbreviation in the index line is usually shorter > than the original patch. Thus, it generates some noise when > said developers try to compare the new patch with the original > patch if there's an exact file-hunk match. > > Make the object name's abbreviation length configurable to > lower those noise. I agree with Peff that with the above as the sole motivating use case, the "--full-index" option is the right approach. It is a much more robust solution than "--abbrev=16 would be long enough for all project participants to avoid length drift". IOW these four paragraphs do not argue _for_ this change, at least to me. On the other hand, I think you could argue that "--full-index" is merely a synonym for "--abbrev=40", and the patch fixes the inconsistency between the object names on the "index" line, which can choose only between the default abbrev length and the full abbrev length, and all the other places we show object names, which uniformly honor the "--abbrev" option. > <Below is the note in 2/2, I don't know if it's worth put > into commit message> > > To preserve backward compatibility with old script that specify > both --full-index and --abbrev, always shows full object id > if --full-index is specified.
On Mon, Aug 10, 2020 at 08:15:41AM -0700, Junio C Hamano wrote: > > A lot of those patches couldn't be applied cleanly to old > > versions of said software, thus requires some changes from > > developer and they needs to be regenerated from their trimmed > > tree. Because the archive tree has significantly fewer > > objects, the abbreviation in the index line is usually shorter > > than the original patch. Thus, it generates some noise when > > said developers try to compare the new patch with the original > > patch if there's an exact file-hunk match. > > > > Make the object name's abbreviation length configurable to > > lower those noise. > > I agree with Peff that with the above as the sole motivating use > case, the "--full-index" option is the right approach. It is a much > more robust solution than "--abbrev=16 would be long enough for all > project participants to avoid length drift". IOW these four > paragraphs do not argue _for_ this change, at least to me. Yeah, that's what I was getting at: if you care about robust machine-readability, then the full index is the best solution. Reading between the lines, I think the argument may be "using --full-index is too long and therefore ugly, so people like the short-ish names but with a bit of extra safety". There's an extra challenge here, which is that you have to convince the sender to use the extra --abbrev option, even though they themselves won't be the ones running into the problem when applying. But I don't think there's an elegant solution to that (we could just bump the default abbrev everywhere to 12+, which is enough in practice). Though I'm not 100% sure that "git apply" is smart enough to only look at blobs (i.e., if "1234abcd" is ambiguous between a tree and a blob, ignore the tree since patches always apply to blobs). That might be another avenue that would make things more likely to work without anybody having to configure anything. > On the other hand, I think you could argue that "--full-index" is > merely a synonym for "--abbrev=40", and the patch fixes the > inconsistency between the object names on the "index" line, which > can choose only between the default abbrev length and the full > abbrev length, and all the other places we show object names, which > uniformly honor the "--abbrev" option. Yeah, I certainly don't mind the extra flexibility between "full" and "default" for "index" lines. I do wonder if people want to configure the abbreviations for those lines separately from other parts. I don't know that I've ever particularly cared about that flexibility, but the fact that they were set up separately all those years ago makes me think somebody might. -Peff
On 2020-08-10 11:27:05-0400, Jeff King <peff@peff.net> wrote: > On Mon, Aug 10, 2020 at 08:15:41AM -0700, Junio C Hamano wrote: > > > > A lot of those patches couldn't be applied cleanly to old > > > versions of said software, thus requires some changes from > > > developer and they needs to be regenerated from their trimmed > > > tree. Because the archive tree has significantly fewer > > > objects, the abbreviation in the index line is usually shorter > > > than the original patch. Thus, it generates some noise when > > > said developers try to compare the new patch with the original > > > patch if there's an exact file-hunk match. > > > > > > Make the object name's abbreviation length configurable to > > > lower those noise. > > > > I agree with Peff that with the above as the sole motivating use > > case, the "--full-index" option is the right approach. It is a much > > more robust solution than "--abbrev=16 would be long enough for all > > project participants to avoid length drift". IOW these four > > paragraphs do not argue _for_ this change, at least to me. > > Yeah, that's what I was getting at: if you care about robust > machine-readability, then the full index is the best solution. Reading > between the lines, I think the argument may be "using --full-index is > too long and therefore ugly, so people like the short-ish names but with > a bit of extra safety". My argument was people can either easily fetch the patch via HTTP like: curl -LO https://github.com/git/git/commit/eb12adc74cf22add318f884072be2071d181abaa.patch or take it from a mailing list archive, bugzilla, instead of cloning a full repository. With those options, we can't say, "we prefer full-index, please send us the patch with full-index instead". > > There's an extra challenge here, which is that you have to convince the > sender to use the extra --abbrev option, even though they themselves > won't be the ones running into the problem when applying. Not really, since the sender tree is usually larger than the archived tree, their abbrev is usually long enough, and the receiver will use --abbrev to lengthen their abbrev to reduce the noise instead. > But I don't > think there's an elegant solution to that (we could just bump the > default abbrev everywhere to 12+, which is enough in practice). > Though I'm not 100% sure that "git apply" is smart enough to only look > at blobs (i.e., if "1234abcd" is ambiguous between a tree and a blob, > ignore the tree since patches always apply to blobs). That might be > another avenue that would make things more likely to work without > anybody having to configure anything. > > > On the other hand, I think you could argue that "--full-index" is > > merely a synonym for "--abbrev=40", and the patch fixes the > > inconsistency between the object names on the "index" line, which > > can choose only between the default abbrev length and the full > > abbrev length, and all the other places we show object names, which > > uniformly honor the "--abbrev" option. I think this argument could be a way to go. In fact, I always try to use --abbrev with diff family because I know it works with a handful with other tools, (describe, blame), then I surprise that it doesn't work, and the documentation tells me `--abbrev` only works with diff-raw and diff-tree header line. Then, I keep forgetting that documentation, and I tried again. For now, I filtered out the index line before comparing 2 patches. > Yeah, I certainly don't mind the extra flexibility between "full" and > "default" for "index" lines. I do wonder if people want to configure the > abbreviations for those lines separately from other parts. I don't know > that I've ever particularly cared about that flexibility, but the fact > that they were set up separately all those years ago makes me think > somebody might. I don't think people particularly care about the index line (and to the extent, its length) that much, since the default is number is actually a minimum number, if Git can't differentiate object with that number of characters, Git will show a longer object names anyway. I think most people scripts will put a regex for: /index [a-z0-9]{7,}\.\.[a-z0-9]{7,} [0-7]{6}/ Or even: /index [a-z0-9]+\.\.[a-z0-9]+ [0-7]+/ For the former case, we could change the code in 2/2 to set the minimum default to DEFAULT_ABBREV instead of MINIMUM_ABBREV? For the historical case that users put both --full-index and --abbrev into there scripts, we still keep our promise to not break their script by always respect --full-index, regardless of --abbrev.
On Tue, Aug 11, 2020 at 07:33:59AM +0700, Đoàn Trần Công Danh wrote: > > Yeah, that's what I was getting at: if you care about robust > > machine-readability, then the full index is the best solution. Reading > > between the lines, I think the argument may be "using --full-index is > > too long and therefore ugly, so people like the short-ish names but with > > a bit of extra safety". > > My argument was people can either easily fetch the patch via HTTP like: > > curl -LO https://github.com/git/git/commit/eb12adc74cf22add318f884072be2071d181abaa.patch > > or take it from a mailing list archive, bugzilla, instead of > cloning a full repository. With those options, we can't say, > "we prefer full-index, please send us the patch with full-index > instead". OK. But then how would they use "--abbrev" in that case? I.e., isn't it too late at that point (especially in the mailing list archive case) to do change anything in the formatting of the patch? Maybe I'm confused... > > There's an extra challenge here, which is that you have to convince the > > sender to use the extra --abbrev option, even though they themselves > > won't be the ones running into the problem when applying. > > Not really, since the sender tree is usually larger than the archived > tree, their abbrev is usually long enough, and the receiver will use > --abbrev to lengthen their abbrev to reduce the noise instead. Now I'm doubly confused. If the sender has the larger tree then they'll have the larger abbrev. So what's the problem? Going back to re-read your earlier responses...So...this _isn't_ a problem within Git itself? It's only about people trying to compare textual patches byte-for-byte and seeing different index lines? If that's the case, then it seems to me that the byte comparison is the problem here. If I have: index 1234abcd..5678bcde and index 1234abcd87..5678bcde65 those should be considered equivalent to see if two patches are plausibly the same. And I think tools like git-cherry, etc, would do that (and we provide git-patch-id for that purpose, too). > > Yeah, I certainly don't mind the extra flexibility between "full" and > > "default" for "index" lines. I do wonder if people want to configure the > > abbreviations for those lines separately from other parts. I don't know > > that I've ever particularly cared about that flexibility, but the fact > > that they were set up separately all those years ago makes me think > > somebody might. > > I don't think people particularly care about the index line (and to > the extent, its length) that much, since the default is number is > actually a minimum number, if Git can't differentiate object with that > number of characters, Git will show a longer object names anyway. > > I think most people scripts will put a regex for: > > /index [a-z0-9]{7,}\.\.[a-z0-9]{7,} [0-7]{6}/ > > Or even: > > /index [a-z0-9]+\.\.[a-z0-9]+ [0-7]+/ > > For the former case, we could change the code in 2/2 to set the minimum > default to DEFAULT_ABBREV instead of MINIMUM_ABBREV? > > For the historical case that users put both --full-index and --abbrev > into there scripts, we still keep our promise to not break their > script by always respect --full-index, regardless of --abbrev. I care less about scripting (as you note, anything consuming abbreviated objects has to handle longer-than-minimum names anyway), and was more wondering whether anybody really cared that: git log --abbrev=30 -p kept the short index lines (e.g., because they're easier to read). But I'm having trouble coming up with a plausible reason somebody would want long object names in earlier lines like "Merge:" but not in the patch index lines. And already we respect --abbrev for --raw, so it's not like the diff code isn't already affected. Making "-p" consistent with all the rest of it is probably worth doing regardless. -Peff
On 2020-08-11 01:22:26-0400, Jeff King <peff@peff.net> wrote: > On Tue, Aug 11, 2020 at 07:33:59AM +0700, Đoàn Trần Công Danh wrote: > > > > Yeah, that's what I was getting at: if you care about robust > > > machine-readability, then the full index is the best solution. Reading > > > between the lines, I think the argument may be "using --full-index is > > > too long and therefore ugly, so people like the short-ish names but with > > > a bit of extra safety". > > > > My argument was people can either easily fetch the patch via HTTP like: > > > > curl -LO https://github.com/git/git/commit/eb12adc74cf22add318f884072be2071d181abaa.patch > > > > or take it from a mailing list archive, bugzilla, instead of > > cloning a full repository. With those options, we can't say, > > "we prefer full-index, please send us the patch with full-index > > instead". > > OK. But then how would they use "--abbrev" in that case? I.e., isn't it > too late at that point (especially in the mailing list archive case) to > do change anything in the formatting of the patch? > > Maybe I'm confused... > > > > There's an extra challenge here, which is that you have to convince the > > > sender to use the extra --abbrev option, even though they themselves > > > won't be the ones running into the problem when applying. > > > > Not really, since the sender tree is usually larger than the archived > > tree, their abbrev is usually long enough, and the receiver will use > > --abbrev to lengthen their abbrev to reduce the noise instead. > > Now I'm doubly confused. If the sender has the larger tree then they'll > have the larger abbrev. So what's the problem? > > Going back to re-read your earlier responses...So...this _isn't_ a > problem within Git itself? Correct. It's NOT Git's problems by any mean. > It's only about people trying to compare > textual patches byte-for-byte and seeing different index lines? Yeah, it's about people trying to backport patch to old tree. Fixing conflicts, and try to compare to old patch to see if they have made any errors. Because conflicts resolving is complicated. > > If that's the case, then it seems to me that the byte comparison is the > problem here. If I have: > > index 1234abcd..5678bcde > > and > > index 1234abcd87..5678bcde65 > > those should be considered equivalent to see if two patches are > plausibly the same. And I think tools like git-cherry, etc, would do > that (and we provide git-patch-id for that purpose, too). Yes, git-patch-id is very useful tool. There're time that half of the patch can be applied cleanly with the exact object names. Another half needs to be fixed heavily, (maybe removed). git-cherry and git-patch-id couldn't cope well in those situation. That condition is true if there's a major change to source tree. > > > Yeah, I certainly don't mind the extra flexibility between "full" and > > > "default" for "index" lines. I do wonder if people want to configure the > > > abbreviations for those lines separately from other parts. I don't know > > > that I've ever particularly cared about that flexibility, but the fact > > > that they were set up separately all those years ago makes me think > > > somebody might. > > > > I don't think people particularly care about the index line (and to > > the extent, its length) that much, since the default is number is > > actually a minimum number, if Git can't differentiate object with that > > number of characters, Git will show a longer object names anyway. > > > > I think most people scripts will put a regex for: > > > > /index [a-z0-9]{7,}\.\.[a-z0-9]{7,} [0-7]{6}/ > > > > Or even: > > > > /index [a-z0-9]+\.\.[a-z0-9]+ [0-7]+/ > > > > For the former case, we could change the code in 2/2 to set the minimum > > default to DEFAULT_ABBREV instead of MINIMUM_ABBREV? > > > > For the historical case that users put both --full-index and --abbrev > > into there scripts, we still keep our promise to not break their > > script by always respect --full-index, regardless of --abbrev. > > I care less about scripting (as you note, anything consuming abbreviated > objects has to handle longer-than-minimum names anyway), and was more > wondering whether anybody really cared that: > > git log --abbrev=30 -p > > kept the short index lines (e.g., because they're easier to read). But > I'm having trouble coming up with a plausible reason somebody would want > long object names in earlier lines like "Merge:" but not in the patch > index lines. And already we respect --abbrev for --raw, so it's not like > the diff code isn't already affected. Making "-p" consistent with all > the rest of it is probably worth doing regardless. Yes, I think this is the easier to accept argument. I've gone with that in the resend.