mbox series

[RFC,0/2] extend --abbrev support to diff-patch format

Message ID cover.1596887883.git.congdanhqx@gmail.com (mailing list archive)
Headers show
Series extend --abbrev support to diff-patch format | expand

Message

Đoàn Trần Công Danh Aug. 9, 2020, 2:19 a.m. UTC
For some reason, projects tracks local copy of released version of other
projects, and backports change to earlier version.

Due to lower number of object in simplified history,
abbreviated object id has fewer characters, thus generate some noise
when projects try to compare the backported patch with original patch
if the file hunk is an exact match.

This series try to lower that noise.
Since this is very localised use case,
and the noise is not completely eliminated,
other experienced developers may have better opinions.

Đoàn Trần Công Danh (2):
  revision: differentiate if --no-abbrev asked explicitly
  diff: extend --abbrev support to diff-patch format

 Documentation/diff-options.txt                |  9 +++---
 diff.c                                        |  5 +++-
 revision.c                                    |  2 +-
 t/t4013-diff-various.sh                       |  3 ++
 ...ff.diff-tree_--root_-p_--abbrev=10_initial | 29 +++++++++++++++++++
 ...--root_-p_--full-index_--abbrev=10_initial | 29 +++++++++++++++++++
 ...f.diff-tree_--root_-p_--full-index_initial | 29 +++++++++++++++++++
 7 files changed, 100 insertions(+), 6 deletions(-)
 create mode 100644 t/t4013/diff.diff-tree_--root_-p_--abbrev=10_initial
 create mode 100644 t/t4013/diff.diff-tree_--root_-p_--full-index_--abbrev=10_initial
 create mode 100644 t/t4013/diff.diff-tree_--root_-p_--full-index_initial

Comments

Junio C Hamano Aug. 9, 2020, 7:01 p.m. UTC | #1
Đoàn Trần Công Danh  <congdanhqx@gmail.com> writes:

> Đoàn Trần Công Danh (2):
>   revision: differentiate if --no-abbrev asked explicitly
>   diff: extend --abbrev support to diff-patch format

It was not clear, at least to me at all, what these patches are
trying to achieve (i.e. what end-users appreciate) until I saw the
code change X-<.

The changes to fill_metainfo() make sense to me.  It just needs log
messages that explain the intent better.  They do not even make it
clear that they want to make the abbreviation length of the object
names on the "index $from..$to $mode" lines configurable.

Thanks.
Jeff King Aug. 10, 2020, 10 a.m. UTC | #2
On Sun, Aug 09, 2020 at 12:01:35PM -0700, Junio C Hamano wrote:

> Đoàn Trần Công Danh  <congdanhqx@gmail.com> writes:
> 
> > Đoàn Trần Công Danh (2):
> >   revision: differentiate if --no-abbrev asked explicitly
> >   diff: extend --abbrev support to diff-patch format
> 
> It was not clear, at least to me at all, what these patches are
> trying to achieve (i.e. what end-users appreciate) until I saw the
> code change X-<.
> 
> The changes to fill_metainfo() make sense to me.  It just needs log
> messages that explain the intent better.  They do not even make it
> clear that they want to make the abbreviation length of the object
> names on the "index $from..$to $mode" lines configurable.

After reading the original including cover letter, I'm still confused
using why --full-index is not the solution for most cases. Perhaps that
would be worth touching on, as well.

-Peff
Đoàn Trần Công Danh Aug. 10, 2020, 12:31 p.m. UTC | #3
On 2020-08-10 06:00:38-0400, Jeff King <peff@peff.net> wrote:
> On Sun, Aug 09, 2020 at 12:01:35PM -0700, Junio C Hamano wrote:
> 
> > Đoàn Trần Công Danh  <congdanhqx@gmail.com> writes:
> > 
> > > Đoàn Trần Công Danh (2):
> > >   revision: differentiate if --no-abbrev asked explicitly
> > >   diff: extend --abbrev support to diff-patch format
> > 
> > It was not clear, at least to me at all, what these patches are
> > trying to achieve (i.e. what end-users appreciate) until I saw the
> > code change X-<.
> > 
> > The changes to fill_metainfo() make sense to me.  It just needs log
> > messages that explain the intent better.  They do not even make it
> > clear that they want to make the abbreviation length of the object
> > names on the "index $from..$to $mode" lines configurable.
> 
> After reading the original including cover letter, I'm still confused
> using why --full-index is not the solution for most cases. Perhaps that
> would be worth touching on, as well.

At that time, I'm not really sure what should be written there.
The commit message was inspired by --abbrev documentation.

Reading both of your two's emails, I think this one could be used:
I'll resend this serie if this serie is deemed useful with this
explaination.

	diff: index-line: make object name's abbrev length configurable

	There are some projects that want to archive and track only
	released version of other software projects. They also want
	to backport some changes into those versions unsupported by
	upstream. Most of git hosting services support some method to
	download patches without cloning the full (potentially large)
	repository and/or fiddling with git partial-clone or
	sparse-checkout.

	Most of those git hosting services generate those patches with
	git-format-patch(1) or something alike. Due to its large
	amount of objects, their object names' abbreviation in the
	index-line is usually long but not full.

	A lot of those patches couldn't be applied cleanly to old
	versions of said software, thus requires some changes from
	developer and they needs to be regenerated from their trimmed
	tree. Because the archive tree has significantly fewer
	objects, the abbreviation in the index line is usually shorter
	than the original patch. Thus, it generates some noise when
	said developers try to compare the new patch with the original
	patch if there's an exact file-hunk match.

	Make the object name's abbreviation length configurable to
	lower those noise.

	<Below is the  note in 2/2, I don't know if it's worth put
	into commit message>

	To preserve backward compatibility with old script that specify
	both --full-index and --abbrev, always shows full object id
	if --full-index is specified.
Junio C Hamano Aug. 10, 2020, 3:06 p.m. UTC | #4
Jeff King <peff@peff.net> writes:

> On Sun, Aug 09, 2020 at 12:01:35PM -0700, Junio C Hamano wrote:
>
>> Đoàn Trần Công Danh  <congdanhqx@gmail.com> writes:
>> 
>> > Đoàn Trần Công Danh (2):
>> >   revision: differentiate if --no-abbrev asked explicitly
>> >   diff: extend --abbrev support to diff-patch format
>> 
>> It was not clear, at least to me at all, what these patches are
>> trying to achieve (i.e. what end-users appreciate) until I saw the
>> code change X-<.
>> 
>> The changes to fill_metainfo() make sense to me.  It just needs log
>> messages that explain the intent better.  They do not even make it
>> clear that they want to make the abbreviation length of the object
>> names on the "index $from..$to $mode" lines configurable.
>
> After reading the original including cover letter, I'm still confused
> using why --full-index is not the solution for most cases. Perhaps that
> would be worth touching on, as well.

True.  

Presumably you could force some stability without sacrificing line
length limit by using --abbrev=12 instead of --full-index but I do
not think it is such a big deal.  But it does look odd that we use
a special/single-purpose option --full-index to control the length
only for those two object names on the "index" line, when all the
other object names we see are controlled with the --abbrev option.
Junio C Hamano Aug. 10, 2020, 3:15 p.m. UTC | #5
Đoàn Trần Công Danh  <congdanhqx@gmail.com> writes:

> Reading both of your two's emails, I think this one could be used:
> I'll resend this serie if this serie is deemed useful with this
> explaination.
>
> 	diff: index-line: make object name's abbrev length configurable
>
> 	There are some projects that want to archive and track only
> 	released version of other software projects. They also want
> 	to backport some changes into those versions unsupported by
> 	upstream. Most of git hosting services support some method to
> 	download patches without cloning the full (potentially large)
> 	repository and/or fiddling with git partial-clone or
> 	sparse-checkout.
>
> 	Most of those git hosting services generate those patches with
> 	git-format-patch(1) or something alike. Due to its large
> 	amount of objects, their object names' abbreviation in the
> 	index-line is usually long but not full.
>
> 	A lot of those patches couldn't be applied cleanly to old
> 	versions of said software, thus requires some changes from
> 	developer and they needs to be regenerated from their trimmed
> 	tree. Because the archive tree has significantly fewer
> 	objects, the abbreviation in the index line is usually shorter
> 	than the original patch. Thus, it generates some noise when
> 	said developers try to compare the new patch with the original
> 	patch if there's an exact file-hunk match.
>
> 	Make the object name's abbreviation length configurable to
> 	lower those noise.

I agree with Peff that with the above as the sole motivating use
case, the "--full-index" option is the right approach.  It is a much
more robust solution than "--abbrev=16 would be long enough for all
project participants to avoid length drift".  IOW these four
paragraphs do not argue _for_ this change, at least to me.

On the other hand, I think you could argue that "--full-index" is
merely a synonym for "--abbrev=40", and the patch fixes the
inconsistency between the object names on the "index" line, which
can choose only between the default abbrev length and the full
abbrev length, and all the other places we show object names, which
uniformly honor the "--abbrev" option.


> 	<Below is the  note in 2/2, I don't know if it's worth put
> 	into commit message>
>
> 	To preserve backward compatibility with old script that specify
> 	both --full-index and --abbrev, always shows full object id
> 	if --full-index is specified.
Jeff King Aug. 10, 2020, 3:27 p.m. UTC | #6
On Mon, Aug 10, 2020 at 08:15:41AM -0700, Junio C Hamano wrote:

> > 	A lot of those patches couldn't be applied cleanly to old
> > 	versions of said software, thus requires some changes from
> > 	developer and they needs to be regenerated from their trimmed
> > 	tree. Because the archive tree has significantly fewer
> > 	objects, the abbreviation in the index line is usually shorter
> > 	than the original patch. Thus, it generates some noise when
> > 	said developers try to compare the new patch with the original
> > 	patch if there's an exact file-hunk match.
> >
> > 	Make the object name's abbreviation length configurable to
> > 	lower those noise.
> 
> I agree with Peff that with the above as the sole motivating use
> case, the "--full-index" option is the right approach.  It is a much
> more robust solution than "--abbrev=16 would be long enough for all
> project participants to avoid length drift".  IOW these four
> paragraphs do not argue _for_ this change, at least to me.

Yeah, that's what I was getting at: if you care about robust
machine-readability, then the full index is the best solution. Reading
between the lines, I think the argument may be "using --full-index is
too long and therefore ugly, so people like the short-ish names but with
a bit of extra safety".

There's an extra challenge here, which is that you have to convince the
sender to use the extra --abbrev option, even though they themselves
won't be the ones running into the problem when applying. But I don't
think there's an elegant solution to that (we could just bump the
default abbrev everywhere to 12+, which is enough in practice).

Though I'm not 100% sure that "git apply" is smart enough to only look
at blobs (i.e., if "1234abcd" is ambiguous between a tree and a blob,
ignore the tree since patches always apply to blobs). That might be
another avenue that would make things more likely to work without
anybody having to configure anything.

> On the other hand, I think you could argue that "--full-index" is
> merely a synonym for "--abbrev=40", and the patch fixes the
> inconsistency between the object names on the "index" line, which
> can choose only between the default abbrev length and the full
> abbrev length, and all the other places we show object names, which
> uniformly honor the "--abbrev" option.

Yeah, I certainly don't mind the extra flexibility between "full" and
"default" for "index" lines. I do wonder if people want to configure the
abbreviations for those lines separately from other parts. I don't know
that I've ever particularly cared about that flexibility, but the fact
that they were set up separately all those years ago makes me think
somebody might.

-Peff
Đoàn Trần Công Danh Aug. 11, 2020, 12:33 a.m. UTC | #7
On 2020-08-10 11:27:05-0400, Jeff King <peff@peff.net> wrote:
> On Mon, Aug 10, 2020 at 08:15:41AM -0700, Junio C Hamano wrote:
> 
> > > 	A lot of those patches couldn't be applied cleanly to old
> > > 	versions of said software, thus requires some changes from
> > > 	developer and they needs to be regenerated from their trimmed
> > > 	tree. Because the archive tree has significantly fewer
> > > 	objects, the abbreviation in the index line is usually shorter
> > > 	than the original patch. Thus, it generates some noise when
> > > 	said developers try to compare the new patch with the original
> > > 	patch if there's an exact file-hunk match.
> > >
> > > 	Make the object name's abbreviation length configurable to
> > > 	lower those noise.
> > 
> > I agree with Peff that with the above as the sole motivating use
> > case, the "--full-index" option is the right approach.  It is a much
> > more robust solution than "--abbrev=16 would be long enough for all
> > project participants to avoid length drift".  IOW these four
> > paragraphs do not argue _for_ this change, at least to me.
> 
> Yeah, that's what I was getting at: if you care about robust
> machine-readability, then the full index is the best solution. Reading
> between the lines, I think the argument may be "using --full-index is
> too long and therefore ugly, so people like the short-ish names but with
> a bit of extra safety".

My argument was people can either easily fetch the patch via HTTP like:

	curl -LO https://github.com/git/git/commit/eb12adc74cf22add318f884072be2071d181abaa.patch

or take it from a mailing list archive, bugzilla, instead of
cloning a full repository. With those options, we can't say,
"we prefer full-index, please send us the patch with full-index
instead".

> 
> There's an extra challenge here, which is that you have to convince the
> sender to use the extra --abbrev option, even though they themselves
> won't be the ones running into the problem when applying.

Not really, since the sender tree is usually larger than the archived
tree, their abbrev is usually long enough, and the receiver will use
--abbrev to lengthen their abbrev to reduce the noise instead.

> But I don't
> think there's an elegant solution to that (we could just bump the
> default abbrev everywhere to 12+, which is enough in practice).
> Though I'm not 100% sure that "git apply" is smart enough to only look
> at blobs (i.e., if "1234abcd" is ambiguous between a tree and a blob,
> ignore the tree since patches always apply to blobs). That might be
> another avenue that would make things more likely to work without
> anybody having to configure anything.
> 
> > On the other hand, I think you could argue that "--full-index" is
> > merely a synonym for "--abbrev=40", and the patch fixes the
> > inconsistency between the object names on the "index" line, which
> > can choose only between the default abbrev length and the full
> > abbrev length, and all the other places we show object names, which
> > uniformly honor the "--abbrev" option.

I think this argument could be a way to go.
In fact, I always try to use --abbrev with diff family because I know
it works with a handful with other tools, (describe, blame), then
I surprise that it doesn't work, and the documentation tells me
`--abbrev` only works with diff-raw and diff-tree header line.

Then, I keep forgetting that documentation, and I tried again.

For now, I filtered out the index line before comparing 2 patches.

> Yeah, I certainly don't mind the extra flexibility between "full" and
> "default" for "index" lines. I do wonder if people want to configure the
> abbreviations for those lines separately from other parts. I don't know
> that I've ever particularly cared about that flexibility, but the fact
> that they were set up separately all those years ago makes me think
> somebody might.

I don't think people particularly care about the index line (and to
the extent, its length) that much, since the default is number is
actually a minimum number, if Git can't differentiate object with that
number of characters, Git will show a longer object names anyway.

I think most people scripts will put a regex for:

	/index [a-z0-9]{7,}\.\.[a-z0-9]{7,} [0-7]{6}/

Or even:

	/index [a-z0-9]+\.\.[a-z0-9]+ [0-7]+/

For the former case, we could change the code in 2/2 to set the minimum
default to DEFAULT_ABBREV instead of MINIMUM_ABBREV?

For the historical case that users put both --full-index and --abbrev
into there scripts, we still keep our promise to not break their
script by always respect --full-index, regardless of --abbrev.
Jeff King Aug. 11, 2020, 5:22 a.m. UTC | #8
On Tue, Aug 11, 2020 at 07:33:59AM +0700, Đoàn Trần Công Danh wrote:

> > Yeah, that's what I was getting at: if you care about robust
> > machine-readability, then the full index is the best solution. Reading
> > between the lines, I think the argument may be "using --full-index is
> > too long and therefore ugly, so people like the short-ish names but with
> > a bit of extra safety".
> 
> My argument was people can either easily fetch the patch via HTTP like:
> 
> 	curl -LO https://github.com/git/git/commit/eb12adc74cf22add318f884072be2071d181abaa.patch
> 
> or take it from a mailing list archive, bugzilla, instead of
> cloning a full repository. With those options, we can't say,
> "we prefer full-index, please send us the patch with full-index
> instead".

OK. But then how would they use "--abbrev" in that case? I.e., isn't it
too late at that point (especially in the mailing list archive case) to
do change anything in the formatting of the patch?

Maybe I'm confused...

> > There's an extra challenge here, which is that you have to convince the
> > sender to use the extra --abbrev option, even though they themselves
> > won't be the ones running into the problem when applying.
> 
> Not really, since the sender tree is usually larger than the archived
> tree, their abbrev is usually long enough, and the receiver will use
> --abbrev to lengthen their abbrev to reduce the noise instead.

Now I'm doubly confused. If the sender has the larger tree then they'll
have the larger abbrev. So what's the problem?

Going back to re-read your earlier responses...So...this _isn't_ a
problem within Git itself? It's only about people trying to compare
textual patches byte-for-byte and seeing different index lines?

If that's the case, then it seems to me that the byte comparison is the
problem here. If I have:

  index 1234abcd..5678bcde

and

  index 1234abcd87..5678bcde65

those should be considered equivalent to see if two patches are
plausibly the same. And I think tools like git-cherry, etc, would do
that (and we provide git-patch-id for that purpose, too).

> > Yeah, I certainly don't mind the extra flexibility between "full" and
> > "default" for "index" lines. I do wonder if people want to configure the
> > abbreviations for those lines separately from other parts. I don't know
> > that I've ever particularly cared about that flexibility, but the fact
> > that they were set up separately all those years ago makes me think
> > somebody might.
> 
> I don't think people particularly care about the index line (and to
> the extent, its length) that much, since the default is number is
> actually a minimum number, if Git can't differentiate object with that
> number of characters, Git will show a longer object names anyway.
> 
> I think most people scripts will put a regex for:
> 
> 	/index [a-z0-9]{7,}\.\.[a-z0-9]{7,} [0-7]{6}/
> 
> Or even:
> 
> 	/index [a-z0-9]+\.\.[a-z0-9]+ [0-7]+/
> 
> For the former case, we could change the code in 2/2 to set the minimum
> default to DEFAULT_ABBREV instead of MINIMUM_ABBREV?
> 
> For the historical case that users put both --full-index and --abbrev
> into there scripts, we still keep our promise to not break their
> script by always respect --full-index, regardless of --abbrev.

I care less about scripting (as you note, anything consuming abbreviated
objects has to handle longer-than-minimum names anyway), and was more
wondering whether anybody really cared that:

 git log --abbrev=30 -p

kept the short index lines (e.g., because they're easier to read). But
I'm having trouble coming up with a plausible reason somebody would want
long object names in earlier lines like "Merge:" but not in the patch
index lines. And already we respect --abbrev for --raw, so it's not like
the diff code isn't already affected. Making "-p" consistent with all
the rest of it is probably worth doing regardless.

-Peff
Đoàn Trần Công Danh Aug. 11, 2020, 12:07 p.m. UTC | #9
On 2020-08-11 01:22:26-0400, Jeff King <peff@peff.net> wrote:
> On Tue, Aug 11, 2020 at 07:33:59AM +0700, Đoàn Trần Công Danh wrote:
> 
> > > Yeah, that's what I was getting at: if you care about robust
> > > machine-readability, then the full index is the best solution. Reading
> > > between the lines, I think the argument may be "using --full-index is
> > > too long and therefore ugly, so people like the short-ish names but with
> > > a bit of extra safety".
> > 
> > My argument was people can either easily fetch the patch via HTTP like:
> > 
> > 	curl -LO https://github.com/git/git/commit/eb12adc74cf22add318f884072be2071d181abaa.patch
> > 
> > or take it from a mailing list archive, bugzilla, instead of
> > cloning a full repository. With those options, we can't say,
> > "we prefer full-index, please send us the patch with full-index
> > instead".
> 
> OK. But then how would they use "--abbrev" in that case? I.e., isn't it
> too late at that point (especially in the mailing list archive case) to
> do change anything in the formatting of the patch?
> 
> Maybe I'm confused...
> 
> > > There's an extra challenge here, which is that you have to convince the
> > > sender to use the extra --abbrev option, even though they themselves
> > > won't be the ones running into the problem when applying.
> > 
> > Not really, since the sender tree is usually larger than the archived
> > tree, their abbrev is usually long enough, and the receiver will use
> > --abbrev to lengthen their abbrev to reduce the noise instead.
> 
> Now I'm doubly confused. If the sender has the larger tree then they'll
> have the larger abbrev. So what's the problem?
> 
> Going back to re-read your earlier responses...So...this _isn't_ a
> problem within Git itself?

Correct. It's NOT Git's problems by any mean.

> It's only about people trying to compare
> textual patches byte-for-byte and seeing different index lines?

Yeah, it's about people trying to backport patch to old tree.
Fixing conflicts, and try to compare to old patch to see if they have
made any errors.

Because conflicts resolving is complicated.

> 
> If that's the case, then it seems to me that the byte comparison is the
> problem here. If I have:
> 
>   index 1234abcd..5678bcde
> 
> and
> 
>   index 1234abcd87..5678bcde65
> 
> those should be considered equivalent to see if two patches are
> plausibly the same. And I think tools like git-cherry, etc, would do
> that (and we provide git-patch-id for that purpose, too).

Yes, git-patch-id is very useful tool.

There're time that half of the patch can be applied cleanly with the
exact object names.
Another half needs to be fixed heavily, (maybe removed). git-cherry
and git-patch-id couldn't cope well in those situation.
That condition is true if there's a major change to source tree.

> > > Yeah, I certainly don't mind the extra flexibility between "full" and
> > > "default" for "index" lines. I do wonder if people want to configure the
> > > abbreviations for those lines separately from other parts. I don't know
> > > that I've ever particularly cared about that flexibility, but the fact
> > > that they were set up separately all those years ago makes me think
> > > somebody might.
> > 
> > I don't think people particularly care about the index line (and to
> > the extent, its length) that much, since the default is number is
> > actually a minimum number, if Git can't differentiate object with that
> > number of characters, Git will show a longer object names anyway.
> > 
> > I think most people scripts will put a regex for:
> > 
> > 	/index [a-z0-9]{7,}\.\.[a-z0-9]{7,} [0-7]{6}/
> > 
> > Or even:
> > 
> > 	/index [a-z0-9]+\.\.[a-z0-9]+ [0-7]+/
> > 
> > For the former case, we could change the code in 2/2 to set the minimum
> > default to DEFAULT_ABBREV instead of MINIMUM_ABBREV?
> > 
> > For the historical case that users put both --full-index and --abbrev
> > into there scripts, we still keep our promise to not break their
> > script by always respect --full-index, regardless of --abbrev.
> 
> I care less about scripting (as you note, anything consuming abbreviated
> objects has to handle longer-than-minimum names anyway), and was more
> wondering whether anybody really cared that:
> 
>  git log --abbrev=30 -p
> 
> kept the short index lines (e.g., because they're easier to read). But
> I'm having trouble coming up with a plausible reason somebody would want
> long object names in earlier lines like "Merge:" but not in the patch
> index lines. And already we respect --abbrev for --raw, so it's not like
> the diff code isn't already affected. Making "-p" consistent with all
> the rest of it is probably worth doing regardless.

Yes, I think this is the easier to accept argument.
I've gone with that in the resend.