[RFC] diff: only prefetch for certain output formats

Message ID	20200128213508.31661-1-jonathantanmy@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=aR3Y=3R=vger.kernel.org=git-owner@kernel.org> Date: Tue, 28 Jan 2020 13:35:08 -0800 Message-Id: <20200128213508.31661-1-jonathantanmy@google.com> Mime-Version: 1.0 Subject: [RFC PATCH] diff: only prefetch for certain output formats From: Jonathan Tan <jonathantanmy@google.com> To: git@vger.kernel.org Cc: Jonathan Tan <jonathantanmy@google.com> Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk
Series	[RFC] diff: only prefetch for certain output formats \| expand [RFC] diff: only prefetch for certain output formats

Jonathan Tan Jan. 28, 2020, 9:35 p.m. UTC

Running "git status" on a partial clone that was cloned with
"--no-checkout", like this:

  git clone --no-checkout --filter=blob:none <url> foo
  git -C foo status

results in an unnecessary lazy fetch. This is because of commit
7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08), which
optimized "diff" by prefetching, but inadvertently affected at least one
command ("status") that does not need prefetching. These 2 cases can be
distinguished by looking at output_format; the former uses
DIFF_FORMAT_PATCH and the latter uses DIFF_FORMAT_CALLBACK.

Therefore, restrict prefetching only to output_format values like the
one used by "diff".

(Note that other callers that use DIFF_FORMAT_CALLBACK will also lose
prefetching. I haven't investigated enough to see if this is a net
benefit or drawback, but I think this will need to be done on a
caller-by-caller basis and can be done in the future.)

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
Points of discussion I can think of:

 - Is the whitelist of output_format constants the best?
 - Should we just have the callers pass a prefetch boolean arg instead
   of trying to guess it ourselves? I'm leaning towards no since I think
   we should avoid options unless they are necessary.

Perhaps there are others.
---
 diff.c                   | 10 +++++++++-
 t/t0410-partial-clone.sh | 13 +++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

Jeff King Jan. 29, 2020, 5:09 a.m. UTC | #1

On Tue, Jan 28, 2020 at 01:35:08PM -0800, Jonathan Tan wrote:

> Running "git status" on a partial clone that was cloned with
> "--no-checkout", like this:
> 
>   git clone --no-checkout --filter=blob:none <url> foo
>   git -C foo status
> 
> results in an unnecessary lazy fetch. This is because of commit
> 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08), which
> optimized "diff" by prefetching, but inadvertently affected at least one
> command ("status") that does not need prefetching. These 2 cases can be
> distinguished by looking at output_format; the former uses
> DIFF_FORMAT_PATCH and the latter uses DIFF_FORMAT_CALLBACK.

Sometimes "status" does need the actual blobs, if it's going to do
inexact rename detection. Likewise if break-detection is turned on,
though I don't think anything does it by default, and there's (I don't
think) any config option to enable it.

So something like "git log --name-status -B -M" would probably regress
from this (though only in speed, of course; I do think we can play a
little loose with heuristics since we'd generate the correct answer
either way).

You could get pretty specific by putting logic inside diffcore_rename(),
which would know if anything is left over after exact rename detection,
but I suspect just checking:

  (options->break_opt != -1 || options->detect_rename)

in diffcore_std() would be OK in practice.

> (Note that other callers that use DIFF_FORMAT_CALLBACK will also lose
> prefetching. I haven't investigated enough to see if this is a net
> benefit or drawback, but I think this will need to be done on a
> caller-by-caller basis and can be done in the future.)

We can't know what the caller is going to do with a
DIFF_FORMAT_CALLBACK, so I think it makes sense to avoid pre-fetching in
that case. It might be nice, though, if that pre-fetch loop in
diffcore_std() were pulled into a helper function, so they could just
call:

  diffcore_prefetch(&options);

or something. I'm OK to wait on that until one of the FORMAT_CALLBACK
users needs it, though.

> Points of discussion I can think of:
> 
>  - Is the whitelist of output_format constants the best?

I think it could be pared down a bit. For example, --raw doesn't
need the blobs (aside from renames, etc, above). I think the same is
true of --summary. You've already omitted --name-status and --name-only,
which makes sense.

I think --dirstat, even though it is only showing per-file info, still
relies on the line-level stat info. So it should stay.

>  - Should we just have the callers pass a prefetch boolean arg instead
>    of trying to guess it ourselves? I'm leaning towards no since I think
>    we should avoid options unless they are necessary.

If we can avoid it, I think we should. It makes sense to me to try to
tweak the heuristics as much as possible in this central code. If
there's a case that does the wrong thing, then we can decide whether the
heuristics can be improved, or if we need more information from the
caller.

-Peff

Jonathan Tan Jan. 30, 2020, 1:39 a.m. UTC | #2

> Sometimes "status" does need the actual blobs, if it's going to do
> inexact rename detection. Likewise if break-detection is turned on,
> though I don't think anything does it by default, and there's (I don't
> think) any config option to enable it.
> 
> So something like "git log --name-status -B -M" would probably regress
> from this (though only in speed, of course; I do think we can play a
> little loose with heuristics since we'd generate the correct answer
> either way).
> 
> You could get pretty specific by putting logic inside diffcore_rename(),
> which would know if anything is left over after exact rename detection,
> but I suspect just checking:
> 
>   (options->break_opt != -1 || options->detect_rename)
> 
> in diffcore_std() would be OK in practice.

Thanks for taking a look at this patch and for the pointers, especially
to rename detection. I investigated and found that in practice, with
respect to rename detection, options->detect_rename is insufficient to
determine exactly when we need to fetch; we need to fetch when
(for example) a file is deleted and another added, but not when a file
is merely changed, and these rules are not reflected in
options->detect_rename. These rules indeed are in diffcore_rename(), as
you mentioned, but putting logic inside diffcore_rename() (or copying
the same logic over to diffcore_std()) complicates things for too little
benefit, I think.

To add to this, rename detection is turned on by default, so it wouldn't
even fix the original issue with "status".

So I'll abandon this patch, at least until someone finds a use case for
diffing with no rename detection on a partial clone and would rather not
have a prefetch.

> >  - Is the whitelist of output_format constants the best?
> 
> I think it could be pared down a bit. For example, --raw doesn't
> need the blobs (aside from renames, etc, above). I think the same is
> true of --summary. You've already omitted --name-status and --name-only,
> which makes sense.
> 
> I think --dirstat, even though it is only showing per-file info, still
> relies on the line-level stat info. So it should stay.

Thanks for taking a look at this.

Jeff King Jan. 30, 2020, 5:51 a.m. UTC | #3

On Wed, Jan 29, 2020 at 05:39:00PM -0800, Jonathan Tan wrote:

> > You could get pretty specific by putting logic inside diffcore_rename(),
> > which would know if anything is left over after exact rename detection,
> > but I suspect just checking:
> > 
> >   (options->break_opt != -1 || options->detect_rename)
> > 
> > in diffcore_std() would be OK in practice.
> 
> Thanks for taking a look at this patch and for the pointers, especially
> to rename detection. I investigated and found that in practice, with
> respect to rename detection, options->detect_rename is insufficient to
> determine exactly when we need to fetch; we need to fetch when
> (for example) a file is deleted and another added, but not when a file
> is merely changed, and these rules are not reflected in
> options->detect_rename. These rules indeed are in diffcore_rename(), as
> you mentioned, but putting logic inside diffcore_rename() (or copying
> the same logic over to diffcore_std()) complicates things for too little
> benefit, I think.
> 
> To add to this, rename detection is turned on by default, so it wouldn't
> even fix the original issue with "status".
> 
> So I'll abandon this patch, at least until someone finds a use case for
> diffing with no rename detection on a partial clone and would rather not
> have a prefetch.

Ah, true, "options->detect_rename" would be overly broad.

I actually don't think it would be that bad to put the logic in
diffcore_rename(). If we wait until the right moment (after inexact
renames have been resolved, and when we see if there are any candidates
left), it should just be a matter of walking over the candidate lists.

Something like this (it would need the add_if_missing() helper from
diffcore_std()):

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 531d7adeaf..d519ffcc45 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -458,6 +458,7 @@ void diffcore_rename(struct diff_options *options)
 	int i, j, rename_count, skip_unmodified = 0;
 	int num_create, dst_cnt;
 	struct progress *progress = NULL;
+	struct oid_array to_fetch = OID_ARRAY_INIT;
 
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
@@ -538,6 +539,25 @@ void diffcore_rename(struct diff_options *options)
 		break;
 	}
 
+	/*
+	 * At this point we know there's actual work to do: we have rename
+	 * destinations that didn't find an exact match, and we have potential
+	 * sources. So we'll have to do inexact rename detection, which
+	 * requires looking at the blobs. It's worth pre-fetching them as a
+	 * group now.
+	 */
+	for (i = 0; i < rename_dst_nr; i++) {
+		if (rename_dst[i].pair)
+			continue; /* already found exact match */
+		add_if_missing(options->repo, &to_fetch, rename_dst[i].two);
+	}
+	for (i = 0; i < rename_src_nr; i++) {
+		add_if_missing(options->repo, &to_fetch, rename_src[i].p->one);
+	}
+	if (to_fetch.nr)
+		promisor_remote_get_direct(options->repo,
+					   to_fetch.oid, to_fetch.nr);
+
 	if (options->show_rename_progress) {
 		progress = start_delayed_progress(
 				_("Performing inexact rename detection"),

Jonathan Tan Jan. 30, 2020, 11:20 p.m. UTC | #4

> Ah, true, "options->detect_rename" would be overly broad.
> 
> I actually don't think it would be that bad to put the logic in
> diffcore_rename(). If we wait until the right moment (after inexact
> renames have been resolved, and when we see if there are any candidates
> left), it should just be a matter of walking over the candidate lists.
> 
> Something like this (it would need the add_if_missing() helper from
> diffcore_std()):

[snip]

> @@ -538,6 +539,25 @@ void diffcore_rename(struct diff_options *options)
>  		break;
>  	}
>  
> +	/*
> +	 * At this point we know there's actual work to do: we have rename
> +	 * destinations that didn't find an exact match, and we have potential
> +	 * sources. So we'll have to do inexact rename detection, which
> +	 * requires looking at the blobs. It's worth pre-fetching them as a
> +	 * group now.
> +	 */
> +	for (i = 0; i < rename_dst_nr; i++) {
> +		if (rename_dst[i].pair)
> +			continue; /* already found exact match */
> +		add_if_missing(options->repo, &to_fetch, rename_dst[i].two);
> +	}
> +	for (i = 0; i < rename_src_nr; i++) {
> +		add_if_missing(options->repo, &to_fetch, rename_src[i].p->one);
> +	}
> +	if (to_fetch.nr)
> +		promisor_remote_get_direct(options->repo,
> +					   to_fetch.oid, to_fetch.nr);
> +
>  	if (options->show_rename_progress) {
>  		progress = start_delayed_progress(
>  				_("Performing inexact rename detection"),

And also the equivalent code in diffcore_break() and in diffcore_std()
after both these functions are invoked (in case nothing got prefetched,
but the diff still requires blobs).

Jeff King Jan. 31, 2020, 12:14 a.m. UTC | #5

On Thu, Jan 30, 2020 at 03:20:02PM -0800, Jonathan Tan wrote:

> > +	/*
> > +	 * At this point we know there's actual work to do: we have rename
> > +	 * destinations that didn't find an exact match, and we have potential
> > +	 * sources. So we'll have to do inexact rename detection, which
> > +	 * requires looking at the blobs. It's worth pre-fetching them as a
> > +	 * group now.
> > +	 */
> > +	for (i = 0; i < rename_dst_nr; i++) {
> [...]
> 
> And also the equivalent code in diffcore_break() and in diffcore_std()
> after both these functions are invoked (in case nothing got prefetched,
> but the diff still requires blobs).

I think diffcore_break() would probably be OK to just pre-fetch
everything if it's enabled, since it has to look at the content of all
modifications. Though I suppose _technically_ added/deleted entries do
not get looked at, I doubt anybody would care in practice since the
primary use is to then feed all of the pairs into the rename code.

The diffcore_std() logic would be similar to what you wrote earlier
based on theformats. I think you'd want it to come first, before
diffcore_rename(), because it fetches a superset of refname (if it
fetches anything at all). I.e., for "diff -M -p", you'd want:

  1. diffcore_std() sees "-p" and fetches everything

  2. diffcore_rename() sees there's nothing we don't already have

rather than:

  1. diffcore_rename() fetches a few blobs to do rename detection

  2. diffcore_std() fetches a few more blobs that weren't rename
     candidates, but we need for "-p"

-Peff

Junio C Hamano Jan. 31, 2020, 6:08 p.m. UTC | #6

Jeff King <peff@peff.net> writes:

> fetches anything at all). I.e., for "diff -M -p", you'd want:
>
>   1. diffcore_std() sees "-p" and fetches everything
>
>   2. diffcore_rename() sees there's nothing we don't already have
>
> rather than:
>
>   1. diffcore_rename() fetches a few blobs to do rename detection
>
>   2. diffcore_std() fetches a few more blobs that weren't rename
>      candidates, but we need for "-p"

Hmph, a pure rename only change will cause no blobs transferred with
the latter (because there is no content change for "-p" to report,
and the rename detection for R100 paths would be done at the object
name level), but all blobs in filepairs (before rename matches A/D
up) with the former, no?

Jeff King Feb. 1, 2020, 11:29 a.m. UTC | #7

On Fri, Jan 31, 2020 at 10:08:36AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > fetches anything at all). I.e., for "diff -M -p", you'd want:
> >
> >   1. diffcore_std() sees "-p" and fetches everything
> >
> >   2. diffcore_rename() sees there's nothing we don't already have
> >
> > rather than:
> >
> >   1. diffcore_rename() fetches a few blobs to do rename detection
> >
> >   2. diffcore_std() fetches a few more blobs that weren't rename
> >      candidates, but we need for "-p"
> 
> Hmph, a pure rename only change will cause no blobs transferred with
> the latter (because there is no content change for "-p" to report,
> and the rename detection for R100 paths would be done at the object
> name level), but all blobs in filepairs (before rename matches A/D
> up) with the former, no?

Hmm, good point that an exact rename won't show the blobs. I don't think
there's a way that covers all cases in a single fetch at the granularity
of the functions I listed above. But we could if we break it down a bit:

  1. look for exact renames

  2. queue for fetching any inexact rename candidates leftover

  3. if we're going to show a diff that requires contents, then:

     3a. if marked as an exact rename/copy, don't queue (I guess this is
         really checking p->one->oid versus p->two->oid)

     3b. likewise, deletions with --irreversible-delete don't need queued

     3c. otherwise, queue for fetch

  4. fetch whatever was queued

  5. proceed with inexact rename detection, then showing the diffs

So that logic has to go in the middle of diffcore_rename(). And if we're
not doing renames, then the logic from (3) needs to get triggered by
diffcore_std(). So it probably needs to be hoisted out into a helper,
and made idempotent (it already should be, since after the first
prefetch we'd have all of the objects, but we might want a flag to avoid
unnecessarily checking again).

-Peff

[RFC] diff: only prefetch for certain output formats

Commit Message

Comments

Patch