Message ID | 20200128213508.31661-1-jonathantanmy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] diff: only prefetch for certain output formats | expand |
On Tue, Jan 28, 2020 at 01:35:08PM -0800, Jonathan Tan wrote: > Running "git status" on a partial clone that was cloned with > "--no-checkout", like this: > > git clone --no-checkout --filter=blob:none <url> foo > git -C foo status > > results in an unnecessary lazy fetch. This is because of commit > 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08), which > optimized "diff" by prefetching, but inadvertently affected at least one > command ("status") that does not need prefetching. These 2 cases can be > distinguished by looking at output_format; the former uses > DIFF_FORMAT_PATCH and the latter uses DIFF_FORMAT_CALLBACK. Sometimes "status" does need the actual blobs, if it's going to do inexact rename detection. Likewise if break-detection is turned on, though I don't think anything does it by default, and there's (I don't think) any config option to enable it. So something like "git log --name-status -B -M" would probably regress from this (though only in speed, of course; I do think we can play a little loose with heuristics since we'd generate the correct answer either way). You could get pretty specific by putting logic inside diffcore_rename(), which would know if anything is left over after exact rename detection, but I suspect just checking: (options->break_opt != -1 || options->detect_rename) in diffcore_std() would be OK in practice. > (Note that other callers that use DIFF_FORMAT_CALLBACK will also lose > prefetching. I haven't investigated enough to see if this is a net > benefit or drawback, but I think this will need to be done on a > caller-by-caller basis and can be done in the future.) We can't know what the caller is going to do with a DIFF_FORMAT_CALLBACK, so I think it makes sense to avoid pre-fetching in that case. It might be nice, though, if that pre-fetch loop in diffcore_std() were pulled into a helper function, so they could just call: diffcore_prefetch(&options); or something. I'm OK to wait on that until one of the FORMAT_CALLBACK users needs it, though. > Points of discussion I can think of: > > - Is the whitelist of output_format constants the best? I think it could be pared down a bit. For example, --raw doesn't need the blobs (aside from renames, etc, above). I think the same is true of --summary. You've already omitted --name-status and --name-only, which makes sense. I think --dirstat, even though it is only showing per-file info, still relies on the line-level stat info. So it should stay. > - Should we just have the callers pass a prefetch boolean arg instead > of trying to guess it ourselves? I'm leaning towards no since I think > we should avoid options unless they are necessary. If we can avoid it, I think we should. It makes sense to me to try to tweak the heuristics as much as possible in this central code. If there's a case that does the wrong thing, then we can decide whether the heuristics can be improved, or if we need more information from the caller. -Peff
> Sometimes "status" does need the actual blobs, if it's going to do > inexact rename detection. Likewise if break-detection is turned on, > though I don't think anything does it by default, and there's (I don't > think) any config option to enable it. > > So something like "git log --name-status -B -M" would probably regress > from this (though only in speed, of course; I do think we can play a > little loose with heuristics since we'd generate the correct answer > either way). > > You could get pretty specific by putting logic inside diffcore_rename(), > which would know if anything is left over after exact rename detection, > but I suspect just checking: > > (options->break_opt != -1 || options->detect_rename) > > in diffcore_std() would be OK in practice. Thanks for taking a look at this patch and for the pointers, especially to rename detection. I investigated and found that in practice, with respect to rename detection, options->detect_rename is insufficient to determine exactly when we need to fetch; we need to fetch when (for example) a file is deleted and another added, but not when a file is merely changed, and these rules are not reflected in options->detect_rename. These rules indeed are in diffcore_rename(), as you mentioned, but putting logic inside diffcore_rename() (or copying the same logic over to diffcore_std()) complicates things for too little benefit, I think. To add to this, rename detection is turned on by default, so it wouldn't even fix the original issue with "status". So I'll abandon this patch, at least until someone finds a use case for diffing with no rename detection on a partial clone and would rather not have a prefetch. > > - Is the whitelist of output_format constants the best? > > I think it could be pared down a bit. For example, --raw doesn't > need the blobs (aside from renames, etc, above). I think the same is > true of --summary. You've already omitted --name-status and --name-only, > which makes sense. > > I think --dirstat, even though it is only showing per-file info, still > relies on the line-level stat info. So it should stay. Thanks for taking a look at this.
On Wed, Jan 29, 2020 at 05:39:00PM -0800, Jonathan Tan wrote: > > You could get pretty specific by putting logic inside diffcore_rename(), > > which would know if anything is left over after exact rename detection, > > but I suspect just checking: > > > > (options->break_opt != -1 || options->detect_rename) > > > > in diffcore_std() would be OK in practice. > > Thanks for taking a look at this patch and for the pointers, especially > to rename detection. I investigated and found that in practice, with > respect to rename detection, options->detect_rename is insufficient to > determine exactly when we need to fetch; we need to fetch when > (for example) a file is deleted and another added, but not when a file > is merely changed, and these rules are not reflected in > options->detect_rename. These rules indeed are in diffcore_rename(), as > you mentioned, but putting logic inside diffcore_rename() (or copying > the same logic over to diffcore_std()) complicates things for too little > benefit, I think. > > To add to this, rename detection is turned on by default, so it wouldn't > even fix the original issue with "status". > > So I'll abandon this patch, at least until someone finds a use case for > diffing with no rename detection on a partial clone and would rather not > have a prefetch. Ah, true, "options->detect_rename" would be overly broad. I actually don't think it would be that bad to put the logic in diffcore_rename(). If we wait until the right moment (after inexact renames have been resolved, and when we see if there are any candidates left), it should just be a matter of walking over the candidate lists. Something like this (it would need the add_if_missing() helper from diffcore_std()): diff --git a/diffcore-rename.c b/diffcore-rename.c index 531d7adeaf..d519ffcc45 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -458,6 +458,7 @@ void diffcore_rename(struct diff_options *options) int i, j, rename_count, skip_unmodified = 0; int num_create, dst_cnt; struct progress *progress = NULL; + struct oid_array to_fetch = OID_ARRAY_INIT; if (!minimum_score) minimum_score = DEFAULT_RENAME_SCORE; @@ -538,6 +539,25 @@ void diffcore_rename(struct diff_options *options) break; } + /* + * At this point we know there's actual work to do: we have rename + * destinations that didn't find an exact match, and we have potential + * sources. So we'll have to do inexact rename detection, which + * requires looking at the blobs. It's worth pre-fetching them as a + * group now. + */ + for (i = 0; i < rename_dst_nr; i++) { + if (rename_dst[i].pair) + continue; /* already found exact match */ + add_if_missing(options->repo, &to_fetch, rename_dst[i].two); + } + for (i = 0; i < rename_src_nr; i++) { + add_if_missing(options->repo, &to_fetch, rename_src[i].p->one); + } + if (to_fetch.nr) + promisor_remote_get_direct(options->repo, + to_fetch.oid, to_fetch.nr); + if (options->show_rename_progress) { progress = start_delayed_progress( _("Performing inexact rename detection"),
> Ah, true, "options->detect_rename" would be overly broad. > > I actually don't think it would be that bad to put the logic in > diffcore_rename(). If we wait until the right moment (after inexact > renames have been resolved, and when we see if there are any candidates > left), it should just be a matter of walking over the candidate lists. > > Something like this (it would need the add_if_missing() helper from > diffcore_std()): [snip] > @@ -538,6 +539,25 @@ void diffcore_rename(struct diff_options *options) > break; > } > > + /* > + * At this point we know there's actual work to do: we have rename > + * destinations that didn't find an exact match, and we have potential > + * sources. So we'll have to do inexact rename detection, which > + * requires looking at the blobs. It's worth pre-fetching them as a > + * group now. > + */ > + for (i = 0; i < rename_dst_nr; i++) { > + if (rename_dst[i].pair) > + continue; /* already found exact match */ > + add_if_missing(options->repo, &to_fetch, rename_dst[i].two); > + } > + for (i = 0; i < rename_src_nr; i++) { > + add_if_missing(options->repo, &to_fetch, rename_src[i].p->one); > + } > + if (to_fetch.nr) > + promisor_remote_get_direct(options->repo, > + to_fetch.oid, to_fetch.nr); > + > if (options->show_rename_progress) { > progress = start_delayed_progress( > _("Performing inexact rename detection"), And also the equivalent code in diffcore_break() and in diffcore_std() after both these functions are invoked (in case nothing got prefetched, but the diff still requires blobs).
On Thu, Jan 30, 2020 at 03:20:02PM -0800, Jonathan Tan wrote: > > + /* > > + * At this point we know there's actual work to do: we have rename > > + * destinations that didn't find an exact match, and we have potential > > + * sources. So we'll have to do inexact rename detection, which > > + * requires looking at the blobs. It's worth pre-fetching them as a > > + * group now. > > + */ > > + for (i = 0; i < rename_dst_nr; i++) { > [...] > > And also the equivalent code in diffcore_break() and in diffcore_std() > after both these functions are invoked (in case nothing got prefetched, > but the diff still requires blobs). I think diffcore_break() would probably be OK to just pre-fetch everything if it's enabled, since it has to look at the content of all modifications. Though I suppose _technically_ added/deleted entries do not get looked at, I doubt anybody would care in practice since the primary use is to then feed all of the pairs into the rename code. The diffcore_std() logic would be similar to what you wrote earlier based on theformats. I think you'd want it to come first, before diffcore_rename(), because it fetches a superset of refname (if it fetches anything at all). I.e., for "diff -M -p", you'd want: 1. diffcore_std() sees "-p" and fetches everything 2. diffcore_rename() sees there's nothing we don't already have rather than: 1. diffcore_rename() fetches a few blobs to do rename detection 2. diffcore_std() fetches a few more blobs that weren't rename candidates, but we need for "-p" -Peff
Jeff King <peff@peff.net> writes: > fetches anything at all). I.e., for "diff -M -p", you'd want: > > 1. diffcore_std() sees "-p" and fetches everything > > 2. diffcore_rename() sees there's nothing we don't already have > > rather than: > > 1. diffcore_rename() fetches a few blobs to do rename detection > > 2. diffcore_std() fetches a few more blobs that weren't rename > candidates, but we need for "-p" Hmph, a pure rename only change will cause no blobs transferred with the latter (because there is no content change for "-p" to report, and the rename detection for R100 paths would be done at the object name level), but all blobs in filepairs (before rename matches A/D up) with the former, no?
On Fri, Jan 31, 2020 at 10:08:36AM -0800, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > fetches anything at all). I.e., for "diff -M -p", you'd want: > > > > 1. diffcore_std() sees "-p" and fetches everything > > > > 2. diffcore_rename() sees there's nothing we don't already have > > > > rather than: > > > > 1. diffcore_rename() fetches a few blobs to do rename detection > > > > 2. diffcore_std() fetches a few more blobs that weren't rename > > candidates, but we need for "-p" > > Hmph, a pure rename only change will cause no blobs transferred with > the latter (because there is no content change for "-p" to report, > and the rename detection for R100 paths would be done at the object > name level), but all blobs in filepairs (before rename matches A/D > up) with the former, no? Hmm, good point that an exact rename won't show the blobs. I don't think there's a way that covers all cases in a single fetch at the granularity of the functions I listed above. But we could if we break it down a bit: 1. look for exact renames 2. queue for fetching any inexact rename candidates leftover 3. if we're going to show a diff that requires contents, then: 3a. if marked as an exact rename/copy, don't queue (I guess this is really checking p->one->oid versus p->two->oid) 3b. likewise, deletions with --irreversible-delete don't need queued 3c. otherwise, queue for fetch 4. fetch whatever was queued 5. proceed with inexact rename detection, then showing the diffs So that logic has to go in the middle of diffcore_rename(). And if we're not doing renames, then the logic from (3) needs to get triggered by diffcore_std(). So it probably needs to be hoisted out into a helper, and made idempotent (it already should be, since after the first prefetch we'd have all of the objects, but we might want a flag to avoid unnecessarily checking again). -Peff
diff --git a/diff.c b/diff.c index 8e2914c031..8263491783 100644 --- a/diff.c +++ b/diff.c @@ -6504,7 +6504,15 @@ static void add_if_missing(struct repository *r, void diffcore_std(struct diff_options *options) { - if (options->repo == the_repository && has_promisor_remote()) { + int output_formats_to_prefetch = DIFF_FORMAT_RAW | + DIFF_FORMAT_DIFFSTAT | + DIFF_FORMAT_NUMSTAT | + DIFF_FORMAT_SUMMARY | + DIFF_FORMAT_PATCH | + DIFF_FORMAT_SHORTSTAT | + DIFF_FORMAT_DIRSTAT; + if ((options->output_format & output_formats_to_prefetch) && + options->repo == the_repository && has_promisor_remote()) { /* * Prefetch the diff pairs that are about to be flushed. */ diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh index a3988bd4b8..d72631a3a0 100755 --- a/t/t0410-partial-clone.sh +++ b/t/t0410-partial-clone.sh @@ -567,6 +567,19 @@ test_expect_success 'do not fetch when checking existence of tree we construct o git -C repo cherry-pick side1 ' +test_expect_success 'do not fetch when running status against --no-checkout partial clone' ' + rm -rf server client trace && + test_create_repo server && + test_commit -C server one && + + test_config -C server uploadpack.allowfilter 1 && + git clone --no-checkout --filter=blob:none "file://$(pwd)/server" client && + + test_config -C server uploadpack.allowanysha1inwant 1 && + GIT_TRACE_PACKET="$(pwd)/trace" git -C client status && + ! test_path_exists trace +' + . "$TEST_DIRECTORY"/lib-httpd.sh start_httpd
Running "git status" on a partial clone that was cloned with "--no-checkout", like this: git clone --no-checkout --filter=blob:none <url> foo git -C foo status results in an unnecessary lazy fetch. This is because of commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08), which optimized "diff" by prefetching, but inadvertently affected at least one command ("status") that does not need prefetching. These 2 cases can be distinguished by looking at output_format; the former uses DIFF_FORMAT_PATCH and the latter uses DIFF_FORMAT_CALLBACK. Therefore, restrict prefetching only to output_format values like the one used by "diff". (Note that other callers that use DIFF_FORMAT_CALLBACK will also lose prefetching. I haven't investigated enough to see if this is a net benefit or drawback, but I think this will need to be done on a caller-by-caller basis and can be done in the future.) Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- Points of discussion I can think of: - Is the whitelist of output_format constants the best? - Should we just have the callers pass a prefetch boolean arg instead of trying to guess it ourselves? I'm leaning towards no since I think we should avoid options unless they are necessary. Perhaps there are others. --- diff.c | 10 +++++++++- t/t0410-partial-clone.sh | 13 +++++++++++++ 2 files changed, 22 insertions(+), 1 deletion(-)