Message ID | 20200601202218.GA2763518@coredump.intra.peff.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | diff: discard blob data from stat-unmatched pairs | expand |
On Mon, Jun 1, 2020 at 4:22 PM Jeff King <peff@peff.net> wrote: > Subject: [PATCH] diff: discard blob data from stat-unmatched pairs > > When performing a tree-level diff against the working tree, we may find > that our index stat information is dirty, so we queue a filepair to be > examined later. If the actual content hasn't changed, we call this a > stat-unmatch; the stat information was out of date, but there's no > actual diff. Normally diffcore_std() would detect and remove these > identical filepairs via diffcore_skip_stat_unmatch(). However, when > "--quiet" is used, we want to stop the diff as soon as we see any > changes, so we check for stat-unmatches immediately in diff_change(). > > That check may require us to actually load the file contents into the > pair of diff_filespecs. If we find that the pair isn't a stat-unmatch, > then no big deal; we'd likely load the contents later anyway to generate > a patch, do rename detection, etc, so we want to hold on to it. But if > it a stat-unmatch, then we have no more use for that data; the whole s/it/& is/ > point is that we're going discard the pair. However, we never free the > allocated diff_filespec data. > > In most cases, keeping that data isn't a problem. We don't expect a lot > of stat-unmatch entries, and since we're using --quiet, we'd quit as > soon as we saw such a real change anyway. However, there are extreme > cases where it makes a big difference: > > 1. We'd generally mmap() the working tree half of the pair. And since > the OS may limit the total number of maps, we can run afoul of this > in large repositories. E.g.: > > $ cd linux > $ git ls-files | wc -l > 67959 > $ sysctl vm.max_map_count > vm.max_map_count = 65530 > $ git ls-files | xargs touch ;# everything is stat-dirty! > $ git diff --quiet > fatal: mmap failed: Cannot allocate memory > > It should be unusual to have so many files stat-dirty, but it's > possible if you've just run a script like "sed -i" or similar. > > After this patch, the above correctly exits with code 0. > > 2. Even if you don't hit mmap limits, the index half of the pair will > have been pulled from the object database into heap memory. Again > in a clone of linux.git, running: > > $ git ls-files | head -n 10000 | xargs touch > $ git diff --quiet > > peaks at 145MB heap before this patch, and 94MB after. > > This patch solves the problem by freeing any diff_filespec data we > picked up during the "--quiet" stat-unmatch check in diff_changes. > Nobody is going to need that data later, so there's no point holding on > to it. There are a few things to note: > > - we could skip queueing the pair entirely, which could in theory save > a little work. But there's not much to save, as we need a > diff_filepair to feed to diff_filespec_check_stat_unmatch() anyway. > And since we cache the result of the stat-unmatch checks, a later > call to diffcore_skip_stat_unmatch() call will quickly skip over > them. The diffcore code also counts up the number of stat-unmatched > pairs as it removes them. It's doubtful any callers would care about > that in combination with --quiet, but we'd have to reimplement the > logic here to be on the safe side. So it's not really worth the > trouble. > > - I didn't write a test, because we always produce the correct output > unless we run up against system mmap limits, which are both > unportable and expensive to test against. Measuring peak heap > would be interesting, but our perf suite isn't yet capable of that. > > - note that diff without "--quiet" does not suffer from the same > problem. In diffcore_skip_stat_unmatch(), we detect the stat-unmatch > entries and drop them immediately, so we're not carrying their data > around. > > - you _can_ still trigger the mmap limit problem if you truly have > that many files with actual changes. But it's rather unlikely. The > stat-unmatch check avoids loading the file contents if the sizes > don't match, so you'd need a pretty trivial change in every single > file. Likewise, inexact rename detection might load the data for > many files all at once. But you'd need not just 64k changes, but > that many deletions and additions. The most likely candidate is > perhaps break-detection, which would load the data for all pairs and > keep it around for the content-level diff. But again, you'd need 64k > actually changed files in the first place. > > So it's still possible to trigger this case, but it seems like "I > accidentally made all my files stat-dirty" is the most likely case > in the real world. > > Reported-by: Jan Christoph Uhde <Jan@UhdeJc.com> > Signed-off-by: Jeff King <peff@peff.net>
Jeff King <peff@peff.net> writes: > Subject: [PATCH] diff: discard blob data from stat-unmatched pairs > > When performing a tree-level diff against the working tree, we may find > that our index stat information is dirty, so we queue a filepair to be > examined later. If the actual content hasn't changed, we call this a > stat-unmatch; the stat information was out of date, but there's no > actual diff. Normally diffcore_std() would detect and remove these > identical filepairs via diffcore_skip_stat_unmatch(). However, when > "--quiet" is used, we want to stop the diff as soon as we see any > changes, so we check for stat-unmatches immediately in diff_change(). > > That check may require us to actually load the file contents into the > pair of diff_filespecs. If we find that the pair isn't a stat-unmatch, > then no big deal; we'd likely load the contents later anyway to generate > a patch, do rename detection, etc, so we want to hold on to it. But if > it a stat-unmatch, then we have no more use for that data; the whole > point is that we're going discard the pair. However, we never free the > allocated diff_filespec data. Nicely spotted. So we can discard when quiet is in effect after this check, which makes sense. After reading the initial analysis, I wondered if the fix we did in f34b205f (diff: do not quit early on stat-dirty files, 2014-01-25) was suboptimal and we should have instead done the "if QUICK, check if the pair is merely stat-unmatch" in the loop(s) that call diff_change(), hoping that it may have given us a better control over the lifetime of the filespecs in each diff_filepair, but I do not think it would made much difference. > if (options->flags.quick && options->skip_stat_unmatch && > - !diff_filespec_check_stat_unmatch(options->repo, p)) > + !diff_filespec_check_stat_unmatch(options->repo, p)) { > + diff_free_filespec_data(p->one); > + diff_free_filespec_data(p->two); > return; > + } Thanks. Will queue (with that s/it/& is/ typofix mentioned elsewhere).
diff --git a/diff.c b/diff.c index d1ad6a3c4a..4d46fab5d3 100644 --- a/diff.c +++ b/diff.c @@ -6758,8 +6758,11 @@ void diff_change(struct diff_options *options, return; if (options->flags.quick && options->skip_stat_unmatch && - !diff_filespec_check_stat_unmatch(options->repo, p)) + !diff_filespec_check_stat_unmatch(options->repo, p)) { + diff_free_filespec_data(p->one); + diff_free_filespec_data(p->two); return; + } options->flags.has_changes = 1; }