Message ID | 20181017164021.15204-2-peartben@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v1,1/2] reset: don't compute unstaged changes after reset when --quiet | expand |
On Wed, Oct 17, 2018 at 12:40 PM Ben Peart <peartben@gmail.com> wrote: > When git reset is run with the --quiet flag, don't bother finding any > additional unstaged changes as they won't be output anyway. This speeds up > the git reset command by avoiding having to lstat() every file looking for > changes that aren't going to be reported anyway. > > Signed-off-by: Ben Peart <benpeart@microsoft.com> > --- > diff --git a/Documentation/git-reset.txt b/Documentation/git-reset.txt > @@ -95,7 +95,9 @@ OPTIONS > --quiet:: > - Be quiet, only report errors. > + Be quiet, only report errors. Can optimize the performance of reset > + by avoiding scaning all files in the repo looking for additional > + unstaged changes. s/scaning/scanning/ However, I'm not convinced that this should be documented here or at least in this fashion. When I read this new documentation before reading the commit message, I was baffled by what it was trying to say since --quiet'ness is a superficial quality, not an optimizer. My knee-jerk reaction is that it doesn't belong in end-user documentation at all since it's an implementation detail, however, I can see that such knowledge could be handy for people in situations which would be helped by this. That said, if you do document it, this doesn't feel like the correct place to do so; it should be in a "Discussion" section or something. (Who would expect to find --quiet documentation talking about optimizations? Likely, nobody.)
On Wed, Oct 17, 2018 at 02:14:32PM -0400, Eric Sunshine wrote: > > diff --git a/Documentation/git-reset.txt b/Documentation/git-reset.txt > > @@ -95,7 +95,9 @@ OPTIONS > > --quiet:: > > - Be quiet, only report errors. > > + Be quiet, only report errors. Can optimize the performance of reset > > + by avoiding scaning all files in the repo looking for additional > > + unstaged changes. > > s/scaning/scanning/ > > However, I'm not convinced that this should be documented here or at > least in this fashion. When I read this new documentation before > reading the commit message, I was baffled by what it was trying to say > since --quiet'ness is a superficial quality, not an optimizer. My > knee-jerk reaction is that it doesn't belong in end-user documentation > at all since it's an implementation detail, however, I can see that > such knowledge could be handy for people in situations which would be > helped by this. That said, if you do document it, this doesn't feel > like the correct place to do so; it should be in a "Discussion" > section or something. (Who would expect to find --quiet documentation > talking about optimizations? Likely, nobody.) Yeah, I had the same thought. You'd probably choose --quiet because you want it, you know, quiet. Whereas for the new config variable, you'd probably set it not because you want it quiet all the time, but because you want to get some time savings. So there it does make sense to me to explain. Other than that, this seems like an obvious and easy win. It does feel a little hacky (you're really losing something in the output, and ideally we'd just be able to give that answer quickly), but this may be OK as a hack in the interim. The sad thing is just that it's user-facing, so we have to respect it forever. I almost wonder if there should be a global core.optimizeMessages or something that tries to tradeoff less information for speed in all commands, but makes no promises about which. Then a user with a big repo who sets it once will get the benefit as more areas are identified (I think "status" already has a similar case with ahead/behind)? And vice versa, as some messages get faster to produce, they can be dropped from that option. I dunno. Maybe that is a stupid idea, and people really do want to control it on a per-message basis. -Peff
Jeff King <peff@peff.net> writes: > Whereas for the new config variable, you'd probably set it not because > you want it quiet all the time, but because you want to get some time > savings. So there it does make sense to me to explain. > > Other than that, this seems like an obvious and easy win. It does feel a > little hacky (you're really losing something in the output, and ideally > we'd just be able to give that answer quickly), but this may be OK as a > hack in the interim. After "git reset --quiet -- this/area/" with this change, any operation you'd do next that needs to learn if working tree files are different from what is recorded in the index outside that area will have to spend more cycles, because the refresh done by "reset" is now limited to the area. So if your final goal is "make 'reset' as fast as possible", this is an obvious and easy win. For other goals, i.e. "make the overall experience of using Git feel faster", it is not so obvious to me, though. If we somehow know that it is much less important in your setup that the cached stat bits in the index is kept up to date (e.g. perhaps you are more heavily relying on fsmonitor and are happy with it), then I suspect that we could even skip the refreshing altogether and gain more performance, without sacrificing the "overall experience of using Git" at all, which would be even better. > The sad thing is just that it's user-facing, so we have to respect it > forever. I almost wonder if there should be a global core.optimizeMessages > or something that tries to tradeoff less information for speed in all > commands, but makes no promises about which. Then a user with a big repo > who sets it once will get the benefit as more areas are identified (I > think "status" already has a similar case with ahead/behind)? And vice > versa, as some messages get faster to produce, they can be dropped from > that option. > > I dunno. Maybe that is a stupid idea, and people really do want to > control it on a per-message basis. > > -Peff
On Thu, Oct 18, 2018 at 12:40:48PM +0900, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > Whereas for the new config variable, you'd probably set it not because > > you want it quiet all the time, but because you want to get some time > > savings. So there it does make sense to me to explain. > > > > Other than that, this seems like an obvious and easy win. It does feel a > > little hacky (you're really losing something in the output, and ideally > > we'd just be able to give that answer quickly), but this may be OK as a > > hack in the interim. > > After "git reset --quiet -- this/area/" with this change, any > operation you'd do next that needs to learn if working tree files > are different from what is recorded in the index outside that area > will have to spend more cycles, because the refresh done by "reset" > is now limited to the area. So if your final goal is "make 'reset' > as fast as possible", this is an obvious and easy win. For other > goals, i.e. "make the overall experience of using Git feel faster", > it is not so obvious to me, though. > > If we somehow know that it is much less important in your setup that > the cached stat bits in the index is kept up to date (e.g. perhaps > you are more heavily relying on fsmonitor and are happy with it), > then I suspect that we could even skip the refreshing altogether and > gain more performance, without sacrificing the "overall experience > of using Git" at all, which would be even better. Yeah, I assumed that Ben was using fsmonitor. I agree if we can just use that to make this output faster, that would be the ideal. This is the "later the message would get faster to produce" I hinted at in my earlier message. So I think we are in agreement. It just isn't clear to me how much work it would take to get to the "ideal". If it's long enough, then this kind of hackery may be useful in the meantime. -Peff
On 10/18/2018 2:36 AM, Jeff King wrote: > On Thu, Oct 18, 2018 at 12:40:48PM +0900, Junio C Hamano wrote: > >> Jeff King <peff@peff.net> writes: >> >>> Whereas for the new config variable, you'd probably set it not because >>> you want it quiet all the time, but because you want to get some time >>> savings. So there it does make sense to me to explain. >>> >>> Other than that, this seems like an obvious and easy win. It does feel a >>> little hacky (you're really losing something in the output, and ideally >>> we'd just be able to give that answer quickly), but this may be OK as a >>> hack in the interim. >> >> After "git reset --quiet -- this/area/" with this change, any >> operation you'd do next that needs to learn if working tree files >> are different from what is recorded in the index outside that area >> will have to spend more cycles, because the refresh done by "reset" >> is now limited to the area. So if your final goal is "make 'reset' >> as fast as possible", this is an obvious and easy win. For other >> goals, i.e. "make the overall experience of using Git feel faster", >> it is not so obvious to me, though. The final goal is to make git faster (especially on larger repos) and this proposal accomplishes that. Let's look at why that is. By scoping down (or eliminating) what refresh_index() has to lstat() at the end of the reset command, clearly the reset command is faster. Yes, the index isn't as "fresh" because not everything was updated but that doesn't typically impact the performance of subsequent commands. On the next command, git still has to lstat() every file because it isn't sure what changes could have happened in the file system. As a result, the overall impact is that we have had to lstat() every file one fewer times between the two commands. A net win overall. In addition, the preload_index() code that does the lstat() command is highly optimized across multiple threads (and on Windows takes advantage of the fscache). This means that it can lstat() every file _much_ faster than the single threaded loop in refresh_index(). This also makes the overall performance of the pair of git commands faster as we got rid of the slow lstat() loop and kept the fast one. Here are some numbers to demonstrate that. These are hot cache numbers as they are easier to generate. Cold cache numbers make the net perf win significantly better as the cost for the reset jumps from 2.43 seconds to 7.16 seconds. 0.32 git add asdf 0.31 git -c reset.quiet=true reset asdf 1.34 git status 1.97 Total 0.32 git add asdf 2.43 git -c reset.quiet=false reset asdf 1.32 git status 4.07 Total Note the status command after the reset doesn't really change as it still must lstat() every file (the 0.02 difference is well within the variability of run to run differences). FWIW, none of these numbers are using fsmonitor. There was additional discussion about whether this should be tied to the "--quiet" option and how it should be documented. One option would be to change the default behavior of reset so that it doesn't do the refresh_index() call at all. This speeds up reset by default so there are no user discoverability issues but changes the default behavior which is an issue. Another option that was suggested was to add a separate flag that could be passed to reset so that the "quiet" and "fast" options don't get conflated. I don't care for that option because the two options (at this point and for the foreseeable future) would be identical in behavior from the end users perspective. It was also suggested that there be a single "fast and quiet" option for all of git instead of separate options for each command. I worry about that because now we're forcing users to choose between the "fast and quiet" version of git and the "slow and noisy" version. How do we help them decide which they want? That seems difficult to explain so that they can make a rational choice and also hard to discover. I also have to wonder who would say "give me the slow and noisy version please." :) I'd prefer we systematically move towards a model where the default values that are chosen for various settings throughout the code are all configurable via settings. All defaults by necessity make certain assumptions about user preference, data shape, machine performance, etc and if those assumptions don't match the user's environment then the hard coded defaults aren't appropriate. We do our best but its going to be hit or miss. A consistent way to be able to change those defaults would be very useful in those circumstances. To be clear, I'm not proposing we do a wholesale update of our preferences model at this point in time - that seems like a significant undertaking and I don't want to tie this specific optimization to a potential change in how default settings work. To move this forward, here is what I propose: 1) If the '--quiet' flag is passed, we silently take advantage of the fact we can avoid having to do an "extra" lstat() of every file and scope the refresh_index() call to those paths that we know have changed. 2) I can remove the note in the documentation of --quiet which I only added to facilitate discoverability. 3) I can also edit the documentation for reset.quietDefault (maybe I should rename that to "reset.quiet"?) so that it does not discuss the potential performance impact. 4) To improve the discoverability of the enhanced performance, I could add logic similar to what exists for "status --uno" and if refresh_index() takes > x seconds, prompt the user with something like: "It took %.2f seconds to enumerate unstaged changes after reset. 'reset --quiet' may speed it up. Set the config setting reset.quiet to true to make this the default." >> >> If we somehow know that it is much less important in your setup that >> the cached stat bits in the index is kept up to date (e.g. perhaps >> you are more heavily relying on fsmonitor and are happy with it), >> then I suspect that we could even skip the refreshing altogether and >> gain more performance, without sacrificing the "overall experience >> of using Git" at all, which would be even better. > > Yeah, I assumed that Ben was using fsmonitor. I agree if we can just use > that to make this output faster, that would be the ideal. This is the > "later the message would get faster to produce" I hinted at in my > earlier message. > > So I think we are in agreement. It just isn't clear to me how much work > it would take to get to the "ideal". If it's long enough, then this kind > of hackery may be useful in the meantime. > I actually started my effort to speed up reset by attempting to multi-thread refresh_index(). You can see a work in progress at: https://github.com/benpeart/git/pull/new/refresh-index-multithread-gvfs The patch doesn't always work as it is still not thread safe. When it works, it's great but I ran into to many difficulties trying to debug the remaining threading issues (even adding print statements would change the timing and the repro would disappear). It will take a lot of code review to discover and fix the remaining non-thread safe code paths. In addition, the optimized code path that takes advantage of fsmonitor, uses multiple threads, fscache, etc _already exists_ in preload_index(). Trying to recreate all those optimizations in refresh_index() is (as I discovered) a daunting task. This patch was tiny/trivial in comparison and provided all the performance benefits so seems like a much better option at this point in time. For now, I suggest we just use that existing path as it provides the benefits without the significant additional work and complexity. > -Peff >
On Thu, Oct 18, 2018 at 8:18 PM Ben Peart <peartben@gmail.com> wrote: > I actually started my effort to speed up reset by attempting to > multi-thread refresh_index(). You can see a work in progress at: > > https://github.com/benpeart/git/pull/new/refresh-index-multithread-gvfs > > The patch doesn't always work as it is still not thread safe. When it > works, it's great but I ran into to many difficulties trying to debug > the remaining threading issues (even adding print statements would > change the timing and the repro would disappear). It will take a lot of > code review to discover and fix the remaining non-thread safe code paths. > > In addition, the optimized code path that takes advantage of fsmonitor, > uses multiple threads, fscache, etc _already exists_ in preload_index(). > Trying to recreate all those optimizations in refresh_index() is (as I > discovered) a daunting task. Why not make refresh_index() run preload_index() first (or the parallel lstat part to be precise), and only do the heavy content-based refresh in single thread mode?
On 10/18/2018 2:26 PM, Duy Nguyen wrote: > On Thu, Oct 18, 2018 at 8:18 PM Ben Peart <peartben@gmail.com> wrote: >> I actually started my effort to speed up reset by attempting to >> multi-thread refresh_index(). You can see a work in progress at: >> >> https://github.com/benpeart/git/pull/new/refresh-index-multithread-gvfs >> >> The patch doesn't always work as it is still not thread safe. When it >> works, it's great but I ran into to many difficulties trying to debug >> the remaining threading issues (even adding print statements would >> change the timing and the repro would disappear). It will take a lot of >> code review to discover and fix the remaining non-thread safe code paths. >> >> In addition, the optimized code path that takes advantage of fsmonitor, >> uses multiple threads, fscache, etc _already exists_ in preload_index(). >> Trying to recreate all those optimizations in refresh_index() is (as I >> discovered) a daunting task. > > Why not make refresh_index() run preload_index() first (or the > parallel lstat part to be precise), and only do the heavy > content-based refresh in single thread mode? > Head smack! Why didn't I think of that? That is a terrific suggestion. Calling preload_index() right before the big for loop in refresh_index() is a trivial and effective way to do the bulk of the updating with the optimized code. After doing that, most of the cache entries can bail out quickly down in refresh_cache_ent() when it tests ce_uptodate(ce). Here are the numbers using that optimization (hot cache, averaged across 3 runs): 0.32 git add asdf 1.67 git reset asdf 1.68 git status 3.67 Total vs without it: 0.32 git add asdf 2.48 git reset asdf 1.50 git status 4.30 Total For a savings in the reset command of 32% and 15% overall. Clearly doing the refresh_index() faster is not as much savings as not doing it at all. Given how simple this patch is, I think it makes sense to do both so that we have optimized each path to is fullest.
Ben Peart <peartben@gmail.com> writes: > Note the status command after the reset doesn't really change as it > still must lstat() every file (the 0.02 difference is well within the > variability of run to run differences). Of course, it would not make an iota of difference, whether reset refreshes the cached stat index fully, to the cost of later lstat(). What the refreshing saves is having to scan the contents to find that the file is unchanged at runtime. If your lstat() is not significantly faster than opening and scanning the file, the optimization based on the cached-stat information becomes moot. In a working tree full of unmodified files, stale cached-stat info in the index will cause us to compare the contents and waste a lot of time, and that is what refreshing avoids. If the "status" in your test sequence do not have to do that (e.g. the cached-stat information is already up-to-date and there is no point running refresh in reset), then I'd expect no difference between these two tests. > To move this forward, here is what I propose: > > 1) If the '--quiet' flag is passed, we silently take advantage of the > fact we can avoid having to do an "extra" lstat() of every file and > scope the refresh_index() call to those paths that we know have > changed. That's pretty much what the patch under discussion does. > 2) I can remove the note in the documentation of --quiet which I only > added to facilitate discoverability. Quite honestly, I am not sure if this (meaning #1 above) alone need to be even discoverable. Those who want --quiet output would use it, those who want to be told which paths are modified would not, and those who want to quickly be told which paths are modified would not be helped by the limited refresh anyway, so "with --quiet you can make it go faster" would not help anybody. > 3) I can also edit the documentation for reset.quietDefault (maybe I > should rename that to "reset.quiet"?) so that it does not discuss the > potential performance impact. I think reset.quiet (or reset.verbosity) is a good thing to have regardless.
diff --git a/Documentation/git-reset.txt b/Documentation/git-reset.txt index 1d697d9962..8610309b55 100644 --- a/Documentation/git-reset.txt +++ b/Documentation/git-reset.txt @@ -95,7 +95,9 @@ OPTIONS -q:: --quiet:: - Be quiet, only report errors. + Be quiet, only report errors. Can optimize the performance of reset + by avoiding scaning all files in the repo looking for additional + unstaged changes. EXAMPLES diff --git a/builtin/reset.c b/builtin/reset.c index 11cd0dcb8c..0822798616 100644 --- a/builtin/reset.c +++ b/builtin/reset.c @@ -376,7 +376,7 @@ int cmd_reset(int argc, const char **argv, const char *prefix) if (read_from_tree(&pathspec, &oid, intent_to_add)) return 1; if (get_git_work_tree()) - refresh_index(&the_index, flags, NULL, NULL, + refresh_index(&the_index, flags, quiet ? &pathspec : NULL, NULL, _("Unstaged changes after reset:")); } else { int err = reset_index(&oid, reset_type, quiet);