Message ID | 20181102133050.10756-1-peartben@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v1] add: speed up cmd_add() by utilizing read_cache_preload() | expand |
Ben Peart <peartben@gmail.com> writes: > From: Ben Peart <benpeart@microsoft.com> > > During an "add", a call is made to run_diff_files() which calls > check_remove() for each index-entry. The preload_index() code > distributes some of the costs across multiple threads. Nice. I peeked around and noticed that we already do this in builtin_diff_index() before running run_diff_index() when !cached, and builtin_diff_files(), of course. > Because the files checked are restricted to pathspec, adding individual > files makes no measurable impact but on a Windows repo with ~200K files, > 'git add .' drops from 6.3 seconds to 3.3 seconds for a 47% savings. ;-) > diff --git a/builtin/add.c b/builtin/add.c > index ad49806ebf..f65c172299 100644 > --- a/builtin/add.c > +++ b/builtin/add.c > @@ -445,11 +445,6 @@ int cmd_add(int argc, const char **argv, const char *prefix) > return 0; > } > > - if (read_cache() < 0) > - die(_("index file corrupt")); > - > - die_in_unpopulated_submodule(&the_index, prefix); > - > /* > * Check the "pathspec '%s' did not match any files" block > * below before enabling new magic. It is not explained why this is not a mere s/read_cache/&_preload/ in the log message. I can see it is because you wanted to make the pathspec available to preload to further cut down the preloaded paths, and I do not think it has any unintended (negatie) side effect to parse the pathspec before populating the in-core index, but that would have been a good thing to mention in the proposed log message. > @@ -459,6 +454,10 @@ int cmd_add(int argc, const char **argv, const char *prefix) > PATHSPEC_SYMLINK_LEADING_PATH, > prefix, argv); > > + if (read_cache_preload(&pathspec) < 0) > + die(_("index file corrupt")); > + > + die_in_unpopulated_submodule(&the_index, prefix); > die_path_inside_submodule(&the_index, &pathspec); > > if (add_new_files) { > > base-commit: 4ede3d42dfb57f9a41ac96a1f216c62eb7566cc2
On Fri, Nov 2, 2018 at 2:32 PM Ben Peart <peartben@gmail.com> wrote: > > From: Ben Peart <benpeart@microsoft.com> > > During an "add", a call is made to run_diff_files() which calls > check_remove() for each index-entry. The preload_index() code distributes > some of the costs across multiple threads. Instead of doing this site by site. How about we make read_cache() always do multithread preload? The only downside I see is preload may actually harm when there are too few cache entries (but more than 500), but this needs to be verified. If the penalty is small enough, I think we could live with it since everything is fast when you have few entries anyway. But if that's not true, we could add a threshold to activate preload. Something like "if you have 50k files or more, then activate preload" would do. I notice THREAD_COST in preload code, but I don't think it's the same thing. > > Because the files checked are restricted to pathspec, adding individual > files makes no measurable impact but on a Windows repo with ~200K files, > 'git add .' drops from 6.3 seconds to 3.3 seconds for a 47% savings. > > Signed-off-by: Ben Peart <benpeart@microsoft.com> > --- > > Notes: > Base Ref: master > Web-Diff: https://github.com/benpeart/git/commit/fc4830b545 > Checkout: git fetch https://github.com/benpeart/git add-preload-index-v1 && git checkout fc4830b545 > > builtin/add.c | 9 ++++----- > 1 file changed, 4 insertions(+), 5 deletions(-) > > diff --git a/builtin/add.c b/builtin/add.c > index ad49806ebf..f65c172299 100644 > --- a/builtin/add.c > +++ b/builtin/add.c > @@ -445,11 +445,6 @@ int cmd_add(int argc, const char **argv, const char *prefix) > return 0; > } > > - if (read_cache() < 0) > - die(_("index file corrupt")); > - > - die_in_unpopulated_submodule(&the_index, prefix); > - > /* > * Check the "pathspec '%s' did not match any files" block > * below before enabling new magic. > @@ -459,6 +454,10 @@ int cmd_add(int argc, const char **argv, const char *prefix) > PATHSPEC_SYMLINK_LEADING_PATH, > prefix, argv); > > + if (read_cache_preload(&pathspec) < 0) > + die(_("index file corrupt")); > + > + die_in_unpopulated_submodule(&the_index, prefix); > die_path_inside_submodule(&the_index, &pathspec); > > if (add_new_files) { > > base-commit: 4ede3d42dfb57f9a41ac96a1f216c62eb7566cc2 > -- > 2.18.0.windows.1 >
On 11/2/2018 11:23 AM, Junio C Hamano wrote: > Ben Peart <peartben@gmail.com> writes: > >> From: Ben Peart <benpeart@microsoft.com> >> >> During an "add", a call is made to run_diff_files() which calls >> check_remove() for each index-entry. The preload_index() code >> distributes some of the costs across multiple threads. > > Nice. I peeked around and noticed that we already do this in > builtin_diff_index() before running run_diff_index() when !cached, > and builtin_diff_files(), of course. > Thanks for double checking! >> Because the files checked are restricted to pathspec, adding individual >> files makes no measurable impact but on a Windows repo with ~200K files, >> 'git add .' drops from 6.3 seconds to 3.3 seconds for a 47% savings. > > ;-) > >> diff --git a/builtin/add.c b/builtin/add.c >> index ad49806ebf..f65c172299 100644 >> --- a/builtin/add.c >> +++ b/builtin/add.c >> @@ -445,11 +445,6 @@ int cmd_add(int argc, const char **argv, const char *prefix) >> return 0; >> } >> >> - if (read_cache() < 0) >> - die(_("index file corrupt")); >> - >> - die_in_unpopulated_submodule(&the_index, prefix); >> - >> /* >> * Check the "pathspec '%s' did not match any files" block >> * below before enabling new magic. > > It is not explained why this is not a mere s/read_cache/&_preload/ > in the log message. I can see it is because you wanted to make the > pathspec available to preload to further cut down the preloaded > paths, and I do not think it has any unintended (negatie) side > effect to parse the pathspec before populating the in-core index, > but that would have been a good thing to mention in the proposed log > message. > That is correct. parse_pathspec() was after read_cache() because it _used_ to use the index to determine whether a pathspec is in a submodule. That was fixed by Brandon in Aug 2017 when he cleaned up all submodule config code so it is safe to move read_cache_preload() after the call to parse_pathspec(). How about this for a revised commit message? During an "add", a call is made to run_diff_files() which calls check_remove() for each index-entry. The preload_index() code distributes some of the costs across multiple threads. Limit read_cache_preload() to pathspec, so that it doesn't process more entries than are needed and end up slowing things down instead of speeding them up. Because the files checked are restricted to pathspec, adding individual files makes no measurable impact but on a Windows repo with ~200K files, 'git add .' drops from 6.3 seconds to 3.3 seconds for a 47% savings. >> @@ -459,6 +454,10 @@ int cmd_add(int argc, const char **argv, const char *prefix) >> PATHSPEC_SYMLINK_LEADING_PATH, >> prefix, argv); >> >> + if (read_cache_preload(&pathspec) < 0) >> + die(_("index file corrupt")); >> + >> + die_in_unpopulated_submodule(&the_index, prefix); >> die_path_inside_submodule(&the_index, &pathspec); >> >> if (add_new_files) { >> >> base-commit: 4ede3d42dfb57f9a41ac96a1f216c62eb7566cc2
Duy Nguyen <pclouds@gmail.com> writes: > On Fri, Nov 2, 2018 at 2:32 PM Ben Peart <peartben@gmail.com> wrote: >> >> From: Ben Peart <benpeart@microsoft.com> >> >> During an "add", a call is made to run_diff_files() which calls >> check_remove() for each index-entry. The preload_index() code distributes >> some of the costs across multiple threads. > > Instead of doing this site by site. How about we make read_cache() > always do multithread preload? I suspect that it would be a huge performance killer. Many codepaths do not even want to know if the working tree files have been modified, even though they need to know what's in the index. Think "git commit-tree", "git diff --cached", etc.
On Sat, Nov 3, 2018 at 1:38 AM Junio C Hamano <gitster@pobox.com> wrote: > > Duy Nguyen <pclouds@gmail.com> writes: > > > On Fri, Nov 2, 2018 at 2:32 PM Ben Peart <peartben@gmail.com> wrote: > >> > >> From: Ben Peart <benpeart@microsoft.com> > >> > >> During an "add", a call is made to run_diff_files() which calls > >> check_remove() for each index-entry. The preload_index() code distributes > >> some of the costs across multiple threads. > > > > Instead of doing this site by site. How about we make read_cache() > > always do multithread preload? > > I suspect that it would be a huge performance killer. > > Many codepaths do not even want to know if the working tree files > have been modified, even though they need to know what's in the > index. Think "git commit-tree", "git diff --cached", etc. Ah. I keep forgetting read_cache_preload is loading the index _and_ refreshing. I thought the two had some different semantics but failed to see it last time.
diff --git a/builtin/add.c b/builtin/add.c index ad49806ebf..f65c172299 100644 --- a/builtin/add.c +++ b/builtin/add.c @@ -445,11 +445,6 @@ int cmd_add(int argc, const char **argv, const char *prefix) return 0; } - if (read_cache() < 0) - die(_("index file corrupt")); - - die_in_unpopulated_submodule(&the_index, prefix); - /* * Check the "pathspec '%s' did not match any files" block * below before enabling new magic. @@ -459,6 +454,10 @@ int cmd_add(int argc, const char **argv, const char *prefix) PATHSPEC_SYMLINK_LEADING_PATH, prefix, argv); + if (read_cache_preload(&pathspec) < 0) + die(_("index file corrupt")); + + die_in_unpopulated_submodule(&the_index, prefix); die_path_inside_submodule(&the_index, &pathspec); if (add_new_files) {