Message ID | 87in2hgzin.fsf@evledraar.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] We should add a "git gc --auto" after "git clone" due to commit graph | expand |
On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote: > I don't have time to polish this up for submission now, but here's a WIP > patch that implements this, highlights: > > * There's a gc.clone.autoDetach=false default setting which overrides > gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a > --cloning option to indicate this). I'll repeat that it could make sense to do the same thing on clone _and_ fetch. Perhaps a "--post-fetch" flag would be good here to communicate that we just downloaded a pack from a remote. > * A clone of say git.git with gc.writeCommitGraph=true looks like: > > [...] > Receiving objects: 100% (255262/255262), 100.49 MiB | 17.78 MiB/s, done. > Resolving deltas: 100% (188947/188947), done. > Computing commit graph generation numbers: 100% (55210/55210), done. This looks like good UX. Thanks for the progress here! > * The 'git gc --auto' command also knows to (only) run the commit-graph > (and space is left for future optimization steps) if general GC isn't > needed, but we need "optimization": > > $ rm .git/objects/info/commit-graph; ~/g/git/git --exec-path=$PWD -c gc.writeCommitGraph=true -c gc.autoDetach=false gc --auto; > Annotating commits in commit graph: 341229, done. > Computing commit graph generation numbers: 100% (165969/165969), done. > $ Will this also trigger a full commit-graph rewrite on every 'git commit' command? Or is there some way we can compute the staleness of the commit-graph in order to only update if we get too far ahead? Previously, this was solved by relying on the auto-GC threshold. > * The patch to gc.c looks less scary with -w, most of it is indenting > the existing pack-refs etc. with a "!auto_gc || should_gc" condition. > > * I added a commit_graph_exists() exists function and only care if I > get ENOENT for the purposes of this gc mode. This would need to be > tweaked for the incremental mode Derrick talks about, but if we just > set "should_optimize" that'll also work as far as gc --auto is > concerned (e.g. on fetch, am etc.) The incremental mode would operate the same as split-index, which means we will still look for .git/objects/info/commit-graph. That file may point us to more files. > +int commit_graph_exists(const char *graph_file) > +{ > + struct stat st; > + if (stat(graph_file, &st)) { > + if (errno == ENOENT) > + return 0; > + else > + return -1; > + } > + return 1; > +} > + This method serves a very similar purpose to generation_numbers_enabled(), except your method only cares about the file existing. It ignores information like `core.commitGraph`, which should keep us from doing anything with the commit-graph file if false. Nothing about your method is specific to the commit-graph file, since you provide a filename as a parameter. It could easily be "int file_exists(const char *filename)". Thanks, -Stolee
On Fri, Oct 05 2018, Derrick Stolee wrote: > On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote: >> I don't have time to polish this up for submission now, but here's a WIP >> patch that implements this, highlights: >> >> * There's a gc.clone.autoDetach=false default setting which overrides >> gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a >> --cloning option to indicate this). > > I'll repeat that it could make sense to do the same thing on clone > _and_ fetch. Perhaps a "--post-fetch" flag would be good here to > communicate that we just downloaded a pack from a remote. I don't think that makes sense, but let's talk about why, because maybe I've missed something, you're certainly more familiar with the commit-graph than I am. The reason to do it on clone as a special-case or when the file is missing, is because we know the file is desired (via the GC config), and presumably is expected to help performance, and we have 0% of it. So by going from 0% to 100% on clone we'll get fast --contains and other goodies the graph helps with. But when we're doing a fetch, or really anything else that runs "git gc --auto" we can safely assume that we have a recent enough graph, because it will have been run whenever auto-gc kicked in. I.e.: # Slow, if we assume background forked commit-graph generation # (which I'm avoiding) git clone x && cd x && git tag --contains # Fast enough, since we have an existing commit-graph cd x && git fetch && git tag --contains I *do* think it might make sense to in general split off parts of "gc --auto" that we'd like to be more aggressive about, simply because the ratio of how long it takes to do, and how much it helps with performance makes more sense than a full repack, which is what the current heuristic is based on. And maybe when we run in that mode we should run in the foreground, but I don't see why git-fetch should be a special case there, and in this regard, the gc.clone.autoDetach=false setting I've made doesn't make much sence. I.e. maybe we should also skip forking to the background in such a mode when we trigger such a "mini gc" via git-commit or whatever. >> * A clone of say git.git with gc.writeCommitGraph=true looks like: >> >> [...] >> Receiving objects: 100% (255262/255262), 100.49 MiB | 17.78 MiB/s, done. >> Resolving deltas: 100% (188947/188947), done. >> Computing commit graph generation numbers: 100% (55210/55210), done. > > This looks like good UX. Thanks for the progress here! > >> * The 'git gc --auto' command also knows to (only) run the commit-graph >> (and space is left for future optimization steps) if general GC isn't >> needed, but we need "optimization": >> >> $ rm .git/objects/info/commit-graph; ~/g/git/git --exec-path=$PWD -c gc.writeCommitGraph=true -c gc.autoDetach=false gc --auto; >> Annotating commits in commit graph: 341229, done. >> Computing commit graph generation numbers: 100% (165969/165969), done. >> $ > > Will this also trigger a full commit-graph rewrite on every 'git > commit' command? Nope, because "git commit" can safely be assumed to have some commit-graph anyway, and I'm just special casing the case where it doesn't exist. But if it doesn't exist and you do a "git commit" then "gc --auto" will be run, and we'll fork to the background and generate it... > Or is there some way we can compute the staleness of > the commit-graph in order to only update if we get too far ahead? > Previously, this was solved by relying on the auto-GC threshold. So re the "I don't think that makes sense..." at the start of my E-Mail, isn't it fine to rely on the default thresholds here, or should we be more aggressive? >> * The patch to gc.c looks less scary with -w, most of it is indenting >> the existing pack-refs etc. with a "!auto_gc || should_gc" condition. >> >> * I added a commit_graph_exists() exists function and only care if I >> get ENOENT for the purposes of this gc mode. This would need to be >> tweaked for the incremental mode Derrick talks about, but if we just >> set "should_optimize" that'll also work as far as gc --auto is >> concerned (e.g. on fetch, am etc.) > > The incremental mode would operate the same as split-index, which > means we will still look for .git/objects/info/commit-graph. That file > may point us to more files. Ah! >> +int commit_graph_exists(const char *graph_file) >> +{ >> + struct stat st; >> + if (stat(graph_file, &st)) { >> + if (errno == ENOENT) >> + return 0; >> + else >> + return -1; >> + } >> + return 1; >> +} >> + > > This method serves a very similar purpose to > generation_numbers_enabled(), except your method only cares about the > file existing. It ignores information like `core.commitGraph`, which > should keep us from doing anything with the commit-graph file if > false. > > Nothing about your method is specific to the commit-graph file, since > you provide a filename as a parameter. It could easily be "int > file_exists(const char *filename)". I was being paranoid about not doing this if it didn't exist but it was something else than ENOENT (e.g. permission error?), but in retrospect that's silly. I'll drop this helper and just use file_exists().
On 10/5/2018 9:05 AM, Ævar Arnfjörð Bjarmason wrote: > On Fri, Oct 05 2018, Derrick Stolee wrote: > >> On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote: >>> I don't have time to polish this up for submission now, but here's a WIP >>> patch that implements this, highlights: >>> >>> * There's a gc.clone.autoDetach=false default setting which overrides >>> gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a >>> --cloning option to indicate this). >> I'll repeat that it could make sense to do the same thing on clone >> _and_ fetch. Perhaps a "--post-fetch" flag would be good here to >> communicate that we just downloaded a pack from a remote. > I don't think that makes sense, but let's talk about why, because maybe > I've missed something, you're certainly more familiar with the > commit-graph than I am. > > The reason to do it on clone as a special-case or when the file is > missing, is because we know the file is desired (via the GC config), and > presumably is expected to help performance, and we have 0% of it. So by > going from 0% to 100% on clone we'll get fast --contains and other > goodies the graph helps with. > > But when we're doing a fetch, or really anything else that runs "git gc > --auto" we can safely assume that we have a recent enough graph, because > it will have been run whenever auto-gc kicked in. > > I.e.: > > # Slow, if we assume background forked commit-graph generation > # (which I'm avoiding) > git clone x && cd x && git tag --contains > # Fast enough, since we have an existing commit-graph > cd x && git fetch && git tag --contains > > I *do* think it might make sense to in general split off parts of "gc > --auto" that we'd like to be more aggressive about, simply because the > ratio of how long it takes to do, and how much it helps with performance > makes more sense than a full repack, which is what the current heuristic > is based on. > > And maybe when we run in that mode we should run in the foreground, but > I don't see why git-fetch should be a special case there, and in this > regard, the gc.clone.autoDetach=false setting I've made doesn't make > much sence. I.e. maybe we should also skip forking to the background in > such a mode when we trigger such a "mini gc" via git-commit or whatever. My misunderstanding was that your proposed change to gc computes the commit-graph in either of these two cases: (1) The auto-GC threshold is met. (2) There is no commit-graph file. And what I hope to have instead of (2) is (3): (3) The commit-graph file is "sufficiently behind" the tip refs. This condition is intentionally vague at the moment. It could be that we hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a pack, and it probably contains a lot of new commits") or we could create some more complicated condition based on counting reachable commits with infinite generation number (the number of commits not in the commit-graph file). I like that you are moving forward to make the commit-graph be written more frequently, but I'm trying to push us in a direction of writing it even more often than your proposed strategy. We should avoid creating too many orthogonal conditions that trigger the commit-graph write, which is why I'm pushing on your design here. Anyone else have thoughts on this direction? Thanks, -Stolee
On Fri, Oct 05 2018, Derrick Stolee wrote: > On 10/5/2018 9:05 AM, Ævar Arnfjörð Bjarmason wrote: >> On Fri, Oct 05 2018, Derrick Stolee wrote: >> >>> On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote: >>>> I don't have time to polish this up for submission now, but here's a WIP >>>> patch that implements this, highlights: >>>> >>>> * There's a gc.clone.autoDetach=false default setting which overrides >>>> gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a >>>> --cloning option to indicate this). >>> I'll repeat that it could make sense to do the same thing on clone >>> _and_ fetch. Perhaps a "--post-fetch" flag would be good here to >>> communicate that we just downloaded a pack from a remote. >> I don't think that makes sense, but let's talk about why, because maybe >> I've missed something, you're certainly more familiar with the >> commit-graph than I am. >> >> The reason to do it on clone as a special-case or when the file is >> missing, is because we know the file is desired (via the GC config), and >> presumably is expected to help performance, and we have 0% of it. So by >> going from 0% to 100% on clone we'll get fast --contains and other >> goodies the graph helps with. >> >> But when we're doing a fetch, or really anything else that runs "git gc >> --auto" we can safely assume that we have a recent enough graph, because >> it will have been run whenever auto-gc kicked in. >> >> I.e.: >> >> # Slow, if we assume background forked commit-graph generation >> # (which I'm avoiding) >> git clone x && cd x && git tag --contains >> # Fast enough, since we have an existing commit-graph >> cd x && git fetch && git tag --contains >> >> I *do* think it might make sense to in general split off parts of "gc >> --auto" that we'd like to be more aggressive about, simply because the >> ratio of how long it takes to do, and how much it helps with performance >> makes more sense than a full repack, which is what the current heuristic >> is based on. >> >> And maybe when we run in that mode we should run in the foreground, but >> I don't see why git-fetch should be a special case there, and in this >> regard, the gc.clone.autoDetach=false setting I've made doesn't make >> much sence. I.e. maybe we should also skip forking to the background in >> such a mode when we trigger such a "mini gc" via git-commit or whatever. > > My misunderstanding was that your proposed change to gc computes the > commit-graph in either of these two cases: > > (1) The auto-GC threshold is met. > > (2) There is no commit-graph file. > > And what I hope to have instead of (2) is (3): > > (3) The commit-graph file is "sufficiently behind" the tip refs. > > This condition is intentionally vague at the moment. It could be that > we hint that (3) holds by saying "--post-fetch" (i.e. "We just > downloaded a pack, and it probably contains a lot of new commits") or > we could create some more complicated condition based on counting > reachable commits with infinite generation number (the number of > commits not in the commit-graph file). > > I like that you are moving forward to make the commit-graph be written > more frequently, but I'm trying to push us in a direction of writing > it even more often than your proposed strategy. We should avoid > creating too many orthogonal conditions that trigger the commit-graph > write, which is why I'm pushing on your design here. > > Anyone else have thoughts on this direction? Ah. I see. I think #3 makes perfect sense, but probably makes sense to do as a follow-up, or maybe you'd like to stick a patch on top of the series I have when I send it. I don't know how to write the "I'm not quite happy about the commit graph" code :) What I will do is refactor gc.c a bit and leave it in a state where it's going to be really easy to change the existing "we have no commit graph, and thus should do the optimization step" to have some more complex condition instead of "we have no commit graph", i.e. your "we just grabbed a lot of data". Also, I'll drop the gc.clone.autoDetach=false setting and name it something more general. maybe gc.AutoDetachOnBigOptimization=false? Anyway something more generic so that "clone" will always pass in some option saying "expect a large % commit graph update" (100% in its case), and then in "fetch" we could have some detection of how big what we just got from the server is, and do the same. This seems to be to be the most general thing that would make sense, and could also be extended e.g. to "git commit" and other users of gc --auto. If I started with a README file in an empty repo, and then made a commit where I added 1 million files all in one commit, in which case we'd (depending on that setting) also block in the foreground and generate the commit-graph.
On Fri, Oct 05, 2018 at 09:45:47AM -0400, Derrick Stolee wrote: > My misunderstanding was that your proposed change to gc computes the > commit-graph in either of these two cases: > > (1) The auto-GC threshold is met. > > (2) There is no commit-graph file. > > And what I hope to have instead of (2) is (3): > > (3) The commit-graph file is "sufficiently behind" the tip refs. > > This condition is intentionally vague at the moment. It could be that we > hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a > pack, and it probably contains a lot of new commits") or we could create > some more complicated condition based on counting reachable commits with > infinite generation number (the number of commits not in the commit-graph > file). > > I like that you are moving forward to make the commit-graph be written more > frequently, but I'm trying to push us in a direction of writing it even more > often than your proposed strategy. We should avoid creating too many > orthogonal conditions that trigger the commit-graph write, which is why I'm > pushing on your design here. > > Anyone else have thoughts on this direction? Yes, I think measuring "sufficiently behind" is the right thing. Everything else is a proxy or heuristic, and will run into corner cases. E.g., I have some small number of objects and then do a huge fetch, and now my commit-graph only covers 5% of what's available. We know how many objects are in the graph already. And it's not too expensive to get the number of objects in the repository. We can do the same sampling for loose objects that "gc --auto" does, and counting packed objects just involves opening up the .idx files (that can be slow if you have a ton of packs, but you'd want to either repack or use a .midx in that case anyway, either of which would help here). So can we really just take (total_objects - commit_graph_objects) and compare it to some threshold? -Peff
On 10/5/2018 3:21 PM, Jeff King wrote: > On Fri, Oct 05, 2018 at 09:45:47AM -0400, Derrick Stolee wrote: > >> My misunderstanding was that your proposed change to gc computes the >> commit-graph in either of these two cases: >> >> (1) The auto-GC threshold is met. >> >> (2) There is no commit-graph file. >> >> And what I hope to have instead of (2) is (3): >> >> (3) The commit-graph file is "sufficiently behind" the tip refs. >> >> This condition is intentionally vague at the moment. It could be that we >> hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a >> pack, and it probably contains a lot of new commits") or we could create >> some more complicated condition based on counting reachable commits with >> infinite generation number (the number of commits not in the commit-graph >> file). >> >> I like that you are moving forward to make the commit-graph be written more >> frequently, but I'm trying to push us in a direction of writing it even more >> often than your proposed strategy. We should avoid creating too many >> orthogonal conditions that trigger the commit-graph write, which is why I'm >> pushing on your design here. >> >> Anyone else have thoughts on this direction? > Yes, I think measuring "sufficiently behind" is the right thing. > Everything else is a proxy or heuristic, and will run into corner cases. > E.g., I have some small number of objects and then do a huge fetch, and > now my commit-graph only covers 5% of what's available. > > We know how many objects are in the graph already. And it's not too > expensive to get the number of objects in the repository. We can do the > same sampling for loose objects that "gc --auto" does, and counting > packed objects just involves opening up the .idx files (that can be slow > if you have a ton of packs, but you'd want to either repack or use a > .midx in that case anyway, either of which would help here). > > So can we really just take (total_objects - commit_graph_objects) and > compare it to some threshold? The commit-graph only stores the number of _commits_, not total objects. Azure Repos' commit-graph does store the total number of objects, and that is how we trigger updating the graph, so it is not unreasonable to use that as a heuristic. Thanks, -Stolee
On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote: > > So can we really just take (total_objects - commit_graph_objects) and > > compare it to some threshold? > > The commit-graph only stores the number of _commits_, not total objects. Oh, right, of course. That does throw a monkey wrench in that line of thought. ;) There's unfortunately not a fast way of doing that. One option would be to keep a counter of "ungraphed commit objects", and have callers update it. Anybody admitting a pack via index-pack or unpack-objects can easily get this information. Commands like fast-import can do likewise, and "git commit" obviously increments it by one. I'm not excited about adding a new global on-disk data structure (and the accompanying lock). -Peff
On 10/5/2018 3:47 PM, Jeff King wrote: > On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote: > >>> So can we really just take (total_objects - commit_graph_objects) and >>> compare it to some threshold? >> The commit-graph only stores the number of _commits_, not total objects. > Oh, right, of course. That does throw a monkey wrench in that line of > thought. ;) > > There's unfortunately not a fast way of doing that. One option would be > to keep a counter of "ungraphed commit objects", and have callers update > it. Anybody admitting a pack via index-pack or unpack-objects can easily > get this information. Commands like fast-import can do likewise, and > "git commit" obviously increments it by one. > > I'm not excited about adding a new global on-disk data structure (and > the accompanying lock). If we want, then we can add an optional chunk to the commit-graph file that stores the object count.
On Fri, Oct 05 2018, Jeff King wrote: > On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote: > >> > So can we really just take (total_objects - commit_graph_objects) and >> > compare it to some threshold? >> >> The commit-graph only stores the number of _commits_, not total objects. > > Oh, right, of course. That does throw a monkey wrench in that line of > thought. ;) > > There's unfortunately not a fast way of doing that. One option would be > to keep a counter of "ungraphed commit objects", and have callers update > it. Anybody admitting a pack via index-pack or unpack-objects can easily > get this information. Commands like fast-import can do likewise, and > "git commit" obviously increments it by one. > > I'm not excited about adding a new global on-disk data structure (and > the accompanying lock). You don't really need a new global datastructure to solve this problem. It would be sufficient to have git-gc itself write out a 4-line text file after it runs saying how many tags, commits, trees and blobs it found on its last run. You can then fuzzily compare object counts v.s. commit counts for the purposes of deciding whether something like the commit-graph needs to be updated, while assuming that whatever new data you have has similar enough ratios of those as your existing data. That's an assumption that'll hold well enough for big repos where this matters the most, and who tend to grow in fairly uniform ways as far as their object type ratios go. Databases like MySQL, PostgreSQL etc. pull similar tricks with their fuzzy table statistics.
On Fri, Oct 05, 2018 at 04:00:12PM -0400, Derrick Stolee wrote: > On 10/5/2018 3:47 PM, Jeff King wrote: > > On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote: > > > > > > So can we really just take (total_objects - commit_graph_objects) and > > > > compare it to some threshold? > > > The commit-graph only stores the number of _commits_, not total objects. > > Oh, right, of course. That does throw a monkey wrench in that line of > > thought. ;) > > > > There's unfortunately not a fast way of doing that. One option would be > > to keep a counter of "ungraphed commit objects", and have callers update > > it. Anybody admitting a pack via index-pack or unpack-objects can easily > > get this information. Commands like fast-import can do likewise, and > > "git commit" obviously increments it by one. > > > > I'm not excited about adding a new global on-disk data structure (and > > the accompanying lock). > > If we want, then we can add an optional chunk to the commit-graph file that > stores the object count. Yeah, that's probably a saner route, since we have to do the write then anyway. -Peff
On Fri, Oct 05, 2018 at 10:01:31PM +0200, Ævar Arnfjörð Bjarmason wrote: > > There's unfortunately not a fast way of doing that. One option would be > > to keep a counter of "ungraphed commit objects", and have callers update > > it. Anybody admitting a pack via index-pack or unpack-objects can easily > > get this information. Commands like fast-import can do likewise, and > > "git commit" obviously increments it by one. > > > > I'm not excited about adding a new global on-disk data structure (and > > the accompanying lock). > > You don't really need a new global datastructure to solve this > problem. It would be sufficient to have git-gc itself write out a 4-line > text file after it runs saying how many tags, commits, trees and blobs > it found on its last run. > > You can then fuzzily compare object counts v.s. commit counts for the > purposes of deciding whether something like the commit-graph needs to be > updated, while assuming that whatever new data you have has similar > enough ratios of those as your existing data. I think this is basically the same thing as Stolee's suggestion to keep the total object count in the commit-graph file. The only difference is here is that we know the actual ratio of commit to blobs for this particular repository. But I don't think we need to know that. As you said, this is fuzzy anyway, so a single number for "update the graph when there are N new objects" is likely enough. If you had a repository with an unusually large tree, you'd end up rebuilding the graph more often. But I think it would probably be OK, as we're primarily trying not to waste time doing a graph rebuild when we've only done a small amount of other work. But if we just shoved a ton of objects through index-pack then we did a lot of work, whether those were commit objects or not. -Peff
diff --git a/Documentation/config.txt b/Documentation/config.txt index 1546833213..5759fbb067 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -1621,7 +1621,19 @@ gc.autoPackLimit:: gc.autoDetach:: Make `git gc --auto` return immediately and run in background - if the system supports it. Default is true. + if the system supports it. Default is true. Overridden by + `gc.clone.autoDetach` when running linkgit:git-clone[1]. + +gc.clone.autoDetach:: + Make `git gc --auto` return immediately and run in background + if the system supports it when run via + linkgit:git-clone[1]. Default is false. ++ +The reason this defaults to false is because the only time we'll have +work to do after a 'git clone' is if something like +`gc.writeCommitGraph` is true, in that case we'd like to compute the +optimized file before returning, so that say commands that benefit +from commit graph aren't slow until it's generated in the background. gc.bigPackThreshold:: If non-zero, all packs larger than this limit are kept when diff --git a/builtin/clone.c b/builtin/clone.c index 15b142d646..824c130ba5 100644 --- a/builtin/clone.c +++ b/builtin/clone.c @@ -897,6 +897,8 @@ int cmd_clone(int argc, const char **argv, const char *prefix) struct remote *remote; int err = 0, complete_refs_before_fetch = 1; int submodule_progress; + const char *argv_gc_auto[] = {"gc", "--auto", "--cloning", NULL}; + const char *argv_gc_auto_quiet[] = {"gc", "--auto", "--cloning", "--quiet", NULL}; struct refspec rs = REFSPEC_INIT_FETCH; struct argv_array ref_prefixes = ARGV_ARRAY_INIT; @@ -1245,5 +1247,11 @@ int cmd_clone(int argc, const char **argv, const char *prefix) refspec_clear(&rs); argv_array_clear(&ref_prefixes); + + if (0 <= option_verbosity) + run_command_v_opt_cd_env(argv_gc_auto, RUN_GIT_CMD, git_dir, NULL); + else + run_command_v_opt_cd_env(argv_gc_auto_quiet, RUN_GIT_CMD, git_dir, NULL); + return err; } diff --git a/builtin/gc.c b/builtin/gc.c index 6591ddbe83..27be03890a 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -43,6 +43,7 @@ static int gc_auto_threshold = 6700; static int gc_auto_pack_limit = 50; static int gc_write_commit_graph; static int detach_auto = 1; +static int detach_clone_auto = 0; static timestamp_t gc_log_expire_time; static const char *gc_log_expire = "1.day.ago"; static const char *prune_expire = "2.weeks.ago"; @@ -133,6 +134,7 @@ static void gc_config(void) git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit); git_config_get_bool("gc.writecommitgraph", &gc_write_commit_graph); git_config_get_bool("gc.autodetach", &detach_auto); + git_config_get_bool("gc.clone.autodetach", &detach_clone_auto); git_config_get_expiry("gc.pruneexpire", &prune_expire); git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire); git_config_get_expiry("gc.logexpiry", &gc_log_expire); @@ -157,9 +159,6 @@ static int too_many_loose_objects(void) int num_loose = 0; int needed = 0; - if (gc_auto_threshold <= 0) - return 0; - dir = opendir(git_path("objects/17")); if (!dir) return 0; @@ -369,10 +368,21 @@ static int need_to_gc(void) return 0; if (run_hook_le(NULL, "pre-auto-gc", NULL)) - return 0; + return -1; return 1; } +static int need_to_optimize(void) { + if (gc_write_commit_graph) { + char *obj_dir = get_object_directory(); + char *graph_name = get_commit_graph_filename(obj_dir); + + if (commit_graph_exists(graph_name) == 0) /* ENOENT */ + return 1; + } + return 0; +} + /* return NULL on success, else hostname running the gc */ static const char *lock_repo_for_gc(int force, pid_t* ret_pid) { @@ -491,6 +501,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix) { int aggressive = 0; int auto_gc = 0; + int cloning = 0; int quiet = 0; int force = 0; const char *name; @@ -498,6 +509,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix) int daemonized = 0; int keep_base_pack = -1; timestamp_t dummy; + int should_gc; + int should_optimize; struct option builtin_gc_options[] = { OPT__QUIET(&quiet, N_("suppress progress reporting")), @@ -507,6 +520,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix) OPT_BOOL(0, "aggressive", &aggressive, N_("be more thorough (increased runtime)")), OPT_BOOL_F(0, "auto", &auto_gc, N_("enable auto-gc mode"), PARSE_OPT_NOCOMPLETE), + OPT_BOOL_F(0, "cloning", &cloning, N_("enable cloning mode"), + PARSE_OPT_NOCOMPLETE), OPT_BOOL_F(0, "force", &force, N_("force running gc even if there may be another gc running"), PARSE_OPT_NOCOMPLETE), @@ -555,22 +570,27 @@ int cmd_gc(int argc, const char **argv, const char *prefix) /* * Auto-gc should be least intrusive as possible. */ - if (!need_to_gc()) + should_gc = need_to_gc(); + if (should_gc == -1) + return 0; + should_optimize = need_to_optimize(); + if (!should_gc && !should_optimize) return 0; - if (!quiet) { + if (!quiet && should_gc) { if (detach_auto) fprintf(stderr, _("Auto packing the repository in background for optimum performance.\n")); else fprintf(stderr, _("Auto packing the repository for optimum performance.\n")); fprintf(stderr, _("See \"git help gc\" for manual housekeeping.\n")); } - if (detach_auto) { + if (detach_auto && + (!cloning || (cloning && detach_clone_auto))) { if (report_last_gc_error()) return -1; if (lock_repo_for_gc(force, &pid)) return 0; - if (gc_before_repack()) + if (should_gc && gc_before_repack()) return -1; delete_tempfile(&pidfile); @@ -611,45 +631,48 @@ int cmd_gc(int argc, const char **argv, const char *prefix) atexit(process_log_file_at_exit); } - if (gc_before_repack()) - return -1; - - if (!repository_format_precious_objects) { - close_all_packs(the_repository->objects); - if (run_command_v_opt(repack.argv, RUN_GIT_CMD)) - return error(FAILED_RUN, repack.argv[0]); - - if (prune_expire) { - argv_array_push(&prune, prune_expire); - if (quiet) - argv_array_push(&prune, "--no-progress"); - if (repository_format_partial_clone) - argv_array_push(&prune, - "--exclude-promisor-objects"); - if (run_command_v_opt(prune.argv, RUN_GIT_CMD)) - return error(FAILED_RUN, prune.argv[0]); + if (!auto_gc || should_gc) { + if (gc_before_repack()) + return -1; + + if (!repository_format_precious_objects) { + close_all_packs(the_repository->objects); + if (run_command_v_opt(repack.argv, RUN_GIT_CMD)) + return error(FAILED_RUN, repack.argv[0]); + + if (prune_expire) { + argv_array_push(&prune, prune_expire); + if (quiet) + argv_array_push(&prune, "--no-progress"); + if (repository_format_partial_clone) + argv_array_push(&prune, + "--exclude-promisor-objects"); + if (run_command_v_opt(prune.argv, RUN_GIT_CMD)) + return error(FAILED_RUN, prune.argv[0]); + } } - } - if (prune_worktrees_expire) { - argv_array_push(&prune_worktrees, prune_worktrees_expire); - if (run_command_v_opt(prune_worktrees.argv, RUN_GIT_CMD)) - return error(FAILED_RUN, prune_worktrees.argv[0]); - } - if (run_command_v_opt(rerere.argv, RUN_GIT_CMD)) - return error(FAILED_RUN, rerere.argv[0]); + if (prune_worktrees_expire) { + argv_array_push(&prune_worktrees, prune_worktrees_expire); + if (run_command_v_opt(prune_worktrees.argv, RUN_GIT_CMD)) + return error(FAILED_RUN, prune_worktrees.argv[0]); + } - report_garbage = report_pack_garbage; - reprepare_packed_git(the_repository); - if (pack_garbage.nr > 0) - clean_pack_garbage(); + if (run_command_v_opt(rerere.argv, RUN_GIT_CMD)) + return error(FAILED_RUN, rerere.argv[0]); + + report_garbage = report_pack_garbage; + reprepare_packed_git(the_repository); + if (pack_garbage.nr > 0) + clean_pack_garbage(); + } if (gc_write_commit_graph) write_commit_graph_reachable(get_object_directory(), 0, !quiet && !daemonized); - if (auto_gc && too_many_loose_objects()) + if (auto_gc && should_gc && too_many_loose_objects()) warning(_("There are too many unreachable loose objects; " "run 'git prune' to remove them.")); diff --git a/commit-graph.c b/commit-graph.c index 5908bd4e34..a4a7c94cec 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -57,6 +57,18 @@ static struct commit_graph *alloc_commit_graph(void) return g; } +int commit_graph_exists(const char *graph_file) +{ + struct stat st; + if (stat(graph_file, &st)) { + if (errno == ENOENT) + return 0; + else + return -1; + } + return 1; +} + struct commit_graph *load_commit_graph_one(const char *graph_file) { void *graph_map; diff --git a/commit-graph.h b/commit-graph.h index 5678a8f4ca..a251f1bc32 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -11,6 +11,7 @@ struct commit; char *get_commit_graph_filename(const char *obj_dir); +int commit_graph_exists(const char *graph_file); /* * Given a commit struct, try to fill the commit struct info, including: