Message ID | 66165917a4660f63ce60b820d178d52a51304d20.1638224692.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | cruft packs | expand |
On 11/29/2021 5:25 PM, Taylor Blau wrote: > Generating a non-expiring cruft packs works as follows: I had trouble parsing the documentation changes below, so I came back to this commit message to see if that helps. > - Callers provide a list of every pack they know about, and indicate > which packs are about to be removed. This corresponds to the list over stdin. > - All packs which are going to be removed (we'll call these the > redundant ones) are marked as kept in-core, as well as any packs > that `pack-objects` found but the caller did not specify. Ok, so as an implementation detail we mark these as keep packs. > These packs are presumed to have entered the repository between > the caller collecting packs and invoking `pack-objects`. Since we > do not want to include objects in these packs (because we don't know > which of their objects are or aren't reachable), these are also > marked as kept in-core. Here, "are presumed" is doing a lot of work. Theoretically, there could be three categories: 1. This pack was just repacked and will be removed because all of its objects were placed into new objects. 2. Either this pack was repacked and contains important reachable objects OR we did a repack of reachable objects and this pack contained some extra, unreachable objects. 3. This pack was added to the repository while creating those repacked packs from category 2, so we don't know if things are reachable or not. So, the packs that we discover on-disk but are not specified over stdin are in this third category, but these are grouped with category 1 as we will treat them the same. > - Then, we enumerate all objects in the repository, and add them to > our packing list if they do not appear in an in-core kept pack. Here, we are looking at all of the objects in category 2 as well as loose objects. > This results in a new cruft pack which contains all known objects that > aren't included in the kept packs. When the kept pack is the result of > `git repack -A`, the resulting pack contains all unreachable objects. This now describes how 'git repack' will interface with this new change to pack-objects. I'll keep an eye out for that. > +--cruft:: Now getting to this description. > + Packs unreachable objects into a separate "cruft" pack, denoted > + by the existence of a `.mtimes` file. Pack names provided over > + stdin indicate which packs will remain after a `git repack`. > + Pack names prefixed with a `-` indicate those which will be > + removed. (...) This description is too tied to 'git repack'. Can we describe the input using terms independent of the 'git repack' operation? I need to keep reading. > (...) The contents of the cruft pack are all objects not > + contained in the surviving packs specified by `--keep-pack`) Now you use --keep-pack, which is a way of specifying a pack as "in-core keep" which was not in your commit message. Here, we also don't link the packs over stdin to the concept of keep packs. > + which have not exceeded the grace period (see > + `--cruft-expiration` below), or which have exceeded the grace > + period, but are reachable from an other object which hasn't. And now we think about the grace period! There is so much going on that I need to break it down to understand. An object is _excluded_ from the new cruft pack if 1. It is reachable from at least one reference. 2. It is in a pack from stdin prefixed with "-" 3. It is in a pack specified by `--keep-pack` 4. It is in an existing cruft pack and the .mtimes file states that its mtime is at least as recent as the time specified by the --cruft-expiration option. Breaking it down into a list like this helps me, at least. I'm not sure what the best way would look like. (Needing to pause here and look at the implementation later.) Thanks, -Stolee
On 11/29/2021 5:25 PM, Taylor Blau wrote: > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > +static int add_cruft_object_entry(const struct object_id *oid, enum object_type type, > + struct packed_git *pack, off_t offset, > + const char *name, uint32_t mtime) > +{ > + struct object_entry *entry; > + > + display_progress(progress_state, ++nr_seen); I don't love the global nr_seen here, but it is pervasive through the file. OK. > + entry = packlist_find(&to_pack, oid); > + if (entry) { > + if (name) { > + entry->hash = pack_name_hash(name); > + entry->no_try_delta = name && no_try_delta(name); This is already in an "if (name)" block, so "name &&" isn't needed. > + } > + } else { > + if (!want_object_in_pack(oid, 0, &pack, &offset)) > + return 0; > + if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) { > + /* > + * If a traversed tree has a missing blob then we want > + * to avoid adding that missing object to our pack. > + * > + * This only applies to missing blobs, not trees, > + * because the traversal needs to parse sub-trees but > + * not blobs. > + * > + * Note we only perform this check when we couldn't > + * already find the object in a pack, so we're really > + * limited to "ensure non-tip blobs which don't exist in > + * packs do exist via loose objects". Confused? > + */ > + return 0; > + } > + > + entry = create_object_entry(oid, type, pack_name_hash(name), > + 0, name && no_try_delta(name), > + pack, offset); > + } > + > + if (mtime > oe_cruft_mtime(&to_pack, entry)) > + oe_set_cruft_mtime(&to_pack, entry, mtime); > + return 1; I was confused at this "return 1" here, while other cases return 0. It turns out that there are multiple methods in this file that have different semantics: add_loose_object() and add_object_entry_from_pack() are both called from iterators where "return 1" means "stop iterating" so they return 0 always. add_object_entry_from_bitmap() is used to iterate over a bitmap and "return 1" means "include this object". However, the return code for add_cruft_object_entry() is never used, so it should probably return void or swap the meanings to have nonzero mean an error occurred. > +static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep) > +{ > + struct string_list_item *item = NULL; > + for_each_string_list_item(item, packs) { > + struct packed_git *p = item->util; > + if (!p) > + die(_("could not find pack '%s'"), item->string); Interesting that this is a potential issue. We are expecting the pack to be loaded before we get here. Is this more because some packs might not actually load, but it's fine as long as we don't mark them as kept? > + p->pack_keep_in_core = keep; > + } > +} ... > +static void read_cruft_objects(void) > +{ > + struct strbuf buf = STRBUF_INIT; > + struct string_list discard_packs = STRING_LIST_INIT_DUP; > + struct string_list fresh_packs = STRING_LIST_INIT_DUP; > + struct packed_git *p; > + > + ignore_packed_keep_in_core = 1; Here is a global that we are suddenly changing. Should we not be returning it to its initial state when this method is complete? > +static int option_parse_cruft_expiration(const struct option *opt, > + const char *arg, int unset) > +{ > + if (unset) { > + cruft = 0; This unassignment of 'cruft' when cruft-expiration is unset with --no-cruft-expiration seems odd. I would expect git pack-objects --cruft --no-cruft-expiration to still make a cruft pack, but not expire anything. It seems that your code here makes --no-cruft-expiration disable the --cruft option. > + cruft_expiration = 0; > + } else { > + cruft = 1; > + if (arg) > + cruft_expiration = approxidate(arg); > + } > + return 0; > +} .. > + OPT_BOOL(0, "cruft", &cruft, N_("create a cruft pack")), > + OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"), > + N_("expire cruft objects older than <time>"), > + PARSE_OPT_OPTARG, option_parse_cruft_expiration), > -static int has_loose_object(const struct object_id *oid) > +int has_loose_object(const struct object_id *oid) > { > return check_and_freshen(oid, 0); > } I'm surprised this hasn't been modified to use a repository pointer. Adding another caller here isn't too much debt, though. > diff --git a/object-store.h b/object-store.h > index d87481f101..a79c1c91ab 100644 > --- a/object-store.h > +++ b/object-store.h > @@ -308,6 +308,8 @@ int repo_has_object_file_with_flags(struct repository *r, > */ > int has_loose_object_nonlocal(const struct object_id *); Of course, here is another example that is already more widely used. > +int has_loose_object(const struct object_id *); > + > void assert_oid_type(const struct object_id *oid, enum object_type expect); ... > + test_expect_success "unreachable packed objects are packed (expire $expire)" ' > + git init repo && > + test_when_finished "rm -fr repo" && > + ( > + cd repo && > + > + test_commit packed && > + git repack -Ad && > + test_commit other && > + > + git rev-list --objects --no-object-names packed.. >objects && > + keep="$(basename "$(ls $packdir/pack-*.pack)")" && > + other="$(git pack-objects --delta-base-offset \ > + $packdir/pack <objects)" && > + git prune-packed && > + > + test-tool chmtime --get -100 "$packdir/pack-$other.pack" >expect && I am missing how this test creates _unreachable_ objects. I would expect removal of some refs or a 'git reset --hard' somewhere. What am I missing? > + cruft="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF > + $keep > + -pack-$other.pack > + EOF > + )" && > + test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw && > + > + cut -d" " -f2 <actual.raw | sort -u >actual && > + > + test_cmp expect actual > + ) > + ' > + > + test_expect_success "unreachable cruft objects are repacked (expire $expire)" ' I have the same question for all of the tests, really. > + # remove the unreachable tree, but leave the commit > + # which has it as its root tree in-tact nit: "intact" is one word. > + rm -fr "$objdir/$(test_oid_to_path "$tree")" && > + > + git repack -Ad && > + basename $(ls $packdir/pack-*.pack) >in && > + git pack-objects --cruft --cruft-expiration="$expire" \ > + $packdir/pack <in > + ) > + ' ... > +basic_cruft_pack_tests never I look forward to seeing how this changes with additional expiration values. Thanks, -Stolee
On Tue, Dec 07, 2021 at 10:17:28AM -0500, Derrick Stolee wrote: > On 11/29/2021 5:25 PM, Taylor Blau wrote: > > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > > +static int add_cruft_object_entry(const struct object_id *oid, enum object_type type, > > + struct packed_git *pack, off_t offset, > > + const char *name, uint32_t mtime) > > +{ > > + struct object_entry *entry; > > + > > + display_progress(progress_state, ++nr_seen); > > I don't love the global nr_seen here, but it is pervasive through the > file. OK. Yeah; this is how all of the existing progress code works in pack-objects. > > + entry = packlist_find(&to_pack, oid); > > + if (entry) { > > + if (name) { > > + entry->hash = pack_name_hash(name); > > + entry->no_try_delta = name && no_try_delta(name); > > This is already in an "if (name)" block, so "name &&" isn't needed. Thanks; this is a copy-and-paste from add_object_entry(), where we aren't in a conditional on "name". We could also fold the conditional on whether or not name is NULL into no_try_delta itself, since all existing calls look like "name && no_try_delta(name)". So adding something like: if (!name) return 0; to the beginning of no_try_delta()'s implementation would allow us to get rid of the handful of "name &&"s. But I'm trying to avoid touching other parts of pack-objects as much as I can, so I'll hold off for now. > > + } > > + } else { > > + if (!want_object_in_pack(oid, 0, &pack, &offset)) > > + return 0; > > + if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) { > > + /* > > + * If a traversed tree has a missing blob then we want > > + * to avoid adding that missing object to our pack. > > + * > > + * This only applies to missing blobs, not trees, > > + * because the traversal needs to parse sub-trees but > > + * not blobs. > > + * > > + * Note we only perform this check when we couldn't > > + * already find the object in a pack, so we're really > > + * limited to "ensure non-tip blobs which don't exist in > > + * packs do exist via loose objects". Confused? > > + */ > > + return 0; > > + } > > + > > + entry = create_object_entry(oid, type, pack_name_hash(name), > > + 0, name && no_try_delta(name), > > + pack, offset); > > + } > > + > > + if (mtime > oe_cruft_mtime(&to_pack, entry)) > > + oe_set_cruft_mtime(&to_pack, entry, mtime); > > + return 1; > > I was confused at this "return 1" here, while other cases return 0. > > It turns out that there are multiple methods in this file that have > different semantics: add_loose_object() and add_object_entry_from_pack() > are both called from iterators where "return 1" means "stop iterating" > so they return 0 always. add_object_entry_from_bitmap() is used to > iterate over a bitmap and "return 1" means "include this object". > > However, the return code for add_cruft_object_entry() is never used, > so it should probably return void or swap the meanings to have nonzero > mean an error occurred. Yes, exactly. And thanks for tracing out both of the different meanings/interpretations of these add_xyz_entry() functions. As you can imagine, this implementation is copy-and-pasted from add_object_entry(), which was specialized for this use here. At the time, I gave some effort towards trying to share more code with add_object_entry() for this special case, but it ended up being pretty awkward, hence the separate implementation. Ironically, add_object_entry()'s return code is also unused, so we could probably clean that up, too. But like the above, I'll avoid it for now in an effort to touch as little of pack-objects in this patch as I can. > > +static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep) > > +{ > > + struct string_list_item *item = NULL; > > + for_each_string_list_item(item, packs) { > > + struct packed_git *p = item->util; > > + if (!p) > > + die(_("could not find pack '%s'"), item->string); > > Interesting that this is a potential issue. We are expecting the pack > to be loaded before we get here. Is this more because some packs might > not actually load, but it's fine as long as we don't mark them as kept? Not quite "loaded" (though any pack structures that we look at by this point will be fully "loaded"). Instead, we're making sure that all of the packs names we read from stdin could be matched to packs that we found in the repository (i.e., that we produce an appropriate error message if we found "pack-does-not-exist.pack" on stdin). This is all because we process input from stdin in two phases: - First, read all of the input into two string_lists, one for the packs we're about to discard (anything that start with '-'), and another for all of the "fresh" packs (i.e., anything that we're not going to discard). - Then, loop through all of the packed_git structs we have, querying both of the aforementioned string lists for input that matches each pack's `pack_name` field, and setting the `->util` pointer of the matching string_list_entry appropriately. Following those two steps, any list entries that have a NULL util pointer correspond with bogus input, so we want to call die() there. > > + p->pack_keep_in_core = keep; > > + } > > +} > ... > > +static void read_cruft_objects(void) > > +{ > > + struct strbuf buf = STRBUF_INIT; > > + struct string_list discard_packs = STRING_LIST_INIT_DUP; > > + struct string_list fresh_packs = STRING_LIST_INIT_DUP; > > + struct packed_git *p; > > + > > + ignore_packed_keep_in_core = 1; > > Here is a global that we are suddenly changing. Should we not be > returning it to its initial state when this method is complete? We could, although it won't matter in practice, because we'll want to keep that setting around for our traversal, after which point pack-objects will exit. > > +static int option_parse_cruft_expiration(const struct option *opt, > > + const char *arg, int unset) > > +{ > > + if (unset) { > > + cruft = 0; > > This unassignment of 'cruft' when cruft-expiration is unset with > --no-cruft-expiration seems odd. I would expect > > git pack-objects --cruft --no-cruft-expiration > > to still make a cruft pack, but not expire anything. It seems that > your code here makes --no-cruft-expiration disable the --cruft option. Hmm. I could see compelling reasoning that goes both ways. On the one hand, `--no-cruft-expiration` (to me, at least) seems to imply "set `--cruft-expiration` to "never"). On the other hand, it also matches our convention of `--no`-prefixed options to unset some value. This implementation takes the latter approach, though we could easily change it to set the cruft expiration to "never". I don't have a strong opinion about which is better, so I'm happy to do either if you have a better sense about which has more expected behavior. > > + cruft_expiration = 0; > > + } else { > > + cruft = 1; > > + if (arg) > > + cruft_expiration = approxidate(arg); > > + } > > + return 0; > > +} > .. > > + OPT_BOOL(0, "cruft", &cruft, N_("create a cruft pack")), > > + OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"), > > + N_("expire cruft objects older than <time>"), > > + PARSE_OPT_OPTARG, option_parse_cruft_expiration), > > > -static int has_loose_object(const struct object_id *oid) > > +int has_loose_object(const struct object_id *oid) > > { > > return check_and_freshen(oid, 0); > > } > > I'm surprised this hasn't been modified to use a repository pointer. > Adding another caller here isn't too much debt, though. Yeah, check_and_freshen() doesn't have a variant that takes a repository pointer. Good #leftoverbits, I guess! > > +int has_loose_object(const struct object_id *); > > + > > void assert_oid_type(const struct object_id *oid, enum object_type expect); > > ... > > > + test_expect_success "unreachable packed objects are packed (expire $expire)" ' > > + git init repo && > > + test_when_finished "rm -fr repo" && > > + ( > > + cd repo && > > + > > + test_commit packed && > > + git repack -Ad && > > + test_commit other && > > + > > + git rev-list --objects --no-object-names packed.. >objects && > > + keep="$(basename "$(ls $packdir/pack-*.pack)")" && > > + other="$(git pack-objects --delta-base-offset \ > > + $packdir/pack <objects)" && > > + git prune-packed && > > + > > + test-tool chmtime --get -100 "$packdir/pack-$other.pack" >expect && > > I am missing how this test creates _unreachable_ objects. I would expect removal of > some refs or a 'git reset --hard' somewhere. What am I missing? For this and the other tests the so-called "unreachable" objects are technically reachable, but we can treat them as unreachable by putting them in the "discard" packs list (or by not mentioning them at all to `git pack-objects --cruft`). > > + # remove the unreachable tree, but leave the commit > > + # which has it as its root tree in-tact > > nit: "intact" is one word. Thanks; fixed here and in the other test which was added by this commit. Thanks, Taylor
On Mon, Dec 06, 2021 at 04:44:31PM -0500, Derrick Stolee wrote: > On 11/29/2021 5:25 PM, Taylor Blau wrote: > > Generating a non-expiring cruft packs works as follows: > > I had trouble parsing the documentation changes below, so I came back > to this commit message to see if that helps. > > > - Callers provide a list of every pack they know about, and indicate > > which packs are about to be removed. > > This corresponds to the list over stdin. > > > - All packs which are going to be removed (we'll call these the > > redundant ones) are marked as kept in-core, as well as any packs > > that `pack-objects` found but the caller did not specify. > > Ok, so as an implementation detail we mark these as keep packs. > > These packs are presumed to have entered the repository between > > the caller collecting packs and invoking `pack-objects`. Since we > > do not want to include objects in these packs (because we don't know > > which of their objects are or aren't reachable), these are also > > marked as kept in-core. > > Here, "are presumed" is doing a lot of work. Theoretically, there could > be three categories: > > 1. This pack was just repacked and will be removed because all of its > objects were placed into new objects. > > 2. Either this pack was repacked and contains important reachable objects > OR we did a repack of reachable objects and this pack contained some > extra, unreachable objects. > > 3. This pack was added to the repository while creating those repacked > packs from category 2, so we don't know if things are reachable or > not. > > So, the packs that we discover on-disk but are not specified over stdin > are in this third category, but these are grouped with category 1 as we > will treat them the same. Ah, I think I caused some unintentional confusion by attaching "are presumed" to "these packs", when it wasn't clear that "these packs" meant "ones that aren't listed over stdin". Since the caller is supposed to provide a complete picture of the repository as they see it, any packs known to the pack-objects process that aren't mentioned over stdin are assumed to have entered the repository after the caller was spun up. I'll clarify this section of the commit message, since I agree it is unnecessarily confusing. > > - Then, we enumerate all objects in the repository, and add them to > > our packing list if they do not appear in an in-core kept pack. > > Here, we are looking at all of the objects in category 2 as well as > loose objects. We're enumerating any objects that aren't in packs which are marked as kept in-core (along with loose objects which don't appear in packs that are marked as kept in-core). The in-core kept packs are ones that the caller (and I find it's helpful to read "the caller" as "git repack") has marked as "will delete". So the non in-core pack(s) that we're looking at here contain all reachable objects (e.g., like you would get with `git repack -A`). > > + Packs unreachable objects into a separate "cruft" pack, denoted > > + by the existence of a `.mtimes` file. Pack names provided over > > + stdin indicate which packs will remain after a `git repack`. > > + Pack names prefixed with a `-` indicate those which will be > > + removed. (...) > > This description is too tied to 'git repack'. Can we describe the > input using terms independent of the 'git repack' operation? I need > to keep reading. > > > (...) The contents of the cruft pack are all objects not > > + contained in the surviving packs specified by `--keep-pack`) > > Now you use --keep-pack, which is a way of specifying a pack as > "in-core keep" which was not in your commit message. Here, we also > don't link the packs over stdin to the concept of keep packs. The mention of `--keep-pack` is a mistake left over from a previous version; thanks for spotting. Here's a version of the first paragraph from this piece of documentation which is less tied to `git repack` and hopefully a little clearer: --cruft:: Packs unreachable objects into a separate "cruft" pack, denoted by the existence of a `.mtimes` file. Typically used by `git repack --cruft`. Callers provide a list of pack names and indicate which packs will remain in the repository, along with which packs will be deleted (indicated by the `-` prefix). The contents of the cruft pack are all objects not contained in the surviving packs which have not exceeded the grace period (see `--cruft-expiration` below), or which have exceeded the grace period, but are reachable from an other object which hasn't. > > + which have not exceeded the grace period (see > > + `--cruft-expiration` below), or which have exceeded the grace > > + period, but are reachable from an other object which hasn't. > > And now we think about the grace period! There is so much going on > that I need to break it down to understand. > > An object is _excluded_ from the new cruft pack if > > 1. It is reachable from at least one reference. > 2. It is in a pack from stdin prefixed with "-" > 3. It is in a pack specified by `--keep-pack` > 4. It is in an existing cruft pack and the .mtimes file states > that its mtime is at least as recent as the time specified by > the --cruft-expiration option. > > Breaking it down into a list like this helps me, at least. I'm not > sure what the best way would look like. Given some expiration T, cruft packs contain all unreachable objects which are newer than T, along with any cruft objects (i.e., those not directly reachable from any ref) which are older than T, but reachable from another cruft object newer than T. Thanks, Taylor
diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index dbfd1f9017..573c18afcd 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -13,6 +13,7 @@ SYNOPSIS [--no-reuse-delta] [--delta-base-offset] [--non-empty] [--local] [--incremental] [--window=<n>] [--depth=<n>] [--revs [--unpacked | --all]] [--keep-pack=<pack-name>] + [--cruft] [--cruft-expiration=<time>] [--stdout [--filter=<filter-spec>] | base-name] [--shallow] [--keep-true-parents] [--[no-]sparse] < object-list @@ -95,6 +96,28 @@ base-name:: Incompatible with `--revs`, or options that imply `--revs` (such as `--all`), with the exception of `--unpacked`, which is compatible. +--cruft:: + Packs unreachable objects into a separate "cruft" pack, denoted + by the existence of a `.mtimes` file. Pack names provided over + stdin indicate which packs will remain after a `git repack`. + Pack names prefixed with a `-` indicate those which will be + removed. The contents of the cruft pack are all objects not + contained in the surviving packs specified by `--keep-pack`) + which have not exceeded the grace period (see + `--cruft-expiration` below), or which have exceeded the grace + period, but are reachable from an other object which hasn't. ++ +Incompatible with `--unpack-unreachable`, `--keep-unreachable`, +`--pack-loose-unreachable`, `--stdin-packs`, as well as any other +options which imply `--revs`. Also incompatible with `--max-pack-size`; +when this option is set, the maximum pack size is not inferred from +`pack.packSizeLimit`. + +--cruft-expiration=<approxidate>:: + If specified, objects are eliminated from the cruft pack if they + have an mtime older than `<approxidate>`. If unspecified (and + given `--cruft`), then no objects are eliminated. + --window=<n>:: --depth=<n>:: These two options affect how the objects contained in diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 3fb10529ba..b12e79e4b1 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -36,6 +36,7 @@ #include "trace2.h" #include "shallow.h" #include "promisor-remote.h" +#include "pack-mtimes.h" /* * Objects we are going to pack are collected in the `to_pack` structure. @@ -194,6 +195,8 @@ static int reuse_delta = 1, reuse_object = 1; static int keep_unreachable, unpack_unreachable, include_tag; static timestamp_t unpack_unreachable_expiration; static int pack_loose_unreachable; +static int cruft; +static timestamp_t cruft_expiration; static int local; static int have_non_local_packs; static int incremental; @@ -1252,6 +1255,9 @@ static void write_pack_file(void) &to_pack, written_list, nr_written); } + if (cruft) + pack_idx_opts.flags |= WRITE_MTIMES; + stage_tmp_packfiles(&tmpname, pack_tmp_name, written_list, nr_written, &to_pack, &pack_idx_opts, hash, @@ -3389,6 +3395,135 @@ static void read_packs_list_from_stdin(void) string_list_clear(&exclude_packs, 0); } +static int add_cruft_object_entry(const struct object_id *oid, enum object_type type, + struct packed_git *pack, off_t offset, + const char *name, uint32_t mtime) +{ + struct object_entry *entry; + + display_progress(progress_state, ++nr_seen); + + entry = packlist_find(&to_pack, oid); + if (entry) { + if (name) { + entry->hash = pack_name_hash(name); + entry->no_try_delta = name && no_try_delta(name); + } + } else { + if (!want_object_in_pack(oid, 0, &pack, &offset)) + return 0; + if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) { + /* + * If a traversed tree has a missing blob then we want + * to avoid adding that missing object to our pack. + * + * This only applies to missing blobs, not trees, + * because the traversal needs to parse sub-trees but + * not blobs. + * + * Note we only perform this check when we couldn't + * already find the object in a pack, so we're really + * limited to "ensure non-tip blobs which don't exist in + * packs do exist via loose objects". Confused? + */ + return 0; + } + + entry = create_object_entry(oid, type, pack_name_hash(name), + 0, name && no_try_delta(name), + pack, offset); + } + + if (mtime > oe_cruft_mtime(&to_pack, entry)) + oe_set_cruft_mtime(&to_pack, entry, mtime); + return 1; +} + +static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep) +{ + struct string_list_item *item = NULL; + for_each_string_list_item(item, packs) { + struct packed_git *p = item->util; + if (!p) + die(_("could not find pack '%s'"), item->string); + p->pack_keep_in_core = keep; + } +} + +static void add_unreachable_loose_objects(void); +static void add_objects_in_unpacked_packs(void); + +static void enumerate_cruft_objects(void) +{ + if (progress) + progress_state = start_progress(_("Enumerating cruft objects"), 0); + + add_objects_in_unpacked_packs(); + add_unreachable_loose_objects(); + + stop_progress(&progress_state); +} + +static void read_cruft_objects(void) +{ + struct strbuf buf = STRBUF_INIT; + struct string_list discard_packs = STRING_LIST_INIT_DUP; + struct string_list fresh_packs = STRING_LIST_INIT_DUP; + struct packed_git *p; + + ignore_packed_keep_in_core = 1; + + while (strbuf_getline(&buf, stdin) != EOF) { + if (!buf.len) + continue; + + if (*buf.buf == '-') + string_list_append(&discard_packs, buf.buf + 1); + else + string_list_append(&fresh_packs, buf.buf); + strbuf_reset(&buf); + } + + string_list_sort(&discard_packs); + string_list_sort(&fresh_packs); + + for (p = get_all_packs(the_repository); p; p = p->next) { + const char *pack_name = pack_basename(p); + struct string_list_item *item; + + item = string_list_lookup(&fresh_packs, pack_name); + if (!item) + item = string_list_lookup(&discard_packs, pack_name); + + if (item) { + item->util = p; + } else { + /* + * This pack wasn't mentioned in either the "fresh" or + * "discard" list, so the caller didn't know about it. + * + * Mark it as kept so that its objects are ignored by + * add_unseen_recent_objects_to_traversal(). We'll + * unmark it before starting the traversal so it doesn't + * halt the traversal early. + */ + p->pack_keep_in_core = 1; + } + } + + mark_pack_kept_in_core(&fresh_packs, 1); + mark_pack_kept_in_core(&discard_packs, 0); + + if (cruft_expiration) + die("--cruft-expiration not yet implemented"); + else + enumerate_cruft_objects(); + + strbuf_release(&buf); + string_list_clear(&discard_packs, 0); + string_list_clear(&fresh_packs, 0); +} + static void read_object_list_from_stdin(void) { char line[GIT_MAX_HEXSZ + 1 + PATH_MAX + 2]; @@ -3521,7 +3656,24 @@ static int add_object_in_unpacked_pack(const struct object_id *oid, uint32_t pos, void *_data) { - add_object_entry(oid, OBJ_NONE, "", 0); + if (cruft) { + off_t offset; + time_t mtime; + + if (pack->is_cruft) { + if (load_pack_mtimes(pack) < 0) + die(_("could not load cruft pack .mtimes")); + mtime = nth_packed_mtime(pack, pos); + } else { + mtime = pack->mtime; + } + offset = nth_packed_object_offset(pack, pos); + + add_cruft_object_entry(oid, OBJ_NONE, pack, offset, + NULL, mtime); + } else { + add_object_entry(oid, OBJ_NONE, "", 0); + } return 0; } @@ -3545,7 +3697,19 @@ static int add_loose_object(const struct object_id *oid, const char *path, return 0; } - add_object_entry(oid, type, "", 0); + if (cruft) { + struct stat st; + if (stat(path, &st) < 0) { + if (errno == ENOENT) + return 0; + return error_errno("unable to stat %s", oid_to_hex(oid)); + } + + add_cruft_object_entry(oid, type, NULL, 0, NULL, + st.st_mtime); + } else { + add_object_entry(oid, type, "", 0); + } return 0; } @@ -3864,6 +4028,20 @@ static int option_parse_unpack_unreachable(const struct option *opt, return 0; } +static int option_parse_cruft_expiration(const struct option *opt, + const char *arg, int unset) +{ + if (unset) { + cruft = 0; + cruft_expiration = 0; + } else { + cruft = 1; + if (arg) + cruft_expiration = approxidate(arg); + } + return 0; +} + int cmd_pack_objects(int argc, const char **argv, const char *prefix) { int use_internal_rev_list = 0; @@ -3936,6 +4114,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) OPT_CALLBACK_F(0, "unpack-unreachable", NULL, N_("time"), N_("unpack unreachable objects newer than <time>"), PARSE_OPT_OPTARG, option_parse_unpack_unreachable), + OPT_BOOL(0, "cruft", &cruft, N_("create a cruft pack")), + OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"), + N_("expire cruft objects older than <time>"), + PARSE_OPT_OPTARG, option_parse_cruft_expiration), OPT_BOOL(0, "sparse", &sparse, N_("use the sparse reachability algorithm")), OPT_BOOL(0, "thin", &thin, @@ -4060,7 +4242,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) if (!HAVE_THREADS && delta_search_threads != 1) warning(_("no threads support, ignoring --threads")); - if (!pack_to_stdout && !pack_size_limit) + if (!pack_to_stdout && !pack_size_limit && !cruft) pack_size_limit = pack_size_limit_cfg; if (pack_to_stdout && pack_size_limit) die(_("--max-pack-size cannot be used to build a pack for transfer")); @@ -4087,6 +4269,15 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) if (stdin_packs && use_internal_rev_list) die(_("cannot use internal rev list with --stdin-packs")); + if (cruft) { + if (use_internal_rev_list) + die(_("cannot use internal rev list with --cruft")); + if (stdin_packs) + die(_("cannot use --stdin-packs with --cruft")); + if (pack_size_limit) + die(_("cannot use --max-pack-size with --cruft")); + } + /* * "soft" reasons not to use bitmaps - for on-disk repack by default we want * @@ -4143,7 +4334,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) the_repository); prepare_packing_data(the_repository, &to_pack); - if (progress) + if (progress && !cruft) progress_state = start_progress(_("Enumerating objects"), 0); if (stdin_packs) { /* avoids adding objects in excluded packs */ @@ -4151,7 +4342,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) read_packs_list_from_stdin(); if (rev_list_unpacked) add_unreachable_loose_objects(); - } else if (!use_internal_rev_list) + } else if (cruft) + read_cruft_objects(); + else if (!use_internal_rev_list) read_object_list_from_stdin(); else { get_object_list(rp.nr, rp.v); diff --git a/object-file.c b/object-file.c index c3d866a287..7ddb38b64a 100644 --- a/object-file.c +++ b/object-file.c @@ -956,7 +956,7 @@ int has_loose_object_nonlocal(const struct object_id *oid) return check_and_freshen_nonlocal(oid, 0); } -static int has_loose_object(const struct object_id *oid) +int has_loose_object(const struct object_id *oid) { return check_and_freshen(oid, 0); } diff --git a/object-store.h b/object-store.h index d87481f101..a79c1c91ab 100644 --- a/object-store.h +++ b/object-store.h @@ -308,6 +308,8 @@ int repo_has_object_file_with_flags(struct repository *r, */ int has_loose_object_nonlocal(const struct object_id *); +int has_loose_object(const struct object_id *); + void assert_oid_type(const struct object_id *oid, enum object_type expect); /* diff --git a/t/t5327-pack-objects-cruft.sh b/t/t5327-pack-objects-cruft.sh new file mode 100755 index 0000000000..543a80e9bf --- /dev/null +++ b/t/t5327-pack-objects-cruft.sh @@ -0,0 +1,218 @@ +#!/bin/sh + +test_description='cruft pack related pack-objects tests' +. ./test-lib.sh + +objdir=.git/objects +packdir=$objdir/pack + +basic_cruft_pack_tests () { + expire="$1" + + test_expect_success "unreachable loose objects are packed (expire $expire)" ' + git init repo && + test_when_finished "rm -fr repo" && + ( + cd repo && + + test_commit base && + git repack -Ad && + test_commit loose && + + test-tool chmtime +2000 "$objdir/$(test_oid_to_path \ + $(git rev-parse loose:loose.t))" && + test-tool chmtime +1000 "$objdir/$(test_oid_to_path \ + $(git rev-parse loose^{tree}))" && + + ( + git rev-list --objects --no-object-names base..loose | + while read oid + do + path="$objdir/$(test_oid_to_path "$oid")" && + printf "%s %d\n" "$oid" "$(test-tool chmtime --get "$path")" + done | + sort -k1 + ) >expect && + + keep="$(basename "$(ls $packdir/pack-*.pack)")" && + cruft="$(echo $keep | git pack-objects --cruft \ + --cruft-expiration="$expire" $packdir/pack)" && + test-tool pack-mtimes "pack-$cruft.mtimes" >actual && + + test_cmp expect actual + ) + ' + + test_expect_success "unreachable packed objects are packed (expire $expire)" ' + git init repo && + test_when_finished "rm -fr repo" && + ( + cd repo && + + test_commit packed && + git repack -Ad && + test_commit other && + + git rev-list --objects --no-object-names packed.. >objects && + keep="$(basename "$(ls $packdir/pack-*.pack)")" && + other="$(git pack-objects --delta-base-offset \ + $packdir/pack <objects)" && + git prune-packed && + + test-tool chmtime --get -100 "$packdir/pack-$other.pack" >expect && + + cruft="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF + $keep + -pack-$other.pack + EOF + )" && + test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw && + + cut -d" " -f2 <actual.raw | sort -u >actual && + + test_cmp expect actual + ) + ' + + test_expect_success "unreachable cruft objects are repacked (expire $expire)" ' + git init repo && + test_when_finished "rm -fr repo" && + ( + cd repo && + + test_commit packed && + git repack -Ad && + test_commit other && + + git rev-list --objects --no-object-names packed.. >objects && + keep="$(basename "$(ls $packdir/pack-*.pack)")" && + + cruft_a="$(echo $keep | git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack)" && + git prune-packed && + cruft_b="$(git pack-objects --cruft --cruft-expiration="$expire" $packdir/pack <<-EOF + $keep + -pack-$cruft_a.pack + EOF + )" && + + test-tool pack-mtimes "pack-$cruft_a.mtimes" >expect.raw && + test-tool pack-mtimes "pack-$cruft_b.mtimes" >actual.raw && + + sort <expect.raw >expect && + sort <actual.raw >actual && + + test_cmp expect actual + ) + ' + + test_expect_success "multiple cruft packs (expire $expire)" ' + git init repo && + test_when_finished "rm -fr repo" && + ( + cd repo && + + test_commit reachable && + git repack -Ad && + keep="$(basename "$(ls $packdir/pack-*.pack)")" && + + test_commit cruft && + loose="$objdir/$(test_oid_to_path $(git rev-parse cruft))" && + + # generate three copies of the cruft object in different + # cruft packs, each with a unique mtime: + # - one expired (1000 seconds ago) + # - two non-expired (one 1000 seconds in the future, + # one 1500 seconds in the future) + test-tool chmtime =-1000 "$loose" && + git pack-objects --cruft $packdir/pack-A <<-EOF && + $keep + EOF + test-tool chmtime =+1000 "$loose" && + git pack-objects --cruft $packdir/pack-B <<-EOF && + $keep + -$(basename $(ls $packdir/pack-A-*.pack)) + EOF + test-tool chmtime =+1500 "$loose" && + git pack-objects --cruft $packdir/pack-C <<-EOF && + $keep + -$(basename $(ls $packdir/pack-A-*.pack)) + -$(basename $(ls $packdir/pack-B-*.pack)) + EOF + + # ensure the resulting cruft pack takes the most recent + # mtime among all copies + cruft="$(git pack-objects --cruft \ + --cruft-expiration="$expire" \ + $packdir/pack <<-EOF + $keep + -$(basename $(ls $packdir/pack-A-*.pack)) + -$(basename $(ls $packdir/pack-B-*.pack)) + -$(basename $(ls $packdir/pack-C-*.pack)) + EOF + )" && + + test-tool pack-mtimes "$(basename $(ls $packdir/pack-C-*.mtimes))" >expect.raw && + test-tool pack-mtimes "pack-$cruft.mtimes" >actual.raw && + + sort expect.raw >expect && + sort actual.raw >actual && + test_cmp expect actual + ) + ' + + test_expect_success "cruft packs tolerate missing trees (expire $expire)" ' + git init repo && + test_when_finished "rm -fr repo" && + ( + cd repo && + + test_commit reachable && + test_commit cruft && + + tree="$(git rev-parse cruft^{tree})" && + + git reset --hard reachable && + git tag -d cruft && + rm -fr .git/logs && + + # remove the unreachable tree, but leave the commit + # which has it as its root tree in-tact + rm -fr "$objdir/$(test_oid_to_path "$tree")" && + + git repack -Ad && + basename $(ls $packdir/pack-*.pack) >in && + git pack-objects --cruft --cruft-expiration="$expire" \ + $packdir/pack <in + ) + ' + + test_expect_success "cruft packs tolerate missing blobs (expire $expire)" ' + git init repo && + test_when_finished "rm -fr repo" && + ( + cd repo && + + test_commit reachable && + test_commit cruft && + + blob="$(git rev-parse cruft:cruft.t)" && + + git reset --hard reachable && + git tag -d cruft && + rm -fr .git/logs && + + # remove the unreachable blob, but leave the commit (and + # the root tree of that commit) in-tact + rm -fr "$objdir/$(test_oid_to_path "$blob")" && + + git repack -Ad && + basename $(ls $packdir/pack-*.pack) >in && + git pack-objects --cruft --cruft-expiration="$expire" \ + $packdir/pack <in + ) + ' +} + +basic_cruft_pack_tests never + +test_done
Teach `pack-objects` how to generate a cruft pack when no objects are dropped (i.e., `--cruft-expiration=never`). Later patches will teach `pack-objects` how to generate a cruft pack that prunes objects. When generating a cruft pack which does not prune objects, we want to collect all unreachable objects into a single pack (noting and updating their mtimes as we accumulate them). Ordinary use will pass the result of a `git repack -A` as a kept pack, so when this patch says "kept pack", readers should think "reachable objects". Generating a non-expiring cruft packs works as follows: - Callers provide a list of every pack they know about, and indicate which packs are about to be removed. - All packs which are going to be removed (we'll call these the redundant ones) are marked as kept in-core, as well as any packs that `pack-objects` found but the caller did not specify. These packs are presumed to have entered the repository between the caller collecting packs and invoking `pack-objects`. Since we do not want to include objects in these packs (because we don't know which of their objects are or aren't reachable), these are also marked as kept in-core. - Then, we enumerate all objects in the repository, and add them to our packing list if they do not appear in an in-core kept pack. This results in a new cruft pack which contains all known objects that aren't included in the kept packs. When the kept pack is the result of `git repack -A`, the resulting pack contains all unreachable objects. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- Documentation/git-pack-objects.txt | 23 +++ builtin/pack-objects.c | 203 ++++++++++++++++++++++++++- object-file.c | 2 +- object-store.h | 2 + t/t5327-pack-objects-cruft.sh | 218 +++++++++++++++++++++++++++++ 5 files changed, 442 insertions(+), 6 deletions(-) create mode 100755 t/t5327-pack-objects-cruft.sh