Message ID | cover.1656528343.git.me@ttaylorr.com (mailing list archive) |
---|---|
Headers | show |
Series | move pruned objects to a separate repository | expand |
Taylor Blau <me@ttaylorr.com> writes: > This series is an RFC for now since I'm interested in discussing whether > or not this is a feature that people would actually want to use or not. > But if it is, I'm happy to polish this up and turn it into a > non-RFC-quality series ;-). > > In the meantime, thanks for your review! Thanks for this patch set. I can see this being used by, say, someone who wants to preserve a repo that rewinds branches all the time (the refs would need to be backed-up separately, but at least this provides a way for objects to be stored efficiently, in that reachable objects are still stored in the main repo and unreachable objects are stored in the backup with no overlap between them). I think there is at least one more alternative that should be considered, though: since the cruft pack is unlikely to have its objects "resurrected" (since the reason why they're there is because they are unreachable), it is likely that the objects that are pruned are exactly the same as those in the craft pack. So it would be more efficient to just unconditionally rename the cruft pack to the backup destination. Having said that, if there is a compelling use case for repacking even when we're moving from cruft pack to backup, the design of this patch set looks good overall. There are some minor points (e.g. the naming of the parameter "out" in patch 1), but I understand that this is an RFC and I'll wait for a non-RFC patch set before looking more closely at these things.
On Wed, Jun 29, 2022 at 03:54:04PM -0700, Jonathan Tan wrote: > Taylor Blau <me@ttaylorr.com> writes: > > This series is an RFC for now since I'm interested in discussing whether > > or not this is a feature that people would actually want to use or not. > > But if it is, I'm happy to polish this up and turn it into a > > non-RFC-quality series ;-). > > > > In the meantime, thanks for your review! > > Thanks for this patch set. I can see this being used by, say, someone > who wants to preserve a repo that rewinds branches all the time (the > refs would need to be backed-up separately, but at least this provides a > way for objects to be stored efficiently, in that reachable objects are > still stored in the main repo and unreachable objects are stored in the > backup with no overlap between them). Yes, definitely. If it helps, I can share a little bit about the motivating use-case within GitHub. All objects from a fork network are stored together in a repository that we call the network.git, with individual forks keeping track of their own references. The network.git repository can often grow quite large, and/or contain data that the owner of an individual fork would like removed (e.g., they accidentally pushed sensitive credentials, force-pushed over it, but would like the now-unreachable objects to be removed). We don't usually do pruning GC's except during manual intervention or upon request through a support ticket. But when we do it is often infeasible to lock the entire network's push traffic and reference updates. So it is not an unheard of event to encounter the race that I described above. The idea is that, at least for non-sensitive pruning, we would move the pruned objects to a separate repository and hold them there until we could run `git fsck` on the repository after pruning and verify that the repository is intact. If it is, then the expired.git repository can be emptied, too, permanently removing the pruned objects. If not, the expired.git repository then becomes a donor for the missing objects, which are used to heal the corrupt main repository. Once *that* is done, and fsck comes back clean, then the expired.git repository can be removed. > I think there is at least one more alternative that should be > considered, though: since the cruft pack is unlikely to have its objects > "resurrected" (since the reason why they're there is because they are > unreachable), it is likely that the objects that are pruned are exactly > the same as those in the craft pack. So it would be more efficient to > just unconditionally rename the cruft pack to the backup destination. This isn't quite right. The contents that are written into the expired.git repository is everything that *didn't* end up in the cruft pack. Suppose your cruft expiration is 1.hour.ago, and your doing a repack on repository foo.git, expiring objects into expired.git. There are three disjoint sets of objects: - reachable objects, which will stay in foo.git - unreachable objects which were written within the last hour (and are thus too new to prune) which will stay in foo.git - unreachable objects which *weren't* written within the last hour (and thus will be pruned) which are moved to a new pack in expired.git (and removed from foo.git) So the cruft pack in foo.git and the one written to expired.git are a disjoint cut of the unreachable objects in foo.git based on their mtime, with the recent objects staying in the source repository and the stale ones moving to the expired.git repository. The original implementation of this feature was to move the entire cruft pack out of the way like you describe. This is sub-optimal because you are forced to generate that cruft pack with `--cruft-expiration=never`, since you can't actually prune any objects when generating the cruft pack, or they would be gone forever. But since you have to move the entire cruft pack out of the way, the visible effect looks like you actually pruned *all* unreachable objects, as if you had supplied `--cruft-expiration=now`. Being able to expire just the objects which have aged out of the grace period should cause this race to happen less frequently in practice. > Having said that, if there is a compelling use case for repacking even > when we're moving from cruft pack to backup, the design of this patch > set looks good overall. There are some minor points (e.g. the naming of > the parameter "out" in patch 1), but I understand that this is an RFC > and I'll wait for a non-RFC patch set before looking more closely at > these things. Thanks, Taylor
On Wed, Jun 29 2022, Taylor Blau wrote: > Now that cruft packs are available in v2.37.0, here is an interesting > application of that new feature to enable a two-phase object pruning > approach. > > This came out of a discussion within GitHub about ways we could support > storing a set of pruned objects in "limbo" so that they were not > accessible from the repository which pruned them, but instead stored in > a cruft pack in a separate repository which lists the original one as an > alternate. > > This makes it possible to take the collection of all pruned objects and > store them in a cruft pack in a separate repository. This repository > (which I have been referring to as the "expired.git") can then be used > as a donor repository for any missing objects (like the ones described > by the race in [1]). > [...] > [1]: https://lore.kernel.org/git/YryF+vkosJOXf+mQ@nand.local/ I think the best description of that race on-list is this by Jeff King, if so I think it would be nice to work it into a commit message (for 4/4): https://public-inbox.org/git/20190319001829.GL29661@sigill.intra.peff.net/ Downthread of that, starting at: https://public-inbox.org/git/878svjj4t5.fsf@evledraar.gmail.com/ I describe a proposed mechanism to address the race condition, which seems to me be functionally the same as parts of what you're proposing here. I.e. the "limbo" here being the same as the proposed "gc quarantine". The main difference being one that this RFC leaves on the table, which is how you'd get these objects back into the non-cruft repository once they're erroneously/racily expired. I imagined that we'd add it as a special alternate, read it last, and make the object reading code aware that any object needed from such an alternate is one that we'd need to COR over to our primary repository: https://public-inbox.org/git/8736lnxlig.fsf@evledraar.gmail.com/ Whereas it seems like you're imagining just having the "cruft pack" repository around so that a support engineer can manually recover from corruption, or have some other out-of-tree mechanism not part of this RFC to (semi-?)automate that step. If you haven't it would be nice if you could read that thread & see if what I'm describing there is essentially a superset of what you have here, and if any of the concerns Jeff King brought up there are ones you think apply here. Particularly as there was a reference to an off-list (presumably at GitHub) discussion with Michael Haggerty about these sorts of races. I don't know if either Jeff or Michael were involved in the discussions you had. I think that the mechanism I proposed there was subtly different from what Jeff was concerned about being racy, but that thread was left hanging as the last reply is from me trying to clarify that point. So maybe I'm wrong, but I think if that was the case you'd also be wrong about this approach being viable, so it would be nice to clear that up :) I'd also be very interested to know if you have anything like my proposed auto-healing via a special alternate planned. I think that would allow aggressive pruning of live repositories not by fixing our underlying race conditions, but by "leaning into" them as it were. I.e. we'd race even more, but as we could always auto-heal by "no, I'll actually need that" COR-ing the relevant object(s) back from the "gc quarantine" (or your "cruft repository") that would be OK.
Taylor Blau <me@ttaylorr.com> writes: > We don't usually do pruning GC's except during manual intervention or > upon request through a support ticket. But when we do it is often > infeasible to lock the entire network's push traffic and reference > updates. So it is not an unheard of event to encounter the race that I > described above. > > The idea is that, at least for non-sensitive pruning, we would move the > pruned objects to a separate repository and hold them there until we > could run `git fsck` on the repository after pruning and verify that the > repository is intact. If it is, then the expired.git repository can be > emptied, too, permanently removing the pruned objects. If not, the > expired.git repository then becomes a donor for the missing objects, > which are used to heal the corrupt main repository. Once *that* is done, > and fsck comes back clean, then the expired.git repository can be > removed. Thanks for the information. Presumably, during such pruning GCs (because manual intervention was needed or because of a support ticket), you would use a cruft expiration of "now", not something like "14 days ago". If so, more on this specific case later... > > I think there is at least one more alternative that should be > > considered, though: since the cruft pack is unlikely to have its objects > > "resurrected" (since the reason why they're there is because they are > > unreachable), it is likely that the objects that are pruned are exactly > > the same as those in the craft pack. So it would be more efficient to > > just unconditionally rename the cruft pack to the backup destination. > > This isn't quite right. The contents that are written into the > expired.git repository is everything that *didn't* end up in the cruft > pack. I reread what I wrote and realized that I didn't give a good description of what I was thinking. So hopefully this is a better one: I was thinking of the case in which GC is regularly run on a repo, say, every 14 days with a 14-day expiration time. So on the first run you would have: a1.pack a2.pack (+ .mtime) Reachable Unreachable (cruft pack) and on the second run, assuming this patch set is in effect: b1.pack b2.pack (+ .mtime) expired.git/b3.pack Reachable Unreachable (cruft pack) In expired.git and my idea was that it is very likely that a2.pack and expired.git/b3.pack are the same, so I thought that a feature in which cruft packs could be moved instead of deleted would be more efficient. Going back to the specific case at the start of this email. I now see that my idea wouldn't work in that specific case, because you would want to generate expired.git packs from a repository that is unlikely to have cruft packs (or, at least, it is unlikely that the existing cruft packs match exactly what you would write in expired.git). If it is likely that cruft pack(s) would only be written in one place (whether in the same repository or in expired.git, but not both), another design option would be to be able to tell Git where to write cruft packs, but not give Git the ability to write cruft packs both to the same repository and expired.git. (Unless in your use case, Taylor, you envision that you would occasionally need to write cruft packs in both places.) That would simplify the UI and the code a bit, but not by much (this patch set, as written, is already elegantly written), so I don't feel too strongly about it and I would be happy with the current pattern.