mbox series

[RFC,0/4] move pruned objects to a separate repository

Message ID cover.1656528343.git.me@ttaylorr.com (mailing list archive)
Headers show
Series move pruned objects to a separate repository | expand

Message

Taylor Blau June 29, 2022, 6:45 p.m. UTC
Now that cruft packs are available in v2.37.0, here is an interesting
application of that new feature to enable a two-phase object pruning
approach.

This came out of a discussion within GitHub about ways we could support
storing a set of pruned objects in "limbo" so that they were not
accessible from the repository which pruned them, but instead stored in
a cruft pack in a separate repository which lists the original one as an
alternate.

This makes it possible to take the collection of all pruned objects and
store them in a cruft pack in a separate repository. This repository
(which I have been referring to as the "expired.git") can then be used
as a donor repository for any missing objects (like the ones described
by the race in [1]).

The first few patches are preparatory. The final one implements writing
the pruned objects separately. The trick is to write another cruft pack
to a separate repository, with two tweaks:

  - the `--cruft-expiration` value is set to "never", since we want to
    keep around all of the objects we expired in the previous step, and

  - the original cruft pack appears as a pack that we are going to keep,
    meaning all unreachable objects that are stored in the original
    cruft pack are excluded from the one we write to the "expired.git"
    repository.

You can try this out yourself by doing something like:

    $ git init --bare ../expired.git $ git repack --cruft
    --cruft-expiration=1.day.ago -d \
    --expire-to=../expired.git/objects/pack/pack

which will create two cruft packs:

  - one in the repository which ran `git repack` containing all
    unreachable objects written within the last day, and
  - another in the "expired.git" repository which contains all
    unreachable objects written prior to the last day

This series is an RFC for now since I'm interested in discussing whether
or not this is a feature that people would actually want to use or not.
But if it is, I'm happy to polish this up and turn it into a
non-RFC-quality series ;-).

In the meantime, thanks for your review!

[1]: https://lore.kernel.org/git/YryF+vkosJOXf+mQ@nand.local/

Taylor Blau (4):
  builtin/repack.c: pass "out" to `prepare_pack_objects`
  builtin/repack.c: pass "cruft_expiration" to `write_cruft_pack`
  builtin/repack.c: write cruft packs to arbitrary locations
  builtin/repack.c: implement `--expire-to` for storing pruned objects

 Documentation/git-repack.txt |   6 ++
 builtin/repack.c             |  67 ++++++++++++++++---
 t/t7700-repack.sh            | 121 +++++++++++++++++++++++++++++++++++
 3 files changed, 186 insertions(+), 8 deletions(-)

Comments

Jonathan Tan June 29, 2022, 10:54 p.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:
> This series is an RFC for now since I'm interested in discussing whether
> or not this is a feature that people would actually want to use or not.
> But if it is, I'm happy to polish this up and turn it into a
> non-RFC-quality series ;-).
> 
> In the meantime, thanks for your review!

Thanks for this patch set. I can see this being used by, say, someone
who wants to preserve a repo that rewinds branches all the time (the
refs would need to be backed-up separately, but at least this provides a
way for objects to be stored efficiently, in that reachable objects are
still stored in the main repo and unreachable objects are stored in the
backup with no overlap between them).

I think there is at least one more alternative that should be
considered, though: since the cruft pack is unlikely to have its objects
"resurrected" (since the reason why they're there is because they are
unreachable), it is likely that the objects that are pruned are exactly
the same as those in the craft pack. So it would be more efficient to
just unconditionally rename the cruft pack to the backup destination.

Having said that, if there is a compelling use case for repacking even
when we're moving from cruft pack to backup, the design of this patch
set looks good overall. There are some minor points (e.g. the naming of
the parameter "out" in patch 1), but I understand that this is an RFC
and I'll wait for a non-RFC patch set before looking more closely at
these things.
Taylor Blau June 30, 2022, 2:47 a.m. UTC | #2
On Wed, Jun 29, 2022 at 03:54:04PM -0700, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> > This series is an RFC for now since I'm interested in discussing whether
> > or not this is a feature that people would actually want to use or not.
> > But if it is, I'm happy to polish this up and turn it into a
> > non-RFC-quality series ;-).
> >
> > In the meantime, thanks for your review!
>
> Thanks for this patch set. I can see this being used by, say, someone
> who wants to preserve a repo that rewinds branches all the time (the
> refs would need to be backed-up separately, but at least this provides a
> way for objects to be stored efficiently, in that reachable objects are
> still stored in the main repo and unreachable objects are stored in the
> backup with no overlap between them).

Yes, definitely.

If it helps, I can share a little bit about the motivating use-case
within GitHub. All objects from a fork network are stored together in a
repository that we call the network.git, with individual forks keeping
track of their own references.

The network.git repository can often grow quite large, and/or contain
data that the owner of an individual fork would like removed (e.g., they
accidentally pushed sensitive credentials, force-pushed over it, but
would like the now-unreachable objects to be removed).

We don't usually do pruning GC's except during manual intervention or
upon request through a support ticket. But when we do it is often
infeasible to lock the entire network's push traffic and reference
updates. So it is not an unheard of event to encounter the race that I
described above.

The idea is that, at least for non-sensitive pruning, we would move the
pruned objects to a separate repository and hold them there until we
could run `git fsck` on the repository after pruning and verify that the
repository is intact. If it is, then the expired.git repository can be
emptied, too, permanently removing the pruned objects. If not, the
expired.git repository then becomes a donor for the missing objects,
which are used to heal the corrupt main repository. Once *that* is done,
and fsck comes back clean, then the expired.git repository can be
removed.

> I think there is at least one more alternative that should be
> considered, though: since the cruft pack is unlikely to have its objects
> "resurrected" (since the reason why they're there is because they are
> unreachable), it is likely that the objects that are pruned are exactly
> the same as those in the craft pack. So it would be more efficient to
> just unconditionally rename the cruft pack to the backup destination.

This isn't quite right. The contents that are written into the
expired.git repository is everything that *didn't* end up in the cruft
pack.

Suppose your cruft expiration is 1.hour.ago, and your doing a repack on
repository foo.git, expiring objects into expired.git. There are three
disjoint sets of objects:

  - reachable objects, which will stay in foo.git
  - unreachable objects which were written within the last hour (and are
    thus too new to prune) which will stay in foo.git
  - unreachable objects which *weren't* written within the last hour
    (and thus will be pruned) which are moved to a new pack in
    expired.git (and removed from foo.git)

So the cruft pack in foo.git and the one written to expired.git are a
disjoint cut of the unreachable objects in foo.git based on their mtime,
with the recent objects staying in the source repository and the stale
ones moving to the expired.git repository.

The original implementation of this feature was to move the entire cruft
pack out of the way like you describe. This is sub-optimal because you
are forced to generate that cruft pack with `--cruft-expiration=never`,
since you can't actually prune any objects when generating the cruft
pack, or they would be gone forever. But since you have to move the
entire cruft pack out of the way, the visible effect looks like you
actually pruned *all* unreachable objects, as if you had supplied
`--cruft-expiration=now`.

Being able to expire just the objects which have aged out of the grace
period should cause this race to happen less frequently in practice.

> Having said that, if there is a compelling use case for repacking even
> when we're moving from cruft pack to backup, the design of this patch
> set looks good overall. There are some minor points (e.g. the naming of
> the parameter "out" in patch 1), but I understand that this is an RFC
> and I'll wait for a non-RFC patch set before looking more closely at
> these things.

Thanks,
Taylor
Ævar Arnfjörð Bjarmason June 30, 2022, 8 a.m. UTC | #3
On Wed, Jun 29 2022, Taylor Blau wrote:

> Now that cruft packs are available in v2.37.0, here is an interesting
> application of that new feature to enable a two-phase object pruning
> approach.
>
> This came out of a discussion within GitHub about ways we could support
> storing a set of pruned objects in "limbo" so that they were not
> accessible from the repository which pruned them, but instead stored in
> a cruft pack in a separate repository which lists the original one as an
> alternate.
>
> This makes it possible to take the collection of all pruned objects and
> store them in a cruft pack in a separate repository. This repository
> (which I have been referring to as the "expired.git") can then be used
> as a donor repository for any missing objects (like the ones described
> by the race in [1]).
> [...]
> [1]: https://lore.kernel.org/git/YryF+vkosJOXf+mQ@nand.local/

I think the best description of that race on-list is this by Jeff King,
if so I think it would be nice to work it into a commit message (for
4/4):

	https://public-inbox.org/git/20190319001829.GL29661@sigill.intra.peff.net/

Downthread of that, starting at:

	https://public-inbox.org/git/878svjj4t5.fsf@evledraar.gmail.com/

I describe a proposed mechanism to address the race condition, which
seems to me be functionally the same as parts of what you're proposing
here. I.e. the "limbo" here being the same as the proposed "gc
quarantine".

The main difference being one that this RFC leaves on the table, which
is how you'd get these objects back into the non-cruft repository once
they're erroneously/racily expired. I imagined that we'd add it as a
special alternate, read it last, and make the object reading code aware
that any object needed from such an alternate is one that we'd need to
COR over to our primary repository:

	https://public-inbox.org/git/8736lnxlig.fsf@evledraar.gmail.com/

Whereas it seems like you're imagining just having the "cruft pack"
repository around so that a support engineer can manually recover from
corruption, or have some other out-of-tree mechanism not part of this
RFC to (semi-?)automate that step.

If you haven't it would be nice if you could read that thread & see if
what I'm describing there is essentially a superset of what you have
here, and if any of the concerns Jeff King brought up there are ones you
think apply here.

Particularly as there was a reference to an off-list (presumably at
GitHub) discussion with Michael Haggerty about these sorts of races. I
don't know if either Jeff or Michael were involved in the discussions
you had.

I think that the mechanism I proposed there was subtly different from
what Jeff was concerned about being racy, but that thread was left
hanging as the last reply is from me trying to clarify that point.

So maybe I'm wrong, but I think if that was the case you'd also be wrong
about this approach being viable, so it would be nice to clear that up
:)

I'd also be very interested to know if you have anything like my
proposed auto-healing via a special alternate planned.  I think that
would allow aggressive pruning of live repositories not by fixing our
underlying race conditions, but by "leaning into" them as it were.

I.e. we'd race even more, but as we could always auto-heal by "no, I'll
actually need that" COR-ing the relevant object(s) back from the "gc
quarantine" (or your "cruft repository") that would be OK.
Jonathan Tan June 30, 2022, 9:15 p.m. UTC | #4
Taylor Blau <me@ttaylorr.com> writes:
> We don't usually do pruning GC's except during manual intervention or
> upon request through a support ticket. But when we do it is often
> infeasible to lock the entire network's push traffic and reference
> updates. So it is not an unheard of event to encounter the race that I
> described above.
> 
> The idea is that, at least for non-sensitive pruning, we would move the
> pruned objects to a separate repository and hold them there until we
> could run `git fsck` on the repository after pruning and verify that the
> repository is intact. If it is, then the expired.git repository can be
> emptied, too, permanently removing the pruned objects. If not, the
> expired.git repository then becomes a donor for the missing objects,
> which are used to heal the corrupt main repository. Once *that* is done,
> and fsck comes back clean, then the expired.git repository can be
> removed.

Thanks for the information. Presumably, during such pruning GCs (because
manual intervention was needed or because of a support ticket), you
would use a cruft expiration of "now", not something like "14 days ago".
If so, more on this specific case later...

> > I think there is at least one more alternative that should be
> > considered, though: since the cruft pack is unlikely to have its objects
> > "resurrected" (since the reason why they're there is because they are
> > unreachable), it is likely that the objects that are pruned are exactly
> > the same as those in the craft pack. So it would be more efficient to
> > just unconditionally rename the cruft pack to the backup destination.
> 
> This isn't quite right. The contents that are written into the
> expired.git repository is everything that *didn't* end up in the cruft
> pack.

I reread what I wrote and realized that I didn't give a good description
of what I was thinking. So hopefully this is a better one: I was
thinking of the case in which GC is regularly run on a repo, say, every
14 days with a 14-day expiration time. So on the first run you would
have:

  a1.pack      a2.pack (+ .mtime)
  Reachable    Unreachable (cruft pack)

and on the second run, assuming this patch set is in effect:

  b1.pack      b2.pack (+ .mtime)          expired.git/b3.pack
  Reachable    Unreachable (cruft pack)    In expired.git

and my idea was that it is very likely that a2.pack and
expired.git/b3.pack are the same, so I thought that a feature in which
cruft packs could be moved instead of deleted would be more efficient.

Going back to the specific case at the start of this email. I now see
that my idea wouldn't work in that specific case, because you would want
to generate expired.git packs from a repository that is unlikely to have
cruft packs (or, at least, it is unlikely that the existing cruft packs
match exactly what you would write in expired.git).

If it is likely that cruft pack(s) would only be written in one place
(whether in the same repository or in expired.git, but not both),
another design option would be to be able to tell Git where to write
cruft packs, but not give Git the ability to write cruft packs both to
the same repository and expired.git. (Unless in your use case, Taylor,
you envision that you would occasionally need to write cruft packs in
both places.) That would simplify the UI and the code a bit, but not by
much (this patch set, as written, is already elegantly written), so I
don't feel too strongly about it and I would be happy with the current
pattern.