diff mbox series

[v3,01/17] Documentation/technical: add cruft-packs.txt

Message ID 784ee7e0eec9ba520ebaaa27de2de810e2f6798a.1646266835.git.me@ttaylorr.com (mailing list archive)
State New, archived
Headers show
Series cruft packs | expand

Commit Message

Taylor Blau March 3, 2022, 12:20 a.m. UTC
Create a technical document to explain cruft packs. It contains a brief
overview of the problem, some background, details on the implementation,
and a couple of alternative approaches not considered here.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/Makefile                  |  1 +
 Documentation/technical/cruft-packs.txt | 97 +++++++++++++++++++++++++
 2 files changed, 98 insertions(+)
 create mode 100644 Documentation/technical/cruft-packs.txt

Comments

Jonathan Nieder March 7, 2022, 6:03 p.m. UTC | #1
Hi,

Taylor Blau wrote:

> Create a technical document to explain cruft packs. It contains a brief
> overview of the problem, some background, details on the implementation,
> and a couple of alternative approaches not considered here.

Sorry for the very slow review!  I've mentioned a few times that this
overlaps in interesting ways with the gc mechanism described in
hash-function-transition.txt, so I'd like to compare and see how they
interact.

[...]
> --- /dev/null
> +++ b/Documentation/technical/cruft-packs.txt
> @@ -0,0 +1,97 @@
[...]
> +Unreachable objects aren't removed immediately, since doing so could race with
> +an incoming push which may reference an object which is about to be deleted.
> +Instead, those unreachable objects are stored as loose object and stay that way
> +until they are older than the expiration window, at which point they are removed
> +by linkgit:git-prune[1].
> +
> +Git must store these unreachable objects loose in order to keep track of their
> +per-object mtimes.

It's worth noting that this behavior is already racy.  That is because
when an unreachable object becomes newly reachable, we do not update
its mtime and the mtimes of every object reachable from it, so if it
then becomes transiently unreachable again then it can be wrongly
collected.

[...]
>                                these repositories often take up a large amount of
> +disk space, since we can only zlib compress them, but not store them in delta
> +chains.

Yes!  I'm happy we're making progress on this.

> +
> +== Cruft packs
> +
> +A cruft pack eliminates the need for storing unreachable objects in a loose
> +state by including the per-object mtimes in a separate file alongside a single
> +pack containing all loose objects.

Can this doc say a little about how "git prune" handles these files?
In particular, does a non cruft pack aware copy of Git (or JGit,
libgit2, etc) do the right thing or does it fight with this mechanism?
If the latter, do we have a repository extension (extensions.*) to
prevent that?

[...]
> +  3. Write the pack out, along with a `.mtimes` file that records the per-object
> +     timestamps.

As a point of comparison, the design in hash-function-transition uses
a single timestamp for the whole pack.  During read operations, objects
in a cruft pack are considered present; during writes, they are
considered _not present_ so that if we want to make a cruft object
newly present then we put a copy of it in a new pack.

Advantage of the mtimes file approach:
- less duplication of storage: a revived object is only stored once,
  in a cruft pack, and then the next gc can "graduate" it out of the
  cruft pack and shrink the cruft pack
- less affect on non-gc Git code: writes don't need to know that any
  cruft objects referenced need to be copied into a new pack

Advantages of the mtime per cruft pack approach:
- easy expiration: once a cruft pack has reached its expiration date,
  it can be deleted as a whole
- less I/O churn: a cruft pack stays as-is until combined into another
  cruft pack or deleted.  There is no frequently-modified mtimes file
  associated to it
- informs the storage layer about what is likely to be accessed: cruft
  packs can get filesystem attributes to put them in less-optimized
  storage since they are likely to be less frequently read

[...]
> +Notable alternatives to this design include:
> +
> +  - The location of the per-object mtime data, and
> +  - Storing unreachable objects in multiple cruft packs.
> +
> +On the location of mtime data, a new auxiliary file tied to the pack was chosen
> +to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
> +support for optional chunks of data, it may make sense to consolidate the
> +`.mtimes` format into the `.idx` itself.
> +
> +Storing unreachable objects among multiple cruft packs (e.g., creating a new
> +cruft pack during each repacking operation including only unreachable objects
> +which aren't already stored in an earlier cruft pack) is significantly more
> +complicated to construct, and so aren't pursued here. The obvious drawback to
> +the current implementation is that the entire cruft pack must be re-written from
> +scratch.

This doesn't mention the approach described in
hash-function-transition.txt (and that's already implemented and has
been in use for many years in JGit's DfsRepository).  Does that mean
you aren't aware of it?

Thanks,
Jonathan
Taylor Blau March 22, 2022, 1:16 a.m. UTC | #2
On Mon, Mar 07, 2022 at 10:03:35AM -0800, Jonathan Nieder wrote:
> Sorry for the very slow review!  I've mentioned a few times that this
> overlaps in interesting ways with the gc mechanism described in
> hash-function-transition.txt, so I'd like to compare and see how they
> interact.

Sorry for my equally-slow reply ;). I was on vacation last week and
wasn't following the list closely.

> > +Unreachable objects aren't removed immediately, since doing so could race with
> > +an incoming push which may reference an object which is about to be deleted.
> > +Instead, those unreachable objects are stored as loose object and stay that way
> > +until they are older than the expiration window, at which point they are removed
> > +by linkgit:git-prune[1].
> > +
> > +Git must store these unreachable objects loose in order to keep track of their
> > +per-object mtimes.
>
> It's worth noting that this behavior is already racy.  That is because
> when an unreachable object becomes newly reachable, we do not update
> its mtime and the mtimes of every object reachable from it, so if it
> then becomes transiently unreachable again then it can be wrongly
> collected.

Just to be clear, the race here only happens if the object in question
becomes reachable _after_ a pruning GC determines its mtime. If that's
the case, then the object will indeed be wrongly collected. This is
consistent with the existing behavior (which is racy in the exact same
way).

(After re-reading what you wrote and my response, I think we are saying
the exact same thing, but it doesn't hurt to think aloud).

> > +
> > +== Cruft packs
> > +
> > +A cruft pack eliminates the need for storing unreachable objects in a loose
> > +state by including the per-object mtimes in a separate file alongside a single
> > +pack containing all loose objects.
>
> Can this doc say a little about how "git prune" handles these files?
> In particular, does a non cruft pack aware copy of Git (or JGit,
> libgit2, etc) do the right thing or does it fight with this mechanism?
> If the latter, do we have a repository extension (extensions.*) to
> prevent that?

I mentioned this in much more detail in [1], but the answer is that the
cruft pack looks like any other pack, it just happens to have another
metadata file (the .mtimes one) attached to it. So other implementations
of Git should treat it as they would any other pack. Like I mentioned in
[1], cruft packs were designed with the explicit goal of not requiring a
repository extension.

> > +  3. Write the pack out, along with a `.mtimes` file that records the per-object
> > +     timestamps.
>
> As a point of comparison, the design in hash-function-transition uses
> a single timestamp for the whole pack.  During read operations, objects
> in a cruft pack are considered present; during writes, they are
> considered _not present_ so that if we want to make a cruft object
> newly present then we put a copy of it in a new pack.
>
> Advantage of the mtimes file approach:
> - less duplication of storage: a revived object is only stored once,
>   in a cruft pack, and then the next gc can "graduate" it out of the
>   cruft pack and shrink the cruft pack
> - less affect on non-gc Git code: writes don't need to know that any
>   cruft objects referenced need to be copied into a new pack
>
> Advantages of the mtime per cruft pack approach:
> - easy expiration: once a cruft pack has reached its expiration date,
>   it can be deleted as a whole
> - less I/O churn: a cruft pack stays as-is until combined into another
>   cruft pack or deleted.  There is no frequently-modified mtimes file
>   associated to it
> - informs the storage layer about what is likely to be accessed: cruft
>   packs can get filesystem attributes to put them in less-optimized
>   storage since they are likely to be less frequently read
>
> [...]

The key advantage of cruft packs is that you can expire unreachable
objects in piecemeal while still retaining the benefit of being able to
de-duplicate cruft objects and store them packed against each other.

> > +Notable alternatives to this design include:
>
> This doesn't mention the approach described in
> hash-function-transition.txt (and that's already implemented and has
> been in use for many years in JGit's DfsRepository).  Does that mean
> you aren't aware of it?

Implementing the UNREACHABLE_GARBAGE concept from
hash-function-transition.txt in cruft pack-terms would be equivalent to
not writing the mtimes file at all. This follows from the fact that a
pre-cruft packs implementation of Git considers a packed object's mtime
to be the same as the pack it's contained in. (I'm deliberately
avoiding any details from the h-f-t document regarding re-writing
objects contained in a garbage pack here, since this is separate from
the pack structure itself (and could easily be implemented on top of
cruft packs)).

So I'm not sure what the alternative we'd list would be, since it
removes the key feature of the design of cruft packs.

Thanks,
Taylor

[1]: https://lore.kernel.org/git/YiZMhuI%2FDdpvQ%2FED@nand.local/
Jonathan Nieder March 22, 2022, 9:45 p.m. UTC | #3
Hi,

Taylor Blau wrote:
> On Mon, Mar 07, 2022 at 10:03:35AM -0800, Jonathan Nieder wrote:

>> Sorry for the very slow review!  I've mentioned a few times that this
>> overlaps in interesting ways with the gc mechanism described in
>> hash-function-transition.txt, so I'd like to compare and see how they
>> interact.
>
> Sorry for my equally-slow reply ;). I was on vacation last week and
> wasn't following the list closely.

No problem --- thanks for getting back to me.

[...]
> (After re-reading what you wrote and my response, I think we are saying
> the exact same thing, but it doesn't hurt to think aloud).

Great.  Can the doc cover this?  I think it would be helpful to make
that easy to find for others with similar questions.

If it's a matter of finding enough time to write some text, let me
know and I can try to find some time to help.

[...]
>> Can this doc say a little about how "git prune" handles these files?
>> In particular, does a non cruft pack aware copy of Git (or JGit,
>> libgit2, etc) do the right thing or does it fight with this mechanism?
>> If the latter, do we have a repository extension (extensions.*) to
>> prevent that?
>
> I mentioned this in much more detail in [1], but the answer is that the
> cruft pack looks like any other pack, it just happens to have another
> metadata file (the .mtimes one) attached to it. So other implementations
> of Git should treat it as they would any other pack. Like I mentioned in
> [1], cruft packs were designed with the explicit goal of not requiring a
> repository extension.

Sorry, the above seems like it's answering a different question than I
asked.  The doc in Documentation/technical/ seems like a natural place
to describe what semantics the new .mtimes file has, and I didn't find
that there.  Is there a different piece of documentation I should have
been looking at?

Can you tell me a little more about why we would want _not_ to have a
repository format extension?  To me, it seems like a fairly simple
addition that would drastically reduce the cognitive overload for
people considering making use of this feature.

[...]
> The key advantage of cruft packs is that you can expire unreachable
> objects in piecemeal while still retaining the benefit of being able to
> de-duplicate cruft objects and store them packed against each other.

Can you say a little more about this?  My experience with the similar
feature in JGit is that it has been helpful to be able to expire a
cruft pack altogether; since objects that became reachable around the
same time get packed at the same time, it's not obvious to me what
benefit this extra piecemeal capability brings.

That doesn't mean the benefit doesn't exist, just that it seems like
there's a piece of context I'm still missing.

>>> +Notable alternatives to this design include:
>>
>> This doesn't mention the approach described in
>> hash-function-transition.txt (and that's already implemented and has
>> been in use for many years in JGit's DfsRepository).  Does that mean
>> you aren't aware of it?
>
> Implementing the UNREACHABLE_GARBAGE concept from
> hash-function-transition.txt in cruft pack-terms would be equivalent to
> not writing the mtimes file at all. This follows from the fact that a
> pre-cruft packs implementation of Git considers a packed object's mtime
> to be the same as the pack it's contained in. (I'm deliberately
> avoiding any details from the h-f-t document regarding re-writing
> objects contained in a garbage pack here, since this is separate from
> the pack structure itself (and could easily be implemented on top of
> cruft packs)).
>
> So I'm not sure what the alternative we'd list would be, since it
> removes the key feature of the design of cruft packs.

Sorry, I don't understand this answer either.  Do you mean to say that
JGit's DfsRepository does not in fact have a cruft packs like feature
that is live in the wild?  Or that that feature is equivalent to not
having such a feature?  Or something else?

To be clear, I'm not trying to say that that's superior to what you've
proposed here --- only that documenting the comparison would be
useful.

Puzzled,
Jonathan
Taylor Blau March 22, 2022, 10:02 p.m. UTC | #4
On Tue, Mar 22, 2022 at 02:45:16PM -0700, Jonathan Nieder wrote:
> Hi,
>
> Taylor Blau wrote:
> > On Mon, Mar 07, 2022 at 10:03:35AM -0800, Jonathan Nieder wrote:
>
> >> Sorry for the very slow review!  I've mentioned a few times that this
> >> overlaps in interesting ways with the gc mechanism described in
> >> hash-function-transition.txt, so I'd like to compare and see how they
> >> interact.
> >
> > Sorry for my equally-slow reply ;). I was on vacation last week and
> > wasn't following the list closely.
>
> No problem --- thanks for getting back to me.
>
> [...]
> > (After re-reading what you wrote and my response, I think we are saying
> > the exact same thing, but it doesn't hurt to think aloud).
>
> Great.  Can the doc cover this?  I think it would be helpful to make
> that easy to find for others with similar questions.

I believe the doc covers this already, see the paragraph beginning with
"Unreachable objects aren't removed immediately...".

> >> Can this doc say a little about how "git prune" handles these files?
> >> In particular, does a non cruft pack aware copy of Git (or JGit,
> >> libgit2, etc) do the right thing or does it fight with this mechanism?
> >> If the latter, do we have a repository extension (extensions.*) to
> >> prevent that?
> >
> > I mentioned this in much more detail in [1], but the answer is that the
> > cruft pack looks like any other pack, it just happens to have another
> > metadata file (the .mtimes one) attached to it. So other implementations
> > of Git should treat it as they would any other pack. Like I mentioned in
> > [1], cruft packs were designed with the explicit goal of not requiring a
> > repository extension.
>
> Sorry, the above seems like it's answering a different question than I
> asked.  The doc in Documentation/technical/ seems like a natural place
> to describe what semantics the new .mtimes file has, and I didn't find
> that there.  Is there a different piece of documentation I should have
> been looking at?

Are you looking for a technical description of the mtimes file? If so,
there is a section in Documentation/technical/pack-format.txt (added in
"pack-mtimes: support reading .mtimes files") that explains this.

> Can you tell me a little more about why we would want _not_ to have a
> repository format extension?  To me, it seems like a fairly simple
> addition that would drastically reduce the cognitive overload for
> people considering making use of this feature.

There is no reason to prevent a pre-cruft packs version of Git from
reading/writing a repository that uses cruft packs, since the two
versions will still function as normal. Since there's no need to prevent
the old version from interacting with a repository that has cruft packs,
we wouldn't want to enforce an unnecessary boundary with an extension.

> [...]
> > The key advantage of cruft packs is that you can expire unreachable
> > objects in piecemeal while still retaining the benefit of being able to
> > de-duplicate cruft objects and store them packed against each other.
>
> Can you say a little more about this?  My experience with the similar
> feature in JGit is that it has been helpful to be able to expire a
> cruft pack altogether; since objects that became reachable around the
> same time get packed at the same time, it's not obvious to me what
> benefit this extra piecemeal capability brings.
>
> That doesn't mean the benefit doesn't exist, just that it seems like
> there's a piece of context I'm still missing.

Expiring objects in piecemeal is somewhat interesting, but I think I was
reaching a little too far when I said it was the "key benefit". It does
have some nice properties, like being able to store cruft objects as
deltas against other cruft objects which might get pruned at a different
time (though, of course, you'll need to re-delta them in the case you do
prune an object which is the base of another cruft object).

But the issue with having multiple cruft packs is that the semantics get
significantly more complicated. E.g., if you have an object represented
in multiple cruft packs, which mtime do you use? If you want to prune
it, you suddenly may have many packs you need to update and keep track
of.

> >>> +Notable alternatives to this design include:
> >>
> >> This doesn't mention the approach described in
> >> hash-function-transition.txt (and that's already implemented and has
> >> been in use for many years in JGit's DfsRepository).  Does that mean
> >> you aren't aware of it?
> >
> > Implementing the UNREACHABLE_GARBAGE concept from
> > hash-function-transition.txt in cruft pack-terms would be equivalent to
> > not writing the mtimes file at all. This follows from the fact that a
> > pre-cruft packs implementation of Git considers a packed object's mtime
> > to be the same as the pack it's contained in. (I'm deliberately
> > avoiding any details from the h-f-t document regarding re-writing
> > objects contained in a garbage pack here, since this is separate from
> > the pack structure itself (and could easily be implemented on top of
> > cruft packs)).
> >
> > So I'm not sure what the alternative we'd list would be, since it
> > removes the key feature of the design of cruft packs.
>
> Sorry, I don't understand this answer either.  Do you mean to say that
> JGit's DfsRepository does not in fact have a cruft packs like feature
> that is live in the wild?  Or that that feature is equivalent to not
> having such a feature?  Or something else?
>
> To be clear, I'm not trying to say that that's superior to what you've
> proposed here --- only that documenting the comparison would be
> useful.

I'm not familiar enough with JGit (or its DfsRepository class) to know
how to answer this. I was comparing cruft packs to the
UNREACHABLE_GARBAGE concept mentioned in the hash-function-transition
doc, and noting the differences there.

Thanks,
Taylor
Jonathan Nieder March 22, 2022, 11:04 p.m. UTC | #5
Hi,

Taylor Blau wrote:
> On Tue, Mar 22, 2022 at 02:45:16PM -0700, Jonathan Nieder wrote:

>> Great.  Can the doc cover this?  I think it would be helpful to make
>> that easy to find for others with similar questions.
>
> I believe the doc covers this already, see the paragraph beginning with
> "Unreachable objects aren't removed immediately...".

Thanks.  I just reread that section and it didn't say anything obvious
about the race that continues to exist and whether cruft packs address
it.

[...]
>> Sorry, the above seems like it's answering a different question than I
>> asked.  The doc in Documentation/technical/ seems like a natural place
>> to describe what semantics the new .mtimes file has, and I didn't find
>> that there.  Is there a different piece of documentation I should have
>> been looking at?
>
> Are you looking for a technical description of the mtimes file? If so,
> there is a section in Documentation/technical/pack-format.txt (added in
> "pack-mtimes: support reading .mtimes files") that explains this.

I see --- is the idea that cruft-packs.txt means to refer to
pack-format.txt for the details, and cruft-packs is an overview of some
other, non-detail aspects?

I just checked pack-format.txt and it didn't describe the semantics
(what a Git implementation is expected to do when it sees an mtimes
file).  For example, in Documentation/technical/cruft-packs.txt, the
kind of thing I'd expect to see is

- what does an mtime value in the mtimes file represent?  When is it
  meant to be updated?
- what guarantees are present about when an object is safe to be
  pruned?

[...]
>> Can you tell me a little more about why we would want _not_ to have a
>> repository format extension?  To me, it seems like a fairly simple
>> addition that would drastically reduce the cognitive overload for
>> people considering making use of this feature.
>
>There is no reason to prevent a pre-cruft packs version of Git from
> reading/writing a repository that uses cruft packs, since the two
> versions will still function as normal. Since there's no need to prevent
> the old version from interacting with a repository that has cruft packs,
> we wouldn't want to enforce an unnecessary boundary with an extension.

Does "function as normal" include in repository maintenance operations
like "git maintenance", "git gc", and "git prune"?  If so, this seems
like something very useful to describe in the cruft-packs.txt
document, since what happens when we bounce back and forth between old
and new versions of Git operating on the same NFS mounted repository
would not be obvious without such a discussion.

I'm still interested in the _downsides_ of using a repository format
extension.  "There is no reason" is not a downside, unless you mean
that it requires adding a line of code. :)  The main downside I can
imagine is that it prevents accessing the repository _that has enabled
this feature_ with an older version of Git, but I (perhaps due to a
failure of imagination) haven't put two and two together yet about
when I would want to do so.

[...]
> Expiring objects in piecemeal is somewhat interesting, but I think I was
> reaching a little too far when I said it was the "key benefit". It does
> have some nice properties, like being able to store cruft objects as
> deltas against other cruft objects which might get pruned at a different
> time (though, of course, you'll need to re-delta them in the case you do
> prune an object which is the base of another cruft object).
>
> But the issue with having multiple cruft packs is that the semantics get
> significantly more complicated. E.g., if you have an object represented
> in multiple cruft packs, which mtime do you use? If you want to prune
> it, you suddenly may have many packs you need to update and keep track
> of.

Thanks for this explanation.  In hash-function-transition.txt, I see

	"git gc" currently expels any unreachable objects it encounters in
	pack files to loose objects in an attempt to prevent a race when
	pruning them (in case another process is simultaneously writing a new
	object that refers to the about-to-be-deleted object). This leads to
	an explosion in the number of loose objects present and disk space
	usage due to the objects in delta form being replaced with independent
	loose objects.  Worse, the race is still present for loose objects.

	Instead, "git gc" will need to move unreachable objects to a new
	packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see
	below). To avoid the race when writing new objects referring to an
	about-to-be-deleted object, code paths that write new objects will
	need to copy any objects from UNREACHABLE_GARBAGE packs that they
	refer to new, non-UNREACHABLE_GARBAGE packs (or loose objects).
	UNREACHABLE_GARBAGE are then safe to delete if their creation time (as
	indicated by the file's mtime) is long enough ago.

	To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be
	combined under certain circumstances. [etc]

So the proposal there is that the file mtime for an UNREACHABLE_GARBAGE
pack refers to when that pack was written and governs when that pack
can be deleted.  If an object is present in multiple packs, then newer
packs with the object have a newer mtime and thus cause the object to
be kept around for longer.

[...]
>> Sorry, I don't understand this answer either.  Do you mean to say that
>> JGit's DfsRepository does not in fact have a cruft packs like feature
>> that is live in the wild?  Or that that feature is equivalent to not
>> having such a feature?  Or something else?
>>
>> To be clear, I'm not trying to say that that's superior to what you've
>> proposed here --- only that documenting the comparison would be
>> useful.
>
> I'm not familiar enough with JGit (or its DfsRepository class) to know
> how to answer this. I was comparing cruft packs to the
> UNREACHABLE_GARBAGE concept mentioned in the hash-function-transition
> doc, and noting the differences there.

Thanks.  I think there's some implied feedback about the documentation
of UNREACHABLE_GARBAGE there, because if I understand then you're
saying that it does not describe maintaining cruft packs.  Perhaps a
pointer to the particular sentence that led you to that conclusion
would help.

Sincerely,
Jonathan
Taylor Blau March 23, 2022, 1:01 a.m. UTC | #6
On Tue, Mar 22, 2022 at 04:04:53PM -0700, Jonathan Nieder wrote:
> Hi,
>
> Taylor Blau wrote:
> > On Tue, Mar 22, 2022 at 02:45:16PM -0700, Jonathan Nieder wrote:
>
> >> Great.  Can the doc cover this?  I think it would be helpful to make
> >> that easy to find for others with similar questions.
> >
> > I believe the doc covers this already, see the paragraph beginning with
> > "Unreachable objects aren't removed immediately...".
>
> Thanks.  I just reread that section and it didn't say anything obvious
> about the race that continues to exist and whether cruft packs address
> it.

Yeah, there isn't an explicit "and cruft packs addresses the loose
object explosion but does not address the race" sentence. I'm not
opposed to adding something like that to clarify (though TBH, I would
rather do it as a clean-up on top rather than send out a bazillion
mostly-unchanged patches).

> [...]
> >> Sorry, the above seems like it's answering a different question than I
> >> asked.  The doc in Documentation/technical/ seems like a natural place
> >> to describe what semantics the new .mtimes file has, and I didn't find
> >> that there.  Is there a different piece of documentation I should have
> >> been looking at?
> >
> > Are you looking for a technical description of the mtimes file? If so,
> > there is a section in Documentation/technical/pack-format.txt (added in
> > "pack-mtimes: support reading .mtimes files") that explains this.
>
> I see --- is the idea that cruft-packs.txt means to refer to
> pack-format.txt for the details, and cruft-packs is an overview of some
> other, non-detail aspects?
>
> I just checked pack-format.txt and it didn't describe the semantics
> (what a Git implementation is expected to do when it sees an mtimes
> file).  For example, in Documentation/technical/cruft-packs.txt, the
> kind of thing I'd expect to see is

Right; I have always considered the files in Documentation/technical to
primarily be about the file format itself.

> - what does an mtime value in the mtimes file represent?  When is it
>   meant to be updated?
> - what guarantees are present about when an object is safe to be
>   pruned?

The cruft-packs.txt document covers these, though I think somewhat
implicitly. Again, I'm not opposed to more clarification, but I again
would like to do so on top.

I think many of these are discussed within the threads above, but to
answer your questions in order:

  - The mtime of an object in a cruft pack represents the last time that
    object was known to be reachable, and it's updated when generating a
    cruft pack or pruning.

  - The same guarantees are made in the cruft pack case as in the
    non-cruft case (i.e., "none", and so a grace period is recommended).

> [...]
> >> Can you tell me a little more about why we would want _not_ to have a
> >> repository format extension?  To me, it seems like a fairly simple
> >> addition that would drastically reduce the cognitive overload for
> >> people considering making use of this feature.
> >
> >There is no reason to prevent a pre-cruft packs version of Git from
> > reading/writing a repository that uses cruft packs, since the two
> > versions will still function as normal. Since there's no need to prevent
> > the old version from interacting with a repository that has cruft packs,
> > we wouldn't want to enforce an unnecessary boundary with an extension.
>
> Does "function as normal" include in repository maintenance operations
> like "git maintenance", "git gc", and "git prune"?  If so, this seems
> like something very useful to describe in the cruft-packs.txt
> document, since what happens when we bounce back and forth between old
> and new versions of Git operating on the same NFS mounted repository
> would not be obvious without such a discussion.

Yes, all of those commands will simply ignore the .mtimes file and treat
the unreachable objects as normal (where "normal" means in the exact
same way as they currently do without cruft packs). I think adding a
section that summarizes our discussion would be useful.

> I'm still interested in the _downsides_ of using a repository format
> extension.  "There is no reason" is not a downside, unless you mean
> that it requires adding a line of code. :)  The main downside I can
> imagine is that it prevents accessing the repository _that has enabled
> this feature_ with an older version of Git, but I (perhaps due to a
> failure of imagination) haven't put two and two together yet about
> when I would want to do so.

Sorry for not being clear; I meant: "There is no reason [to prohibit
two versions of Git from interacting with each other when they are
compatible to do so]".

> [...]
> > Expiring objects in piecemeal is somewhat interesting, but I think I was
> > reaching a little too far when I said it was the "key benefit". It does
> > have some nice properties, like being able to store cruft objects as
> > deltas against other cruft objects which might get pruned at a different
> > time (though, of course, you'll need to re-delta them in the case you do
> > prune an object which is the base of another cruft object).
> >
> > But the issue with having multiple cruft packs is that the semantics get
> > significantly more complicated. E.g., if you have an object represented
> > in multiple cruft packs, which mtime do you use? If you want to prune
> > it, you suddenly may have many packs you need to update and keep track
> > of.
>
> Thanks for this explanation.  In hash-function-transition.txt, I see
>
> 	"git gc" currently expels any unreachable objects it encounters in
> 	pack files to loose objects in an attempt to prevent a race when
> 	pruning them (in case another process is simultaneously writing a new
> 	object that refers to the about-to-be-deleted object). This leads to
> 	an explosion in the number of loose objects present and disk space
> 	usage due to the objects in delta form being replaced with independent
> 	loose objects.  Worse, the race is still present for loose objects.
>
> 	Instead, "git gc" will need to move unreachable objects to a new
> 	packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see
> 	below). To avoid the race when writing new objects referring to an
> 	about-to-be-deleted object, code paths that write new objects will
> 	need to copy any objects from UNREACHABLE_GARBAGE packs that they
> 	refer to new, non-UNREACHABLE_GARBAGE packs (or loose objects).
> 	UNREACHABLE_GARBAGE are then safe to delete if their creation time (as
> 	indicated by the file's mtime) is long enough ago.
>
> 	To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be
> 	combined under certain circumstances. [etc]
>
> So the proposal there is that the file mtime for an UNREACHABLE_GARBAGE
> pack refers to when that pack was written and governs when that pack
> can be deleted.  If an object is present in multiple packs, then newer
> packs with the object have a newer mtime and thus cause the object to
> be kept around for longer.

That matches my understanding.

Thanks,
Taylor
Taylor Blau March 28, 2022, 6:46 p.m. UTC | #7
On Tue, Mar 22, 2022 at 09:01:43PM -0400, Taylor Blau wrote:
> > >> Can you tell me a little more about why we would want _not_ to have a
> > >> repository format extension?  To me, it seems like a fairly simple
> > >> addition that would drastically reduce the cognitive overload for
> > >> people considering making use of this feature.
> > >
> > >There is no reason to prevent a pre-cruft packs version of Git from
> > > reading/writing a repository that uses cruft packs, since the two
> > > versions will still function as normal. Since there's no need to prevent
> > > the old version from interacting with a repository that has cruft packs,
> > > we wouldn't want to enforce an unnecessary boundary with an extension.
> >
> > Does "function as normal" include in repository maintenance operations
> > like "git maintenance", "git gc", and "git prune"?  If so, this seems
> > like something very useful to describe in the cruft-packs.txt
> > document, since what happens when we bounce back and forth between old
> > and new versions of Git operating on the same NFS mounted repository
> > would not be obvious without such a discussion.
>
> Yes, all of those commands will simply ignore the .mtimes file and treat
> the unreachable objects as normal (where "normal" means in the exact
> same way as they currently do without cruft packs). I think adding a
> section that summarizes our discussion would be useful.
>
> > I'm still interested in the _downsides_ of using a repository format
> > extension.  "There is no reason" is not a downside, unless you mean
> > that it requires adding a line of code. :)  The main downside I can
> > imagine is that it prevents accessing the repository _that has enabled
> > this feature_ with an older version of Git, but I (perhaps due to a
> > failure of imagination) haven't put two and two together yet about
> > when I would want to do so.
>
> Sorry for not being clear; I meant: "There is no reason [to prohibit
> two versions of Git from interacting with each other when they are
> compatible to do so]".

Jonathan, myself, and others discussed this extensively in today's
standup.

To summarize Jonathan's point (as I think I severely misunderstood it
before), if two writers are repacking a repository with unreachable
objects. The following can happen:

  - $NEWGIT packs the repository and writes a cruft pack and .mtimes
    file.

  - $OLDGIT packs the repository, exploding unreachable objects from the
    cruft pack as loose, setting their mtimes to "now".

This causes the repository to lose information about the unreachable
mtimes, which would cause the repository to never prune objects (except
for when`--unpack-unreachable=now` is passed).

One approach (that Jonathan suggested) is to prevent the above situation
by introducing a format extension, so $OLDGIT could not touch the
repository. But this comes at a (in my view, significant) cost which is
that $OLDGIT can't touch the repository _at all_. An extension would be
desirable if cross-version interaction resulted in repository
corruption, but this scenario does not lead to corruption at all.

Another approach (courtesy Stolee, in an off-list discussion) is that we
could introduce an optional extension available as an opt-in to prevent
older versions of Git from interacting in a repository that contains
cruft packs, but is not required to write them.

A third approach (and probably my preferred direction) is to indicate
clearly via a combination of updates to Documentation/cruft-packs.txt
and the release notes that say something along the lines of:

    If you use are repacking a repository using both a pre- and
    post-cruft packs version of Git, please be aware that you will lose
    information about the mtimes of unreachable objects.

I imagine that would probably be sufficient, but we could also introduce
the opt-in extension as an easy alternative to avoid forcing an upgrade
of Git.

Thanks,
Taylor
Junio C Hamano March 28, 2022, 8:55 p.m. UTC | #8
Taylor Blau <me@ttaylorr.com> writes:

> To summarize Jonathan's point (as I think I severely misunderstood it
> before), if two writers are repacking a repository with unreachable
> objects. The following can happen:
>
>   - $NEWGIT packs the repository and writes a cruft pack and .mtimes
>     file.
>
>   - $OLDGIT packs the repository, exploding unreachable objects from the
>     cruft pack as loose, setting their mtimes to "now".

And if these repeat, alternating new and old versions of Git, we
will keep refreshing the unreachable objects' mtimes forever.

But once you stop using old versions of Git, perhaps in 3 release
cycles or so, we'll eventually be able to purge them, right?

> One approach (that Jonathan suggested) is to prevent the above situation
> by introducing a format extension, so $OLDGIT could not touch the
> repository. But this comes at a (in my view, significant) cost which is
> that $OLDGIT can't touch the repository _at all_. An extension would be
> desirable if cross-version interaction resulted in repository
> corruption, but this scenario does not lead to corruption at all.

A repository may not be in a healthy state, when tons of unreachable
objects stay around forever, but it probably is a bit too harsh to
call it "corrupt".

> Another approach (courtesy Stolee, in an off-list discussion) is that we
> could introduce an optional extension available as an opt-in to prevent
> older versions of Git from interacting in a repository that contains
> cruft packs, but is not required to write them.

That smells too magic; let's not go there.

> A third approach (and probably my preferred direction) is to indicate
> clearly via a combination of updates to Documentation/cruft-packs.txt
> and the release notes that say something along the lines of:
>
>     If you use are repacking a repository using both a pre- and
>     post-cruft packs version of Git, please be aware that you will lose
>     information about the mtimes of unreachable objects.

I do not quite see how it helps.  After hearing "... will lose
information about the mtimes ...", what concrete action can a user
take?  Or a sys-admin?

It's not like use of cruft-pack is mandatory when you upgrade the
new version of Git, right?  Perhaps use of cruft-pack should be
guarded behind a configuration variable so that users who might want
to use mixed versions of Git will be protected against accidental
use of new version of Git that introduces the forever-renewing
untracked objects problem?  

Perhaps a configuration variable, repack.cruftPackEnabled, that is
by default disabled, can be used to protect people who do not want
to get into the "keep refreshing mtime" loop from using the cruft
packs by mistake?  repack.cruftPackEnabled can probably be part of
the "experimental" feature set, if we think it is the direction in
the future.
Taylor Blau March 28, 2022, 9:21 p.m. UTC | #9
On Mon, Mar 28, 2022 at 01:55:43PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > To summarize Jonathan's point (as I think I severely misunderstood it
> > before), if two writers are repacking a repository with unreachable
> > objects. The following can happen:
> >
> >   - $NEWGIT packs the repository and writes a cruft pack and .mtimes
> >     file.
> >
> >   - $OLDGIT packs the repository, exploding unreachable objects from the
> >     cruft pack as loose, setting their mtimes to "now".
>
> And if these repeat, alternating new and old versions of Git, we
> will keep refreshing the unreachable objects' mtimes forever.
>
> But once you stop using old versions of Git, perhaps in 3 release
> cycles or so, we'll eventually be able to purge them, right?

As soon as all of the repackers understand cruft packs, yes.

> > One approach (that Jonathan suggested) is to prevent the above situation
> > by introducing a format extension, so $OLDGIT could not touch the
> > repository. But this comes at a (in my view, significant) cost which is
> > that $OLDGIT can't touch the repository _at all_. An extension would be
> > desirable if cross-version interaction resulted in repository
> > corruption, but this scenario does not lead to corruption at all.
>
> A repository may not be in a healthy state, when tons of unreachable
> objects stay around forever, but it probably is a bit too harsh to
> call it "corrupt".

I agree, though I would note that this is no worse than the situation
today, where unreachable-but-recent objects are already exploded as
loose can already cause the kinds of issues that this series is designed
to prevent.

> > Another approach (courtesy Stolee, in an off-list discussion) is that we
> > could introduce an optional extension available as an opt-in to prevent
> > older versions of Git from interacting in a repository that contains
> > cruft packs, but is not required to write them.
>
> That smells too magic; let's not go there.

I'm not sure... if we did:

--- 8< ---

diff --git a/setup.c b/setup.c
index 04ce33cdcd..fa54c9baa4 100644
--- a/setup.c
+++ b/setup.c
@@ -565,2 +565,4 @@ static enum extension_result handle_extension(const char *var,
 		return EXTENSION_OK;
+	} else if (!strcmp(ext, "cruftpacks")) {
+		return EXTENSION_OK;
 	}

--- >8 ---

but nothing more, then a hypothetical `extensions.cruftPacks` could be
used to prevent older writers in a mixed version environment. But if you
don't have or care about older versions of Git, you can avoid setting it
altogether.

The key bit is that we don't have a check along the lines of "only allow
writing a cruft pack when extensions.cruftPacks" is set, so it's opt-in
as far as the new code is concerned.

> > A third approach (and probably my preferred direction) is to indicate
> > clearly via a combination of updates to Documentation/cruft-packs.txt
> > and the release notes that say something along the lines of:
> >
> >     If you use are repacking a repository using both a pre- and
> >     post-cruft packs version of Git, please be aware that you will lose
> >     information about the mtimes of unreachable objects.
>
> I do not quite see how it helps.  After hearing "... will lose
> information about the mtimes ...", what concrete action can a user
> take?  Or a sys-admin?
>
> It's not like use of cruft-pack is mandatory when you upgrade the
> new version of Git, right?  Perhaps use of cruft-pack should be
> guarded behind a configuration variable so that users who might want
> to use mixed versions of Git will be protected against accidental
> use of new version of Git that introduces the forever-renewing
> untracked objects problem?

I don't think we would have much to offer a user in that case; if the
mtimes are gone, then I couldn't think of anything to bring them back
outside of setting them manually.

But cruft packs are already guarded in two places:

  - `git repack` won't write a cruft pack unless given the `--cruft`
    flag (i.e., `git repack -A` doesn't suddenly start generating cruft
    packs upon upgrade).

  - `git gc` won't write cruft packs unless the `gc.cruftPacks`
    configuration is set, or `--cruft` is given as a flag.

I'd be curious what Jonathan and others think of that approach (which,
to be clear, is what this series already implements). We could make it
clear to say:

    If you have mixed versions of Git which both repack a repository
    (either manually or by auto-GC / background maintenance), consider
    leaving `gc.cruftPacks` unset and avoiding passing `--cruft` as a
    command-line argument to `git repack` and `git gc`, since doing so
    can lead to [...]

> Perhaps a configuration variable, repack.cruftPackEnabled, that is
> by default disabled, can be used to protect people who do not want
> to get into the "keep refreshing mtime" loop from using the cruft
> packs by mistake?  repack.cruftPackEnabled can probably be part of
> the "experimental" feature set, if we think it is the direction in
> the future.

I'd probably want to leave `-A` separate from `--cruft`, since something
about setting `repack.cruftPackEnabled` having the effect of causing
`-A` to produce a cruft pack feels strange to me.

Thanks,
Taylor
Junio C Hamano March 29, 2022, 3:59 p.m. UTC | #10
Taylor Blau <me@ttaylorr.com> writes:

> I'm not sure... if we did:
>
> --- 8< ---
>
> diff --git a/setup.c b/setup.c
> index 04ce33cdcd..fa54c9baa4 100644
> --- a/setup.c
> +++ b/setup.c
> @@ -565,2 +565,4 @@ static enum extension_result handle_extension(const char *var,
>  		return EXTENSION_OK;
> +	} else if (!strcmp(ext, "cruftpacks")) {
> +		return EXTENSION_OK;
>  	}
>
> --- >8 ---
>
> but nothing more, then a hypothetical `extensions.cruftPacks` could be
> used to prevent older writers in a mixed version environment. But if you
> don't have or care about older versions of Git, you can avoid setting it
> altogether.

Smells like "unsafe by default, but you can opt into safety", which
is backwards, isn't it?

>> I do not quite see how it helps.  After hearing "... will lose
>> information about the mtimes ...", what concrete action can a user
>> take?  Or a sys-admin?
>>
>> It's not like use of cruft-pack is mandatory when you upgrade the
>> new version of Git, right?  Perhaps use of cruft-pack should be
>> guarded behind a configuration variable so that users who might want
>> to use mixed versions of Git will be protected against accidental
>> use of new version of Git that introduces the forever-renewing
>> untracked objects problem?
>
> I don't think we would have much to offer a user in that case; if the
> mtimes are gone, then I couldn't think of anything to bring them back
> outside of setting them manually.

Yes, so rambling about losing mtimes in documentation or release
notes would not help users all that much.  Let's not do that.

> But cruft packs are already guarded in two places:
>
>   - `git repack` won't write a cruft pack unless given the `--cruft`
>     flag (i.e., `git repack -A` doesn't suddenly start generating cruft
>     packs upon upgrade).
>
>   - `git gc` won't write cruft packs unless the `gc.cruftPacks`
>     configuration is set, or `--cruft` is given as a flag.

Hmph, OK.  So individuals can sort-of protect from hurting
themselves by refraining from running these with --cruft or writing
--cruft in their maintenance scripts.  An organization that wants to
let the more adventurous types to early opt-in can prepare two
versions of the maintenance scripts they distribute to their users,
one with and the other without --cruft, and use the mechanism they
use for gradual rollouts to control the population.  Perhaps that
would make sufficient protection?  I dunno.

Jonathan, what do you think?

> I'd be curious what Jonathan and others think of that approach (which,
> to be clear, is what this series already implements). We could make it
> clear to say:
>
>     If you have mixed versions of Git which both repack a repository
>     (either manually or by auto-GC / background maintenance), consider
>     leaving `gc.cruftPacks` unset and avoiding passing `--cruft` as a
>     command-line argument to `git repack` and `git gc`, since doing so
>     can lead to [...]

That message is (depending on what comes in [...]) much more helpful
than just throwing a word "mtime" out and letting the reader figure
out the rest ;-)
Taylor Blau March 30, 2022, 2:23 a.m. UTC | #11
On Tue, Mar 29, 2022 at 08:59:24AM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > I'm not sure... if we did:
> >
> > --- 8< ---
> >
> > diff --git a/setup.c b/setup.c
> > index 04ce33cdcd..fa54c9baa4 100644
> > --- a/setup.c
> > +++ b/setup.c
> > @@ -565,2 +565,4 @@ static enum extension_result handle_extension(const char *var,
> >  		return EXTENSION_OK;
> > +	} else if (!strcmp(ext, "cruftpacks")) {
> > +		return EXTENSION_OK;
> >  	}
> >
> > --- >8 ---
> >
> > but nothing more, then a hypothetical `extensions.cruftPacks` could be
> > used to prevent older writers in a mixed version environment. But if you
> > don't have or care about older versions of Git, you can avoid setting it
> > altogether.
>
> Smells like "unsafe by default, but you can opt into safety", which
> is backwards, isn't it?

I see it a little differently. The default (not writing cruft packs at
all) is safe, even in a mixed-version environment. If a user (a) wants
to use cruft packs, and (b) has older versions of Git also gc'ing the
repository, and (c) can't get rid of them, _then_ an opt-in extension
would make it impossible for those older versions to interact with the
repository.

I still can't shake the feeling that this is a pretty fringe and
timing-dependent scenario, which at worst keeps too many unreachable
objects around.

But I think this in conjunction with the already opt-in nature of cruft
packs would be a nice way to create safeguards for the situation
Jonathan described. There may be a simpler way, but I'm not sure I see
it (i.e., if you control whether or not `--cruft` is passed when
doing maintenance with newer versions of Git, but not whether older
versions are running around doing their own maintenance, then an
extension would be necessary to lock the old versions out).

> > But cruft packs are already guarded in two places:
> >
> >   - `git repack` won't write a cruft pack unless given the `--cruft`
> >     flag (i.e., `git repack -A` doesn't suddenly start generating cruft
> >     packs upon upgrade).
> >
> >   - `git gc` won't write cruft packs unless the `gc.cruftPacks`
> >     configuration is set, or `--cruft` is given as a flag.
>
> Hmph, OK.  So individuals can sort-of protect from hurting
> themselves by refraining from running these with --cruft or writing
> --cruft in their maintenance scripts.  An organization that wants to
> let the more adventurous types to early opt-in can prepare two
> versions of the maintenance scripts they distribute to their users,
> one with and the other without --cruft, and use the mechanism they
> use for gradual rollouts to control the population.  Perhaps that
> would make sufficient protection?  I dunno.
>
> Jonathan, what do you think?

I'm confused: if newer versions of Git are writing cruft packs, then
having the older versions gc'ing in the same repository runs into the
same scenario Jonathan originally describes.

The thing I think Jonathan seeks to prevent is older versions of Git
gc'ing a repo that has cruft packs. I think I may need you to clarify a
little, sorry :-(.

> > I'd be curious what Jonathan and others think of that approach (which,
> > to be clear, is what this series already implements). We could make it
> > clear to say:
> >
> >     If you have mixed versions of Git which both repack a repository
> >     (either manually or by auto-GC / background maintenance), consider
> >     leaving `gc.cruftPacks` unset and avoiding passing `--cruft` as a
> >     command-line argument to `git repack` and `git gc`, since doing so
> >     can lead to [...]
>
> That message is (depending on what comes in [...]) much more helpful
> than just throwing a word "mtime" out and letting the reader figure
> out the rest ;-)

Yes, totally agreed.

Thanks,
Taylor
Junio C Hamano March 30, 2022, 1:37 p.m. UTC | #12
Taylor Blau <me@ttaylorr.com> writes:

> The thing I think Jonathan seeks to prevent is older versions of Git
> gc'ing a repo that has cruft packs. I think I may need you to clarify a
> little, sorry :-(.

By making controlled rollout of the use of "--cruft" option (and the
assumption here is that a large organization setting people do not
manually say "gc --cruft", and they can ship their maintenance
scripts that may be run via cron or whatever with and without
"--cruft"), you can control the number of repositories that can
potentially see older versions of Git running gc on with cruft
packs.  Those users, for whom it is not their turn to start using
"--cruft" enabled version of the script, will not have cruft packs,
so it does not matter if they keep an older version of Git somewhere
hidden in a hermetic build of an IDE that bundles Git and gc kicks
in for them.
Taylor Blau March 30, 2022, 5:30 p.m. UTC | #13
On Wed, Mar 30, 2022 at 06:37:54AM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > The thing I think Jonathan seeks to prevent is older versions of Git
> > gc'ing a repo that has cruft packs. I think I may need you to clarify a
> > little, sorry :-(.
>
> By making controlled rollout of the use of "--cruft" option (and the
> assumption here is that a large organization setting people do not
> manually say "gc --cruft", and they can ship their maintenance
> scripts that may be run via cron or whatever with and without
> "--cruft"), you can control the number of repositories that can
> potentially see older versions of Git running gc on with cruft
> packs.  Those users, for whom it is not their turn to start using
> "--cruft" enabled version of the script, will not have cruft packs,
> so it does not matter if they keep an older version of Git somewhere
> hidden in a hermetic build of an IDE that bundles Git and gc kicks
> in for them.

Ahh, OK. Thanks for explaining: this is what I was pretty sure you
meant, but I wanted to make sure before agreeing to it.

Yes, this solution amounts to: "if you have mixed-versions of Git
mutually gc'ing a repository, then use the same rollout method used for
controlling Git itself to guard when to start creating cruft packs".

I would be very eager to hear if this works for Jonathan's case. It
should do the trick, I'd think.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/Documentation/Makefile b/Documentation/Makefile
index ed656db2ae..0b01c9408e 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -91,6 +91,7 @@  TECH_DOCS += MyFirstContribution
 TECH_DOCS += MyFirstObjectWalk
 TECH_DOCS += SubmittingPatches
 TECH_DOCS += technical/bundle-format
+TECH_DOCS += technical/cruft-packs
 TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
 TECH_DOCS += technical/index-format
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
new file mode 100644
index 0000000000..2c3c5d93f8
--- /dev/null
+++ b/Documentation/technical/cruft-packs.txt
@@ -0,0 +1,97 @@ 
+= Cruft packs
+
+The cruft packs feature offer an alternative to Git's traditional mechanism of
+removing unreachable objects. This document provides an overview of Git's
+pruning mechanism, and how a cruft pack can be used instead to accomplish the
+same.
+
+== Background
+
+To remove unreachable objects from your repository, Git offers `git repack -Ad`
+(see linkgit:git-repack[1]). Quoting from the documentation:
+
+[quote]
+[...] unreachable objects in a previous pack become loose, unpacked objects,
+instead of being left in the old pack. [...] loose unreachable objects will be
+pruned according to normal expiry rules with the next 'git gc' invocation.
+
+Unreachable objects aren't removed immediately, since doing so could race with
+an incoming push which may reference an object which is about to be deleted.
+Instead, those unreachable objects are stored as loose object and stay that way
+until they are older than the expiration window, at which point they are removed
+by linkgit:git-prune[1].
+
+Git must store these unreachable objects loose in order to keep track of their
+per-object mtimes. If these unreachable objects were written into one big pack,
+then either freshening that pack (because an object contained within it was
+re-written) or creating a new pack of unreachable objects would cause the pack's
+mtime to get updated, and the objects within it would never leave the expiration
+window. Instead, objects are stored loose in order to keep track of the
+individual object mtimes and avoid a situation where all cruft objects are
+freshened at once.
+
+This can lead to undesirable situations when a repository contains many
+unreachable objects which have not yet left the grace period. Having large
+directories in the shards of `.git/objects` can lead to decreased performance in
+the repository. But given enough unreachable objects, this can lead to inode
+starvation and degrade the performance of the whole system. Since we
+can never pack those objects, these repositories often take up a large amount of
+disk space, since we can only zlib compress them, but not store them in delta
+chains.
+
+== Cruft packs
+
+A cruft pack eliminates the need for storing unreachable objects in a loose
+state by including the per-object mtimes in a separate file alongside a single
+pack containing all loose objects.
+
+A cruft pack is written by `git repack --cruft` when generating a new pack.
+linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
+is a classic all-into-one repack, meaning that everything in the resulting pack is
+reachable, and everything else is unreachable. Once written, the `--cruft`
+option instructs `git repack` to generate another pack containing only objects
+not packed in the previous step (which equates to packing all unreachable
+objects together). This progresses as follows:
+
+  1. Enumerate every object, marking any object which is (a) not contained in a
+     kept-pack, and (b) whose mtime is within the grace period as a traversal
+     tip.
+
+  2. Perform a reachability traversal based on the tips gathered in the previous
+     step, adding every object along the way to the pack.
+
+  3. Write the pack out, along with a `.mtimes` file that records the per-object
+     timestamps.
+
+This mode is invoked internally by linkgit:git-repack[1] when instructed to
+write a cruft pack. Crucially, the set of in-core kept packs is exactly the set
+of packs which will not be deleted by the repack; in other words, they contain
+all of the repository's reachable objects.
+
+When a repository already has a cruft pack, `git repack --cruft` typically only
+adds objects to it. An exception to this is when `git repack` is given the
+`--cruft-expiration` option, which allows the generated cruft pack to omit
+expired objects instead of waiting for linkgit:git-gc[1] to expire those objects
+later on.
+
+It is linkgit:git-gc[1] that is typically responsible for removing expired
+unreachable objects.
+
+== Alternatives
+
+Notable alternatives to this design include:
+
+  - The location of the per-object mtime data, and
+  - Storing unreachable objects in multiple cruft packs.
+
+On the location of mtime data, a new auxiliary file tied to the pack was chosen
+to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
+support for optional chunks of data, it may make sense to consolidate the
+`.mtimes` format into the `.idx` itself.
+
+Storing unreachable objects among multiple cruft packs (e.g., creating a new
+cruft pack during each repacking operation including only unreachable objects
+which aren't already stored in an earlier cruft pack) is significantly more
+complicated to construct, and so aren't pursued here. The obvious drawback to
+the current implementation is that the entire cruft pack must be re-written from
+scratch.