Message ID | 784ee7e0eec9ba520ebaaa27de2de810e2f6798a.1646266835.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | cruft packs | expand |
Hi, Taylor Blau wrote: > Create a technical document to explain cruft packs. It contains a brief > overview of the problem, some background, details on the implementation, > and a couple of alternative approaches not considered here. Sorry for the very slow review! I've mentioned a few times that this overlaps in interesting ways with the gc mechanism described in hash-function-transition.txt, so I'd like to compare and see how they interact. [...] > --- /dev/null > +++ b/Documentation/technical/cruft-packs.txt > @@ -0,0 +1,97 @@ [...] > +Unreachable objects aren't removed immediately, since doing so could race with > +an incoming push which may reference an object which is about to be deleted. > +Instead, those unreachable objects are stored as loose object and stay that way > +until they are older than the expiration window, at which point they are removed > +by linkgit:git-prune[1]. > + > +Git must store these unreachable objects loose in order to keep track of their > +per-object mtimes. It's worth noting that this behavior is already racy. That is because when an unreachable object becomes newly reachable, we do not update its mtime and the mtimes of every object reachable from it, so if it then becomes transiently unreachable again then it can be wrongly collected. [...] > these repositories often take up a large amount of > +disk space, since we can only zlib compress them, but not store them in delta > +chains. Yes! I'm happy we're making progress on this. > + > +== Cruft packs > + > +A cruft pack eliminates the need for storing unreachable objects in a loose > +state by including the per-object mtimes in a separate file alongside a single > +pack containing all loose objects. Can this doc say a little about how "git prune" handles these files? In particular, does a non cruft pack aware copy of Git (or JGit, libgit2, etc) do the right thing or does it fight with this mechanism? If the latter, do we have a repository extension (extensions.*) to prevent that? [...] > + 3. Write the pack out, along with a `.mtimes` file that records the per-object > + timestamps. As a point of comparison, the design in hash-function-transition uses a single timestamp for the whole pack. During read operations, objects in a cruft pack are considered present; during writes, they are considered _not present_ so that if we want to make a cruft object newly present then we put a copy of it in a new pack. Advantage of the mtimes file approach: - less duplication of storage: a revived object is only stored once, in a cruft pack, and then the next gc can "graduate" it out of the cruft pack and shrink the cruft pack - less affect on non-gc Git code: writes don't need to know that any cruft objects referenced need to be copied into a new pack Advantages of the mtime per cruft pack approach: - easy expiration: once a cruft pack has reached its expiration date, it can be deleted as a whole - less I/O churn: a cruft pack stays as-is until combined into another cruft pack or deleted. There is no frequently-modified mtimes file associated to it - informs the storage layer about what is likely to be accessed: cruft packs can get filesystem attributes to put them in less-optimized storage since they are likely to be less frequently read [...] > +Notable alternatives to this design include: > + > + - The location of the per-object mtime data, and > + - Storing unreachable objects in multiple cruft packs. > + > +On the location of mtime data, a new auxiliary file tied to the pack was chosen > +to avoid complicating the `.idx` format. If the `.idx` format were ever to gain > +support for optional chunks of data, it may make sense to consolidate the > +`.mtimes` format into the `.idx` itself. > + > +Storing unreachable objects among multiple cruft packs (e.g., creating a new > +cruft pack during each repacking operation including only unreachable objects > +which aren't already stored in an earlier cruft pack) is significantly more > +complicated to construct, and so aren't pursued here. The obvious drawback to > +the current implementation is that the entire cruft pack must be re-written from > +scratch. This doesn't mention the approach described in hash-function-transition.txt (and that's already implemented and has been in use for many years in JGit's DfsRepository). Does that mean you aren't aware of it? Thanks, Jonathan
On Mon, Mar 07, 2022 at 10:03:35AM -0800, Jonathan Nieder wrote: > Sorry for the very slow review! I've mentioned a few times that this > overlaps in interesting ways with the gc mechanism described in > hash-function-transition.txt, so I'd like to compare and see how they > interact. Sorry for my equally-slow reply ;). I was on vacation last week and wasn't following the list closely. > > +Unreachable objects aren't removed immediately, since doing so could race with > > +an incoming push which may reference an object which is about to be deleted. > > +Instead, those unreachable objects are stored as loose object and stay that way > > +until they are older than the expiration window, at which point they are removed > > +by linkgit:git-prune[1]. > > + > > +Git must store these unreachable objects loose in order to keep track of their > > +per-object mtimes. > > It's worth noting that this behavior is already racy. That is because > when an unreachable object becomes newly reachable, we do not update > its mtime and the mtimes of every object reachable from it, so if it > then becomes transiently unreachable again then it can be wrongly > collected. Just to be clear, the race here only happens if the object in question becomes reachable _after_ a pruning GC determines its mtime. If that's the case, then the object will indeed be wrongly collected. This is consistent with the existing behavior (which is racy in the exact same way). (After re-reading what you wrote and my response, I think we are saying the exact same thing, but it doesn't hurt to think aloud). > > + > > +== Cruft packs > > + > > +A cruft pack eliminates the need for storing unreachable objects in a loose > > +state by including the per-object mtimes in a separate file alongside a single > > +pack containing all loose objects. > > Can this doc say a little about how "git prune" handles these files? > In particular, does a non cruft pack aware copy of Git (or JGit, > libgit2, etc) do the right thing or does it fight with this mechanism? > If the latter, do we have a repository extension (extensions.*) to > prevent that? I mentioned this in much more detail in [1], but the answer is that the cruft pack looks like any other pack, it just happens to have another metadata file (the .mtimes one) attached to it. So other implementations of Git should treat it as they would any other pack. Like I mentioned in [1], cruft packs were designed with the explicit goal of not requiring a repository extension. > > + 3. Write the pack out, along with a `.mtimes` file that records the per-object > > + timestamps. > > As a point of comparison, the design in hash-function-transition uses > a single timestamp for the whole pack. During read operations, objects > in a cruft pack are considered present; during writes, they are > considered _not present_ so that if we want to make a cruft object > newly present then we put a copy of it in a new pack. > > Advantage of the mtimes file approach: > - less duplication of storage: a revived object is only stored once, > in a cruft pack, and then the next gc can "graduate" it out of the > cruft pack and shrink the cruft pack > - less affect on non-gc Git code: writes don't need to know that any > cruft objects referenced need to be copied into a new pack > > Advantages of the mtime per cruft pack approach: > - easy expiration: once a cruft pack has reached its expiration date, > it can be deleted as a whole > - less I/O churn: a cruft pack stays as-is until combined into another > cruft pack or deleted. There is no frequently-modified mtimes file > associated to it > - informs the storage layer about what is likely to be accessed: cruft > packs can get filesystem attributes to put them in less-optimized > storage since they are likely to be less frequently read > > [...] The key advantage of cruft packs is that you can expire unreachable objects in piecemeal while still retaining the benefit of being able to de-duplicate cruft objects and store them packed against each other. > > +Notable alternatives to this design include: > > This doesn't mention the approach described in > hash-function-transition.txt (and that's already implemented and has > been in use for many years in JGit's DfsRepository). Does that mean > you aren't aware of it? Implementing the UNREACHABLE_GARBAGE concept from hash-function-transition.txt in cruft pack-terms would be equivalent to not writing the mtimes file at all. This follows from the fact that a pre-cruft packs implementation of Git considers a packed object's mtime to be the same as the pack it's contained in. (I'm deliberately avoiding any details from the h-f-t document regarding re-writing objects contained in a garbage pack here, since this is separate from the pack structure itself (and could easily be implemented on top of cruft packs)). So I'm not sure what the alternative we'd list would be, since it removes the key feature of the design of cruft packs. Thanks, Taylor [1]: https://lore.kernel.org/git/YiZMhuI%2FDdpvQ%2FED@nand.local/
Hi, Taylor Blau wrote: > On Mon, Mar 07, 2022 at 10:03:35AM -0800, Jonathan Nieder wrote: >> Sorry for the very slow review! I've mentioned a few times that this >> overlaps in interesting ways with the gc mechanism described in >> hash-function-transition.txt, so I'd like to compare and see how they >> interact. > > Sorry for my equally-slow reply ;). I was on vacation last week and > wasn't following the list closely. No problem --- thanks for getting back to me. [...] > (After re-reading what you wrote and my response, I think we are saying > the exact same thing, but it doesn't hurt to think aloud). Great. Can the doc cover this? I think it would be helpful to make that easy to find for others with similar questions. If it's a matter of finding enough time to write some text, let me know and I can try to find some time to help. [...] >> Can this doc say a little about how "git prune" handles these files? >> In particular, does a non cruft pack aware copy of Git (or JGit, >> libgit2, etc) do the right thing or does it fight with this mechanism? >> If the latter, do we have a repository extension (extensions.*) to >> prevent that? > > I mentioned this in much more detail in [1], but the answer is that the > cruft pack looks like any other pack, it just happens to have another > metadata file (the .mtimes one) attached to it. So other implementations > of Git should treat it as they would any other pack. Like I mentioned in > [1], cruft packs were designed with the explicit goal of not requiring a > repository extension. Sorry, the above seems like it's answering a different question than I asked. The doc in Documentation/technical/ seems like a natural place to describe what semantics the new .mtimes file has, and I didn't find that there. Is there a different piece of documentation I should have been looking at? Can you tell me a little more about why we would want _not_ to have a repository format extension? To me, it seems like a fairly simple addition that would drastically reduce the cognitive overload for people considering making use of this feature. [...] > The key advantage of cruft packs is that you can expire unreachable > objects in piecemeal while still retaining the benefit of being able to > de-duplicate cruft objects and store them packed against each other. Can you say a little more about this? My experience with the similar feature in JGit is that it has been helpful to be able to expire a cruft pack altogether; since objects that became reachable around the same time get packed at the same time, it's not obvious to me what benefit this extra piecemeal capability brings. That doesn't mean the benefit doesn't exist, just that it seems like there's a piece of context I'm still missing. >>> +Notable alternatives to this design include: >> >> This doesn't mention the approach described in >> hash-function-transition.txt (and that's already implemented and has >> been in use for many years in JGit's DfsRepository). Does that mean >> you aren't aware of it? > > Implementing the UNREACHABLE_GARBAGE concept from > hash-function-transition.txt in cruft pack-terms would be equivalent to > not writing the mtimes file at all. This follows from the fact that a > pre-cruft packs implementation of Git considers a packed object's mtime > to be the same as the pack it's contained in. (I'm deliberately > avoiding any details from the h-f-t document regarding re-writing > objects contained in a garbage pack here, since this is separate from > the pack structure itself (and could easily be implemented on top of > cruft packs)). > > So I'm not sure what the alternative we'd list would be, since it > removes the key feature of the design of cruft packs. Sorry, I don't understand this answer either. Do you mean to say that JGit's DfsRepository does not in fact have a cruft packs like feature that is live in the wild? Or that that feature is equivalent to not having such a feature? Or something else? To be clear, I'm not trying to say that that's superior to what you've proposed here --- only that documenting the comparison would be useful. Puzzled, Jonathan
On Tue, Mar 22, 2022 at 02:45:16PM -0700, Jonathan Nieder wrote: > Hi, > > Taylor Blau wrote: > > On Mon, Mar 07, 2022 at 10:03:35AM -0800, Jonathan Nieder wrote: > > >> Sorry for the very slow review! I've mentioned a few times that this > >> overlaps in interesting ways with the gc mechanism described in > >> hash-function-transition.txt, so I'd like to compare and see how they > >> interact. > > > > Sorry for my equally-slow reply ;). I was on vacation last week and > > wasn't following the list closely. > > No problem --- thanks for getting back to me. > > [...] > > (After re-reading what you wrote and my response, I think we are saying > > the exact same thing, but it doesn't hurt to think aloud). > > Great. Can the doc cover this? I think it would be helpful to make > that easy to find for others with similar questions. I believe the doc covers this already, see the paragraph beginning with "Unreachable objects aren't removed immediately...". > >> Can this doc say a little about how "git prune" handles these files? > >> In particular, does a non cruft pack aware copy of Git (or JGit, > >> libgit2, etc) do the right thing or does it fight with this mechanism? > >> If the latter, do we have a repository extension (extensions.*) to > >> prevent that? > > > > I mentioned this in much more detail in [1], but the answer is that the > > cruft pack looks like any other pack, it just happens to have another > > metadata file (the .mtimes one) attached to it. So other implementations > > of Git should treat it as they would any other pack. Like I mentioned in > > [1], cruft packs were designed with the explicit goal of not requiring a > > repository extension. > > Sorry, the above seems like it's answering a different question than I > asked. The doc in Documentation/technical/ seems like a natural place > to describe what semantics the new .mtimes file has, and I didn't find > that there. Is there a different piece of documentation I should have > been looking at? Are you looking for a technical description of the mtimes file? If so, there is a section in Documentation/technical/pack-format.txt (added in "pack-mtimes: support reading .mtimes files") that explains this. > Can you tell me a little more about why we would want _not_ to have a > repository format extension? To me, it seems like a fairly simple > addition that would drastically reduce the cognitive overload for > people considering making use of this feature. There is no reason to prevent a pre-cruft packs version of Git from reading/writing a repository that uses cruft packs, since the two versions will still function as normal. Since there's no need to prevent the old version from interacting with a repository that has cruft packs, we wouldn't want to enforce an unnecessary boundary with an extension. > [...] > > The key advantage of cruft packs is that you can expire unreachable > > objects in piecemeal while still retaining the benefit of being able to > > de-duplicate cruft objects and store them packed against each other. > > Can you say a little more about this? My experience with the similar > feature in JGit is that it has been helpful to be able to expire a > cruft pack altogether; since objects that became reachable around the > same time get packed at the same time, it's not obvious to me what > benefit this extra piecemeal capability brings. > > That doesn't mean the benefit doesn't exist, just that it seems like > there's a piece of context I'm still missing. Expiring objects in piecemeal is somewhat interesting, but I think I was reaching a little too far when I said it was the "key benefit". It does have some nice properties, like being able to store cruft objects as deltas against other cruft objects which might get pruned at a different time (though, of course, you'll need to re-delta them in the case you do prune an object which is the base of another cruft object). But the issue with having multiple cruft packs is that the semantics get significantly more complicated. E.g., if you have an object represented in multiple cruft packs, which mtime do you use? If you want to prune it, you suddenly may have many packs you need to update and keep track of. > >>> +Notable alternatives to this design include: > >> > >> This doesn't mention the approach described in > >> hash-function-transition.txt (and that's already implemented and has > >> been in use for many years in JGit's DfsRepository). Does that mean > >> you aren't aware of it? > > > > Implementing the UNREACHABLE_GARBAGE concept from > > hash-function-transition.txt in cruft pack-terms would be equivalent to > > not writing the mtimes file at all. This follows from the fact that a > > pre-cruft packs implementation of Git considers a packed object's mtime > > to be the same as the pack it's contained in. (I'm deliberately > > avoiding any details from the h-f-t document regarding re-writing > > objects contained in a garbage pack here, since this is separate from > > the pack structure itself (and could easily be implemented on top of > > cruft packs)). > > > > So I'm not sure what the alternative we'd list would be, since it > > removes the key feature of the design of cruft packs. > > Sorry, I don't understand this answer either. Do you mean to say that > JGit's DfsRepository does not in fact have a cruft packs like feature > that is live in the wild? Or that that feature is equivalent to not > having such a feature? Or something else? > > To be clear, I'm not trying to say that that's superior to what you've > proposed here --- only that documenting the comparison would be > useful. I'm not familiar enough with JGit (or its DfsRepository class) to know how to answer this. I was comparing cruft packs to the UNREACHABLE_GARBAGE concept mentioned in the hash-function-transition doc, and noting the differences there. Thanks, Taylor
Hi, Taylor Blau wrote: > On Tue, Mar 22, 2022 at 02:45:16PM -0700, Jonathan Nieder wrote: >> Great. Can the doc cover this? I think it would be helpful to make >> that easy to find for others with similar questions. > > I believe the doc covers this already, see the paragraph beginning with > "Unreachable objects aren't removed immediately...". Thanks. I just reread that section and it didn't say anything obvious about the race that continues to exist and whether cruft packs address it. [...] >> Sorry, the above seems like it's answering a different question than I >> asked. The doc in Documentation/technical/ seems like a natural place >> to describe what semantics the new .mtimes file has, and I didn't find >> that there. Is there a different piece of documentation I should have >> been looking at? > > Are you looking for a technical description of the mtimes file? If so, > there is a section in Documentation/technical/pack-format.txt (added in > "pack-mtimes: support reading .mtimes files") that explains this. I see --- is the idea that cruft-packs.txt means to refer to pack-format.txt for the details, and cruft-packs is an overview of some other, non-detail aspects? I just checked pack-format.txt and it didn't describe the semantics (what a Git implementation is expected to do when it sees an mtimes file). For example, in Documentation/technical/cruft-packs.txt, the kind of thing I'd expect to see is - what does an mtime value in the mtimes file represent? When is it meant to be updated? - what guarantees are present about when an object is safe to be pruned? [...] >> Can you tell me a little more about why we would want _not_ to have a >> repository format extension? To me, it seems like a fairly simple >> addition that would drastically reduce the cognitive overload for >> people considering making use of this feature. > >There is no reason to prevent a pre-cruft packs version of Git from > reading/writing a repository that uses cruft packs, since the two > versions will still function as normal. Since there's no need to prevent > the old version from interacting with a repository that has cruft packs, > we wouldn't want to enforce an unnecessary boundary with an extension. Does "function as normal" include in repository maintenance operations like "git maintenance", "git gc", and "git prune"? If so, this seems like something very useful to describe in the cruft-packs.txt document, since what happens when we bounce back and forth between old and new versions of Git operating on the same NFS mounted repository would not be obvious without such a discussion. I'm still interested in the _downsides_ of using a repository format extension. "There is no reason" is not a downside, unless you mean that it requires adding a line of code. :) The main downside I can imagine is that it prevents accessing the repository _that has enabled this feature_ with an older version of Git, but I (perhaps due to a failure of imagination) haven't put two and two together yet about when I would want to do so. [...] > Expiring objects in piecemeal is somewhat interesting, but I think I was > reaching a little too far when I said it was the "key benefit". It does > have some nice properties, like being able to store cruft objects as > deltas against other cruft objects which might get pruned at a different > time (though, of course, you'll need to re-delta them in the case you do > prune an object which is the base of another cruft object). > > But the issue with having multiple cruft packs is that the semantics get > significantly more complicated. E.g., if you have an object represented > in multiple cruft packs, which mtime do you use? If you want to prune > it, you suddenly may have many packs you need to update and keep track > of. Thanks for this explanation. In hash-function-transition.txt, I see "git gc" currently expels any unreachable objects it encounters in pack files to loose objects in an attempt to prevent a race when pruning them (in case another process is simultaneously writing a new object that refers to the about-to-be-deleted object). This leads to an explosion in the number of loose objects present and disk space usage due to the objects in delta form being replaced with independent loose objects. Worse, the race is still present for loose objects. Instead, "git gc" will need to move unreachable objects to a new packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see below). To avoid the race when writing new objects referring to an about-to-be-deleted object, code paths that write new objects will need to copy any objects from UNREACHABLE_GARBAGE packs that they refer to new, non-UNREACHABLE_GARBAGE packs (or loose objects). UNREACHABLE_GARBAGE are then safe to delete if their creation time (as indicated by the file's mtime) is long enough ago. To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be combined under certain circumstances. [etc] So the proposal there is that the file mtime for an UNREACHABLE_GARBAGE pack refers to when that pack was written and governs when that pack can be deleted. If an object is present in multiple packs, then newer packs with the object have a newer mtime and thus cause the object to be kept around for longer. [...] >> Sorry, I don't understand this answer either. Do you mean to say that >> JGit's DfsRepository does not in fact have a cruft packs like feature >> that is live in the wild? Or that that feature is equivalent to not >> having such a feature? Or something else? >> >> To be clear, I'm not trying to say that that's superior to what you've >> proposed here --- only that documenting the comparison would be >> useful. > > I'm not familiar enough with JGit (or its DfsRepository class) to know > how to answer this. I was comparing cruft packs to the > UNREACHABLE_GARBAGE concept mentioned in the hash-function-transition > doc, and noting the differences there. Thanks. I think there's some implied feedback about the documentation of UNREACHABLE_GARBAGE there, because if I understand then you're saying that it does not describe maintaining cruft packs. Perhaps a pointer to the particular sentence that led you to that conclusion would help. Sincerely, Jonathan
On Tue, Mar 22, 2022 at 04:04:53PM -0700, Jonathan Nieder wrote: > Hi, > > Taylor Blau wrote: > > On Tue, Mar 22, 2022 at 02:45:16PM -0700, Jonathan Nieder wrote: > > >> Great. Can the doc cover this? I think it would be helpful to make > >> that easy to find for others with similar questions. > > > > I believe the doc covers this already, see the paragraph beginning with > > "Unreachable objects aren't removed immediately...". > > Thanks. I just reread that section and it didn't say anything obvious > about the race that continues to exist and whether cruft packs address > it. Yeah, there isn't an explicit "and cruft packs addresses the loose object explosion but does not address the race" sentence. I'm not opposed to adding something like that to clarify (though TBH, I would rather do it as a clean-up on top rather than send out a bazillion mostly-unchanged patches). > [...] > >> Sorry, the above seems like it's answering a different question than I > >> asked. The doc in Documentation/technical/ seems like a natural place > >> to describe what semantics the new .mtimes file has, and I didn't find > >> that there. Is there a different piece of documentation I should have > >> been looking at? > > > > Are you looking for a technical description of the mtimes file? If so, > > there is a section in Documentation/technical/pack-format.txt (added in > > "pack-mtimes: support reading .mtimes files") that explains this. > > I see --- is the idea that cruft-packs.txt means to refer to > pack-format.txt for the details, and cruft-packs is an overview of some > other, non-detail aspects? > > I just checked pack-format.txt and it didn't describe the semantics > (what a Git implementation is expected to do when it sees an mtimes > file). For example, in Documentation/technical/cruft-packs.txt, the > kind of thing I'd expect to see is Right; I have always considered the files in Documentation/technical to primarily be about the file format itself. > - what does an mtime value in the mtimes file represent? When is it > meant to be updated? > - what guarantees are present about when an object is safe to be > pruned? The cruft-packs.txt document covers these, though I think somewhat implicitly. Again, I'm not opposed to more clarification, but I again would like to do so on top. I think many of these are discussed within the threads above, but to answer your questions in order: - The mtime of an object in a cruft pack represents the last time that object was known to be reachable, and it's updated when generating a cruft pack or pruning. - The same guarantees are made in the cruft pack case as in the non-cruft case (i.e., "none", and so a grace period is recommended). > [...] > >> Can you tell me a little more about why we would want _not_ to have a > >> repository format extension? To me, it seems like a fairly simple > >> addition that would drastically reduce the cognitive overload for > >> people considering making use of this feature. > > > >There is no reason to prevent a pre-cruft packs version of Git from > > reading/writing a repository that uses cruft packs, since the two > > versions will still function as normal. Since there's no need to prevent > > the old version from interacting with a repository that has cruft packs, > > we wouldn't want to enforce an unnecessary boundary with an extension. > > Does "function as normal" include in repository maintenance operations > like "git maintenance", "git gc", and "git prune"? If so, this seems > like something very useful to describe in the cruft-packs.txt > document, since what happens when we bounce back and forth between old > and new versions of Git operating on the same NFS mounted repository > would not be obvious without such a discussion. Yes, all of those commands will simply ignore the .mtimes file and treat the unreachable objects as normal (where "normal" means in the exact same way as they currently do without cruft packs). I think adding a section that summarizes our discussion would be useful. > I'm still interested in the _downsides_ of using a repository format > extension. "There is no reason" is not a downside, unless you mean > that it requires adding a line of code. :) The main downside I can > imagine is that it prevents accessing the repository _that has enabled > this feature_ with an older version of Git, but I (perhaps due to a > failure of imagination) haven't put two and two together yet about > when I would want to do so. Sorry for not being clear; I meant: "There is no reason [to prohibit two versions of Git from interacting with each other when they are compatible to do so]". > [...] > > Expiring objects in piecemeal is somewhat interesting, but I think I was > > reaching a little too far when I said it was the "key benefit". It does > > have some nice properties, like being able to store cruft objects as > > deltas against other cruft objects which might get pruned at a different > > time (though, of course, you'll need to re-delta them in the case you do > > prune an object which is the base of another cruft object). > > > > But the issue with having multiple cruft packs is that the semantics get > > significantly more complicated. E.g., if you have an object represented > > in multiple cruft packs, which mtime do you use? If you want to prune > > it, you suddenly may have many packs you need to update and keep track > > of. > > Thanks for this explanation. In hash-function-transition.txt, I see > > "git gc" currently expels any unreachable objects it encounters in > pack files to loose objects in an attempt to prevent a race when > pruning them (in case another process is simultaneously writing a new > object that refers to the about-to-be-deleted object). This leads to > an explosion in the number of loose objects present and disk space > usage due to the objects in delta form being replaced with independent > loose objects. Worse, the race is still present for loose objects. > > Instead, "git gc" will need to move unreachable objects to a new > packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see > below). To avoid the race when writing new objects referring to an > about-to-be-deleted object, code paths that write new objects will > need to copy any objects from UNREACHABLE_GARBAGE packs that they > refer to new, non-UNREACHABLE_GARBAGE packs (or loose objects). > UNREACHABLE_GARBAGE are then safe to delete if their creation time (as > indicated by the file's mtime) is long enough ago. > > To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be > combined under certain circumstances. [etc] > > So the proposal there is that the file mtime for an UNREACHABLE_GARBAGE > pack refers to when that pack was written and governs when that pack > can be deleted. If an object is present in multiple packs, then newer > packs with the object have a newer mtime and thus cause the object to > be kept around for longer. That matches my understanding. Thanks, Taylor
On Tue, Mar 22, 2022 at 09:01:43PM -0400, Taylor Blau wrote: > > >> Can you tell me a little more about why we would want _not_ to have a > > >> repository format extension? To me, it seems like a fairly simple > > >> addition that would drastically reduce the cognitive overload for > > >> people considering making use of this feature. > > > > > >There is no reason to prevent a pre-cruft packs version of Git from > > > reading/writing a repository that uses cruft packs, since the two > > > versions will still function as normal. Since there's no need to prevent > > > the old version from interacting with a repository that has cruft packs, > > > we wouldn't want to enforce an unnecessary boundary with an extension. > > > > Does "function as normal" include in repository maintenance operations > > like "git maintenance", "git gc", and "git prune"? If so, this seems > > like something very useful to describe in the cruft-packs.txt > > document, since what happens when we bounce back and forth between old > > and new versions of Git operating on the same NFS mounted repository > > would not be obvious without such a discussion. > > Yes, all of those commands will simply ignore the .mtimes file and treat > the unreachable objects as normal (where "normal" means in the exact > same way as they currently do without cruft packs). I think adding a > section that summarizes our discussion would be useful. > > > I'm still interested in the _downsides_ of using a repository format > > extension. "There is no reason" is not a downside, unless you mean > > that it requires adding a line of code. :) The main downside I can > > imagine is that it prevents accessing the repository _that has enabled > > this feature_ with an older version of Git, but I (perhaps due to a > > failure of imagination) haven't put two and two together yet about > > when I would want to do so. > > Sorry for not being clear; I meant: "There is no reason [to prohibit > two versions of Git from interacting with each other when they are > compatible to do so]". Jonathan, myself, and others discussed this extensively in today's standup. To summarize Jonathan's point (as I think I severely misunderstood it before), if two writers are repacking a repository with unreachable objects. The following can happen: - $NEWGIT packs the repository and writes a cruft pack and .mtimes file. - $OLDGIT packs the repository, exploding unreachable objects from the cruft pack as loose, setting their mtimes to "now". This causes the repository to lose information about the unreachable mtimes, which would cause the repository to never prune objects (except for when`--unpack-unreachable=now` is passed). One approach (that Jonathan suggested) is to prevent the above situation by introducing a format extension, so $OLDGIT could not touch the repository. But this comes at a (in my view, significant) cost which is that $OLDGIT can't touch the repository _at all_. An extension would be desirable if cross-version interaction resulted in repository corruption, but this scenario does not lead to corruption at all. Another approach (courtesy Stolee, in an off-list discussion) is that we could introduce an optional extension available as an opt-in to prevent older versions of Git from interacting in a repository that contains cruft packs, but is not required to write them. A third approach (and probably my preferred direction) is to indicate clearly via a combination of updates to Documentation/cruft-packs.txt and the release notes that say something along the lines of: If you use are repacking a repository using both a pre- and post-cruft packs version of Git, please be aware that you will lose information about the mtimes of unreachable objects. I imagine that would probably be sufficient, but we could also introduce the opt-in extension as an easy alternative to avoid forcing an upgrade of Git. Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > To summarize Jonathan's point (as I think I severely misunderstood it > before), if two writers are repacking a repository with unreachable > objects. The following can happen: > > - $NEWGIT packs the repository and writes a cruft pack and .mtimes > file. > > - $OLDGIT packs the repository, exploding unreachable objects from the > cruft pack as loose, setting their mtimes to "now". And if these repeat, alternating new and old versions of Git, we will keep refreshing the unreachable objects' mtimes forever. But once you stop using old versions of Git, perhaps in 3 release cycles or so, we'll eventually be able to purge them, right? > One approach (that Jonathan suggested) is to prevent the above situation > by introducing a format extension, so $OLDGIT could not touch the > repository. But this comes at a (in my view, significant) cost which is > that $OLDGIT can't touch the repository _at all_. An extension would be > desirable if cross-version interaction resulted in repository > corruption, but this scenario does not lead to corruption at all. A repository may not be in a healthy state, when tons of unreachable objects stay around forever, but it probably is a bit too harsh to call it "corrupt". > Another approach (courtesy Stolee, in an off-list discussion) is that we > could introduce an optional extension available as an opt-in to prevent > older versions of Git from interacting in a repository that contains > cruft packs, but is not required to write them. That smells too magic; let's not go there. > A third approach (and probably my preferred direction) is to indicate > clearly via a combination of updates to Documentation/cruft-packs.txt > and the release notes that say something along the lines of: > > If you use are repacking a repository using both a pre- and > post-cruft packs version of Git, please be aware that you will lose > information about the mtimes of unreachable objects. I do not quite see how it helps. After hearing "... will lose information about the mtimes ...", what concrete action can a user take? Or a sys-admin? It's not like use of cruft-pack is mandatory when you upgrade the new version of Git, right? Perhaps use of cruft-pack should be guarded behind a configuration variable so that users who might want to use mixed versions of Git will be protected against accidental use of new version of Git that introduces the forever-renewing untracked objects problem? Perhaps a configuration variable, repack.cruftPackEnabled, that is by default disabled, can be used to protect people who do not want to get into the "keep refreshing mtime" loop from using the cruft packs by mistake? repack.cruftPackEnabled can probably be part of the "experimental" feature set, if we think it is the direction in the future.
On Mon, Mar 28, 2022 at 01:55:43PM -0700, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > To summarize Jonathan's point (as I think I severely misunderstood it > > before), if two writers are repacking a repository with unreachable > > objects. The following can happen: > > > > - $NEWGIT packs the repository and writes a cruft pack and .mtimes > > file. > > > > - $OLDGIT packs the repository, exploding unreachable objects from the > > cruft pack as loose, setting their mtimes to "now". > > And if these repeat, alternating new and old versions of Git, we > will keep refreshing the unreachable objects' mtimes forever. > > But once you stop using old versions of Git, perhaps in 3 release > cycles or so, we'll eventually be able to purge them, right? As soon as all of the repackers understand cruft packs, yes. > > One approach (that Jonathan suggested) is to prevent the above situation > > by introducing a format extension, so $OLDGIT could not touch the > > repository. But this comes at a (in my view, significant) cost which is > > that $OLDGIT can't touch the repository _at all_. An extension would be > > desirable if cross-version interaction resulted in repository > > corruption, but this scenario does not lead to corruption at all. > > A repository may not be in a healthy state, when tons of unreachable > objects stay around forever, but it probably is a bit too harsh to > call it "corrupt". I agree, though I would note that this is no worse than the situation today, where unreachable-but-recent objects are already exploded as loose can already cause the kinds of issues that this series is designed to prevent. > > Another approach (courtesy Stolee, in an off-list discussion) is that we > > could introduce an optional extension available as an opt-in to prevent > > older versions of Git from interacting in a repository that contains > > cruft packs, but is not required to write them. > > That smells too magic; let's not go there. I'm not sure... if we did: --- 8< --- diff --git a/setup.c b/setup.c index 04ce33cdcd..fa54c9baa4 100644 --- a/setup.c +++ b/setup.c @@ -565,2 +565,4 @@ static enum extension_result handle_extension(const char *var, return EXTENSION_OK; + } else if (!strcmp(ext, "cruftpacks")) { + return EXTENSION_OK; } --- >8 --- but nothing more, then a hypothetical `extensions.cruftPacks` could be used to prevent older writers in a mixed version environment. But if you don't have or care about older versions of Git, you can avoid setting it altogether. The key bit is that we don't have a check along the lines of "only allow writing a cruft pack when extensions.cruftPacks" is set, so it's opt-in as far as the new code is concerned. > > A third approach (and probably my preferred direction) is to indicate > > clearly via a combination of updates to Documentation/cruft-packs.txt > > and the release notes that say something along the lines of: > > > > If you use are repacking a repository using both a pre- and > > post-cruft packs version of Git, please be aware that you will lose > > information about the mtimes of unreachable objects. > > I do not quite see how it helps. After hearing "... will lose > information about the mtimes ...", what concrete action can a user > take? Or a sys-admin? > > It's not like use of cruft-pack is mandatory when you upgrade the > new version of Git, right? Perhaps use of cruft-pack should be > guarded behind a configuration variable so that users who might want > to use mixed versions of Git will be protected against accidental > use of new version of Git that introduces the forever-renewing > untracked objects problem? I don't think we would have much to offer a user in that case; if the mtimes are gone, then I couldn't think of anything to bring them back outside of setting them manually. But cruft packs are already guarded in two places: - `git repack` won't write a cruft pack unless given the `--cruft` flag (i.e., `git repack -A` doesn't suddenly start generating cruft packs upon upgrade). - `git gc` won't write cruft packs unless the `gc.cruftPacks` configuration is set, or `--cruft` is given as a flag. I'd be curious what Jonathan and others think of that approach (which, to be clear, is what this series already implements). We could make it clear to say: If you have mixed versions of Git which both repack a repository (either manually or by auto-GC / background maintenance), consider leaving `gc.cruftPacks` unset and avoiding passing `--cruft` as a command-line argument to `git repack` and `git gc`, since doing so can lead to [...] > Perhaps a configuration variable, repack.cruftPackEnabled, that is > by default disabled, can be used to protect people who do not want > to get into the "keep refreshing mtime" loop from using the cruft > packs by mistake? repack.cruftPackEnabled can probably be part of > the "experimental" feature set, if we think it is the direction in > the future. I'd probably want to leave `-A` separate from `--cruft`, since something about setting `repack.cruftPackEnabled` having the effect of causing `-A` to produce a cruft pack feels strange to me. Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > I'm not sure... if we did: > > --- 8< --- > > diff --git a/setup.c b/setup.c > index 04ce33cdcd..fa54c9baa4 100644 > --- a/setup.c > +++ b/setup.c > @@ -565,2 +565,4 @@ static enum extension_result handle_extension(const char *var, > return EXTENSION_OK; > + } else if (!strcmp(ext, "cruftpacks")) { > + return EXTENSION_OK; > } > > --- >8 --- > > but nothing more, then a hypothetical `extensions.cruftPacks` could be > used to prevent older writers in a mixed version environment. But if you > don't have or care about older versions of Git, you can avoid setting it > altogether. Smells like "unsafe by default, but you can opt into safety", which is backwards, isn't it? >> I do not quite see how it helps. After hearing "... will lose >> information about the mtimes ...", what concrete action can a user >> take? Or a sys-admin? >> >> It's not like use of cruft-pack is mandatory when you upgrade the >> new version of Git, right? Perhaps use of cruft-pack should be >> guarded behind a configuration variable so that users who might want >> to use mixed versions of Git will be protected against accidental >> use of new version of Git that introduces the forever-renewing >> untracked objects problem? > > I don't think we would have much to offer a user in that case; if the > mtimes are gone, then I couldn't think of anything to bring them back > outside of setting them manually. Yes, so rambling about losing mtimes in documentation or release notes would not help users all that much. Let's not do that. > But cruft packs are already guarded in two places: > > - `git repack` won't write a cruft pack unless given the `--cruft` > flag (i.e., `git repack -A` doesn't suddenly start generating cruft > packs upon upgrade). > > - `git gc` won't write cruft packs unless the `gc.cruftPacks` > configuration is set, or `--cruft` is given as a flag. Hmph, OK. So individuals can sort-of protect from hurting themselves by refraining from running these with --cruft or writing --cruft in their maintenance scripts. An organization that wants to let the more adventurous types to early opt-in can prepare two versions of the maintenance scripts they distribute to their users, one with and the other without --cruft, and use the mechanism they use for gradual rollouts to control the population. Perhaps that would make sufficient protection? I dunno. Jonathan, what do you think? > I'd be curious what Jonathan and others think of that approach (which, > to be clear, is what this series already implements). We could make it > clear to say: > > If you have mixed versions of Git which both repack a repository > (either manually or by auto-GC / background maintenance), consider > leaving `gc.cruftPacks` unset and avoiding passing `--cruft` as a > command-line argument to `git repack` and `git gc`, since doing so > can lead to [...] That message is (depending on what comes in [...]) much more helpful than just throwing a word "mtime" out and letting the reader figure out the rest ;-)
On Tue, Mar 29, 2022 at 08:59:24AM -0700, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > I'm not sure... if we did: > > > > --- 8< --- > > > > diff --git a/setup.c b/setup.c > > index 04ce33cdcd..fa54c9baa4 100644 > > --- a/setup.c > > +++ b/setup.c > > @@ -565,2 +565,4 @@ static enum extension_result handle_extension(const char *var, > > return EXTENSION_OK; > > + } else if (!strcmp(ext, "cruftpacks")) { > > + return EXTENSION_OK; > > } > > > > --- >8 --- > > > > but nothing more, then a hypothetical `extensions.cruftPacks` could be > > used to prevent older writers in a mixed version environment. But if you > > don't have or care about older versions of Git, you can avoid setting it > > altogether. > > Smells like "unsafe by default, but you can opt into safety", which > is backwards, isn't it? I see it a little differently. The default (not writing cruft packs at all) is safe, even in a mixed-version environment. If a user (a) wants to use cruft packs, and (b) has older versions of Git also gc'ing the repository, and (c) can't get rid of them, _then_ an opt-in extension would make it impossible for those older versions to interact with the repository. I still can't shake the feeling that this is a pretty fringe and timing-dependent scenario, which at worst keeps too many unreachable objects around. But I think this in conjunction with the already opt-in nature of cruft packs would be a nice way to create safeguards for the situation Jonathan described. There may be a simpler way, but I'm not sure I see it (i.e., if you control whether or not `--cruft` is passed when doing maintenance with newer versions of Git, but not whether older versions are running around doing their own maintenance, then an extension would be necessary to lock the old versions out). > > But cruft packs are already guarded in two places: > > > > - `git repack` won't write a cruft pack unless given the `--cruft` > > flag (i.e., `git repack -A` doesn't suddenly start generating cruft > > packs upon upgrade). > > > > - `git gc` won't write cruft packs unless the `gc.cruftPacks` > > configuration is set, or `--cruft` is given as a flag. > > Hmph, OK. So individuals can sort-of protect from hurting > themselves by refraining from running these with --cruft or writing > --cruft in their maintenance scripts. An organization that wants to > let the more adventurous types to early opt-in can prepare two > versions of the maintenance scripts they distribute to their users, > one with and the other without --cruft, and use the mechanism they > use for gradual rollouts to control the population. Perhaps that > would make sufficient protection? I dunno. > > Jonathan, what do you think? I'm confused: if newer versions of Git are writing cruft packs, then having the older versions gc'ing in the same repository runs into the same scenario Jonathan originally describes. The thing I think Jonathan seeks to prevent is older versions of Git gc'ing a repo that has cruft packs. I think I may need you to clarify a little, sorry :-(. > > I'd be curious what Jonathan and others think of that approach (which, > > to be clear, is what this series already implements). We could make it > > clear to say: > > > > If you have mixed versions of Git which both repack a repository > > (either manually or by auto-GC / background maintenance), consider > > leaving `gc.cruftPacks` unset and avoiding passing `--cruft` as a > > command-line argument to `git repack` and `git gc`, since doing so > > can lead to [...] > > That message is (depending on what comes in [...]) much more helpful > than just throwing a word "mtime" out and letting the reader figure > out the rest ;-) Yes, totally agreed. Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > The thing I think Jonathan seeks to prevent is older versions of Git > gc'ing a repo that has cruft packs. I think I may need you to clarify a > little, sorry :-(. By making controlled rollout of the use of "--cruft" option (and the assumption here is that a large organization setting people do not manually say "gc --cruft", and they can ship their maintenance scripts that may be run via cron or whatever with and without "--cruft"), you can control the number of repositories that can potentially see older versions of Git running gc on with cruft packs. Those users, for whom it is not their turn to start using "--cruft" enabled version of the script, will not have cruft packs, so it does not matter if they keep an older version of Git somewhere hidden in a hermetic build of an IDE that bundles Git and gc kicks in for them.
On Wed, Mar 30, 2022 at 06:37:54AM -0700, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > The thing I think Jonathan seeks to prevent is older versions of Git > > gc'ing a repo that has cruft packs. I think I may need you to clarify a > > little, sorry :-(. > > By making controlled rollout of the use of "--cruft" option (and the > assumption here is that a large organization setting people do not > manually say "gc --cruft", and they can ship their maintenance > scripts that may be run via cron or whatever with and without > "--cruft"), you can control the number of repositories that can > potentially see older versions of Git running gc on with cruft > packs. Those users, for whom it is not their turn to start using > "--cruft" enabled version of the script, will not have cruft packs, > so it does not matter if they keep an older version of Git somewhere > hidden in a hermetic build of an IDE that bundles Git and gc kicks > in for them. Ahh, OK. Thanks for explaining: this is what I was pretty sure you meant, but I wanted to make sure before agreeing to it. Yes, this solution amounts to: "if you have mixed-versions of Git mutually gc'ing a repository, then use the same rollout method used for controlling Git itself to guard when to start creating cruft packs". I would be very eager to hear if this works for Jonathan's case. It should do the trick, I'd think. Thanks, Taylor
diff --git a/Documentation/Makefile b/Documentation/Makefile index ed656db2ae..0b01c9408e 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -91,6 +91,7 @@ TECH_DOCS += MyFirstContribution TECH_DOCS += MyFirstObjectWalk TECH_DOCS += SubmittingPatches TECH_DOCS += technical/bundle-format +TECH_DOCS += technical/cruft-packs TECH_DOCS += technical/hash-function-transition TECH_DOCS += technical/http-protocol TECH_DOCS += technical/index-format diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt new file mode 100644 index 0000000000..2c3c5d93f8 --- /dev/null +++ b/Documentation/technical/cruft-packs.txt @@ -0,0 +1,97 @@ += Cruft packs + +The cruft packs feature offer an alternative to Git's traditional mechanism of +removing unreachable objects. This document provides an overview of Git's +pruning mechanism, and how a cruft pack can be used instead to accomplish the +same. + +== Background + +To remove unreachable objects from your repository, Git offers `git repack -Ad` +(see linkgit:git-repack[1]). Quoting from the documentation: + +[quote] +[...] unreachable objects in a previous pack become loose, unpacked objects, +instead of being left in the old pack. [...] loose unreachable objects will be +pruned according to normal expiry rules with the next 'git gc' invocation. + +Unreachable objects aren't removed immediately, since doing so could race with +an incoming push which may reference an object which is about to be deleted. +Instead, those unreachable objects are stored as loose object and stay that way +until they are older than the expiration window, at which point they are removed +by linkgit:git-prune[1]. + +Git must store these unreachable objects loose in order to keep track of their +per-object mtimes. If these unreachable objects were written into one big pack, +then either freshening that pack (because an object contained within it was +re-written) or creating a new pack of unreachable objects would cause the pack's +mtime to get updated, and the objects within it would never leave the expiration +window. Instead, objects are stored loose in order to keep track of the +individual object mtimes and avoid a situation where all cruft objects are +freshened at once. + +This can lead to undesirable situations when a repository contains many +unreachable objects which have not yet left the grace period. Having large +directories in the shards of `.git/objects` can lead to decreased performance in +the repository. But given enough unreachable objects, this can lead to inode +starvation and degrade the performance of the whole system. Since we +can never pack those objects, these repositories often take up a large amount of +disk space, since we can only zlib compress them, but not store them in delta +chains. + +== Cruft packs + +A cruft pack eliminates the need for storing unreachable objects in a loose +state by including the per-object mtimes in a separate file alongside a single +pack containing all loose objects. + +A cruft pack is written by `git repack --cruft` when generating a new pack. +linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft` +is a classic all-into-one repack, meaning that everything in the resulting pack is +reachable, and everything else is unreachable. Once written, the `--cruft` +option instructs `git repack` to generate another pack containing only objects +not packed in the previous step (which equates to packing all unreachable +objects together). This progresses as follows: + + 1. Enumerate every object, marking any object which is (a) not contained in a + kept-pack, and (b) whose mtime is within the grace period as a traversal + tip. + + 2. Perform a reachability traversal based on the tips gathered in the previous + step, adding every object along the way to the pack. + + 3. Write the pack out, along with a `.mtimes` file that records the per-object + timestamps. + +This mode is invoked internally by linkgit:git-repack[1] when instructed to +write a cruft pack. Crucially, the set of in-core kept packs is exactly the set +of packs which will not be deleted by the repack; in other words, they contain +all of the repository's reachable objects. + +When a repository already has a cruft pack, `git repack --cruft` typically only +adds objects to it. An exception to this is when `git repack` is given the +`--cruft-expiration` option, which allows the generated cruft pack to omit +expired objects instead of waiting for linkgit:git-gc[1] to expire those objects +later on. + +It is linkgit:git-gc[1] that is typically responsible for removing expired +unreachable objects. + +== Alternatives + +Notable alternatives to this design include: + + - The location of the per-object mtime data, and + - Storing unreachable objects in multiple cruft packs. + +On the location of mtime data, a new auxiliary file tied to the pack was chosen +to avoid complicating the `.idx` format. If the `.idx` format were ever to gain +support for optional chunks of data, it may make sense to consolidate the +`.mtimes` format into the `.idx` itself. + +Storing unreachable objects among multiple cruft packs (e.g., creating a new +cruft pack during each repacking operation including only unreachable objects +which aren't already stored in an earlier cruft pack) is significantly more +complicated to construct, and so aren't pursued here. The obvious drawback to +the current implementation is that the entire cruft pack must be re-written from +scratch.
Create a technical document to explain cruft packs. It contains a brief overview of the problem, some background, details on the implementation, and a couple of alternative approaches not considered here. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- Documentation/Makefile | 1 + Documentation/technical/cruft-packs.txt | 97 +++++++++++++++++++++++++ 2 files changed, 98 insertions(+) create mode 100644 Documentation/technical/cruft-packs.txt