Message ID | cover.1638224692.git.me@ttaylorr.com (mailing list archive) |
---|---|
Headers | show |
Series | cruft packs | expand |
Taylor Blau <me@ttaylorr.com> writes: > This series implements "cruft packs", a pack which stores accumulated > unreachable objects, along with a new ".mtimes" file which tracks each > object's last known modification time. Let me rephrase the above to test my understanding, since I need to write a summary for the "What's cooking" report. Instead of leaving unreachable objects in loose form when packing, or ejecting them into loose form when repacking, gather them in a packfile with an auxiliary file that records the last-use time of these objects. That way, we do not have to waste so many inodes for loose objects that is not likely to be used, which feels like a win. > - The final patch handles object freshening for objects stored in a > cruft pack. I am not going to read it today, but I think this is the most interesting part of the series. Instead of using mtime of an individual loose object file, we'd need to record the time of last use for each object in a pack. Stepping back a bit, I do not see how we can get away without doing the same .mtimes file for non-cruft packs. An object that is in a non-cruft pack may be referenced immediately after the repack that created the pack, but the ref that was referencing the object may have gone away and now the pack is a month old. If we were to repack the object, we do not know when was the last time the object was reachable from any of the refs and index entries (collectively known as anchor points). Of course, recording all mtimes for all packed objects all the time would involve quite a lot of overhead. I am guessing (I will not spend time today to figure it out myself) that .mtimes update at runtime will happen in-place (i.e. via seek(2)+write(2), or pwrite()), and I wonder what the safety concern would be (which is the primary reason why we tend not to do in-place updates but recreate-and-rename updates). Thanks for working on such an interesting topic.
On Fri, Dec 03, 2021 at 11:51:51AM -0800, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > This series implements "cruft packs", a pack which stores accumulated > > unreachable objects, along with a new ".mtimes" file which tracks each > > object's last known modification time. > > Let me rephrase the above to test my understanding, since I need to > write a summary for the "What's cooking" report. > > Instead of leaving unreachable objects in loose form when packing, > or ejecting them into loose form when repacking, gather them in a > packfile with an auxiliary file that records the last-use time of > these objects. Exactly. Thanks for such a concise and accurate description of the topic. > That way, we do not have to waste so many inodes for loose objects > that is not likely to be used, which feels like a win. Yes. This had historically been a problem for GitHub. We don't automatically prune unreachable objects during repacking, but sometimes customers will ask us to do it on their behalf (if, for example, they accidentally pushed sensitive information to us, and then force-pushed over it). But occasionally we'd get bitten by exploding many years of loose objects (because we used to freshen packfiles too aggressively when moving them around). We've been running this series in production for the past few months, and it's been a huge relief on the folks who typically run these pruning GCs. > > - The final patch handles object freshening for objects stored in a > > cruft pack. > > I am not going to read it today, but I think this is the most > interesting part of the series. Instead of using mtime of an > individual loose object file, we'd need to record the time of > last use for each object in a pack. > > Stepping back a bit, I do not see how we can get away without doing > the same .mtimes file for non-cruft packs. An object that is in a > non-cruft pack may be referenced immediately after the repack that > created the pack, but the ref that was referencing the object may > have gone away and now the pack is a month old. If we were to > repack the object, we do not know when was the last time the object > was reachable from any of the refs and index entries (collectively > known as anchor points). In that situation, we would use the mtime of the pack which contains that object itself as a proxy (or the mtime of a loose copy of the object, if it is more recent). That isn't perfect, as you note, since if the pack isn't otherwise freshened, we'd consider that object to be a month old, even if the reference pointing at it was deleted a mere second ago. I can't recall if Peff and I talked about this off-list, but I have a vague sense we probably did (and I forgot the details). > Of course, recording all mtimes for all > packed objects all the time would involve quite a lot of overhead. > I am guessing (I will not spend time today to figure it out myself) > that .mtimes update at runtime will happen in-place (i.e. via > seek(2)+write(2), or pwrite()), and I wonder what the safety concern > would be (which is the primary reason why we tend not to do in-place > updates but recreate-and-rename updates). Yeah, this series avoids doing an in-place update, and similarly avoids recreating the entire .mtimes file before moving into place. Instead, freshening an object stored in a cruft pack takes place by rewriting a copy of the object loose, since we consider an object's mtime to be the most recent of (a) what's in the .mtimes file, (b) the mtime of the containing pack, and (c) the mtime of a loose copy (if one exists). It can be wasteful, but in practice "resurrecting" an object in a cruft pack is pretty rare, so on balance it ends up costing less work to do. > Thanks for working on such an interesting topic. I'm glad to have piqued your interest. Taylor
On Fri, Dec 03, 2021 at 03:08:00PM -0500, Taylor Blau wrote: > On Fri, Dec 03, 2021 at 11:51:51AM -0800, Junio C Hamano wrote: > > Stepping back a bit, I do not see how we can get away without doing > > the same .mtimes file for non-cruft packs. An object that is in a > > non-cruft pack may be referenced immediately after the repack that > > created the pack, but the ref that was referencing the object may > > have gone away and now the pack is a month old. If we were to > > repack the object, we do not know when was the last time the object > > was reachable from any of the refs and index entries (collectively > > known as anchor points). > > In that situation, we would use the mtime of the pack which contains > that object itself as a proxy (or the mtime of a loose copy of the > object, if it is more recent). > > That isn't perfect, as you note, since if the pack isn't otherwise > freshened, we'd consider that object to be a month old, even if the > reference pointing at it was deleted a mere second ago. > > I can't recall if Peff and I talked about this off-list, but I have a > vague sense we probably did (and I forgot the details). Maybe I can rephrase the problem as being orthogonal to what we're addressing here. Modification time can be a useful-ish proxy for "last referenced time", but they are ultimately different. Forgetting cruft packs for a moment, our behavior today in that situation would be to prune the object if our grace period did not cover the time in which the pack was last modified. So if the pack was a month old, the grace period was two weeks, but the reference pointing at some object in that pack was deleted only a second before starting a pruning GC, we'd prune that object before this series (just as we would do the same thing with this series). Aside from pruning, what happens to the value recorded in the .mtimes file is more interesting. For the case you're talking about, we'll err on the side of newer mtimes (either the original timestamp is recorded, or some future time when the containing pack was rewritten). But the more interesting case is when an object becomes re-referenced. Since the ref-update doesn't cause the object to be rewritten, we wouldn't change the timestamp. Anyway, both of these are still independent from cruft packs, so we're not changing the status quo there, I don't think. Thanks, Taylor