Message ID | cover.1614047097.git.me@ttaylorr.com (mailing list archive) |
---|---|
Headers | show |
Series | repack: support repacking into a geometric sequence | expand |
On Mon, Feb 22, 2021 at 09:24:59PM -0500, Taylor Blau wrote: > Here's a very lightly modified version on v3 of mine and Peff's series > to add a new 'git repack --geometric' mode. Almost nothing has changed > since last time, with the exception of: > > - Packs listed over standard input to 'git pack-objects --stdin-packs' > are sorted in descending mtime order (and objects are strung > together in pack order as before) so that objects are laid out > roughly newest-to-oldest in the resulting pack. > > - Swapped the order of two paragraphs in patch 5 to make the perf > results clearer. > > - Mention '--unpacked' specifically in the documentation for 'git > repack --geometric'. > > - Typo fixes. Thanks, this all looks great to me. -Peff
Taylor Blau <me@ttaylorr.com> writes: > ++ /* > ++ * order packs by descending mtime so that objects are laid out > ++ * roughly as newest-to-oldest > ++ */ > + if (a->mtime < b->mtime) > + return 1; > ++ else if (b->mtime < a->mtime) > ++ return -1; > + else > + return 0; I think this strategy makes sense when this repack using this new feature is run for the first time in a repository that acquired many packs over time. I am not sure what happens after the feature is used a few times---it won't always be the newest sets of packs that will be rewritten, but sometimes older ones are also coalesced, and when that happens the resulting pack that consists primarily of older objects would end up having a more recent timestamp, no? Even then, I do agree that newer to older would be beneficial most of the time, so this is of course not an objection against this particular sort order.
On Mon, Feb 22, 2021 at 11:43:22PM -0800, Junio C Hamano wrote: > > ++ /* > > ++ * order packs by descending mtime so that objects are laid out > > ++ * roughly as newest-to-oldest > > ++ */ > > + if (a->mtime < b->mtime) > > + return 1; > > ++ else if (b->mtime < a->mtime) > > ++ return -1; > > + else > > + return 0; > > I think this strategy makes sense when this repack using this new > feature is run for the first time in a repository that acquired many > packs over time. I am not sure what happens after the feature is > used a few times---it won't always be the newest sets of packs that > will be rewritten, but sometimes older ones are also coalesced, and > when that happens the resulting pack that consists primarily of older > objects would end up having a more recent timestamp, no? Yeah, this is definitely a heuristic that can get out of sync with reality. I think in general if you have base pack A and somebody pushes up B, C, and D in sequence, we're likely to roll up a single DBC (in that order) pack. Further pushes E, F, G would have newer mtimes. So we might get GFEDBC directly. Or we might get GFE and DBC, but the former would still have a newer mtime, so we'd create GFEDBC on the next run. The issues come from: - we are deciding what to roll up based on size. A big push might not get rolled up immediately, putting it out-of-sync with the rest of the rollups. - we are happy to manipulate pack mtimes under the hood as part of the freshen_*() code. I think you probably wouldn't want to use this roll-up strategy all the time (even though in theory it would eventually roll up to a single good pack), just because it is based on heuristics like this. You'd want to occasionally run a "real" repack that does a full traversal, possibly pruning objects, etc. And that's how we plan to use it at GitHub. I don't remember how much of the root problem we've discussed on-list, but the crux of it is: per-object costs including traversal can get really high on big repositories. Our shared-storage repo for all torvalds/linux forks is on the order of 45M objects, and some companies with large and active private repositories are close to that. Traversing the object graph takes 15+ minutes (plus another 15 for delta island marking). For busy repositories, by the time you finish repacking, it's time to start again. :) > Even then, I do agree that newer to older would be beneficial most > of the time, so this is of course not an objection against this > particular sort order. So yeah. I consider this best-effort for sure, and I think this sort order is the best we can do without traversing. OTOH, we _do_ actually do a partial traversal in this latest version of the series. We could use that to impact the final write order. It doesn't necessarily hit every object, though, so we'd still want to fall back on this pack ordering heuristic. I'm content to leave punt on that work for now, and leave it for a future series after we see how this heuristic performs in practice. -Peff
On Tuesday, February 23, 2021 1:44:05 PM MST Jeff King wrote: > On Mon, Feb 22, 2021 at 11:43:22PM -0800, Junio C Hamano wrote: > > > ++ /* > > > ++ * order packs by descending mtime so that objects are laid out > > > ++ * roughly as newest-to-oldest > > > ++ */ > > > > > > + if (a->mtime < b->mtime) > > > + return 1; > > > > > > ++ else if (b->mtime < a->mtime) > > > ++ return -1; > > > > > > + else > > > + return 0; > > > > I think this strategy makes sense when this repack using this new > > feature is run for the first time in a repository that acquired many > > packs over time. I am not sure what happens after the feature is > > used a few times---it won't always be the newest sets of packs that > > will be rewritten, but sometimes older ones are also coalesced, and > > when that happens the resulting pack that consists primarily of older > > objects would end up having a more recent timestamp, no? > > Yeah, this is definitely a heuristic that can get out of sync with > reality. I think in general if you have base pack A and somebody pushes > up B, C, and D in sequence, we're likely to roll up a single DBC (in > that order) pack. Further pushes E, F, G would have newer mtimes. So we > might get GFEDBC directly. Or we might get GFE and DBC, but the former > would still have a newer mtime, so we'd create GFEDBC on the next run. > > The issues come from: > > - we are deciding what to roll up based on size. A big push might not > get rolled up immediately, putting it out-of-sync with the rest of > the rollups. Would it make sense to somehow detect all new packs since the last rollup and always include them in the rollup no matter what their size? That is one thing that my git-exproll script did. One of the main reasons to do this was because newer packs tended to look big (I was using bytes to determine size), and newer packs were often bigger on disk compared to other packs with similar objects in them (I think you suggested this was due to the thickening of packs on receipt). Maybe roll up all packs with a timestamp "new enough", no matter how big they are? -Martin
On Tue, Feb 23, 2021 at 12:54:56PM -0700, Martin Fick wrote: > Would it make sense to somehow detect all new packs since the last rollup and > always include them in the rollup no matter what their size? That is one thing > that my git-exproll script did. I'm certainly not opposed, and this could certainly be done in an additive way (i.e., after this series). I think the current approach has nice properties, but I could also see "roll-up all packs that have mtimes after xyz timestamp" being useful. It would even be possible to reuse a lot of the geometric repack machinery. Having a separate path to arrange packs by their mtimes and determine the "split" at pack whose mtime is nearest the provided one would do exactly what you want. (As a side-note, reading the original threads about your git-exproll was quite humbling, since it turns out all of the problems I thought were hard had already been discussed eight years ago!) Thanks, Taylor
On Tue, Feb 23, 2021 at 12:54:56PM -0700, Martin Fick wrote: > > Yeah, this is definitely a heuristic that can get out of sync with > > reality. I think in general if you have base pack A and somebody pushes > > up B, C, and D in sequence, we're likely to roll up a single DBC (in > > that order) pack. Further pushes E, F, G would have newer mtimes. So we > > might get GFEDBC directly. Or we might get GFE and DBC, but the former > > would still have a newer mtime, so we'd create GFEDBC on the next run. > > > > The issues come from: > > > > - we are deciding what to roll up based on size. A big push might not > > get rolled up immediately, putting it out-of-sync with the rest of > > the rollups. > > Would it make sense to somehow detect all new packs since the last rollup and > always include them in the rollup no matter what their size? That is one thing > that my git-exproll script did. One of the main reasons to do this was because > newer packs tended to look big (I was using bytes to determine size), and > newer packs were often bigger on disk compared to other packs with similar > objects in them (I think you suggested this was due to the thickening of packs > on receipt). Maybe roll up all packs with a timestamp "new enough", no matter > how big they are? That works against the "geometric" part of the strategy, which is trying to roll up in a sequence that is amortized-linear. I.e., we are not always rolling up everything outside of the base pack, but trying to roll up little into medium, and then eventually medium into large. If you roll up things that are "too big", then you end up rewriting the bytes more often, and your amount of work becomes super-linear. Now whether that matters all that much or not is perhaps another discussion. The current strategy is mostly to repack all-into-one with no base, which is the worst possible case. So just about any rollup strategy will be an improvement. ;) -Peff
On Tuesday, February 23, 2021 3:15:12 PM MST Jeff King wrote: > On Tue, Feb 23, 2021 at 12:54:56PM -0700, Martin Fick wrote: > > > Yeah, this is definitely a heuristic that can get out of sync with > > > reality. I think in general if you have base pack A and somebody pushes > > > up B, C, and D in sequence, we're likely to roll up a single DBC (in > > > that order) pack. Further pushes E, F, G would have newer mtimes. So we > > > might get GFEDBC directly. Or we might get GFE and DBC, but the former > > > would still have a newer mtime, so we'd create GFEDBC on the next run. > > > > > > The issues come from: > > > - we are deciding what to roll up based on size. A big push might not > > > > > > get rolled up immediately, putting it out-of-sync with the rest of > > > the rollups. > > > > Would it make sense to somehow detect all new packs since the last rollup > > and always include them in the rollup no matter what their size? That is > > one thing that my git-exproll script did. One of the main reasons to do > > this was because newer packs tended to look big (I was using bytes to > > determine size), and newer packs were often bigger on disk compared to > > other packs with similar objects in them (I think you suggested this was > > due to the thickening of packs on receipt). Maybe roll up all packs with > > a timestamp "new enough", no matter how big they are? > > That works against the "geometric" part of the strategy, which is trying > to roll up in a sequence that is amortized-linear. I.e., we are not > always rolling up everything outside of the base pack, but trying to > roll up little into medium, and then eventually medium into large. If > you roll up things that are "too big", then you end up rewriting the > bytes more often, and your amount of work becomes super-linear. I'm not sure I follow, it would seem to me that it would stay linear, and be at most rewriting each new packfile once more than previously? Are you envisioning more work than that? > Now whether that matters all that much or not is perhaps another > discussion. The current strategy is mostly to repack all-into-one with > no base, which is the worst possible case. So just about any rollup > strategy will be an improvement. ;) +1 Yes, while anything would be an improvement, this series' approach is very good! Thanks for doing this!! -Martin
On Tue, Feb 23, 2021 at 02:41:09PM -0700, Martin Fick wrote: > > > Would it make sense to somehow detect all new packs since the last rollup > > > and always include them in the rollup no matter what their size? That is > > > one thing that my git-exproll script did. One of the main reasons to do > > > this was because newer packs tended to look big (I was using bytes to > > > determine size), and newer packs were often bigger on disk compared to > > > other packs with similar objects in them (I think you suggested this was > > > due to the thickening of packs on receipt). Maybe roll up all packs with > > > a timestamp "new enough", no matter how big they are? > > > > That works against the "geometric" part of the strategy, which is trying > > to roll up in a sequence that is amortized-linear. I.e., we are not > > always rolling up everything outside of the base pack, but trying to > > roll up little into medium, and then eventually medium into large. If > > you roll up things that are "too big", then you end up rewriting the > > bytes more often, and your amount of work becomes super-linear. > > I'm not sure I follow, it would seem to me that it would stay linear, and be > at most rewriting each new packfile once more than previously? Are you > envisioning more work than that? Maybe I don't understand what you're proposing. The idea of the geometric repack is that by sorting by size and then finding a "cutoff" within the size array, we can make sure that we roll up a sufficiently small number of bytes in each roll-up that it ends up linear in the size of the repo in the long run. But if we roll up without regard to size, then our worst case is that the biggest pack is the newest (imagine a repo with 10 small pushes and then one gigantic one). So we roll that up with some small packs, doing effectively O(size_of_repo) work. And then in the next roll up we do it again, and so on. So we end up with O(size_of_repo * nr_rollups) total work. Which is no better than having just done a full repack at each rollup. Now I don't think we'd see that worst case in practice that much. And depending on your definition of "new enough", you might keep nr_rollups pretty small. -Peff
On Tuesday, February 23, 2021 3:06:22 PM MST Taylor Blau wrote: > On Tue, Feb 23, 2021 at 12:54:56PM -0700, Martin Fick wrote: > > Would it make sense to somehow detect all new packs since the last rollup > > and always include them in the rollup no matter what their size? That is > > one thing that my git-exproll script did. > > I'm certainly not opposed, and this could certainly be done in an > additive way (i.e., after this series). I think the current approach has > nice properties, but I could also see "roll-up all packs that have > mtimes after xyz timestamp" being useful. Just to be clear, I meant to combine the two approaches. And yes, my suggestion would likely make more sense as an additive switch later on. > It would even be possible to reuse a lot of the geometric repack > machinery. Having a separate path to arrange packs by their mtimes and > determine the "split" at pack whose mtime is nearest the provided one > would do exactly what you want. I was thinking to keep all of your geometric repack machinery and only looking for the split point starting at the right most pack which is newer than the provided mtime, and then possibly enhancing the approach with a clever way to use the mtime of the last consolidation (maybe by touching a pack/.geometric file?). > (As a side-note, reading the original threads about your git-exproll was > quite humbling, since it turns out all of the problems I thought were > hard had already been discussed eight years ago!) Thanks, but I think you have likely done a much better job than what I did. Your approach of using object counts is likely much better as it should be stable, using byte counts is not. You are also solving only one problem at a time, that's probably better than my hodge-podge of at least 3 different problems. And the most important part of your approach as I understand it, is that it actually saves CPU time whereas my approach only saved IO. Cheers, -Martin
On Tuesday, February 23, 2021 4:53:21 PM MST Jeff King wrote: > On Tue, Feb 23, 2021 at 02:41:09PM -0700, Martin Fick wrote: > > > > Would it make sense to somehow detect all new packs since the last > > > > rollup > > > > and always include them in the rollup no matter what their size? That > > > > is > > > > one thing that my git-exproll script did. One of the main reasons to > > > > do > > > > this was because newer packs tended to look big (I was using bytes to > > > > determine size), and newer packs were often bigger on disk compared to > > > > other packs with similar objects in them (I think you suggested this > > > > was > > > > due to the thickening of packs on receipt). Maybe roll up all packs > > > > with > > > > a timestamp "new enough", no matter how big they are? > > > > > > That works against the "geometric" part of the strategy, which is trying > > > to roll up in a sequence that is amortized-linear. I.e., we are not > > > always rolling up everything outside of the base pack, but trying to > > > roll up little into medium, and then eventually medium into large. If > > > you roll up things that are "too big", then you end up rewriting the > > > bytes more often, and your amount of work becomes super-linear. > > > > I'm not sure I follow, it would seem to me that it would stay linear, and > > be at most rewriting each new packfile once more than previously? Are you > > envisioning more work than that? > > Maybe I don't understand what you're proposing. > > The idea of the geometric repack is that by sorting by size and then > finding a "cutoff" within the size array, we can make sure that we roll > up a sufficiently small number of bytes in each roll-up that it ends up > linear in the size of the repo in the long run. But if we roll up > without regard to size, then our worst case is that the biggest pack is > the newest (imagine a repo with 10 small pushes and then one gigantic > one). So we roll that up with some small packs, doing effectively > O(size_of_repo) work. This isn't quite a fair evaluation, it should be O(size_of_push) I think? > And then in the next roll up we do it again, and so on. I should have clarified that the intent is to prevent this by specifying an mtime after the last rollup so that this should only ever happen once for new packfiles. It also means you probably need special logic to ensure this roll-up doesn't happen if there would only be one file in the rollup, -Martin
On Wed, Feb 24, 2021 at 11:13:53AM -0700, Martin Fick wrote: > > The idea of the geometric repack is that by sorting by size and then > > finding a "cutoff" within the size array, we can make sure that we roll > > up a sufficiently small number of bytes in each roll-up that it ends up > > linear in the size of the repo in the long run. But if we roll up > > without regard to size, then our worst case is that the biggest pack is > > the newest (imagine a repo with 10 small pushes and then one gigantic > > one). So we roll that up with some small packs, doing effectively > > O(size_of_repo) work. > > This isn't quite a fair evaluation, it should be O(size_of_push) I think? Sorry, I had a longer example, but then cut it down in the name of simplicity. But I think I made it too simple. :) You can imagine more pushes after the gigantic one, in which case we'd roll them up with the gigantic push. So that gigantic one is part of multiple sequential rollups, until it is itself rolled up further. But... > > And then in the next roll up we do it again, and so on. > > I should have clarified that the intent is to prevent this by specifying an > mtime after the last rollup so that this should only ever happen once for new > packfiles. It also means you probably need special logic to ensure this roll-up > doesn't happen if there would only be one file in the rollup, Yes, I agree that if you record a cut point, and then avoid rolling up across it, then you'd only consider the single push once. You probably want to record the actual pack set rather than just an mtime cutoff, though, since Git will update the mtime on packs sometimes (to freshen them whenever it optimizes out an object write for an object in the pack). One of the nice things about looking only at the pack sizes is that you don't have to record that cut point. :) But it's possible you'd want to for other reasons (e.g., you may spend extra work to find good deltas in your on-disk packs, so you want to know what is old and what is new in order to discard on-disk deltas from pushed-up packs). -Peff