[v4,0/8] repack: support repacking into a geometric sequence

Message ID	cover.1614047097.git.me@ttaylorr.com (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Date: Mon, 22 Feb 2021 21:24:59 -0500 From: Taylor Blau <me@ttaylorr.com> To: git@vger.kernel.org Cc: peff@peff.net, dstolee@microsoft.com, gitster@pobox.com Subject: [PATCH v4 0/8] repack: support repacking into a geometric sequence Message-ID: <cover.1614047097.git.me@ttaylorr.com> References: <cover.1611098616.git.me@ttaylorr.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <cover.1611098616.git.me@ttaylorr.com> Precedence: bulk
Series	repack: support repacking into a geometric sequence \| expand [v4,0/8] repack: support repacking into a geometric sequence [v4,1/8] packfile: introduce 'find_kept_pack_entry()' [v4,2/8] revision: learn '--no-kept-objects' [v4,3/8] builtin/pack-objects.c: add '--stdin-packs' option [v4,4/8] p5303: add missing &&-chains [v4,5/8] p5303: measure time to repack with keep [v4,6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic [v4,7/8] packfile: add kept-pack cache for find_kept_pack_entry() [v4,8/8] builtin/repack.c: add '--geometric' option

Taylor Blau Feb. 23, 2021, 2:24 a.m. UTC

Here's a very lightly modified version on v3 of mine and Peff's series
to add a new 'git repack --geometric' mode. Almost nothing has changed
since last time, with the exception of:

  - Packs listed over standard input to 'git pack-objects --stdin-packs'
    are sorted in descending mtime order (and objects are strung
    together in pack order as before) so that objects are laid out
    roughly newest-to-oldest in the resulting pack.

  - Swapped the order of two paragraphs in patch 5 to make the perf
    results clearer.

  - Mention '--unpacked' specifically in the documentation for 'git
    repack --geometric'.

  - Typo fixes.

Range-diff is below. It would be good to start merging this down since
we have a release candidate coming up soon, and I'd rather focus future
reviewer efforts on the multi-pack reverse index and bitmaps series
instead of this one.

Jeff King (4):
  p5303: add missing &&-chains
  p5303: measure time to repack with keep
  builtin/pack-objects.c: rewrite honor-pack-keep logic
  packfile: add kept-pack cache for find_kept_pack_entry()

Taylor Blau (4):
  packfile: introduce 'find_kept_pack_entry()'
  revision: learn '--no-kept-objects'
  builtin/pack-objects.c: add '--stdin-packs' option
  builtin/repack.c: add '--geometric' option

 Documentation/git-pack-objects.txt |  10 +
 Documentation/git-repack.txt       |  23 ++
 builtin/pack-objects.c             | 333 ++++++++++++++++++++++++-----
 builtin/repack.c                   | 187 +++++++++++++++-
 object-store.h                     |   5 +
 packfile.c                         |  67 ++++++
 packfile.h                         |   5 +
 revision.c                         |  15 ++
 revision.h                         |   4 +
 t/perf/p5303-many-packs.sh         |  36 +++-
 t/t5300-pack-object.sh             |  97 +++++++++
 t/t6114-keep-packs.sh              |  69 ++++++
 t/t7703-repack-geometric.sh        | 137 ++++++++++++
 13 files changed, 926 insertions(+), 62 deletions(-)
 create mode 100755 t/t6114-keep-packs.sh
 create mode 100755 t/t7703-repack-geometric.sh

Range-diff against v3:
1:  aa94edf39b = 1:  bb674e5119 packfile: introduce 'find_kept_pack_entry()'
2:  82f6b45463 = 2:  c85a915597 revision: learn '--no-kept-objects'
3:  033e4e3f67 ! 3:  649cf9020b builtin/pack-objects.c: add '--stdin-packs' option
    @@ builtin/pack-objects.c: static int git_pack_config(const char *k, const char *v,
     +	struct packed_git *a = ((const struct string_list_item*)_a)->util;
     +	struct packed_git *b = ((const struct string_list_item*)_b)->util;
     +
    ++	/*
    ++	 * order packs by descending mtime so that objects are laid out
    ++	 * roughly as newest-to-oldest
    ++	 */
     +	if (a->mtime < b->mtime)
    -+		return -1;
    -+	else if (b->mtime < a->mtime)
     +		return 1;
    ++	else if (b->mtime < a->mtime)
    ++		return -1;
     +	else
     +		return 0;
     +}
4:  f9a5faf773 = 4:  6de9f0c52b p5303: add missing &&-chains
5:  181c104a03 ! 5:  94e4f3ee3a p5303: measure time to repack with keep
    @@ Metadata
      ## Commit message ##
         p5303: measure time to repack with keep
     
    -    Add two new tests to measure repack performance. Both test split the
    +    Add two new tests to measure repack performance. Both tests split the
         repository into synthetic "pushes", and then leave the remaining objects
         in a big base pack.
     
    @@ Commit message
           5303.17: repack (1000)                      216.87(490.79+14.57)
           5303.18: repack with kept (1000)            665.63(938.87+15.76)
     
    -    Likewise, the scaling is pretty extreme on --stdin-packs:
    -
    -      5303.7: repack with --stdin-packs (1)       0.01(0.01+0.00)
    -      5303.13: repack with --stdin-packs (50)     3.53(12.07+0.24)
    -      5303.19: repack with --stdin-packs (1000)   195.83(371.82+8.10)
    -
         That's because the code paths around handling .keep files are known to
         scale badly; they look in every single pack file to find each object.
         Our solution to that was to notice that most repos don't have keep
    @@ Commit message
         single .keep, that part of pack-objects slows down again (even if we
         have fewer objects total to look at).
     
    +    Likewise, the scaling is pretty extreme on --stdin-packs (but each
    +    subsequent test is also being asked to do more work):
    +
    +      5303.7: repack with --stdin-packs (1)       0.01(0.01+0.00)
    +      5303.13: repack with --stdin-packs (50)     3.53(12.07+0.24)
    +      5303.19: repack with --stdin-packs (1000)   195.83(371.82+8.10)
    +
         Signed-off-by: Jeff King <peff@peff.net>
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
6:  67af143fd1 = 6:  a116587fb2 builtin/pack-objects.c: rewrite honor-pack-keep logic
7:  e9e04b95e7 = 7:  db9f07ec1a packfile: add kept-pack cache for find_kept_pack_entry()
8:  bd492ec142 ! 8:  51f57d5da2 builtin/repack.c: add '--geometric' option
    @@ Documentation/git-repack.txt: depth is 4095.
     +packs determined to need to be combined in order to restore a geometric
     +progression.
     ++
    -+Loose objects are implicitly included in this "roll-up", without respect
    -+to their reachability. This is subject to change in the future. This
    -+option (implying a drastically different repack mode) is not guarenteed
    -+to work with all other combinations of option to `git repack`).
    ++When `--unpacked` is specified, loose objects are implicitly included in
    ++this "roll-up", without respect to their reachability. This is subject
    ++to change in the future. This option (implying a drastically different
    ++repack mode) is not guaranteed to work with all other combinations of
    ++option to `git repack`).
     +
      Configuration
      -------------

Jeff King Feb. 23, 2021, 3:39 a.m. UTC | #1

On Mon, Feb 22, 2021 at 09:24:59PM -0500, Taylor Blau wrote:

> Here's a very lightly modified version on v3 of mine and Peff's series
> to add a new 'git repack --geometric' mode. Almost nothing has changed
> since last time, with the exception of:
> 
>   - Packs listed over standard input to 'git pack-objects --stdin-packs'
>     are sorted in descending mtime order (and objects are strung
>     together in pack order as before) so that objects are laid out
>     roughly newest-to-oldest in the resulting pack.
> 
>   - Swapped the order of two paragraphs in patch 5 to make the perf
>     results clearer.
> 
>   - Mention '--unpacked' specifically in the documentation for 'git
>     repack --geometric'.
> 
>   - Typo fixes.

Thanks, this all looks great to me.

-Peff

Junio C Hamano Feb. 23, 2021, 7:43 a.m. UTC | #2

Taylor Blau <me@ttaylorr.com> writes:

>     ++	/*
>     ++	 * order packs by descending mtime so that objects are laid out
>     ++	 * roughly as newest-to-oldest
>     ++	 */
>      +	if (a->mtime < b->mtime)
>      +		return 1;
>     ++	else if (b->mtime < a->mtime)
>     ++		return -1;
>      +	else
>      +		return 0;

I think this strategy makes sense when this repack using this new
feature is run for the first time in a repository that acquired many
packs over time.  I am not sure what happens after the feature is
used a few times---it won't always be the newest sets of packs that
will be rewritten, but sometimes older ones are also coalesced, and
when that happens the resulting pack that consists primarily of older
objects would end up having a more recent timestamp, no?

Even then, I do agree that newer to older would be beneficial most
of the time, so this is of course not an objection against this
particular sort order.

Jeff King Feb. 23, 2021, 6:44 p.m. UTC | #3

On Mon, Feb 22, 2021 at 11:43:22PM -0800, Junio C Hamano wrote:

> >     ++	/*
> >     ++	 * order packs by descending mtime so that objects are laid out
> >     ++	 * roughly as newest-to-oldest
> >     ++	 */
> >      +	if (a->mtime < b->mtime)
> >      +		return 1;
> >     ++	else if (b->mtime < a->mtime)
> >     ++		return -1;
> >      +	else
> >      +		return 0;
> 
> I think this strategy makes sense when this repack using this new
> feature is run for the first time in a repository that acquired many
> packs over time.  I am not sure what happens after the feature is
> used a few times---it won't always be the newest sets of packs that
> will be rewritten, but sometimes older ones are also coalesced, and
> when that happens the resulting pack that consists primarily of older
> objects would end up having a more recent timestamp, no?

Yeah, this is definitely a heuristic that can get out of sync with
reality. I think in general if you have base pack A and somebody pushes
up B, C, and D in sequence, we're likely to roll up a single DBC (in
that order) pack. Further pushes E, F, G would have newer mtimes. So we
might get GFEDBC directly. Or we might get GFE and DBC, but the former
would still have a newer mtime, so we'd create GFEDBC on the next run.

The issues come from:

  - we are deciding what to roll up based on size. A big push might not
    get rolled up immediately, putting it out-of-sync with the rest of
    the rollups.

  - we are happy to manipulate pack mtimes under the hood as part of the
    freshen_*() code.

I think you probably wouldn't want to use this roll-up strategy all the
time (even though in theory it would eventually roll up to a single good
pack), just because it is based on heuristics like this. You'd want to
occasionally run a "real" repack that does a full traversal, possibly
pruning objects, etc.

And that's how we plan to use it at GitHub. I don't remember how much of
the root problem we've discussed on-list, but the crux of it is:
per-object costs including traversal can get really high on big
repositories. Our shared-storage repo for all torvalds/linux forks is on
the order of 45M objects, and some companies with large and active
private repositories are close to that. Traversing the object graph
takes 15+ minutes (plus another 15 for delta island marking). For busy
repositories, by the time you finish repacking, it's time to start
again. :)

> Even then, I do agree that newer to older would be beneficial most
> of the time, so this is of course not an objection against this
> particular sort order.

So yeah. I consider this best-effort for sure, and I think this sort
order is the best we can do without traversing.

OTOH, we _do_ actually do a partial traversal in this latest version of
the series. We could use that to impact the final write order. It
doesn't necessarily hit every object, though, so we'd still want to fall
back on this pack ordering heuristic. I'm content to leave punt on that
work for now, and leave it for a future series after we see how this
heuristic performs in practice.

-Peff

Martin Fick Feb. 23, 2021, 7:54 p.m. UTC | #4

On Tuesday, February 23, 2021 1:44:05 PM MST Jeff King wrote:
> On Mon, Feb 22, 2021 at 11:43:22PM -0800, Junio C Hamano wrote:
> > >     ++	/*
> > >     ++	 * order packs by descending mtime so that objects are laid out
> > >     ++	 * roughly as newest-to-oldest
> > >     ++	 */
> > >     
> > >      +	if (a->mtime < b->mtime)
> > >      +		return 1;
> > >     
> > >     ++	else if (b->mtime < a->mtime)
> > >     ++		return -1;
> > >     
> > >      +	else
> > >      +		return 0;
> > 
> > I think this strategy makes sense when this repack using this new
> > feature is run for the first time in a repository that acquired many
> > packs over time.  I am not sure what happens after the feature is
> > used a few times---it won't always be the newest sets of packs that
> > will be rewritten, but sometimes older ones are also coalesced, and
> > when that happens the resulting pack that consists primarily of older
> > objects would end up having a more recent timestamp, no?
> 
> Yeah, this is definitely a heuristic that can get out of sync with
> reality. I think in general if you have base pack A and somebody pushes
> up B, C, and D in sequence, we're likely to roll up a single DBC (in
> that order) pack. Further pushes E, F, G would have newer mtimes. So we
> might get GFEDBC directly. Or we might get GFE and DBC, but the former
> would still have a newer mtime, so we'd create GFEDBC on the next run.
> 
> The issues come from:
> 
>   - we are deciding what to roll up based on size. A big push might not
>     get rolled up immediately, putting it out-of-sync with the rest of
>     the rollups.

Would it make sense to somehow detect all new packs since the last rollup and 
always include them in the rollup no matter what their size? That is one thing 
that my git-exproll script did. One of the main reasons to do this was because 
newer packs tended to look big (I was using bytes to determine size), and 
newer packs were often bigger on disk compared to other packs with similar 
objects in them (I think you suggested this was due to the thickening of packs 
on receipt). Maybe roll up all packs with a timestamp "new enough", no matter 
how big they are?

-Martin

Taylor Blau Feb. 23, 2021, 8:06 p.m. UTC | #5

On Tue, Feb 23, 2021 at 12:54:56PM -0700, Martin Fick wrote:
> Would it make sense to somehow detect all new packs since the last rollup and
> always include them in the rollup no matter what their size? That is one thing
> that my git-exproll script did.

I'm certainly not opposed, and this could certainly be done in an
additive way (i.e., after this series). I think the current approach has
nice properties, but I could also see "roll-up all packs that have
mtimes after xyz timestamp" being useful.

It would even be possible to reuse a lot of the geometric repack
machinery. Having a separate path to arrange packs by their mtimes and
determine the "split" at pack whose mtime is nearest the provided one
would do exactly what you want.

(As a side-note, reading the original threads about your git-exproll was
quite humbling, since it turns out all of the problems I thought were
hard had already been discussed eight years ago!)

Thanks,
Taylor

Jeff King Feb. 23, 2021, 8:15 p.m. UTC | #6

On Tue, Feb 23, 2021 at 12:54:56PM -0700, Martin Fick wrote:

> > Yeah, this is definitely a heuristic that can get out of sync with
> > reality. I think in general if you have base pack A and somebody pushes
> > up B, C, and D in sequence, we're likely to roll up a single DBC (in
> > that order) pack. Further pushes E, F, G would have newer mtimes. So we
> > might get GFEDBC directly. Or we might get GFE and DBC, but the former
> > would still have a newer mtime, so we'd create GFEDBC on the next run.
> > 
> > The issues come from:
> > 
> >   - we are deciding what to roll up based on size. A big push might not
> >     get rolled up immediately, putting it out-of-sync with the rest of
> >     the rollups.
> 
> Would it make sense to somehow detect all new packs since the last rollup and 
> always include them in the rollup no matter what their size? That is one thing 
> that my git-exproll script did. One of the main reasons to do this was because 
> newer packs tended to look big (I was using bytes to determine size), and 
> newer packs were often bigger on disk compared to other packs with similar 
> objects in them (I think you suggested this was due to the thickening of packs 
> on receipt). Maybe roll up all packs with a timestamp "new enough", no matter 
> how big they are?

That works against the "geometric" part of the strategy, which is trying
to roll up in a sequence that is amortized-linear. I.e., we are not
always rolling up everything outside of the base pack, but trying to
roll up little into medium, and then eventually medium into large. If
you roll up things that are "too big", then you end up rewriting the
bytes more often, and your amount of work becomes super-linear.

Now whether that matters all that much or not is perhaps another
discussion. The current strategy is mostly to repack all-into-one with
no base, which is the worst possible case. So just about any rollup
strategy will be an improvement. ;)

-Peff

Martin Fick Feb. 23, 2021, 9:41 p.m. UTC | #7

On Tuesday, February 23, 2021 3:15:12 PM MST Jeff King wrote:
> On Tue, Feb 23, 2021 at 12:54:56PM -0700, Martin Fick wrote:
> > > Yeah, this is definitely a heuristic that can get out of sync with
> > > reality. I think in general if you have base pack A and somebody pushes
> > > up B, C, and D in sequence, we're likely to roll up a single DBC (in
> > > that order) pack. Further pushes E, F, G would have newer mtimes. So we
> > > might get GFEDBC directly. Or we might get GFE and DBC, but the former
> > > would still have a newer mtime, so we'd create GFEDBC on the next run.
> > > 
> > > The issues come from:
> > >   - we are deciding what to roll up based on size. A big push might not
> > >   
> > >     get rolled up immediately, putting it out-of-sync with the rest of
> > >     the rollups.
> > 
> > Would it make sense to somehow detect all new packs since the last rollup
> > and always include them in the rollup no matter what their size? That is
> > one thing that my git-exproll script did. One of the main reasons to do
> > this was because newer packs tended to look big (I was using bytes to
> > determine size), and newer packs were often bigger on disk compared to
> > other packs with similar objects in them (I think you suggested this was
> > due to the thickening of packs on receipt). Maybe roll up all packs with
> > a timestamp "new enough", no matter how big they are?
> 
> That works against the "geometric" part of the strategy, which is trying
> to roll up in a sequence that is amortized-linear. I.e., we are not
> always rolling up everything outside of the base pack, but trying to
> roll up little into medium, and then eventually medium into large. If
> you roll up things that are "too big", then you end up rewriting the
> bytes more often, and your amount of work becomes super-linear.

I'm not sure I follow, it would seem to me that it would stay linear, and be 
at most rewriting each new packfile once more than previously? Are you 
envisioning more work than that?

> Now whether that matters all that much or not is perhaps another
> discussion. The current strategy is mostly to repack all-into-one with
> no base, which is the worst possible case. So just about any rollup
> strategy will be an improvement. ;)

+1 Yes, while anything would be an improvement, this series' approach is very 
good! Thanks for doing this!!

-Martin

Jeff King Feb. 23, 2021, 9:53 p.m. UTC | #8

On Tue, Feb 23, 2021 at 02:41:09PM -0700, Martin Fick wrote:

> > > Would it make sense to somehow detect all new packs since the last rollup
> > > and always include them in the rollup no matter what their size? That is
> > > one thing that my git-exproll script did. One of the main reasons to do
> > > this was because newer packs tended to look big (I was using bytes to
> > > determine size), and newer packs were often bigger on disk compared to
> > > other packs with similar objects in them (I think you suggested this was
> > > due to the thickening of packs on receipt). Maybe roll up all packs with
> > > a timestamp "new enough", no matter how big they are?
> > 
> > That works against the "geometric" part of the strategy, which is trying
> > to roll up in a sequence that is amortized-linear. I.e., we are not
> > always rolling up everything outside of the base pack, but trying to
> > roll up little into medium, and then eventually medium into large. If
> > you roll up things that are "too big", then you end up rewriting the
> > bytes more often, and your amount of work becomes super-linear.
> 
> I'm not sure I follow, it would seem to me that it would stay linear, and be 
> at most rewriting each new packfile once more than previously? Are you 
> envisioning more work than that?

Maybe I don't understand what you're proposing.

The idea of the geometric repack is that by sorting by size and then
finding a "cutoff" within the size array, we can make sure that we roll
up a sufficiently small number of bytes in each roll-up that it ends up
linear in the size of the repo in the long run. But if we roll up
without regard to size, then our worst case is that the biggest pack is
the newest (imagine a repo with 10 small pushes and then one gigantic
one). So we roll that up with some small packs, doing effectively
O(size_of_repo) work. And then in the next roll up we do it again, and
so on. So we end up with O(size_of_repo * nr_rollups) total work. Which
is no better than having just done a full repack at each rollup.

Now I don't think we'd see that worst case in practice that much. And
depending on your definition of "new enough", you might keep nr_rollups
pretty small.

-Peff

Martin Fick Feb. 23, 2021, 9:57 p.m. UTC | #9

On Tuesday, February 23, 2021 3:06:22 PM MST Taylor Blau wrote:
> On Tue, Feb 23, 2021 at 12:54:56PM -0700, Martin Fick wrote:
> > Would it make sense to somehow detect all new packs since the last rollup
> > and always include them in the rollup no matter what their size? That is
> > one thing that my git-exproll script did.
> 
> I'm certainly not opposed, and this could certainly be done in an
> additive way (i.e., after this series). I think the current approach has
> nice properties, but I could also see "roll-up all packs that have
> mtimes after xyz timestamp" being useful.

Just to be clear, I meant to combine the two approaches. And yes, my 
suggestion would likely make more sense as an additive switch later on.

> It would even be possible to reuse a lot of the geometric repack
> machinery. Having a separate path to arrange packs by their mtimes and
> determine the "split" at pack whose mtime is nearest the provided one
> would do exactly what you want.

I was thinking to keep all of your geometric repack machinery and only looking 
for the split point starting at the right most pack which is newer than the 
provided mtime, and then possibly enhancing the approach with a clever way to 
use the mtime of the last consolidation (maybe by touching a pack/.geometric 
file?).

> (As a side-note, reading the original threads about your git-exproll was
> quite humbling, since it turns out all of the problems I thought were
> hard had already been discussed eight years ago!)

Thanks, but I think you have likely done a much better job than what I did. 
Your approach of using object counts is likely much better as it should be 
stable, using byte counts is not. You are also solving only one problem at a 
time, that's probably better than my hodge-podge of at least 3 different 
problems. And the most important part of your approach as I understand it, is 
that it actually saves CPU time whereas my approach only saved IO.

Cheers,

-Martin

Martin Fick Feb. 24, 2021, 6:13 p.m. UTC | #10

On Tuesday, February 23, 2021 4:53:21 PM MST Jeff King wrote:
> On Tue, Feb 23, 2021 at 02:41:09PM -0700, Martin Fick wrote:
> > > > Would it make sense to somehow detect all new packs since the last
> > > > rollup
> > > > and always include them in the rollup no matter what their size? That
> > > > is
> > > > one thing that my git-exproll script did. One of the main reasons to
> > > > do
> > > > this was because newer packs tended to look big (I was using bytes to
> > > > determine size), and newer packs were often bigger on disk compared to
> > > > other packs with similar objects in them (I think you suggested this
> > > > was
> > > > due to the thickening of packs on receipt). Maybe roll up all packs
> > > > with
> > > > a timestamp "new enough", no matter how big they are?
> > > 
> > > That works against the "geometric" part of the strategy, which is trying
> > > to roll up in a sequence that is amortized-linear. I.e., we are not
> > > always rolling up everything outside of the base pack, but trying to
> > > roll up little into medium, and then eventually medium into large. If
> > > you roll up things that are "too big", then you end up rewriting the
> > > bytes more often, and your amount of work becomes super-linear.
> > 
> > I'm not sure I follow, it would seem to me that it would stay linear, and
> > be at most rewriting each new packfile once more than previously? Are you
> > envisioning more work than that?
> 
> Maybe I don't understand what you're proposing.
> 
> The idea of the geometric repack is that by sorting by size and then
> finding a "cutoff" within the size array, we can make sure that we roll
> up a sufficiently small number of bytes in each roll-up that it ends up
> linear in the size of the repo in the long run. But if we roll up
> without regard to size, then our worst case is that the biggest pack is
> the newest (imagine a repo with 10 small pushes and then one gigantic
> one). So we roll that up with some small packs, doing effectively
> O(size_of_repo) work.

This isn't quite a fair evaluation, it should be O(size_of_push) I think?

> And then in the next roll up we do it again, and so on. 
 
I should have clarified that the intent is to prevent this by specifying an 
mtime after the last rollup so that this should only ever happen once for new 
packfiles. It also means you probably need special logic to ensure this roll-up 
doesn't happen if there would only be one file in the rollup, 

-Martin

Jeff King Feb. 26, 2021, 6:23 a.m. UTC | #11

On Wed, Feb 24, 2021 at 11:13:53AM -0700, Martin Fick wrote:

> > The idea of the geometric repack is that by sorting by size and then
> > finding a "cutoff" within the size array, we can make sure that we roll
> > up a sufficiently small number of bytes in each roll-up that it ends up
> > linear in the size of the repo in the long run. But if we roll up
> > without regard to size, then our worst case is that the biggest pack is
> > the newest (imagine a repo with 10 small pushes and then one gigantic
> > one). So we roll that up with some small packs, doing effectively
> > O(size_of_repo) work.
> 
> This isn't quite a fair evaluation, it should be O(size_of_push) I think?

Sorry, I had a longer example, but then cut it down in the name of
simplicity. But I think I made it too simple. :)

You can imagine more pushes after the gigantic one, in which case we'd
roll them up with the gigantic push. So that gigantic one is part of
multiple sequential rollups, until it is itself rolled up further.

But...

> > And then in the next roll up we do it again, and so on. 
>  
> I should have clarified that the intent is to prevent this by specifying an 
> mtime after the last rollup so that this should only ever happen once for new 
> packfiles. It also means you probably need special logic to ensure this roll-up 
> doesn't happen if there would only be one file in the rollup, 

Yes, I agree that if you record a cut point, and then avoid rolling up
across it, then you'd only consider the single push once. You probably
want to record the actual pack set rather than just an mtime cutoff,
though, since Git will update the mtime on packs sometimes (to freshen
them whenever it optimizes out an object write for an object in the
pack).

One of the nice things about looking only at the pack sizes is that you
don't have to record that cut point. :) But it's possible you'd want to
for other reasons (e.g., you may spend extra work to find good deltas in
your on-disk packs, so you want to know what is old and what is new in
order to discard on-disk deltas from pushed-up packs).

-Peff

[v4,0/8] repack: support repacking into a geometric sequence

Message

Comments