mbox series

[v2,0/2] pack-objects: freshen objects with multi-cruft packs

Message ID cover.1741124116.git.me@ttaylorr.com (mailing list archive)
Headers show
Series pack-objects: freshen objects with multi-cruft packs | expand

Message

Taylor Blau March 4, 2025, 9:35 p.m. UTC
This is a minor reroll of my series to fix a bug in freshening objects
with multiple cruft packs.

Since last time the only updates to the series are textual changes to
the commit message, so everything functionally should be working as it
was before.

As usual, there is a range-diff showing as much below, and the original
cover letter is as follows:

---

This short series contains a fix for a bug I noticed while rolling out
multi-cruft packs (via 'git repack --max-cruft-size') within GitHub's
infrastructure.

The series is structured as follows:

 - The first patch simplifies how 'repack' aggregates cruft packs
   together when their size is below the '--max-cruft-size' or
   '--max-pack-size' threshold. This simplification changes behavior
   slightly, but not in a meaningful way. It occurred to me while
   writing the second patch.

 - The second patch describes and fixes the main bug. The gist here is
   that objects which are (a) unreachable, (b) exist in a cruft pack
   being retained, and (c) were freshened to have a more recent mtime
   than any existing cruft copy are unable to be freshened.

The fix pursued in the second patch changes the rules around when we
want to retain an object via builtin/pack-objects.c::want_found_object()
when at least one cruft pack will survive the repack.

Previously the rule was to discard any object which appears in any
surviving pack, regardless of mtime. The rule now is to only discard an
object if it appears in either (a) a non-cruft pack which will survive
the repack, or (b) a cruft pack whose mtime for that object is older
than the one we are trying to pack.

I think that this is the right behavior, but admittedly putting this
series together hurt my brain trying to think through all of the cases.
I'm fairly confident in the testing here as I remember it being fairly
exhaustive of all interesting cases. But I'd appreciate a sanity check
from others that they too are convinced this is the right approach.

Thanks in advance for your review!

Taylor Blau (2):
  builtin/repack.c: simplify cruft pack aggregation
  builtin/pack-objects.c: freshen objects from existing cruft packs

 builtin/pack-objects.c  | 118 ++++++++++++++++++++++++++++++++++------
 builtin/repack.c        |  38 +------------
 packfile.c              |   3 +-
 packfile.h              |   2 +
 t/t7704-repack-cruft.sh | 105 +++++++++++++++++++++--------------
 5 files changed, 170 insertions(+), 96 deletions(-)

Range-diff against v1:
1:  8564f982597 ! 1:  63ea9d4d00e builtin/repack.c: simplify cruft pack aggregation
    @@ Commit message
         would get combined together until the sum of their sizes was no larger
         than the given max pack size.
     
    -    There is a much simpler way to achieve this, however, which is to simply
    -    combine *all* cruft packs which are smaller than the threshold,
    +    There is a much simpler way to combine cruft packs, however, which is to
    +    simply combine *all* cruft packs which are smaller than the threshold,
         regardless of what their sum is. With '--max-pack-size', 'pack-objects'
         will split out the resulting pack into individual pack(s) if necessary
         to ensure that the written pack(s) are each no larger than the provided
2:  c0c926adde2 ! 2:  7ba9054701b builtin/pack-objects.c: freshen objects from existing cruft packs
    @@ Commit message
         only be modified in a pruning GC, or if the threshold itself is
         adjusted.
     
    -    However, this process breaks down when we attempt to freshen an object
    -    packed in an earlier cruft pack that is larger than the threshold and
    -    thus will survive the repack.
    +    Prior to this patch, however, this process breaks down when we attempt
    +    to freshen an object packed in an earlier cruft pack, and that cruft
    +    pack is larger than the threshold and thus will survive the repack.
     
         When this is the case, it is impossible to freshen objects in cruft
    -    pack(s) which are larger than the threshold. This is because we avoid
    -    writing them in the new cruft pack entirely, for a couple of reasons.
    +    pack(s) when those cruft packs are larger than the threshold. This is
    +    because we would avoid writing them in the new cruft pack entirely, for
    +    a couple of reasons.
     
          1. When enumerating packed objects via 'add_objects_in_unpacked_packs()'
             we pass the SKIP_IN_CORE_KEPT_PACKS, which is used to avoid looping
    @@ Commit message
          - exists in a non-cruft pack that we are retaining, regardless of that
            pack's mtime, or
     
    -     - exists in a cruft pack with an mtime more recent than the copy we are
    -       debating whether or not to pack, in which case freshening would be
    -       redundant.
    +     - exists in a cruft pack with an mtime at least as recent as the copy
    +       we are debating whether or not to pack, in which case freshening
    +       would be redundant.
     
         To do this, keep track of whether or not we have any cruft packs in our
         in-core kept list with a new 'ignore_packed_keep_in_core_has_cruft'
         flag. When we end up in this new special case, we replace a call to
    -    'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only
    -    reject objects when we have a copy in an existing cruft pack with a more
    -    recent mtime (in which case "freshening" would be redundant).
    +    'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only reject
    +    objects when we have a copy in an existing cruft pack with at least as
    +    recent an mtime as our candidate (in which case "freshening" would be
    +    redundant).
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
    @@ t/t7704-repack-cruft.sh: test_expect_success '--max-cruft-size with freshened ob
     +
     +		git repack --cruft -d &&
     +
    -+		# Make a packed copy of object $foo with a more recent
    -+		# mtime.
    ++		# Make an identical copy of foo stored in a pack with a more
    ++		# recent mtime.
     +		foo="$(generate_random_blob foo $((2*1024*1024)))" &&
     +		foo_pack="$(echo "$foo" | git pack-objects $packdir/pack)" &&
     +		test-tool chmtime --get -100 \
     +			"$packdir/pack-$foo_pack.pack" >foo.new &&
     +		git prune-packed &&
     +
    -+		# Make a loose copy of object $bar with a more recent
    -+		# mtime.
    ++		# Make a loose copy of bar, also with a more recent mtime.
     +		bar="$(generate_random_blob bar $((2*1024*1024)))" &&
     +		test-tool chmtime --get -100 \
     +			"$objdir/$(test_oid_to_path "$bar")" >bar.new &&

base-commit: 08bdfd453584e489d5a551aecbdcb77584e1b958

Comments

Elijah Newren March 4, 2025, 10:55 p.m. UTC | #1
On Tue, Mar 4, 2025 at 1:35 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> This is a minor reroll of my series to fix a bug in freshening objects
> with multiple cruft packs.
>
> Since last time the only updates to the series are textual changes to
> the commit message, so everything functionally should be working as it
> was before.
>
> As usual, there is a range-diff showing as much below, and the original
> cover letter is as follows:
>
> ---
>
> This short series contains a fix for a bug I noticed while rolling out
> multi-cruft packs (via 'git repack --max-cruft-size') within GitHub's
> infrastructure.
>
> The series is structured as follows:
>
>  - The first patch simplifies how 'repack' aggregates cruft packs
>    together when their size is below the '--max-cruft-size' or
>    '--max-pack-size' threshold. This simplification changes behavior
>    slightly, but not in a meaningful way. It occurred to me while
>    writing the second patch.
>
>  - The second patch describes and fixes the main bug. The gist here is
>    that objects which are (a) unreachable, (b) exist in a cruft pack
>    being retained, and (c) were freshened to have a more recent mtime
>    than any existing cruft copy are unable to be freshened.
>
> The fix pursued in the second patch changes the rules around when we
> want to retain an object via builtin/pack-objects.c::want_found_object()
> when at least one cruft pack will survive the repack.
>
> Previously the rule was to discard any object which appears in any
> surviving pack, regardless of mtime. The rule now is to only discard an
> object if it appears in either (a) a non-cruft pack which will survive
> the repack, or (b) a cruft pack whose mtime for that object is older
> than the one we are trying to pack.
>
> I think that this is the right behavior, but admittedly putting this
> series together hurt my brain trying to think through all of the cases.
> I'm fairly confident in the testing here as I remember it being fairly
> exhaustive of all interesting cases. But I'd appreciate a sanity check
> from others that they too are convinced this is the right approach.
>
> Thanks in advance for your review!
>
> Taylor Blau (2):
>   builtin/repack.c: simplify cruft pack aggregation
>   builtin/pack-objects.c: freshen objects from existing cruft packs
>
>  builtin/pack-objects.c  | 118 ++++++++++++++++++++++++++++++++++------
>  builtin/repack.c        |  38 +------------
>  packfile.c              |   3 +-
>  packfile.h              |   2 +
>  t/t7704-repack-cruft.sh | 105 +++++++++++++++++++++--------------
>  5 files changed, 170 insertions(+), 96 deletions(-)
>
> Range-diff against v1:
> 1:  8564f982597 ! 1:  63ea9d4d00e builtin/repack.c: simplify cruft pack aggregation
>     @@ Commit message
>          would get combined together until the sum of their sizes was no larger
>          than the given max pack size.
>
>     -    There is a much simpler way to achieve this, however, which is to simply
>     -    combine *all* cruft packs which are smaller than the threshold,
>     +    There is a much simpler way to combine cruft packs, however, which is to
>     +    simply combine *all* cruft packs which are smaller than the threshold,
>          regardless of what their sum is. With '--max-pack-size', 'pack-objects'
>          will split out the resulting pack into individual pack(s) if necessary
>          to ensure that the written pack(s) are each no larger than the provided
> 2:  c0c926adde2 ! 2:  7ba9054701b builtin/pack-objects.c: freshen objects from existing cruft packs
>     @@ Commit message
>          only be modified in a pruning GC, or if the threshold itself is
>          adjusted.
>
>     -    However, this process breaks down when we attempt to freshen an object
>     -    packed in an earlier cruft pack that is larger than the threshold and
>     -    thus will survive the repack.
>     +    Prior to this patch, however, this process breaks down when we attempt
>     +    to freshen an object packed in an earlier cruft pack, and that cruft
>     +    pack is larger than the threshold and thus will survive the repack.
>
>          When this is the case, it is impossible to freshen objects in cruft
>     -    pack(s) which are larger than the threshold. This is because we avoid
>     -    writing them in the new cruft pack entirely, for a couple of reasons.
>     +    pack(s) when those cruft packs are larger than the threshold. This is
>     +    because we would avoid writing them in the new cruft pack entirely, for
>     +    a couple of reasons.
>
>           1. When enumerating packed objects via 'add_objects_in_unpacked_packs()'
>              we pass the SKIP_IN_CORE_KEPT_PACKS, which is used to avoid looping
>     @@ Commit message
>           - exists in a non-cruft pack that we are retaining, regardless of that
>             pack's mtime, or
>
>     -     - exists in a cruft pack with an mtime more recent than the copy we are
>     -       debating whether or not to pack, in which case freshening would be
>     -       redundant.
>     +     - exists in a cruft pack with an mtime at least as recent as the copy
>     +       we are debating whether or not to pack, in which case freshening
>     +       would be redundant.
>
>          To do this, keep track of whether or not we have any cruft packs in our
>          in-core kept list with a new 'ignore_packed_keep_in_core_has_cruft'
>          flag. When we end up in this new special case, we replace a call to
>     -    'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only
>     -    reject objects when we have a copy in an existing cruft pack with a more
>     -    recent mtime (in which case "freshening" would be redundant).
>     +    'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only reject
>     +    objects when we have a copy in an existing cruft pack with at least as
>     +    recent an mtime as our candidate (in which case "freshening" would be
>     +    redundant).
>
>          Signed-off-by: Taylor Blau <me@ttaylorr.com>
>
>     @@ t/t7704-repack-cruft.sh: test_expect_success '--max-cruft-size with freshened ob
>      +
>      +          git repack --cruft -d &&
>      +
>     -+          # Make a packed copy of object $foo with a more recent
>     -+          # mtime.
>     ++          # Make an identical copy of foo stored in a pack with a more
>     ++          # recent mtime.
>      +          foo="$(generate_random_blob foo $((2*1024*1024)))" &&
>      +          foo_pack="$(echo "$foo" | git pack-objects $packdir/pack)" &&
>      +          test-tool chmtime --get -100 \
>      +                  "$packdir/pack-$foo_pack.pack" >foo.new &&
>      +          git prune-packed &&
>      +
>     -+          # Make a loose copy of object $bar with a more recent
>     -+          # mtime.
>     ++          # Make a loose copy of bar, also with a more recent mtime.
>      +          bar="$(generate_random_blob bar $((2*1024*1024)))" &&
>      +          test-tool chmtime --get -100 \
>      +                  "$objdir/$(test_oid_to_path "$bar")" >bar.new &&
>
> base-commit: 08bdfd453584e489d5a551aecbdcb77584e1b958
> --
> 2.49.0.rc0.57.gdb91954e186.dirty

v2 looks good to me; though I'm curious if some wording improvement in
the commit message might help in distinguishing between
--max-cruft-size and --max-pack-size...and whether we want to provide
any checks on the relative sizes of the two.
Taylor Blau March 5, 2025, 12:06 a.m. UTC | #2
On Tue, Mar 04, 2025 at 02:55:00PM -0800, Elijah Newren wrote:
> v2 looks good to me; though I'm curious if some wording improvement in
> the commit message might help in distinguishing between
> --max-cruft-size and --max-pack-size...and whether we want to provide
> any checks on the relative sizes of the two.

Thanks for the review! :-)

--max-cruft-size and --max-pack-size are the same thing from
pack-objects' perspective; the two flags exist at the repack level in
case you want to set a different maximum pack size for cruft- and
non-cruft packs.

Both end up as a --max-pack-size value when repack invokes pack-objects
for the cruft and non-cruft case.

Thanks,
Taylor
Taylor Blau March 5, 2025, 12:13 a.m. UTC | #3
On Tue, Mar 04, 2025 at 07:06:10PM -0500, Taylor Blau wrote:
> --max-cruft-size and --max-pack-size are the same thing from
> pack-objects' perspective; the two flags exist at the repack level in
> case you want to set a different maximum pack size for cruft- and
> non-cruft packs.
>
> Both end up as a --max-pack-size value when repack invokes pack-objects
> for the cruft and non-cruft case.

Of course, I should have read Patrick's and your discussion earlier in
the thread before sending a new round. The first patch should
*definitely* not be queued.

I'll send a clean v3 that Junio can apply if he likes.

Thanks,
Taylor