mbox series

[v2,0/8] midx-write: miscellaneous clean-ups for incremental MIDXs

Message ID cover.1717023301.git.me@ttaylorr.com (mailing list archive)
Headers show
Series midx-write: miscellaneous clean-ups for incremental MIDXs | expand

Message

Taylor Blau May 29, 2024, 10:55 p.m. UTC
This is a small reroll of my series which has a grab-bag of midx-write
related cleanups that I pulled out of a larger series to implement
incremental MIDX chains.

This series is mostly the same as last time, but notable changes
include:

  - renamed `get_sorted_entries()` to `compute_sorted_entries()` to
    avoid confusion that the function performs a side-effecting write to
    `ctx->entries_nr`.

  - fixed a handful of typos

  - add explanations in a couple of the patches to better motivate the
    change.

Thanks to Patrick Steinhardt for his careful review on the previous
round!

Taylor Blau (8):
  midx-write.c: tolerate `--preferred-pack` without bitmaps
  midx-write.c: reduce argument count for `get_sorted_entries()`
  midx-write.c: pass `start_pack` to `compute_sorted_entries()`
  midx-write.c: extract `should_include_pack()`
  midx-write.c: extract `fill_packs_from_midx()`
  midx-write.c: support reading an existing MIDX with `packs_to_include`
  midx: replace `get_midx_rev_filename()` with a generic helper
  pack-bitmap.c: reimplement `midx_bitmap_filename()` with helper

 midx-write.c                | 161 ++++++++++++++++++------------------
 midx.c                      |  14 ++--
 midx.h                      |   6 +-
 pack-bitmap.c               |   5 +-
 pack-revindex.c             |   3 +-
 t/t5319-multi-pack-index.sh |  23 ++++++
 6 files changed, 119 insertions(+), 93 deletions(-)

Range-diff against v1:
1:  c753bc379b ! 1:  ad268bd264 midx-write.c: tolerate `--preferred-pack` without bitmaps
    @@ Commit message
         unreferenced packs via 'git multi-pack-index expire').
     
         In practice, this isn't possible to trigger when running `git
    -    multi-pack-index write` from via `git repack`, as the latter always
    -    passes `--stdin-packs`, which prevents us from loading an existing MIDX,
    -    as it forces all packs to be read from disk.
    +    multi-pack-index write` from `git repack`, as the latter always passes
    +    `--stdin-packs`, which prevents us from loading an existing MIDX, as it
    +    forces all packs to be read from disk.
     
         But a future commit in this series will change that behavior to
         unconditionally load an existing MIDX, even with `--stdin-packs`, making
2:  07dad5a581 ! 2:  9d422efe6e midx-write.c: reduce argument count for `get_sorted_entries()`
    @@ Commit message
     
         Simplify the declaration of this function by taking a single pointer to
         the whole `struct write_midx_context` instead of various members within
    -    it.
    +    it. Since this function is now computing the entire result (populating
    +    both `ctx->entries`, and `ctx->entries_nr`), rename it to something that
    +    doesn't start with "get_" to make clear that this function has a
    +    side-effect.
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
    @@ midx-write.c: static void midx_fanout_add_pack_fanout(struct midx_fanout *fanout
     -						  uint32_t nr_packs,
     -						  size_t *nr_objects,
     -						  int preferred_pack)
    -+static struct pack_midx_entry *get_sorted_entries(struct write_midx_context *ctx)
    ++static void compute_sorted_entries(struct write_midx_context *ctx)
      {
      	uint32_t cur_fanout, cur_pack, cur_object;
      	size_t alloc_objects, total_objects = 0;
      	struct midx_fanout fanout = { 0 };
    - 	struct pack_midx_entry *deduplicated_entries = NULL;
    +-	struct pack_midx_entry *deduplicated_entries = NULL;
     -	uint32_t start_pack = m ? m->num_packs : 0;
     +	uint32_t start_pack = ctx->m ? ctx->m->num_packs : 0;
      
    @@ midx-write.c: static void midx_fanout_add_pack_fanout(struct midx_fanout *fanout
      	/*
      	 * As we de-duplicate by fanout value, we expect the fanout
     @@ midx-write.c: static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
    + 	alloc_objects = fanout.alloc = total_objects > 3200 ? total_objects / 200 : 16;
      
      	ALLOC_ARRAY(fanout.entries, fanout.alloc);
    - 	ALLOC_ARRAY(deduplicated_entries, alloc_objects);
    +-	ALLOC_ARRAY(deduplicated_entries, alloc_objects);
     -	*nr_objects = 0;
    ++	ALLOC_ARRAY(ctx->entries, alloc_objects);
     +	ctx->entries_nr = 0;
      
      	for (cur_fanout = 0; cur_fanout < 256; cur_fanout++) {
    @@ midx-write.c: static struct pack_midx_entry *get_sorted_entries(struct multi_pac
      				continue;
      
     -			ALLOC_GROW(deduplicated_entries, st_add(*nr_objects, 1),
    -+			ALLOC_GROW(deduplicated_entries, st_add(ctx->entries_nr, 1),
    ++			ALLOC_GROW(ctx->entries, st_add(ctx->entries_nr, 1),
      				   alloc_objects);
     -			memcpy(&deduplicated_entries[*nr_objects],
    -+			memcpy(&deduplicated_entries[ctx->entries_nr],
    ++			memcpy(&ctx->entries[ctx->entries_nr],
      			       &fanout.entries[cur_object],
      			       sizeof(struct pack_midx_entry));
     -			(*nr_objects)++;
    @@ midx-write.c: static struct pack_midx_entry *get_sorted_entries(struct multi_pac
      		}
      	}
      
    + 	free(fanout.entries);
    +-	return deduplicated_entries;
    + }
    + 
    + static int write_midx_pack_names(struct hashfile *f, void *data)
     @@ midx-write.c: static int write_midx_internal(const char *object_dir,
      		}
      	}
      
     -	ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &ctx.entries_nr,
     -					 ctx.preferred_pack_idx);
    -+	ctx.entries = get_sorted_entries(&ctx);
    ++	compute_sorted_entries(&ctx);
      
      	ctx.large_offsets_needed = 0;
      	for (i = 0; i < ctx.entries_nr; i++) {
3:  7acf4557dc ! 3:  e81296f8cc midx-write.c: pass `start_pack` to `get_sorted_entries()`
    @@ Metadata
     Author: Taylor Blau <me@ttaylorr.com>
     
      ## Commit message ##
    -    midx-write.c: pass `start_pack` to `get_sorted_entries()`
    +    midx-write.c: pass `start_pack` to `compute_sorted_entries()`
     
    -    The function `get_sorted_entries()` is broadly responsible for
    +    The function `compute_sorted_entries()` is broadly responsible for
         building an array of the objects to be written into a MIDX based on the
         provided list of packs.
     
    @@ Commit message
     
         The existing implementation simply skips past the first
         ctx->m->num_packs (if ctx->m is non-NULL, indicating that we loaded an
    -    existing MIDX). Future changes (outside the scope of this patch series)
    -    to the MIDX code will require us to skip *at most* that number[^1].
    +    existing MIDX). This is because we read objects in packs from an
    +    existing MIDX via the MIDX itself, rather than from the pack-level
    +    fanout to guarantee a de-duplicated result (see: a40498a1265 (midx: use
    +    existing midx when writing new one, 2018-07-12)).
    +
    +    Future changes (outside the scope of this patch series) to the MIDX code
    +    will require us to skip *at most* that number[^1].
     
         We could tag each pack with a bit that indicates the pack's contents
         should be included in the MIDX. But we can just as easily determine the
    @@ midx-write.c: static void midx_fanout_add_pack_fanout(struct midx_fanout *fanout
       * Copy only the de-duplicated entries (selected by most-recent modified time
       * of a packfile containing the object).
       */
    --static struct pack_midx_entry *get_sorted_entries(struct write_midx_context *ctx)
    -+static struct pack_midx_entry *get_sorted_entries(struct write_midx_context *ctx,
    -+						  uint32_t start_pack)
    +-static void compute_sorted_entries(struct write_midx_context *ctx)
    ++static void compute_sorted_entries(struct write_midx_context *ctx,
    ++				   uint32_t start_pack)
      {
      	uint32_t cur_fanout, cur_pack, cur_object;
      	size_t alloc_objects, total_objects = 0;
      	struct midx_fanout fanout = { 0 };
    - 	struct pack_midx_entry *deduplicated_entries = NULL;
     -	uint32_t start_pack = ctx->m ? ctx->m->num_packs : 0;
      
      	for (cur_pack = start_pack; cur_pack < ctx->nr; cur_pack++)
    @@ midx-write.c: static int write_midx_internal(const char *object_dir,
      		}
      	}
      
    --	ctx.entries = get_sorted_entries(&ctx);
    -+	ctx.entries = get_sorted_entries(&ctx, start_pack);
    +-	compute_sorted_entries(&ctx);
    ++	compute_sorted_entries(&ctx, start_pack);
      
      	ctx.large_offsets_needed = 0;
      	for (i = 0; i < ctx.entries_nr; i++) {
4:  3908546ea8 ! 4:  9cd9257465 midx-write.c: extract `should_include_pack()`
    @@ Commit message
           - or, appear in the "to_include" list, if invoking the MIDX write
             machinery with the `--stdin-packs` command-line argument.
     
    -    In a future commit will want to call a slight variant of these checks
    -    from the code that reuses all packs from an existing MIDX, as well as
    -    the current location via add_pack_to_midx(). The latter will be
    -    modified in subsequent commits to only reuse packs which appear in the
    -    to_include list, if one was given.
    +    A future commit will want to call a slight variant of these checks from
    +    the code that reuses all packs from an existing MIDX, as well as the
    +    current location via add_pack_to_midx(). The latter will be modified in
    +    subsequent commits to only reuse packs which appear in the to_include
    +    list, if one was given.
     
         Prepare for that step by extracting these checks as a subroutine that
         may be called from both places.
5:  dc813ea1ca = 5:  1bb890e87c midx-write.c: extract `fill_packs_from_midx()`
6:  61268114c6 ! 6:  fe187b1939 midx-write.c: support reading an existing MIDX with `packs_to_include`
    @@ Commit message
         reject all of the packs it provided since they appear in an existing
         MIDX by definition.
     
    +    The sum total of this change is that we are now able to read and
    +    reference objects in an existing MIDX even when given a non-NULL
    +    `packs_to_include`. This is a prerequisite step for incremental MIDXs,
    +    which need to load any existing MIDX (if one is present) in order to
    +    determine whether or not an object already appears in an earlier portion
    +    of the MIDX to avoid duplicating it across multiple portions.
    +
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## midx-write.c ##
7:  f4c0167f43 = 7:  f4f977c1c7 midx: replace `get_midx_rev_filename()` with a generic helper
8:  79e3f7f83f = 8:  bcadaf9278 pack-bitmap.c: reimplement `midx_bitmap_filename()` with helper

base-commit: 3a57aa566a21e7a510c64881bc6bdff7eb397988

Comments

Jeff King May 30, 2024, 7:14 a.m. UTC | #1
On Wed, May 29, 2024 at 06:55:15PM -0400, Taylor Blau wrote:

> This is a small reroll of my series which has a grab-bag of midx-write
> related cleanups that I pulled out of a larger series to implement
> incremental MIDX chains.

These all look pretty reasonable to me. Thanks for breaking them off of
the larger series. I think it's generally nice to get things in smaller
chunks.

Sometimes it is a little tough to evaluate refactorings without seeing
the larger context in which they'd be used. But all of these were either
immediate improvements, or didn't take much imagination to see where
they'd make later things easier.

-Peff
Taylor Blau May 30, 2024, 1:59 p.m. UTC | #2
On Thu, May 30, 2024 at 03:14:32AM -0400, Jeff King wrote:
> On Wed, May 29, 2024 at 06:55:15PM -0400, Taylor Blau wrote:
>
> > This is a small reroll of my series which has a grab-bag of midx-write
> > related cleanups that I pulled out of a larger series to implement
> > incremental MIDX chains.
>
> These all look pretty reasonable to me. Thanks for breaking them off of
> the larger series. I think it's generally nice to get things in smaller
> chunks.

Thanks for reviewing! It is much appreciated, as always :-).

> Sometimes it is a little tough to evaluate refactorings without seeing
> the larger context in which they'd be used. But all of these were either
> immediate improvements, or didn't take much imagination to see where
> they'd make later things easier.

Yeah, I feel like this is something that I'm still trying to find the
right balance with. I think with the pseudo-merge bitmaps series, I
erred on the side of making the series too large, which made it
difficult to easily review.

Seeing the review process on that series made me a little uneasy about
sending the incremental MIDX bitmaps series, which was growing similarly
large. I think that seeing these 8 patches or so get sent in preparation
makes the later series easier to review, but I agree that they are
somewhat unsatisfying on their own.

I think in this case it was a reasonable trade-off to make, but you
could probably make a compelling case either way there ;-).

On the bright side, what was once a ~30 patch series is now three series
(of which this is the first one). The main series is now 13 or so
patches, the end state of which is to have incremental MIDXs working
without support for reachability bitmaps. The final series I'm planning
in this area adds support for reachability bitmaps with incremental
MIDXs, but that series is only half a dozen or so patches.

I'm hoping that by sending it in chunks, the end-to-end review process
will be more straightforward and approachable. If you have any other
ideas to make it easier, definitely let me know!

Thanks,
Taylor
Patrick Steinhardt May 31, 2024, 8:28 a.m. UTC | #3
On Thu, May 30, 2024 at 03:14:32AM -0400, Jeff King wrote:
> On Wed, May 29, 2024 at 06:55:15PM -0400, Taylor Blau wrote:
> 
> > This is a small reroll of my series which has a grab-bag of midx-write
> > related cleanups that I pulled out of a larger series to implement
> > incremental MIDX chains.
> 
> These all look pretty reasonable to me. Thanks for breaking them off of
> the larger series. I think it's generally nice to get things in smaller
> chunks.
> 
> Sometimes it is a little tough to evaluate refactorings without seeing
> the larger context in which they'd be used. But all of these were either
> immediate improvements, or didn't take much imagination to see where
> they'd make later things easier.

I've read through the range-diff and didn't spot anything unexpected
there. So this version looks good from my point of view, thanks.

And I very much agree that it's nice to split out smaller topics like
this. I think it's okay to send patch series that have dozens of commits
when most of the commits are trivial. But the changes in the MIDX aren't
that and require the reviewer to dive deep. For me this has the result
that I need an hour or more on such large patch series to review them.
And because I often do not find the time for that I push them onto my
pile of shame of stuff that I do want to review.

I get though that this isn't always easy, and in any case it's a
tradeoff.

Patrick