mbox series

[v2,00/17] pack-objects: add --path-walk option for better deltas

Message ID pull.1813.v2.git.1729431810.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series pack-objects: add --path-walk option for better deltas | expand

Message

Philippe Blain via GitGitGadget Oct. 20, 2024, 1:43 p.m. UTC
This is a reviewable series introducing the path-walk API and its
application within the 'git pack-objects --path-walk' machinery. This API
was previously discussed in the path-walk RFC [1] and the patch series
around the --full-name-hash option (branch ds/pack-name-hash-tweak) [2].
This series conflicts with ds/pack-name-hash-tweak, but that was on hold
because it did not seem as critical based on community interest.

[1]
https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com

[2]
https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com

The primary motivation for this feature is its use to shrink the packfile
created by 'git push' when there are many name-hash collisions. This need
was discovered in several Javascript repositories that use the beachball
tool [3] to generate CHANGELOG.md and CHANGELOG.json files. When a batch of
these files are created at the same time and pushed to a release branch, the
'git pack-objects' process has many collisions among these files and delta
bases are selected poorly.

[3] https://github.com/microsoft/beachball

In some cases, 'git push' is pushing 60-100 MB when the new path-walk
algorithm will identify better delta bases and pack the same objects into a
thin pack less than 1 MB. This was the most extreme example we could find
and is present in a private monorepo. However, the microsoft/fluentui repo
[4] is a good example for demonstrating similar improvements. The patch
descriptions frequently refer to this repo and which commit should be
checked out to reproduce this behavior.

[4] https://github.com/microsoft/fluentui

The path-walk API is a new way to walk objects. Traditionally, the revision
API walks objects by visiting a commit, then visiting its root tree and
recursing through trees to visit reachable objects that were not previously
visited. The path-walk API visits objects in batches based on the path
walked from a root tree to that object. (Only the first discovered path is
chosen; this avoids certain kinds of Git bombs.)

This has an immediate application to 'git pack-objects'.

When using the traditional revision API to walk objects, each object is
emitted with an associated path. Since this path may appear for many objects
spread across the full list, a heuristic is used: the "name-hash" is stored
for that object instead of the full path name. This name-hash will group
objects at the same path together, but also has a form of "locality" to
group likely-similar objects together. When there are few collisions in the
name-hash function, this helps compress objects that appear at the same path
as well as help compress objects across different paths that have similar
suffixes. When there are many versions of the same path, then finding delta
bases across that family of objects is very important. When there are few
versions of the same path, then finding cross-name delta bases is also
important. The former is important for clones and repacks while the latter
is important for shallow clones. They can both be important for pushes. In
all cases, this approach is fraught when there are many name-hash collisions
as the window size becomes a limiting factor for finding quality delta
bases.

When using the path-walk API to walk objects, we group objects by the same
path from the start. We don't need to store the path name, since we have the
objects already in a group. We can compute deltas within that group and then
use the name-hash approach to resort the object list and look for
opportunistic cross-path deltas. Thus, the path-walk approach allows finding
delta bases at least as good as the traditional revision API approach.
(Caveat: if we assume delta reuse and the existing deltas were computed with
the revision API approach, then the path-walk API approach may result in
slightly worse delta compression. The performance tests in this series use
--no-reuse-delta for this reason.)

Once 'git pack-objects --path-walk' exists, we have a few ways to take
advantage of it in other places:

 * The new 'pack.usePathWalk' config option will assume the --path-walk
   option. This allows 'git push' to assume this value and get the effect we
   want. This is similar to the 'pack.useSparse' config option that uses a
   similar path-based walk to limit the set of boundary objects.

 * 'git repack' learns a '--path-walk' option to pass to its child 'git
   pack-objects' process. This is also implied by 'pack.usePathWalk' but
   allows for testing without modifying repository config.

I'll repeat the following table of repacking results when using 'git repack
-adf [--path-walk]' on a set of repositories with many name-hash collisions.
Only the microsoft/fluentui repository is publicly available for testing, so
the others are left as Repo B/C/D.

| Repo     | Standard Repack | With --path-walk |
|----------|-----------------|------------------|
| fluentui |         438 MB  |          148 MB  |
| Repo B   |       6,255 MB  |          778 MB  |
| Repo C   |      37,737 MB  |        6,158 MB  |
| Repo D   |     130,049 MB  |        4,432 MB  |


While this series is replacing ds/pack-name-hash-tweak and its introduction
of the --full-name-hash option, it is worth comparing that option to the
--path-walk option.

 * The --full-name-hash option is a much smaller code change, as it drops
   into the existing uses of the name-hash function.

 * The --full-name-hash option is more likely to integrate with server-side
   features such as delta islands and reachability bitmaps due to not
   changing the object walk. It was already noted that the .bitmap files
   store name-hash values, so there is some compatibility required to
   integrate with bitmaps. The --path-walk option will be more difficult to
   integrate with those options (at least during a repack), but maybe is not
   impossible; the --path-walk option will not work when reading
   reachability bitmaps, since we are avoiding walking trees entirely.

 * The --full-name-hash option is good when there are many name-hash
   collisions and many versions of the paths with those collisions. When
   creating a shallow clone or certain kinds of pushes, the --full-name-hash
   option is much worse at finding cross-path delta bases since it loses the
   locality of the standard name-hash function. The --path-walk option
   includes a second pass of delta computation using the standard name-hash
   function and thus finds good cross-path delta bases when they improve
   upon the same-path delta bases.

There are a few differences from the RFC version of this series:

 1. The last two patches refactor the approach to perform delta calculations
    by path after the object walk and then allows those delta calculations
    to happen in a threaded manner.

 2. Both 'git pack-objects' and 'git repack' are removed from the t0450
    exclusion list.

 3. The path-walk API has improved technical documentation that is extended
    as its functionality is expanded.

 4. Various bugs have been patched with matching tests. The new 'test-tool
    path-walk' helper allows for careful testing of the API separately from
    its use within other commands.


Updates in v2
=============

I'm sending this v2 to request some review feedback on the series. I'm sorry
it's so long.

There are two updates in this version:

 * Fixed a performance issue in the presence of many annotated tags. This is
   caught by p5313 when run on a repo with 10,000+ annotated tags.
 * The Scalar config was previously wrong and should be pack.usePathWalk,
   not push.usePathWalk.

Thanks, - Stolee

Derrick Stolee (17):
  path-walk: introduce an object walk by path
  t6601: add helper for testing path-walk API
  path-walk: allow consumer to specify object types
  path-walk: allow visiting tags
  revision: create mark_trees_uninteresting_dense()
  path-walk: add prune_all_uninteresting option
  pack-objects: extract should_attempt_deltas()
  pack-objects: add --path-walk option
  pack-objects: update usage to match docs
  p5313: add performance tests for --path-walk
  pack-objects: introduce GIT_TEST_PACK_PATH_WALK
  repack: add --path-walk option
  repack: update usage to match docs
  pack-objects: enable --path-walk via config
  scalar: enable path-walk during push via config
  pack-objects: refactor path-walk delta phase
  pack-objects: thread the path-based compression

 Documentation/config/feature.txt          |   4 +
 Documentation/config/pack.txt             |   8 +
 Documentation/git-pack-objects.txt        |  23 +-
 Documentation/git-repack.txt              |  17 +-
 Documentation/technical/api-path-walk.txt |  73 ++++
 Makefile                                  |   2 +
 builtin/pack-objects.c                    | 410 ++++++++++++++++++++--
 builtin/repack.c                          |   9 +-
 ci/run-build-and-tests.sh                 |   1 +
 pack-objects.h                            |  12 +
 path-walk.c                               | 406 +++++++++++++++++++++
 path-walk.h                               |  64 ++++
 repo-settings.c                           |   3 +
 repo-settings.h                           |   1 +
 revision.c                                |  15 +
 revision.h                                |   1 +
 scalar.c                                  |   1 +
 t/README                                  |   4 +
 t/helper/test-path-walk.c                 | 114 ++++++
 t/helper/test-tool.c                      |   1 +
 t/helper/test-tool.h                      |   1 +
 t/perf/p5313-pack-objects.sh              |  77 ++++
 t/t0411-clone-from-partial.sh             |   6 +
 t/t0450/txt-help-mismatches               |   2 -
 t/t5300-pack-object.sh                    |  21 ++
 t/t5306-pack-nobase.sh                    |   5 +
 t/t5310-pack-bitmaps.sh                   |  13 +-
 t/t5316-pack-delta-depth.sh               |   9 +-
 t/t5332-multi-pack-reuse.sh               |   7 +
 t/t6601-path-walk.sh                      | 301 ++++++++++++++++
 t/t7406-submodule-update.sh               |   4 +
 31 files changed, 1565 insertions(+), 50 deletions(-)
 create mode 100644 Documentation/technical/api-path-walk.txt
 create mode 100644 path-walk.c
 create mode 100644 path-walk.h
 create mode 100644 t/helper/test-path-walk.c
 create mode 100755 t/perf/p5313-pack-objects.sh
 create mode 100755 t/t6601-path-walk.sh


base-commit: e9356ba3ea2a6754281ff7697b3e5a1697b21e24
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1813%2Fderrickstolee%2Fpath-walk-upstream-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1813/derrickstolee/path-walk-upstream-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1813

Range-diff vs v1:

  1:  98bdc94a773 =  1:  98bdc94a773 path-walk: introduce an object walk by path
  2:  a00ab0c62c9 =  2:  a00ab0c62c9 t6601: add helper for testing path-walk API
  3:  14375d19392 =  3:  14375d19392 path-walk: allow consumer to specify object types
  4:  6f48cddadc0 !  4:  c321f58c62d path-walk: allow visiting tags
     @@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
      +			if (obj->type == OBJ_COMMIT || obj->flags & SEEN)
      +				continue;
      +
     -+			obj->flags |= SEEN;
     -+
      +			while (obj->type == OBJ_TAG) {
      +				struct tag *tag = lookup_tag(info->revs->repo,
      +							     &obj->oid);
     -+				if (oid_array_lookup(&tags, &obj->oid) < 0)
     ++				if (!(obj->flags & SEEN)) {
     ++					obj->flags |= SEEN;
      +					oid_array_append(&tags, &obj->oid);
     ++				}
      +				obj = tag->tagged;
      +			}
      +
     ++			if ((obj->flags & SEEN))
     ++				continue;
     ++			obj->flags |= SEEN;
     ++
      +			switch (obj->type) {
      +			case OBJ_TREE:
     -+				if (info->trees &&
     -+				    oid_array_lookup(&root_tree_list->oids, &obj->oid) < 0)
     ++				if (info->trees)
      +					oid_array_append(&root_tree_list->oids, &obj->oid);
      +				break;
      +
      +			case OBJ_BLOB:
     -+				if (info->blobs &&
     -+				    oid_array_lookup(&tagged_blob_list, &obj->oid) < 0)
     ++				if (info->blobs)
      +					oid_array_append(&tagged_blob_list, &obj->oid);
      +				break;
      +
  5:  cd98447f7c8 =  5:  6e89fb219b5 revision: create mark_trees_uninteresting_dense()
  6:  214e10a9984 =  6:  238d7d95715 path-walk: add prune_all_uninteresting option
  7:  cd360ad1040 =  7:  3fdb57edbc5 pack-objects: extract should_attempt_deltas()
  8:  f8ee11d3003 =  8:  a0475c7cba8 pack-objects: add --path-walk option
  9:  eaeb40980f4 =  9:  73c8b61e87b pack-objects: update usage to match docs
 10:  3113ead1e01 = 10:  21dc3723c36 p5313: add performance tests for --path-walk
 11:  211a16ae889 = 11:  6f96b1c227a pack-objects: introduce GIT_TEST_PACK_PATH_WALK
 12:  507ed0f6f90 = 12:  834c9ea2709 repack: add --path-walk option
 13:  eae96e8214f = 13:  6ef8d67af4b repack: update usage to match docs
 14:  2a9ae02217f = 14:  1db90e361ba pack-objects: enable --path-walk via config
 15:  adcb3167809 ! 15:  0f3040b4b90 scalar: enable path-walk during push via config
     @@ scalar.c: static int set_recommended_config(int reconfigure)
       		{ "core.autoCRLF", "false" },
       		{ "core.safeCRLF", "false" },
       		{ "fetch.showForcedUpdates", "false" },
     -+		{ "push.usePathWalk", "true" },
     ++		{ "pack.usePathWalk", "true" },
       		{ NULL, NULL },
       	};
       	int i;
 16:  c3f24dc3647 = 16:  030d8ec238e pack-objects: refactor path-walk delta phase
 17:  264affbf058 = 17:  fddc320eb0b pack-objects: thread the path-based compression

Comments

Taylor Blau Oct. 21, 2024, 9:43 p.m. UTC | #1
On Sun, Oct 20, 2024 at 01:43:13PM +0000, Derrick Stolee via GitGitGadget wrote:
> Updates in v2
> =============
>
> I'm sending this v2 to request some review feedback on the series. I'm sorry
> it's so long.
>
> There are two updates in this version:
>
>  * Fixed a performance issue in the presence of many annotated tags. This is
>    caught by p5313 when run on a repo with 10,000+ annotated tags.
>  * The Scalar config was previously wrong and should be pack.usePathWalk,
>    not push.usePathWalk.

Thanks. I queued the new round. As an aside, I would like to find the
time to review this series in depth, but haven't been able to do so
while also trying to keep up with the volume of the rest of the list.

I know that this topic was split out of a larger one. It may be worth
seeing if there is a way to split this topic out into multiple series
that are more easily review-able, but still sensible on their own.

I haven't looked in enough depth to know myself whether such a cut
exists, but it is worth thinking about if you haven't done so already.

Thanks,
Taylor
Derrick Stolee Oct. 24, 2024, 1:29 p.m. UTC | #2
On 10/21/24 5:43 PM, Taylor Blau wrote:
> On Sun, Oct 20, 2024 at 01:43:13PM +0000, Derrick Stolee via GitGitGadget wrote:
>> Updates in v2
>> =============
>>
>> I'm sending this v2 to request some review feedback on the series. I'm sorry
>> it's so long.
>>
>> There are two updates in this version:
>>
>>   * Fixed a performance issue in the presence of many annotated tags. This is
>>     caught by p5313 when run on a repo with 10,000+ annotated tags.
>>   * The Scalar config was previously wrong and should be pack.usePathWalk,
>>     not push.usePathWalk.
> 
> Thanks. I queued the new round. As an aside, I would like to find the
> time to review this series in depth, but haven't been able to do so
> while also trying to keep up with the volume of the rest of the list.
> 
> I know that this topic was split out of a larger one. It may be worth
> seeing if there is a way to split this topic out into multiple series
> that are more easily review-able, but still sensible on their own.

I'll see what I can do. I needed to re-roll after discovering an issue
when trying to integrate the algorithm with shallow clones. The solution
ends up simplifying the code, which is nice.

> I haven't looked in enough depth to know myself whether such a cut
> exists, but it is worth thinking about if you haven't done so already.

In the current series, there's a natural cut between patches 1-4
and the rest, if we want to put the API in without a non-test consumer.

I could also split out the 'git repack' changes into a third series.

Finally, the threading implementation could be done separately, but I
think it's not complicated enough to leave out from the first version
of the --path-walk option in 'git pack-objects'.

Thanks,
-Stolee
Taylor Blau Oct. 24, 2024, 3:52 p.m. UTC | #3
On Thu, Oct 24, 2024 at 09:29:02AM -0400, Derrick Stolee wrote:
> > I know that this topic was split out of a larger one. It may be worth
> > seeing if there is a way to split this topic out into multiple series
> > that are more easily review-able, but still sensible on their own.
>
> I'll see what I can do. I needed to re-roll after discovering an issue
> when trying to integrate the algorithm with shallow clones. The solution
> ends up simplifying the code, which is nice.

It's always nice when that happens :-).

Should we avoid reviewing the current round in anticipation of a
somewhat restructured series, or would you like us to review the current
round as well?

> > I haven't looked in enough depth to know myself whether such a cut
> > exists, but it is worth thinking about if you haven't done so already.
>
> In the current series, there's a natural cut between patches 1-4
> and the rest, if we want to put the API in without a non-test consumer.
>
> I could also split out the 'git repack' changes into a third series.
>
> Finally, the threading implementation could be done separately, but I
> think it's not complicated enough to leave out from the first version
> of the --path-walk option in 'git pack-objects'.

I'd suggest erring on the side of more smaller series rather than a
single large one. If you feel like there are cut points where we can
review them in isolation and still see some benefit, or at least
clearly how they each fit into the larger puzzle, I think that is worth
doing.

But I trust your judgement here, so if you think that the series is best
reviewed as a whole, then that's fine too. Just my $.02 :-).

Thanks,
Taylor
Patrick Steinhardt Oct. 28, 2024, 5:46 a.m. UTC | #4
On Mon, Oct 21, 2024 at 05:43:48PM -0400, Taylor Blau wrote:
> On Sun, Oct 20, 2024 at 01:43:13PM +0000, Derrick Stolee via GitGitGadget wrote:
> > Updates in v2
> > =============
> >
> > I'm sending this v2 to request some review feedback on the series. I'm sorry
> > it's so long.
> >
> > There are two updates in this version:
> >
> >  * Fixed a performance issue in the presence of many annotated tags. This is
> >    caught by p5313 when run on a repo with 10,000+ annotated tags.
> >  * The Scalar config was previously wrong and should be pack.usePathWalk,
> >    not push.usePathWalk.
> 
> Thanks. I queued the new round. As an aside, I would like to find the
> time to review this series in depth, but haven't been able to do so
> while also trying to keep up with the volume of the rest of the list.
> 
> I know that this topic was split out of a larger one. It may be worth
> seeing if there is a way to split this topic out into multiple series
> that are more easily review-able, but still sensible on their own.
> 
> I haven't looked in enough depth to know myself whether such a cut
> exists, but it is worth thinking about if you haven't done so already.

I'm in the same boat -- I want to review this, but somehow never find
the time to sit down and do it. I definitely won't get to it this week
as I'll be out-of-office for most of the part.

I've flagged this internally now at GitLab so that we can provide some
more data with some of the repos that are on the bigger side to check
whether we can confirm the findings and to prioritize its review.

Patrick
Taylor Blau Oct. 28, 2024, 4:47 p.m. UTC | #5
On Mon, Oct 28, 2024 at 06:46:07AM +0100, Patrick Steinhardt wrote:
> I've flagged this internally now at GitLab so that we can provide some
> more data with some of the repos that are on the bigger side to check
> whether we can confirm the findings and to prioritize its review.

I suspect that you'll end up measuring no change assuming that you
(AFAIK) use bitmaps and (I imagine) delta islands in your production
configuration? This series is not compatible with either of those
features to my knowledge.

Thanks,
Taylor
Derrick Stolee Oct. 28, 2024, 5:13 p.m. UTC | #6
On 10/28/24 12:47 PM, Taylor Blau wrote:
> On Mon, Oct 28, 2024 at 06:46:07AM +0100, Patrick Steinhardt wrote:
>> I've flagged this internally now at GitLab so that we can provide some
>> more data with some of the repos that are on the bigger side to check
>> whether we can confirm the findings and to prioritize its review.
> 
> I suspect that you'll end up measuring no change assuming that you
> (AFAIK) use bitmaps and (I imagine) delta islands in your production
> configuration? This series is not compatible with either of those
> features to my knowledge.
You are correct that this is not compatible with those features as-is.
_Maybe_ there is potential to integrate them in the future, but that
would require better understanding of whether the new compression
mechanism valuable in enough cases (final storage size or maybe even
in repacking time).

At the very least, it would be helpful if some other large repos were
tested to see how commonly this could help client-side users. Are
there other aspects to a repo's structure that could be important to
how effective this approach is?

Thanks,
-Stolee
Taylor Blau Oct. 28, 2024, 5:25 p.m. UTC | #7
On Mon, Oct 28, 2024 at 01:13:15PM -0400, Derrick Stolee wrote:
> On 10/28/24 12:47 PM, Taylor Blau wrote:
> > On Mon, Oct 28, 2024 at 06:46:07AM +0100, Patrick Steinhardt wrote:
> > > I've flagged this internally now at GitLab so that we can provide some
> > > more data with some of the repos that are on the bigger side to check
> > > whether we can confirm the findings and to prioritize its review.
> >
> > I suspect that you'll end up measuring no change assuming that you
> > (AFAIK) use bitmaps and (I imagine) delta islands in your production
> > configuration? This series is not compatible with either of those
> > features to my knowledge.
> You are correct that this is not compatible with those features as-is.
> _Maybe_ there is potential to integrate them in the future, but that
> would require better understanding of whether the new compression
> mechanism valuable in enough cases (final storage size or maybe even
> in repacking time).

I think the bitmap thing is not too big of a hurdle. The .bitmap file is
the only spot we store name-hash values on-disk in the "hashcache"
extension.

Unfortunately, there is no easy way to reuse the format of the existing
hashcache extension as-is to indicate to the reader whether they are
recording traditional name-hash values, or the new --path-walk hash
values.

I suspect that you could either add a new extension for --path-walk hash
values, or add a new variant of the hashcache extension that has a flag
to indicate what kind of hash value it's recording.

Of the two, I think the latter is preferred, since it would allow us to
grow new hash functions on paths in the future without needing to add an
additional extension (only a new bit in the existing one).

> At the very least, it would be helpful if some other large repos were
> tested to see how commonly this could help client-side users. Are
> there other aspects to a repo's structure that could be important to
> how effective this approach is?

What measurements are you looking for here? I thought that you had
already done an extensive job of measuring the client-side impact of
pushing smaller packs and faster local repacks, no?

Thanks,
Taylor
Derrick Stolee Oct. 28, 2024, 7:46 p.m. UTC | #8
On 10/28/24 1:25 PM, Taylor Blau wrote:
> On Mon, Oct 28, 2024 at 01:13:15PM -0400, Derrick Stolee wrote:

>> You are correct that this is not compatible with those features as-is.
>> _Maybe_ there is potential to integrate them in the future, but that
>> would require better understanding of whether the new compression
>> mechanism valuable in enough cases (final storage size or maybe even
>> in repacking time).
> 
> I think the bitmap thing is not too big of a hurdle. The .bitmap file is
> the only spot we store name-hash values on-disk in the "hashcache"
> extension.
> 
> Unfortunately, there is no easy way to reuse the format of the existing
> hashcache extension as-is to indicate to the reader whether they are
> recording traditional name-hash values, or the new --path-walk hash
> values.

The --path-walk option does not mess with the name-hash. You're thinking
of the --full-name-hash feature [1] that was pulled out due to a lack of
interest (and better results with --path-walk).

[1] https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com/

>> At the very least, it would be helpful if some other large repos were
>> tested to see how commonly this could help client-side users. Are
>> there other aspects to a repo's structure that could be important to
>> how effective this approach is?
> 
> What measurements are you looking for here? I thought that you had
> already done an extensive job of measuring the client-side impact of
> pushing smaller packs and faster local repacks, no?
I've done what I can with the repos I know about, but perhaps other
folks have other repos they like to test that might present new
aspects to the problem.

For example, a colleague was testing this in a variety of Javascript
repos and found that the node repo [2] was slightly worse with the
--path-walk option. I've since discovered that this is only true when
using a checked-out copy and the .git/index file is iterated, as some
large source files with few versions become split across the boundary
of "in the index" or "in commit history". (I am fixing this aspect as
well in the next iteration, hence some reason for its delay.)

[2] https://github.com/nodejs/node

Thanks,
-Stolee
Taylor Blau Oct. 29, 2024, 6:02 p.m. UTC | #9
On Mon, Oct 28, 2024 at 03:46:11PM -0400, Derrick Stolee wrote:
> On 10/28/24 1:25 PM, Taylor Blau wrote:
> > On Mon, Oct 28, 2024 at 01:13:15PM -0400, Derrick Stolee wrote:
>
> > > You are correct that this is not compatible with those features as-is.
> > > _Maybe_ there is potential to integrate them in the future, but that
> > > would require better understanding of whether the new compression
> > > mechanism valuable in enough cases (final storage size or maybe even
> > > in repacking time).
> >
> > I think the bitmap thing is not too big of a hurdle. The .bitmap file is
> > the only spot we store name-hash values on-disk in the "hashcache"
> > extension.
> >
> > Unfortunately, there is no easy way to reuse the format of the existing
> > hashcache extension as-is to indicate to the reader whether they are
> > recording traditional name-hash values, or the new --path-walk hash
> > values.
>
> The --path-walk option does not mess with the name-hash. You're thinking
> of the --full-name-hash feature [1] that was pulled out due to a lack of
> interest (and better results with --path-walk).
>
> [1] https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com/

Ah, gotcha. Thanks for clarifying.

What is the incompatibility between the two, then? Is it just that
bitmaps give us the objects in pack- or pseudo-pack order, and we don't
have a way to permute that back into the order that --path-walk would
give us?

If so, a couple of thoughts:

  - You could consider storing the path information for each blob and
    tree object in the bitmap using a trie-like structure. This would
    give you enough information to reconstruct the path-walk order (I
    think) at read-time, but at significant cost in terms of the on-disk
    size of the .bitmap.

  - Alternatively, if you construct the bitmap from a pack or packs that
    were generated in path-walk order, then you could store a
    permutation between pack order and path-walk order in the bitmap
    itself.

  - Alternatively still: if the actual pack *order* were dictated solely
    by path-walk, then neither of these would be necessary.

That all said, I'm still not sure that there is a compatibility blocker
here. Is the goal is to ensure that packs generated with
--use-bitmap-index are still compact in the same way that they would be
with your new --path-walk option?

If so, I think matching the object order in a pack to the path walk
order would achieve that goal, since the chunks that you end up reusing
verbatim as a result of pack-reuse driven by the bitmap would already be
delta-ified according to the --path-walk rules, so the resulting pack
would appear similarly.

OTOH, the order in which we pack objects is extremely important to
performance as you no doubt are aware of. So changing that order to more
closely match the --path-walk option should be done with great care.

Anyway. All of that is to say that I want to better understand what does
and doesn't work together between bitmaps and path-walk. Given my
current understanding, it seems there are a couple of approaches to
unifying these two things together, so it would be nice to be able to
do so if possible.

Thanks,
Taylor
Derrick Stolee Oct. 31, 2024, 2:28 a.m. UTC | #10
On 10/29/24 2:02 PM, Taylor Blau wrote:
 > On Mon, Oct 28, 2024 at 03:46:11PM -0400, Derrick Stolee wrote:
 >> On 10/28/24 1:25 PM, Taylor Blau wrote:
 >>> Unfortunately, there is no easy way to reuse the format of the existing
 >>> hashcache extension as-is to indicate to the reader whether they are
 >>> recording traditional name-hash values, or the new --path-walk hash
 >>> values.
 >>
 >> The --path-walk option does not mess with the name-hash. You're thinking
 >> of the --full-name-hash feature [1] that was pulled out due to a lack of
 >> interest (and better results with --path-walk).
 >>
 >> [1] https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com/
 >
 > Ah, gotcha. Thanks for clarifying.
 >
 > What is the incompatibility between the two, then? Is it just that
 > bitmaps give us the objects in pack- or pseudo-pack order, and we don't
 > have a way to permute that back into the order that --path-walk would
 > give us?

The incompatibility of reading bitmaps and using the path-walk API is
that the path-walk API does not check a bitmap to see if an object is
already discovered. Thus, it does not use the reachability information
from the bitmap at all and would parse commits and trees to find the
objects that should be in the pack-file.

It should also be worth noting that using something like 'git repack
--path-walk' does not mean that future 'git pack-objects' executions
from that packfile data need to use the --path-walk option. I expect
that it should be painless to write bitmaps on top of a packfile created
with 'git repack -adf --path-walk', but since most places doing so also
likely want delta islands, I have not explored this option thoroughly.

(Delta islands are their own challenge, since the path-walk API is not
spreading the reachability information across the objects it walks.
However, this could be remedied by doing a separate walk to identify
islands using the normal method. I believe Peff had an idea in that
direction in another thread. This requires some integration and testing
that I don't have the expertise to provide.)

 > If so, a couple of thoughts:
 > ...

Since the incompatibility is in a different direction, I don't think
these thoughts were relevant to the problem.

 > OTOH, the order in which we pack objects is extremely important to
 > performance as you no doubt are aware of. So changing that order to more
 > closely match the --path-walk option should be done with great care.

This is a place where I'm unsure about how the --path-walk option adjusts
the object order within the pack. The packing list gets resorted to match
the typical method, at least for how the delta compression window works.

This would be another good reason to consider the --path-walk option in
server environments very carefully. My patch series puts up guard rails
specifically because it makes no claim to be effective in all of the
dimensions that matter for those scenarios. Hopefully, others will be
motivated enough to determine if the compression that's possible with
this algorithm could be achieved in a way that is compatible with server
needs.

 > Anyway. All of that is to say that I want to better understand what does
 > and doesn't work together between bitmaps and path-walk. Given my
 > current understanding, it seems there are a couple of approaches to
 > unifying these two things together, so it would be nice to be able to
 > do so if possible.

I think this is an excellent opportunity for testing and debugging to
build up more intuition with how the path-walk API works. When I submit
the next version later tonight, the path-walk algorithm will be better
documented.

That said, I don't have any personal motivation to integrate the two
together, so I don't expect to be contributing that integration point
myself. I think that the results speak for themselves in the very
common environment of a Git client without bitmaps.

Thanks,
-Stolee
Taylor Blau Oct. 31, 2024, 9:07 p.m. UTC | #11
On Wed, Oct 30, 2024 at 10:28:22PM -0400, Derrick Stolee wrote:
> On 10/29/24 2:02 PM, Taylor Blau wrote:
> > On Mon, Oct 28, 2024 at 03:46:11PM -0400, Derrick Stolee wrote:
> >> On 10/28/24 1:25 PM, Taylor Blau wrote:
> >>> Unfortunately, there is no easy way to reuse the format of the existing
> >>> hashcache extension as-is to indicate to the reader whether they are
> >>> recording traditional name-hash values, or the new --path-walk hash
> >>> values.
> >>
> >> The --path-walk option does not mess with the name-hash. You're thinking
> >> of the --full-name-hash feature [1] that was pulled out due to a lack of
> >> interest (and better results with --path-walk).
> >>
> >> [1] https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com/
> >
> > Ah, gotcha. Thanks for clarifying.
> >
> > What is the incompatibility between the two, then? Is it just that
> > bitmaps give us the objects in pack- or pseudo-pack order, and we don't
> > have a way to permute that back into the order that --path-walk would
> > give us?
>
> The incompatibility of reading bitmaps and using the path-walk API is
> that the path-walk API does not check a bitmap to see if an object is
> already discovered. Thus, it does not use the reachability information
> from the bitmap at all and would parse commits and trees to find the
> objects that should be in the pack-file.

Sure, I think what I'm trying to understand here is whether this
"incapability" is a fundamental one, or just that we haven't implemented
checking bitmaps in the path-walk API yet.

Thanks,
Taylor