Message ID | pull.1823.v2.git.1733181682.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | pack-objects: Create an alternative name hash algorithm (recreated) | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > This series creates a mechanism to select alternative name hashes using a > new --name-hash-version=<n> option. The versions are: > > 1. Version 1 is the default name hash that already exists. This option > focuses on the final bytes of the path to maximize locality for > cross-path deltas. > > 2. Version 2 is the new path-component hash function suggested by Jonathan > Tan in the previous version (with some modifications). This hash > function essentially computes the v1 name hash of each path component > and then overlays those hashes with a shift to make the parent > directories contribute less to the final hash, but enough to break many > collisions that exist in v1. > > 3. Version 3 is the hash function that I submitted under the > --full-name-hash feature in the previous versions. This uses a > pseudorandom hash procedure to minimize collisions but at the expense of > losing on locality. This version is implemented in the final patch of > the series mostly for comparison purposes, as it is unlikely to be > selected as a valuable hash function over v2. The final patch could be > omitted from the merged version. > > See the patches themselves for detailed results in the p5313-pack-objects.sh > performance test and the p5314-name-hash.sh test that demonstrates how many > collisions occur with each hash function. These do not sound like versions but more like variants to me, especially if one is expected to perform better than another in some cases and worse in some other cases. Is it expected that JTan's hash to perform better than the original and current hash in almost all cases (I would not be surprised at all if that were the case)? > In general, the v2 name hash function gets very close to the compression > results of v3 in the full repack case, even in the repositories that feature > many name hash collisions. These benefits come as well without downsides to > other kinds of packfiles, including small pushed packs, larger incremental > fetch packs, and shallow clones. Nice. > As we can see, v2 nearly reaches the effectiveness of v3 (and outperforms it > once!) but there is still a significant change between the > --name-hash-version feature and the --path-walk feature. > > The main reason we are considering this --name-hash-version feature is that > it has the least amount of stretch required in order for it to be integrated > with reachability bitmaps, required for server environments. In fact, the > change in this version to use a numerical version makes it more obvious how > to connect the version number to a value in the .bitmap file format. Tests > are added to guarantee that the hash functions preserve their behavior over > time, since data files depend on that. Yeah, that aspect certainly is an attractive one. We should be able to teach the bitmap file to say which name-hash function is in use and all the pack data reuse logic we have should be reusable.
On 12/2/24 10:23 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> This series creates a mechanism to select alternative name hashes using a >> new --name-hash-version=<n> option. The versions are: >> >> 1. Version 1 is the default name hash that already exists. This option >> focuses on the final bytes of the path to maximize locality for >> cross-path deltas. >> >> 2. Version 2 is the new path-component hash function suggested by Jonathan >> Tan in the previous version (with some modifications). This hash >> function essentially computes the v1 name hash of each path component >> and then overlays those hashes with a shift to make the parent >> directories contribute less to the final hash, but enough to break many >> collisions that exist in v1. >> >> 3. Version 3 is the hash function that I submitted under the >> --full-name-hash feature in the previous versions. This uses a >> pseudorandom hash procedure to minimize collisions but at the expense of >> losing on locality. This version is implemented in the final patch of >> the series mostly for comparison purposes, as it is unlikely to be >> selected as a valuable hash function over v2. The final patch could be >> omitted from the merged version. >> >> See the patches themselves for detailed results in the p5313-pack-objects.sh >> performance test and the p5314-name-hash.sh test that demonstrates how many >> collisions occur with each hash function. > > These do not sound like versions but more like variants to me, > especially if one is expected to perform better than another in some > cases and worse in some other cases. Is it expected that JTan's hash > to perform better than the original and current hash in almost all > cases (I would not be surprised at all if that were the case)? There are some cases, such as the Linux kernel repo, that have slightly worse compression using JTan's hash. But the naming conventions in that repo are such that the v1 name hash was already pretty effective for that repo. The Git repository has similar issues. See Patch 5 for detailed analysis of these scenarios using the p5313-pack-objects.sh test script. It may be possible to adapt some of the collision rate analysis from the test helper in Patch 6 to create a tool that recommends or dynamically selects the hash function that works best for the repo. Such a feature should be delayed until this code is exercised in more places. Thanks, -Stolee
Derrick Stolee <stolee@gmail.com> writes: >> These do not sound like versions but more like variants to me, >> especially if one is expected to perform better than another in some >> cases and worse in some other cases. Is it expected that JTan's hash >> to perform better than the original and current hash in almost all >> cases (I would not be surprised at all if that were the case)? > > There are some cases, such as the Linux kernel repo, that have slightly > worse compression using JTan's hash. But the naming conventions in that > repo are such that the v1 name hash was already pretty effective for > that repo. The Git repository has similar issues. See Patch 5 for > detailed analysis of these scenarios using the p5313-pack-objects.sh > test script. So it indeed is more "variant" than "version" where v(N+1) is almost always an implementation of a better idea than vN. > It may be possible to adapt some of the collision rate analysis from > the test helper in Patch 6 to create a tool that recommends or > dynamically selects the hash function that works best for the repo. > Such a feature should be delayed until this code is exercised in more > places. Surely. But that is more advanced feature that can wait. It certainly has to wait until we have a foundation to use more than one variant safely, which this series lays a good foundation for. Thanks.