Message ID | pull.1813.v2.git.1729431810.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | pack-objects: add --path-walk option for better deltas | expand |
On Sun, Oct 20, 2024 at 01:43:13PM +0000, Derrick Stolee via GitGitGadget wrote: > Updates in v2 > ============= > > I'm sending this v2 to request some review feedback on the series. I'm sorry > it's so long. > > There are two updates in this version: > > * Fixed a performance issue in the presence of many annotated tags. This is > caught by p5313 when run on a repo with 10,000+ annotated tags. > * The Scalar config was previously wrong and should be pack.usePathWalk, > not push.usePathWalk. Thanks. I queued the new round. As an aside, I would like to find the time to review this series in depth, but haven't been able to do so while also trying to keep up with the volume of the rest of the list. I know that this topic was split out of a larger one. It may be worth seeing if there is a way to split this topic out into multiple series that are more easily review-able, but still sensible on their own. I haven't looked in enough depth to know myself whether such a cut exists, but it is worth thinking about if you haven't done so already. Thanks, Taylor
On 10/21/24 5:43 PM, Taylor Blau wrote: > On Sun, Oct 20, 2024 at 01:43:13PM +0000, Derrick Stolee via GitGitGadget wrote: >> Updates in v2 >> ============= >> >> I'm sending this v2 to request some review feedback on the series. I'm sorry >> it's so long. >> >> There are two updates in this version: >> >> * Fixed a performance issue in the presence of many annotated tags. This is >> caught by p5313 when run on a repo with 10,000+ annotated tags. >> * The Scalar config was previously wrong and should be pack.usePathWalk, >> not push.usePathWalk. > > Thanks. I queued the new round. As an aside, I would like to find the > time to review this series in depth, but haven't been able to do so > while also trying to keep up with the volume of the rest of the list. > > I know that this topic was split out of a larger one. It may be worth > seeing if there is a way to split this topic out into multiple series > that are more easily review-able, but still sensible on their own. I'll see what I can do. I needed to re-roll after discovering an issue when trying to integrate the algorithm with shallow clones. The solution ends up simplifying the code, which is nice. > I haven't looked in enough depth to know myself whether such a cut > exists, but it is worth thinking about if you haven't done so already. In the current series, there's a natural cut between patches 1-4 and the rest, if we want to put the API in without a non-test consumer. I could also split out the 'git repack' changes into a third series. Finally, the threading implementation could be done separately, but I think it's not complicated enough to leave out from the first version of the --path-walk option in 'git pack-objects'. Thanks, -Stolee
On Thu, Oct 24, 2024 at 09:29:02AM -0400, Derrick Stolee wrote: > > I know that this topic was split out of a larger one. It may be worth > > seeing if there is a way to split this topic out into multiple series > > that are more easily review-able, but still sensible on their own. > > I'll see what I can do. I needed to re-roll after discovering an issue > when trying to integrate the algorithm with shallow clones. The solution > ends up simplifying the code, which is nice. It's always nice when that happens :-). Should we avoid reviewing the current round in anticipation of a somewhat restructured series, or would you like us to review the current round as well? > > I haven't looked in enough depth to know myself whether such a cut > > exists, but it is worth thinking about if you haven't done so already. > > In the current series, there's a natural cut between patches 1-4 > and the rest, if we want to put the API in without a non-test consumer. > > I could also split out the 'git repack' changes into a third series. > > Finally, the threading implementation could be done separately, but I > think it's not complicated enough to leave out from the first version > of the --path-walk option in 'git pack-objects'. I'd suggest erring on the side of more smaller series rather than a single large one. If you feel like there are cut points where we can review them in isolation and still see some benefit, or at least clearly how they each fit into the larger puzzle, I think that is worth doing. But I trust your judgement here, so if you think that the series is best reviewed as a whole, then that's fine too. Just my $.02 :-). Thanks, Taylor
On Mon, Oct 21, 2024 at 05:43:48PM -0400, Taylor Blau wrote: > On Sun, Oct 20, 2024 at 01:43:13PM +0000, Derrick Stolee via GitGitGadget wrote: > > Updates in v2 > > ============= > > > > I'm sending this v2 to request some review feedback on the series. I'm sorry > > it's so long. > > > > There are two updates in this version: > > > > * Fixed a performance issue in the presence of many annotated tags. This is > > caught by p5313 when run on a repo with 10,000+ annotated tags. > > * The Scalar config was previously wrong and should be pack.usePathWalk, > > not push.usePathWalk. > > Thanks. I queued the new round. As an aside, I would like to find the > time to review this series in depth, but haven't been able to do so > while also trying to keep up with the volume of the rest of the list. > > I know that this topic was split out of a larger one. It may be worth > seeing if there is a way to split this topic out into multiple series > that are more easily review-able, but still sensible on their own. > > I haven't looked in enough depth to know myself whether such a cut > exists, but it is worth thinking about if you haven't done so already. I'm in the same boat -- I want to review this, but somehow never find the time to sit down and do it. I definitely won't get to it this week as I'll be out-of-office for most of the part. I've flagged this internally now at GitLab so that we can provide some more data with some of the repos that are on the bigger side to check whether we can confirm the findings and to prioritize its review. Patrick
On Mon, Oct 28, 2024 at 06:46:07AM +0100, Patrick Steinhardt wrote: > I've flagged this internally now at GitLab so that we can provide some > more data with some of the repos that are on the bigger side to check > whether we can confirm the findings and to prioritize its review. I suspect that you'll end up measuring no change assuming that you (AFAIK) use bitmaps and (I imagine) delta islands in your production configuration? This series is not compatible with either of those features to my knowledge. Thanks, Taylor
On 10/28/24 12:47 PM, Taylor Blau wrote: > On Mon, Oct 28, 2024 at 06:46:07AM +0100, Patrick Steinhardt wrote: >> I've flagged this internally now at GitLab so that we can provide some >> more data with some of the repos that are on the bigger side to check >> whether we can confirm the findings and to prioritize its review. > > I suspect that you'll end up measuring no change assuming that you > (AFAIK) use bitmaps and (I imagine) delta islands in your production > configuration? This series is not compatible with either of those > features to my knowledge. You are correct that this is not compatible with those features as-is. _Maybe_ there is potential to integrate them in the future, but that would require better understanding of whether the new compression mechanism valuable in enough cases (final storage size or maybe even in repacking time). At the very least, it would be helpful if some other large repos were tested to see how commonly this could help client-side users. Are there other aspects to a repo's structure that could be important to how effective this approach is? Thanks, -Stolee
On Mon, Oct 28, 2024 at 01:13:15PM -0400, Derrick Stolee wrote: > On 10/28/24 12:47 PM, Taylor Blau wrote: > > On Mon, Oct 28, 2024 at 06:46:07AM +0100, Patrick Steinhardt wrote: > > > I've flagged this internally now at GitLab so that we can provide some > > > more data with some of the repos that are on the bigger side to check > > > whether we can confirm the findings and to prioritize its review. > > > > I suspect that you'll end up measuring no change assuming that you > > (AFAIK) use bitmaps and (I imagine) delta islands in your production > > configuration? This series is not compatible with either of those > > features to my knowledge. > You are correct that this is not compatible with those features as-is. > _Maybe_ there is potential to integrate them in the future, but that > would require better understanding of whether the new compression > mechanism valuable in enough cases (final storage size or maybe even > in repacking time). I think the bitmap thing is not too big of a hurdle. The .bitmap file is the only spot we store name-hash values on-disk in the "hashcache" extension. Unfortunately, there is no easy way to reuse the format of the existing hashcache extension as-is to indicate to the reader whether they are recording traditional name-hash values, or the new --path-walk hash values. I suspect that you could either add a new extension for --path-walk hash values, or add a new variant of the hashcache extension that has a flag to indicate what kind of hash value it's recording. Of the two, I think the latter is preferred, since it would allow us to grow new hash functions on paths in the future without needing to add an additional extension (only a new bit in the existing one). > At the very least, it would be helpful if some other large repos were > tested to see how commonly this could help client-side users. Are > there other aspects to a repo's structure that could be important to > how effective this approach is? What measurements are you looking for here? I thought that you had already done an extensive job of measuring the client-side impact of pushing smaller packs and faster local repacks, no? Thanks, Taylor
On 10/28/24 1:25 PM, Taylor Blau wrote: > On Mon, Oct 28, 2024 at 01:13:15PM -0400, Derrick Stolee wrote: >> You are correct that this is not compatible with those features as-is. >> _Maybe_ there is potential to integrate them in the future, but that >> would require better understanding of whether the new compression >> mechanism valuable in enough cases (final storage size or maybe even >> in repacking time). > > I think the bitmap thing is not too big of a hurdle. The .bitmap file is > the only spot we store name-hash values on-disk in the "hashcache" > extension. > > Unfortunately, there is no easy way to reuse the format of the existing > hashcache extension as-is to indicate to the reader whether they are > recording traditional name-hash values, or the new --path-walk hash > values. The --path-walk option does not mess with the name-hash. You're thinking of the --full-name-hash feature [1] that was pulled out due to a lack of interest (and better results with --path-walk). [1] https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com/ >> At the very least, it would be helpful if some other large repos were >> tested to see how commonly this could help client-side users. Are >> there other aspects to a repo's structure that could be important to >> how effective this approach is? > > What measurements are you looking for here? I thought that you had > already done an extensive job of measuring the client-side impact of > pushing smaller packs and faster local repacks, no? I've done what I can with the repos I know about, but perhaps other folks have other repos they like to test that might present new aspects to the problem. For example, a colleague was testing this in a variety of Javascript repos and found that the node repo [2] was slightly worse with the --path-walk option. I've since discovered that this is only true when using a checked-out copy and the .git/index file is iterated, as some large source files with few versions become split across the boundary of "in the index" or "in commit history". (I am fixing this aspect as well in the next iteration, hence some reason for its delay.) [2] https://github.com/nodejs/node Thanks, -Stolee
On Mon, Oct 28, 2024 at 03:46:11PM -0400, Derrick Stolee wrote: > On 10/28/24 1:25 PM, Taylor Blau wrote: > > On Mon, Oct 28, 2024 at 01:13:15PM -0400, Derrick Stolee wrote: > > > > You are correct that this is not compatible with those features as-is. > > > _Maybe_ there is potential to integrate them in the future, but that > > > would require better understanding of whether the new compression > > > mechanism valuable in enough cases (final storage size or maybe even > > > in repacking time). > > > > I think the bitmap thing is not too big of a hurdle. The .bitmap file is > > the only spot we store name-hash values on-disk in the "hashcache" > > extension. > > > > Unfortunately, there is no easy way to reuse the format of the existing > > hashcache extension as-is to indicate to the reader whether they are > > recording traditional name-hash values, or the new --path-walk hash > > values. > > The --path-walk option does not mess with the name-hash. You're thinking > of the --full-name-hash feature [1] that was pulled out due to a lack of > interest (and better results with --path-walk). > > [1] https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com/ Ah, gotcha. Thanks for clarifying. What is the incompatibility between the two, then? Is it just that bitmaps give us the objects in pack- or pseudo-pack order, and we don't have a way to permute that back into the order that --path-walk would give us? If so, a couple of thoughts: - You could consider storing the path information for each blob and tree object in the bitmap using a trie-like structure. This would give you enough information to reconstruct the path-walk order (I think) at read-time, but at significant cost in terms of the on-disk size of the .bitmap. - Alternatively, if you construct the bitmap from a pack or packs that were generated in path-walk order, then you could store a permutation between pack order and path-walk order in the bitmap itself. - Alternatively still: if the actual pack *order* were dictated solely by path-walk, then neither of these would be necessary. That all said, I'm still not sure that there is a compatibility blocker here. Is the goal is to ensure that packs generated with --use-bitmap-index are still compact in the same way that they would be with your new --path-walk option? If so, I think matching the object order in a pack to the path walk order would achieve that goal, since the chunks that you end up reusing verbatim as a result of pack-reuse driven by the bitmap would already be delta-ified according to the --path-walk rules, so the resulting pack would appear similarly. OTOH, the order in which we pack objects is extremely important to performance as you no doubt are aware of. So changing that order to more closely match the --path-walk option should be done with great care. Anyway. All of that is to say that I want to better understand what does and doesn't work together between bitmaps and path-walk. Given my current understanding, it seems there are a couple of approaches to unifying these two things together, so it would be nice to be able to do so if possible. Thanks, Taylor
On 10/29/24 2:02 PM, Taylor Blau wrote: > On Mon, Oct 28, 2024 at 03:46:11PM -0400, Derrick Stolee wrote: >> On 10/28/24 1:25 PM, Taylor Blau wrote: >>> Unfortunately, there is no easy way to reuse the format of the existing >>> hashcache extension as-is to indicate to the reader whether they are >>> recording traditional name-hash values, or the new --path-walk hash >>> values. >> >> The --path-walk option does not mess with the name-hash. You're thinking >> of the --full-name-hash feature [1] that was pulled out due to a lack of >> interest (and better results with --path-walk). >> >> [1] https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com/ > > Ah, gotcha. Thanks for clarifying. > > What is the incompatibility between the two, then? Is it just that > bitmaps give us the objects in pack- or pseudo-pack order, and we don't > have a way to permute that back into the order that --path-walk would > give us? The incompatibility of reading bitmaps and using the path-walk API is that the path-walk API does not check a bitmap to see if an object is already discovered. Thus, it does not use the reachability information from the bitmap at all and would parse commits and trees to find the objects that should be in the pack-file. It should also be worth noting that using something like 'git repack --path-walk' does not mean that future 'git pack-objects' executions from that packfile data need to use the --path-walk option. I expect that it should be painless to write bitmaps on top of a packfile created with 'git repack -adf --path-walk', but since most places doing so also likely want delta islands, I have not explored this option thoroughly. (Delta islands are their own challenge, since the path-walk API is not spreading the reachability information across the objects it walks. However, this could be remedied by doing a separate walk to identify islands using the normal method. I believe Peff had an idea in that direction in another thread. This requires some integration and testing that I don't have the expertise to provide.) > If so, a couple of thoughts: > ... Since the incompatibility is in a different direction, I don't think these thoughts were relevant to the problem. > OTOH, the order in which we pack objects is extremely important to > performance as you no doubt are aware of. So changing that order to more > closely match the --path-walk option should be done with great care. This is a place where I'm unsure about how the --path-walk option adjusts the object order within the pack. The packing list gets resorted to match the typical method, at least for how the delta compression window works. This would be another good reason to consider the --path-walk option in server environments very carefully. My patch series puts up guard rails specifically because it makes no claim to be effective in all of the dimensions that matter for those scenarios. Hopefully, others will be motivated enough to determine if the compression that's possible with this algorithm could be achieved in a way that is compatible with server needs. > Anyway. All of that is to say that I want to better understand what does > and doesn't work together between bitmaps and path-walk. Given my > current understanding, it seems there are a couple of approaches to > unifying these two things together, so it would be nice to be able to > do so if possible. I think this is an excellent opportunity for testing and debugging to build up more intuition with how the path-walk API works. When I submit the next version later tonight, the path-walk algorithm will be better documented. That said, I don't have any personal motivation to integrate the two together, so I don't expect to be contributing that integration point myself. I think that the results speak for themselves in the very common environment of a Git client without bitmaps. Thanks, -Stolee
On Wed, Oct 30, 2024 at 10:28:22PM -0400, Derrick Stolee wrote: > On 10/29/24 2:02 PM, Taylor Blau wrote: > > On Mon, Oct 28, 2024 at 03:46:11PM -0400, Derrick Stolee wrote: > >> On 10/28/24 1:25 PM, Taylor Blau wrote: > >>> Unfortunately, there is no easy way to reuse the format of the existing > >>> hashcache extension as-is to indicate to the reader whether they are > >>> recording traditional name-hash values, or the new --path-walk hash > >>> values. > >> > >> The --path-walk option does not mess with the name-hash. You're thinking > >> of the --full-name-hash feature [1] that was pulled out due to a lack of > >> interest (and better results with --path-walk). > >> > >> [1] https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com/ > > > > Ah, gotcha. Thanks for clarifying. > > > > What is the incompatibility between the two, then? Is it just that > > bitmaps give us the objects in pack- or pseudo-pack order, and we don't > > have a way to permute that back into the order that --path-walk would > > give us? > > The incompatibility of reading bitmaps and using the path-walk API is > that the path-walk API does not check a bitmap to see if an object is > already discovered. Thus, it does not use the reachability information > from the bitmap at all and would parse commits and trees to find the > objects that should be in the pack-file. Sure, I think what I'm trying to understand here is whether this "incapability" is a fundamental one, or just that we haven't implemented checking bitmaps in the path-walk API yet. Thanks, Taylor