Message ID | cover.1689889382.git.jonathantanmy@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Changed path filter hash fix and version bump | expand |
Jonathan Tan <jonathantanmy@google.com> writes: > Thanks, Junio and Taylor, for your reviews. I have included Taylor's > patches in this patch set. > > There seemed to be some merge conflicts when I tried to apply the > patches Taylor provided on the base that I built my patches on (that is, > the base of jt/path-filter-fix, namely, maint-2.40), so I have rebased > all my patches onto latest master. > > Jonathan Tan (4): > gitformat-commit-graph: describe version 2 of BDAT > t4216: test changed path filters with high bit paths > repo-settings: introduce commitgraph.changedPathsVersion > commit-graph: new filter ver. that fixes murmur3 > > Taylor Blau (3): > t/helper/test-read-graph.c: extract `dump_graph_info()` > bloom.h: make `load_bloom_filter_from_graph()` public > t/helper/test-read-graph: implement `bloom-filters` mode Thanks, I seem to have missed this one. Let's queue this version and merge it down to 'next', unless there is no other blocking comments in a few days. Thanks.
Jonathan Tan <jonathantanmy@google.com> writes: > Jonathan Tan (4): > gitformat-commit-graph: describe version 2 of BDAT > t4216: test changed path filters with high bit paths > repo-settings: introduce commitgraph.changedPathsVersion > commit-graph: new filter ver. that fixes murmur3 > > Taylor Blau (3): > t/helper/test-read-graph.c: extract `dump_graph_info()` > bloom.h: make `load_bloom_filter_from_graph()` public > t/helper/test-read-graph: implement `bloom-filters` mode After a week, nobody seems to have found anything worth saying about these patches. Does the design, especially the migration story, now look sensible to everybody? I was contemplating to mark the topic for 'next' after reading them myself once more. Thanks.
On Wed, Jul 26, 2023 at 01:39:14PM -0700, Junio C Hamano wrote: > Jonathan Tan <jonathantanmy@google.com> writes: > > > Jonathan Tan (4): > > gitformat-commit-graph: describe version 2 of BDAT > > t4216: test changed path filters with high bit paths > > repo-settings: introduce commitgraph.changedPathsVersion > > commit-graph: new filter ver. that fixes murmur3 > > > > Taylor Blau (3): > > t/helper/test-read-graph.c: extract `dump_graph_info()` > > bloom.h: make `load_bloom_filter_from_graph()` public > > t/helper/test-read-graph: implement `bloom-filters` mode > > After a week, nobody seems to have found anything worth saying about > these patches. Does the design, especially the migration story, now > look sensible to everybody? I was contemplating to mark the topic > for 'next' after reading them myself once more. Sorry for not getting to this sooner. I didn't notice anything during my review, but I think there may be a bug here. Suppose that we have an existing commit-graph with v1 Bloom filters. If we then try to rewrite that commit-graph using v2 Bloom filters, we *should* attempt to recompute the filter from scratch. But AFAICT, that isn't what happens. Here's my test setup: test_expect_success 'test' ' test_commit base && git repack -d && git -c commitGraph.changedPathVersion=1 commit-graph write --changed-paths && debug git -c commitGraph.changedPathVersion=2 commit-graph write --changed-paths ' if you attach a debugger to the second process, and break inside of get_or_compute_bloom_filter() when compute_if_not_present is set, you'll see that Git will pass along the existing *v1* Bloom filter, and then write its contents to the new commit-graph: (gdb) b get_or_compute_bloom_filter if compute_if_not_present Breakpoint 1 at 0x14340f: file bloom.c, line 260. (gdb) r Starting program: /home/ttaylorr/src/git/git -c commitGraph.changedPathVersion=2 commit-graph write --changed-paths [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Breakpoint 1, get_or_compute_bloom_filter (r=0x5555559bdc80 <the_repo>, c=0x5555559c8ef0, compute_if_not_present=1, settings=0x5555559c6950, computed=0x7fffffffd854) at bloom.c:260 260 if (computed) (gdb) until 271 get_or_compute_bloom_filter (r=0x5555559bdc80 <the_repo>, c=0x5555559c8ef0, compute_if_not_present=1, settings=0x5555559c6950, computed=0x7fffffffd854) at bloom.c:271 271 load_bloom_filter_from_graph(r->objects->commit_graph, (gdb) p *filter $2 = {data = 0x0, len = 0} (gdb) n 275 if (filter->data && filter->len) (gdb) p *filter $3 = {data = 0x7ffff7fc24a8 "\210\210\322a\267\234\214s}\004J\265\313\201\241\032e\312\034", len = 2} If I'm parsing this all correctly, Git used the v1 filter corresponding to that commit, and did not recompute a new one. I think that this could lead to incorrect results if you use this to masquerade a v1 Bloom filter as a v2 one. Since they use different implementations (one correct, one not) of murmur3, that opens us up to false negatives, at which point all bets are off. So I think we want to be more careful about when we load the existing Bloom data or not. I think we probably want to load it in all cases, but only read it when we have compatible Bloom settings. That should stop us from exhibiting this kind of bug, but it also gives us a handle on existing Bloom data if we wanted to copy forward existing results where we can. If all of this tracks, I think that there is a gap in our test coverage that didn't catch this earlier. Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > Sorry for not getting to this sooner. I didn't notice anything during my > review, but I think there may be a bug here. > > Suppose that we have an existing commit-graph with v1 Bloom filters. If > we then try to rewrite that commit-graph using v2 Bloom filters, we > *should* attempt to recompute the filter from scratch. But AFAICT, that > isn't what happens. > ... > If I'm parsing this all correctly, Git used the v1 filter corresponding > to that commit, and did not recompute a new one. > > I think that this could lead to incorrect results if you use this to > masquerade a v1 Bloom filter as a v2 one. That indeed is very bad. How hard it would be to construct a test case that fails if filter computed as v1 is marketed as v2? A test may be far easier to construct if it does not have to be end-to-end (e.g. instrument the codepath you followed through with the debugger with trace2 annotations and see the control takes the right branches by reading the log). > So I think we want to be more careful about when we load the existing > Bloom data or not. I think we probably want to load it in all cases, but > only read it when we have compatible Bloom settings. That should stop us > from exhibiting this kind of bug, but it also gives us a handle on > existing Bloom data if we wanted to copy forward existing results where > we can. > > If all of this tracks, I think that there is a gap in our test coverage > that didn't catch this earlier. Yeah, thanks for raising a concern. Jonathan?
Junio C Hamano <gitster@pobox.com> writes: > Taylor Blau <me@ttaylorr.com> writes: > > Suppose that we have an existing commit-graph with v1 Bloom filters. If > > we then try to rewrite that commit-graph using v2 Bloom filters, we > > *should* attempt to recompute the filter from scratch. But AFAICT, that > > isn't what happens. > > ... > > If I'm parsing this all correctly, Git used the v1 filter corresponding > > to that commit, and did not recompute a new one. > > > > I think that this could lead to incorrect results if you use this to > > masquerade a v1 Bloom filter as a v2 one. > > That indeed is very bad. How hard it would be to construct a test > case that fails if filter computed as v1 is marketed as v2? A test > may be far easier to construct if it does not have to be end-to-end > (e.g. instrument the codepath you followed through with the debugger > with trace2 annotations and see the control takes the right branches > by reading the log). Ah, thanks, Taylor, for so meticulously investigating this. I'll take a look. A test should be doable - we already have tests (the ones that use "get_first_changed_path_filter") that check the bytes of the filter generated, so we should be able to write a test that writes one version, writes the other, then checks the bytes. > > So I think we want to be more careful about when we load the existing > > Bloom data or not. I think we probably want to load it in all cases, but > > only read it when we have compatible Bloom settings. That should stop us > > from exhibiting this kind of bug, but it also gives us a handle on > > existing Bloom data if we wanted to copy forward existing results where > > we can. The intention in the current patch set was to not load it at all when we have incompatible Bloom settings because it appeared quite troublesome to notate which Bloom filter in memory is of which version. If we want to copy forward existing results, we can change that, but I don't know whether it's worth doing that (and if we think it's worth doing, this should probably go in another patch set). > > If all of this tracks, I think that there is a gap in our test coverage > > that didn't catch this earlier. > > Yeah, thanks for raising a concern. > > Jonathan? I'll take a look. Yes this does seem like a gap in test coverage - I thought the existing test that checks that Bloom filters are not used when a different version is requested would be sufficient, but apparently that's not the case.
On Thu, Jul 27, 2023 at 10:39:05AM -0700, Jonathan Tan wrote: > A test should be doable - we already have tests (the ones that use > "get_first_changed_path_filter") that check the bytes of the filter > generated, so we should be able to write a test that writes one version, > writes the other, then checks the bytes. Thanks for looking into it! > > > So I think we want to be more careful about when we load the existing > > > Bloom data or not. I think we probably want to load it in all cases, but > > > only read it when we have compatible Bloom settings. That should stop us > > > from exhibiting this kind of bug, but it also gives us a handle on > > > existing Bloom data if we wanted to copy forward existing results where > > > we can. > > The intention in the current patch set was to not load it at all when we > have incompatible Bloom settings because it appeared quite troublesome > to notate which Bloom filter in memory is of which version. If we want > to copy forward existing results, we can change that, but I don't know > whether it's worth doing that (and if we think it's worth doing, this > should probably go in another patch set). Yeah, I think having Bloom filters accessible from a commit-graph regardless of whether or not it matches our Bloom filter version is prerequisite to being able to implement something like this. I feel like this is important enough to do in the same patch set, or the same release to avoid surprising operators when their commit-graph write suddenly recomputes all of its Bloom filters. Since we already store the Bloom version that we're using in the commit-graph file itself, shouldn't it be something along the lines of sticking that value onto the bloom_filter when we read its contents? Although I don't think that you'd even need to annotate each individual filter, since you know that every pre-existing Bloom filter you are able to find has the version given by: the_repository->objects->commit_graph->bloom_filter_settings->hash_version right? Thanks, Taylor
Junio C Hamano <gitster@pobox.com> writes: > ... How hard it would be to construct a test > case that fails if filter computed as v1 is marketed as v2? There may be a more effective way for future-proofing, besides ensuring the test coverage. Although this series added a way to see which version the on-disk data is using using the version field, I do not think it touched the "struct bloom_filter" and "struct bloom_key" that represent the in-core data. If we had a member in these structures that get_or_compute_bloom_filter() can fill in from the on-disk structure or the version it used when it computed the filter anew, would it become easier to catch the case where we try to add a version 2 computed key to a filter that was read from version 1 on-disk structure, presumably at add_key_to_filter()? Thanks.
Taylor Blau <me@ttaylorr.com> writes: > > The intention in the current patch set was to not load it at all when we > > have incompatible Bloom settings because it appeared quite troublesome > > to notate which Bloom filter in memory is of which version. If we want > > to copy forward existing results, we can change that, but I don't know > > whether it's worth doing that (and if we think it's worth doing, this > > should probably go in another patch set). > > Yeah, I think having Bloom filters accessible from a commit-graph > regardless of whether or not it matches our Bloom filter version is > prerequisite to being able to implement something like this. > > I feel like this is important enough to do in the same patch set, or the > same release to avoid surprising operators when their commit-graph write > suddenly recomputes all of its Bloom filters. Suddenly reading many (or most) of the repo's trees would be a similar surprise, right? Also this would happen only if the server operator explicitly sets a changed path filter version. If they leave it as-is, commit graphs will still be written with the same version as the one on disk. > Since we already store the Bloom version that we're using in the > commit-graph file itself, shouldn't it be something along the lines of > sticking that value onto the bloom_filter when we read its contents? > > Although I don't think that you'd even need to annotate each individual > filter, since you know that every pre-existing Bloom filter you are able > to find has the version given by: > > the_repository->objects->commit_graph->bloom_filter_settings->hash_version > > right? > > Thanks, > Taylor Regarding consulting commit_graph->bloom_filter_settings->hash_version, the issue I ran into was that firstly, I didn't know what to do about commit_graph->base_graph (which also has its own bloom_filter_settings) and what to do if it had a contradictory hash_version. And even if we found a way to unify those, it is not true that every Bloom filter in memory is of that version, since we may have generated some that correspond to the version we're writing (not the version on disk). In particular, the Bloom filters we write come from a commit slab (bloom_filters in bloom.c) and in that slab, both Bloom filters from disk and Bloom filters that were generated in-process coexist. I also thought of your other proposal of augmenting struct bloom_filter to also include the version. The issue I ran into there is if, for a given commit, there already exists a Bloom filter read from disk with the wrong version, what should we do? Overwrite it, or store both versions in memory? (We can't just immediately output the Bloom filter to disk and forget about the new version, only storing its size so that we can generate the BIDX, because in the current code, generation and writing to disk are separate. We could try to refactor it, but I didn't want to make such a large change to reduce the possibility of bugs.) Both storing the version number and storing an additional pointer for a second version would increase memory consumption too, even when supporting two versions isn't needed, but maybe this isn't a big deal.
On Thu, Jul 27, 2023 at 01:53:08PM -0700, Jonathan Tan wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > The intention in the current patch set was to not load it at all when we > > > have incompatible Bloom settings because it appeared quite troublesome > > > to notate which Bloom filter in memory is of which version. If we want > > > to copy forward existing results, we can change that, but I don't know > > > whether it's worth doing that (and if we think it's worth doing, this > > > should probably go in another patch set). > > > > Yeah, I think having Bloom filters accessible from a commit-graph > > regardless of whether or not it matches our Bloom filter version is > > prerequisite to being able to implement something like this. > > > > I feel like this is important enough to do in the same patch set, or the > > same release to avoid surprising operators when their commit-graph write > > suddenly recomputes all of its Bloom filters. > > Suddenly reading many (or most) of the repo's trees would be a similar > surprise, right? That's a good point. I think in general I'd expect Git to avoid recomputing Bloom filters where that work can be avoided, if the work in order to detect whether or not we need to recompute a filter is cheap enough to carry out. > Also this would happen only if the server operator explicitly sets a > changed path filter version. If they leave it as-is, commit graphs will > still be written with the same version as the one on disk. I think that I could live with that if the default is to leave things as-is. I still think that it's worth it to have this functionality to propagate Bloom filters forward should ship in Git, but we can work on that outside of this series. > Regarding consulting commit_graph->bloom_filter_settings->hash_version, > the issue I ran into was that firstly, I didn't know what to do about > commit_graph->base_graph (which also has its own bloom_filter_settings) > and what to do if it had a contradictory hash_version. And even if > we found a way to unify those, it is not true that every Bloom filter > in memory is of that version, since we may have generated some that > correspond to the version we're writing (not the version on disk). > In particular, the Bloom filters we write come from a commit slab > (bloom_filters in bloom.c) and in that slab, both Bloom filters from > disk and Bloom filters that were generated in-process coexist. Would we ever want to have a commit-graph chain with mixed Bloom filter versions? We avoid mixing generation number schemes across multiple layers of a commit-graph chain. But I don't see any reason that we should or need to have a similar restriction in place for the Bloom filter version. Both are readable, as long as the user-provided configuration allows them to be. We just have to interpret them differently depending on what layer of the graph (and therefore, what Bloom filter version they are) they come from. Sorry for thinking aloud a little there. I think that this means that we at minimum have to keep in context the commit-graph layer we found the Bloom filter in so that we can tie that back to its Bloom filter version. That might just mean that we have to tag each Bloom filter with the version it was computed under, or maybe we already have the commit-graph layer in context, in which case we shouldn't need an additional field. My gut is telling me that we probably *do* need such a field, since we don't often have a reference to the particular layer that we found a Bloom filter in, just the tip of the commit-graph chain that it came from. > I also thought of your other proposal of augmenting struct bloom_filter > to also include the version. The issue I ran into there is if, for a > given commit, there already exists a Bloom filter read from disk with > the wrong version, what should we do? Overwrite it, or store both > versions in memory? (We can't just immediately output the Bloom filter > to disk and forget about the new version, only storing its size so that > we can generate the BIDX, because in the current code, generation and > writing to disk are separate. We could try to refactor it, but I didn't > want to make such a large change to reduce the possibility of bugs.) > Both storing the version number and storing an additional pointer for a > second version would increase memory consumption too, even when > supporting two versions isn't needed, but maybe this isn't a big deal. It's likely that I'm missing something here, but what is stopping us from discarding the old Bloom filter as soon as we generate the new one? We shouldn't need to load the old filter again out of the commit slab, right? Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > On Thu, Jul 27, 2023 at 01:53:08PM -0700, Jonathan Tan wrote: > > Suddenly reading many (or most) of the repo's trees would be a similar > > surprise, right? > > That's a good point. I think in general I'd expect Git to avoid > recomputing Bloom filters where that work can be avoided, if the work in > order to detect whether or not we need to recompute a filter is cheap > enough to carry out. Makes sense. I just don't think that there is a cheap way to detect if a filter does not need to be recomputed (the closest way I think we have is something that will require reading a lot of trees in a repo). > > Also this would happen only if the server operator explicitly sets a > > changed path filter version. If they leave it as-is, commit graphs will > > still be written with the same version as the one on disk. > > I think that I could live with that if the default is to leave things > as-is. Ah, thanks. > I still think that it's worth it to have this functionality to propagate > Bloom filters forward should ship in Git, but we can work on that > outside of this series. Makes sense. > > Regarding consulting commit_graph->bloom_filter_settings->hash_version, > > the issue I ran into was that firstly, I didn't know what to do about > > commit_graph->base_graph (which also has its own bloom_filter_settings) > > and what to do if it had a contradictory hash_version. And even if > > we found a way to unify those, it is not true that every Bloom filter > > in memory is of that version, since we may have generated some that > > correspond to the version we're writing (not the version on disk). > > In particular, the Bloom filters we write come from a commit slab > > (bloom_filters in bloom.c) and in that slab, both Bloom filters from > > disk and Bloom filters that were generated in-process coexist. > > Would we ever want to have a commit-graph chain with mixed Bloom filter > versions? Probably not, but I wanted to be robust in case a third-party tool wrote a chain with mixed versions. > We avoid mixing generation number schemes across multiple layers of a > commit-graph chain. But I don't see any reason that we should or need to > have a similar restriction in place for the Bloom filter version. Both > are readable, as long as the user-provided configuration allows them to > be. Yes, that's true - there is no inherent reason why we can't mix them, unlike with generation numbers. > We just have to interpret them differently depending on what layer of > the graph (and therefore, what Bloom filter version they are) they come > from. > > Sorry for thinking aloud a little there. I think that this means that we > at minimum have to keep in context the commit-graph layer we found the > Bloom filter in so that we can tie that back to its Bloom filter > version. That might just mean that we have to tag each Bloom filter with > the version it was computed under, or maybe we already have the > commit-graph layer in context, in which case we shouldn't need an > additional field. > > My gut is telling me that we probably *do* need such a field, since we > don't often have a reference to the particular layer that we found a > Bloom filter in, just the tip of the commit-graph chain that it came > from. We'll need the additional field because we don't know which commit graph layer it comes from. In fact, we don't even know which *repo* the commit comes from, since the commit slab is global. (Moving the slab to being under a repo or under a commit graph layer would fix this.) But I think there still remains the question of whether we really need to support multiple versions in one Git invocation. > > I also thought of your other proposal of augmenting struct bloom_filter > > to also include the version. The issue I ran into there is if, for a > > given commit, there already exists a Bloom filter read from disk with > > the wrong version, what should we do? Overwrite it, or store both > > versions in memory? (We can't just immediately output the Bloom filter > > to disk and forget about the new version, only storing its size so that > > we can generate the BIDX, because in the current code, generation and > > writing to disk are separate. We could try to refactor it, but I didn't > > want to make such a large change to reduce the possibility of bugs.) > > Both storing the version number and storing an additional pointer for a > > second version would increase memory consumption too, even when > > supporting two versions isn't needed, but maybe this isn't a big deal. > > It's likely that I'm missing something here, but what is stopping us > from discarding the old Bloom filter as soon as we generate the new > one? We shouldn't need to load the old filter again out of the commit > slab, right? > > Thanks, > Taylor I did not look at the code closely enough except to see that there was a gap between generating the new-version Bloom filters and writing them to disk, and I was concerned that now, or in the future, there would be some code in that gap that reads the Bloom filter for a commit and expects the old-version Bloom filter there.
On Tue, Aug 01, 2023 at 02:08:50PM -0400, Taylor Blau wrote: > On Thu, Jul 27, 2023 at 01:53:08PM -0700, Jonathan Tan wrote: > > Taylor Blau <me@ttaylorr.com> writes: > > > > The intention in the current patch set was to not load it at all when we > > > > have incompatible Bloom settings because it appeared quite troublesome > > > > to notate which Bloom filter in memory is of which version. If we want > > > > to copy forward existing results, we can change that, but I don't know > > > > whether it's worth doing that (and if we think it's worth doing, this > > > > should probably go in another patch set). > > > > > > Yeah, I think having Bloom filters accessible from a commit-graph > > > regardless of whether or not it matches our Bloom filter version is > > > prerequisite to being able to implement something like this. > > > > > > I feel like this is important enough to do in the same patch set, or the > > > same release to avoid surprising operators when their commit-graph write > > > suddenly recomputes all of its Bloom filters. > > > > Suddenly reading many (or most) of the repo's trees would be a similar > > surprise, right? > > That's a good point. I think in general I'd expect Git to avoid > recomputing Bloom filters where that work can be avoided, if the work in > order to detect whether or not we need to recompute a filter is cheap > enough to carry out. I spent some time implementing this (patches are available in the branch 'tb/path-filter-fix-upgrade' from my fork). Handling incompatible Bloom filter versions is kind of tricky, but do-able without having to implement too much on top of what's already there. But I don't think that it's enough to say that we can reuse a commit's Bloom filter iff that commit's tree has no paths with characters >= 0x80. Suppose we have such a tree, whose Bloom filter we believe to be reusable. If its first parent *does* have such a path, then that path would appear as a deletion relative to its first parent. So that path *would* be in the filter, meaning that it isn't reusable. So I think the revised condition is something like: a commit's Bloom filter is reusable when there are no paths with characters >= 0x80 in a tree-diff against its first parent. I think that ensuring that there are no such paths in both a commit's root tree, as well as its first parent's root tree is equivalent, since that removes the possibility of such a path showing up in its tree-diff. As long as we aren't generating a commit-graph with --stdin-packs, then we process commits in generation order, so we will see parents before their children. I think we could reuse existing filters in that case, but the condition is slightly more complex than I originally thought. Thanks, Taylor
On 8/2/2023 8:01 PM, Taylor Blau wrote: > On Tue, Aug 01, 2023 at 02:08:50PM -0400, Taylor Blau wrote: >> That's a good point. I think in general I'd expect Git to avoid >> recomputing Bloom filters where that work can be avoided, if the work in >> order to detect whether or not we need to recompute a filter is cheap >> enough to carry out. > > I spent some time implementing this (patches are available in the branch > 'tb/path-filter-fix-upgrade' from my fork). Handling incompatible Bloom > filter versions is kind of tricky, but do-able without having to > implement too much on top of what's already there. > > But I don't think that it's enough to say that we can reuse a commit's > Bloom filter iff that commit's tree has no paths with characters >= > 0x80. Suppose we have such a tree, whose Bloom filter we believe to be > reusable. If its first parent *does* have such a path, then that path > would appear as a deletion relative to its first parent. So that path > *would* be in the filter, meaning that it isn't reusable. > > So I think the revised condition is something like: a commit's Bloom > filter is reusable when there are no paths with characters >= 0x80 in > a tree-diff against its first parent. I think that ensuring that there > are no such paths in both a commit's root tree, as well as its first > parent's root tree is equivalent, since that removes the possibility of > such a path showing up in its tree-diff. This condition is exactly "we computed the diff to know which paths were input to the filter" which is as difficult as recomputing the Bloom filter from scratch. I don't think there is much room to gain a performance improvement here. Thanks, -Stolee
On Thu, Aug 03, 2023 at 09:18:11AM -0400, Derrick Stolee wrote: > > So I think the revised condition is something like: a commit's Bloom > > filter is reusable when there are no paths with characters >= 0x80 in > > a tree-diff against its first parent. I think that ensuring that there > > are no such paths in both a commit's root tree, as well as its first > > parent's root tree is equivalent, since that removes the possibility of > > such a path showing up in its tree-diff. > > This condition is exactly "we computed the diff to know which paths were > input to the filter" which is as difficult as recomputing the Bloom filter > from scratch. I don't think there is much room to gain a performance > improvement here. I think that's true in the worst case, and certainly for repositories with many tree entries which have characters >= 0x80. But note that it's a heuristic in one direction only. If we know that a commit's root tree (and that of it's first parent, if it has one) is free of any such paths, then it's impossible for the first parent tree-diff to contain such an entry, and therefore we can reuse any existing filter. Of course, a commit's root tree (and its parent) may both have a path whose characters >= 0x80 while still not seeing a corresponding entry show up in the tree-diff if that path is unchanged between a commit and its first parent. I think if we were looking at every tree every time only to realize that we have to go back and compute its changed-path Bloom filter, this would be a non-starter. But since we "cache" the results of our walk via the tree object's flags bits, we can skip looking at many trees. In my testing, this showed a significant improvement on linux.git and git.git. My setup for testing is something like: $ git clone git@github.com:torvalds/linux.git $ cd linux $ git commit-graph write --reachable --changed-paths $ graph=".git/objects/info/commit-graph" $ mv $graph{,.bak} so that .git/objects/info/commit-graph.bak is a commit-graph with v1 changed-path Bloom filters for all commits in generation order. With that in place, I can do: $ git config commitGraph.changedPathsVersion 2 $ hyperfine -p 'cp -f $graph.bak $graph' -L v 0,1 \ 'GIT_TEST_UPGRADE_BLOOM_FILTERS={v} git.compile commit-graph write --reachable --changed-paths' , producing the following results on linux.git: Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths Time (mean ± σ): 124.873 s ± 0.316 s [User: 124.081 s, System: 0.643 s] Range (min … max): 124.621 s … 125.227 s 3 runs Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths Time (mean ± σ): 79.271 s ± 0.163 s [User: 74.611 s, System: 4.521 s] Range (min … max): 79.112 s … 79.437 s 3 runs Summary 'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran 1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths' On git.git, writing a new commit-graph after upgrading from v1 to v2 filters went from taking 4.163 seconds to 3.348 seconds, for a more modest 1.24x speed-up. Of course, all of this depends on how much of the tree meets the above conditions, so we'd expect worse results on repositories with paths that contain characters >= 0x80. I think we'd want some kind of mechanism (probably via config, not a GIT_TEST environment variable) to control whether or not to upgrade existing filters. Thanks, Taylor