mbox series

[v5,0/4] Changed path filter hash fix and version bump

Message ID cover.1689283789.git.jonathantanmy@google.com (mailing list archive)
Headers show
Series Changed path filter hash fix and version bump | expand

Message

Jonathan Tan July 13, 2023, 9:42 p.m. UTC
Sorry it took me a while to get back to this. Looking at the existing
code, Bloom filters are passed around a lot without context, especially
when writing - they are generated into a commit slab and then when it is
time to write them to disk, they are taken from that commit slab. And
rather than annotating where they are passed around, I thought it better
to stick to the single-version approach in version 4 (per Git invocation
and per repo, only one version), which also sidesteps what happens if
there so happens to be multiple commit graphs each with their own Bloom
filter version (not possible to be generated by Git but possible with
a hex editor) and what happens if we want to write a different version
than what is currently stored in the commit slab. But with an auto-
detection of that version, I think we have what we need; in regular
operation, Git will run with whatever the version on disk is, and when
it is time to migrate, the user can explicitly specify the version.

I did not implement the mitigation of not using the Bloom filters when
a high-bit path is sought because, as Stolee says, this is useful only
when mixing Git implementations and will slow down operations (without
any increase in correctness) in the absence of such a mix [1]. But I can
implement this if need be.

[1] https://lore.kernel.org/git/e57b2272-b269-b705-3d42-d32e0b410f03@github.com/

Jonathan Tan (4):
  gitformat-commit-graph: describe version 2 of BDAT
  t4216: test changed path filters with high bit paths
  repo-settings: introduce commitgraph.changedPathsVersion
  commit-graph: new filter ver. that fixes murmur3

 Documentation/config/commitgraph.txt     |  19 +++-
 Documentation/gitformat-commit-graph.txt |   9 +-
 bloom.c                                  |  65 ++++++++++++-
 bloom.h                                  |   8 +-
 commit-graph.c                           |  33 +++++--
 oss-fuzz/fuzz-commit-graph.c             |   2 +-
 repo-settings.c                          |   6 +-
 repository.h                             |   2 +-
 t/helper/test-bloom.c                    |   9 +-
 t/t0095-bloom.sh                         |   8 ++
 t/t4216-log-bloom.sh                     | 117 +++++++++++++++++++++++
 11 files changed, 256 insertions(+), 22 deletions(-)

Range-diff against v4:
1:  a5955cda3d ! 1:  52e281eef0 gitformat-commit-graph: describe version 2 of BDAT
    @@ Documentation/gitformat-commit-graph.txt: All multi-byte numbers are in network
      	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
      	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
     -	in Probabilistic Verification"
    -+	in Probabilistic Verification". Version 1 bloom filters have a bug that appears
    ++	in Probabilistic Verification". Version 1 Bloom filters have a bug that appears
     +	when char is signed and the repository has path names that have characters >=
     +	0x80; Git supports reading and writing them, but this ability will be removed
     +	in a future version of Git.
2:  68732120f9 ! 2:  94a4c7af38 t4216: test changed path filters with high bit paths
    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm
     +test_expect_success 'setup check value of version 1 changed-path' '
     +	(cd highbit1 &&
     +		printf "52a9" >expect &&
    -+		get_first_changed_path_filter >actual)
    ++		get_first_changed_path_filter >actual &&
    ++		test_cmp expect actual)
     +'
     +
     +# expect will not match actual if char is unsigned by default. Write the test
3:  44cbcc6a69 ! 3:  131095666d repo-settings: introduce commitgraph.changedPathsVersion
    @@ Commit message
         repo-settings: introduce commitgraph.changedPathsVersion
     
         A subsequent commit will introduce another version of the changed-path
    -    filter in the commit graph file. In order to control which version is
    -    to be accepted when read (and which version to write), a config variable
    -    is needed.
    +    filter in the commit graph file. In order to control which version to
    +    write (and read), a config variable is needed.
     
         Therefore, introduce this config variable. For forwards compatibility,
         teach Git to not read commit graphs when the config variable
    @@ Commit message
         This commit does not change the behavior of writing (Git writes changed
         path filters when explicitly instructed regardless of any config
         variable), but a subsequent commit will restrict Git such that it will
    -    only write when commitgraph.changedPathsVersion is 0, 1, or 2.
    +    only write when commitgraph.changedPathsVersion is a recognized value.
     
         Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
         Signed-off-by: Junio C Hamano <gitster@pobox.com>
    @@ Documentation/config/commitgraph.txt: commitGraph.maxNewFilters::
     -	If true, then git will use the changed-path Bloom filters in the
     -	commit-graph file (if it exists, and they are present). Defaults to
     -	true. See linkgit:git-commit-graph[1] for more information.
    -+	Deprecated. Equivalent to changedPathsVersion=1 if true, and
    ++	Deprecated. Equivalent to changedPathsVersion=-1 if true, and
     +	changedPathsVersion=0 if false.
     +
     +commitGraph.changedPathsVersion::
     +	Specifies the version of the changed-path Bloom filters that Git will read and
    -+	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
    ++	write. May be -1, 0 or 1. Any changed-path Bloom filters on disk that do not
     +	match the version set in this config variable will be ignored.
     ++
    -+Defaults to 1.
    ++Defaults to -1.
    +++
    ++If -1, Git will use the version of the changed-path Bloom filters in the
    ++repository, defaulting to 1 if there are none.
     ++
     +If 0, git will write version 1 Bloom filters when instructed to write.
     ++
    @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s,
      	}
      
     -	if (s->commit_graph_read_changed_paths) {
    -+	if (s->commit_graph_changed_paths_version == 1) {
    ++	if (s->commit_graph_changed_paths_version != 0) {
      		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
      			   &graph->chunk_bloom_indexes);
      		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
    @@ repo-settings.c: void prepare_repo_settings(struct repository *r)
     +	repo_cfg_bool(r, "commitgraph.readchangedpaths", &readChangedPaths, 1);
     +	repo_cfg_int(r, "commitgraph.changedpathsversion",
     +		     &r->settings.commit_graph_changed_paths_version,
    -+		     readChangedPaths ? 1 : 0);
    ++		     readChangedPaths ? -1 : 0);
      	repo_cfg_bool(r, "gc.writecommitgraph", &r->settings.gc_write_commit_graph, 1);
      	repo_cfg_bool(r, "fetch.writecommitgraph", &r->settings.fetch_write_commit_graph, 0);
      
4:  6dee3bfa70 ! 4:  47ba89c565 commit-graph: new filter ver. that fixes murmur3
    @@ Commit message
         So this patch does not include any mechanism to "salvage" changed path
         filters from repositories. There is also no "mixed" mode - for each
         invocation of Git, reading and writing changed path filters are done
    -    with the same version number.
    +    with the same version number; this version number may be explicitly
    +    stated (typically if the user knows which version they need) or
    +    automatically determined from the version of the existing changed path
    +    filters in the repository.
     
         There is a change in write_commit_graph(). graph_read_bloom_data()
         makes it possible for chunk_bloom_data to be non-NULL but
    @@ Documentation/config/commitgraph.txt: commitGraph.readChangedPaths::
      
      commitGraph.changedPathsVersion::
      	Specifies the version of the changed-path Bloom filters that Git will read and
    --	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
    -+	write. May be 0, 1, or 2. Any changed-path Bloom filters on disk that do not
    +-	write. May be -1, 0 or 1. Any changed-path Bloom filters on disk that do not
    ++	write. May be -1, 0, 1, or 2. Any changed-path Bloom filters on disk that do not
      	match the version set in this config variable will be ignored.
      +
    - Defaults to 1.
    + Defaults to -1.
     
      ## bloom.c ##
     @@ bloom.c: static int load_bloom_filter_from_graph(struct commit_graph *g,
    @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star
      
     +struct graph_read_bloom_data_data {
     +	struct commit_graph *g;
    -+	int commit_graph_changed_paths_version;
    ++	int *commit_graph_changed_paths_version;
     +};
     +
      static int graph_read_bloom_data(const unsigned char *chunk_start,
    @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star
      	hash_version = get_be32(chunk_start);
      
     -	if (hash_version != 1)
    -+	if (hash_version != d->commit_graph_changed_paths_version)
    - 		return 0;
    +-		return 0;
    ++	if (*d->commit_graph_changed_paths_version == -1) {
    ++		*d->commit_graph_changed_paths_version = hash_version;
    ++	} else if (hash_version != *d->commit_graph_changed_paths_version) {
    ++ 		return 0;
    ++	}
      
      	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
    + 	g->bloom_filter_settings->hash_version = hash_version;
     @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s,
    - 			graph->read_generation_data = 1;
      	}
      
    --	if (s->commit_graph_changed_paths_version == 1) {
    -+	if (s->commit_graph_changed_paths_version == 1
    -+	    || s->commit_graph_changed_paths_version == 2) {
    + 	if (s->commit_graph_changed_paths_version != 0) {
     +		struct graph_read_bloom_data_data data = {
     +			.g = graph,
    -+			.commit_graph_changed_paths_version = s->commit_graph_changed_paths_version
    ++			.commit_graph_changed_paths_version = &s->commit_graph_changed_paths_version
     +		};
      		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
      			   &graph->chunk_bloom_indexes);
    @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
      	ctx->write_generation_data = (get_configured_generation_version(r) == 2);
      	ctx->num_generation_data_overflows = 0;
      
    -+	if (r->settings.commit_graph_changed_paths_version < 0
    ++	if (r->settings.commit_graph_changed_paths_version < -1
     +	    || r->settings.commit_graph_changed_paths_version > 2) {
     +		warning(_("attempting to write a commit-graph, but 'commitgraph.changedPathsVersion' (%d) is not supported"),
     +			r->settings.commit_graph_changed_paths_version);
    @@ t/t0095-bloom.sh: test_expect_success 'compute unseeded murmur3 hash for test st
      	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
     
      ## t/t4216-log-bloom.sh ##
    +@@ t/t4216-log-bloom.sh: get_bdat_offset () {
    + 		.git/objects/info/commit-graph
    + }
    + 
    ++get_changed_path_filter_version () {
    ++	BDAT_OFFSET=$(get_bdat_offset) &&
    ++	perl -0777 -ne \
    ++		'print unpack("H*", substr($_, '$BDAT_OFFSET', 4))' \
    ++		.git/objects/info/commit-graph
    ++}
    ++
    + get_first_changed_path_filter () {
    + 	BDAT_OFFSET=$(get_bdat_offset) &&
    + 	perl -0777 -ne \
    +@@ t/t4216-log-bloom.sh: test_expect_success 'set up repo with high bit path, version 1 changed-path' '
    + 	git -C highbit1 commit-graph write --reachable --changed-paths
    + '
    + 
    +-test_expect_success 'setup check value of version 1 changed-path' '
    ++test_expect_success 'check value of version 1 changed-path' '
    + 	(cd highbit1 &&
    + 		printf "52a9" >expect &&
    + 		get_first_changed_path_filter >actual &&
     @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when version 1 requested' '
      		test_bloom_filters_used "-- $CENT")
      '
    @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when vers
     +		test_bloom_filters_not_used "-- $CENT")
     +'
     +
    ++test_expect_success 'version 1 changed-path used when autodetect requested' '
    ++	(cd highbit1 &&
    ++		git config --add commitgraph.changedPathsVersion -1 &&
    ++		test_bloom_filters_used "-- $CENT")
    ++'
    ++
    ++test_expect_success 'when writing another commit graph, preserve existing version 1 of changed-path' '
    ++	test_commit -C highbit1 c1double "$CENT$CENT" &&
    ++	git -C highbit1 commit-graph write --reachable --changed-paths &&
    ++	(cd highbit1 &&
    ++		git config --add commitgraph.changedPathsVersion -1 &&
    ++		printf "00000001" >expect &&
    ++		get_changed_path_filter_version >actual &&
    ++		test_cmp expect actual)
    ++'
    ++
     +test_expect_success 'set up repo with high bit path, version 2 changed-path' '
     +	git init highbit2 &&
     +	git -C highbit2 config --add commitgraph.changedPathsVersion 2 &&
    @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when vers
     +		git config --add commitgraph.changedPathsVersion 1 &&
     +		test_bloom_filters_not_used "-- $CENT")
     +'
    ++
    ++test_expect_success 'version 2 changed-path used when autodetect requested' '
    ++	(cd highbit2 &&
    ++		git config --add commitgraph.changedPathsVersion -1 &&
    ++		test_bloom_filters_used "-- $CENT")
    ++'
    ++
    ++test_expect_success 'when writing another commit graph, preserve existing version 2 of changed-path' '
    ++	test_commit -C highbit2 c2double "$CENT$CENT" &&
    ++	git -C highbit2 commit-graph write --reachable --changed-paths &&
    ++	(cd highbit2 &&
    ++		git config --add commitgraph.changedPathsVersion -1 &&
    ++		printf "00000002" >expect &&
    ++		get_changed_path_filter_version >actual &&
    ++		test_cmp expect actual)
    ++'
     +
      test_done

Comments

Junio C Hamano July 13, 2023, 10:16 p.m. UTC | #1
Jonathan Tan <jonathantanmy@google.com> writes:

> I did not implement the mitigation of not using the Bloom filters when
> a high-bit path is sought because, as Stolee says, this is useful only
> when mixing Git implementations and will slow down operations (without
> any increase in correctness) in the absence of such a mix [1].

Sensible, I guess.

>     @@ Commit message
>          This commit does not change the behavior of writing (Git writes changed
>          path filters when explicitly instructed regardless of any config
>          variable), but a subsequent commit will restrict Git such that it will
>     -    only write when commitgraph.changedPathsVersion is 0, 1, or 2.
>     +    only write when commitgraph.changedPathsVersion is a recognized value.

This is nicer.

>          Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
>          Signed-off-by: Junio C Hamano <gitster@pobox.com>
>     @@ Documentation/config/commitgraph.txt: commitGraph.maxNewFilters::
>      -	If true, then git will use the changed-path Bloom filters in the
>      -	commit-graph file (if it exists, and they are present). Defaults to
>      -	true. See linkgit:git-commit-graph[1] for more information.
>     -+	Deprecated. Equivalent to changedPathsVersion=1 if true, and
>     ++	Deprecated. Equivalent to changedPathsVersion=-1 if true, and
>      +	changedPathsVersion=0 if false.

I forgot to comment on this part earlier, but does the context make
it clear enough that these `changedPathsVersion` references are
about `commitGraph.changedPathsVersion` configuration variable
without fully spelled out?  They sit next to each other right now,
so it may not be too bad.  If they appeared across more distance,
I would be worried, though.

>      +commitGraph.changedPathsVersion::
>      +	Specifies the version of the changed-path Bloom filters that Git will read and
>     -+	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
>     ++	write. May be -1, 0 or 1. Any changed-path Bloom filters on disk that do not
>      +	match the version set in this config variable will be ignored.

So, any time the user configures this to a different value, we will
start to ignore the existing changed-path-filters data in the
repository, and when we are told to write commit-graph, we will
construct changed-path-filters data using the new version?

>      ++
>     -+Defaults to 1.
>     ++Defaults to -1.
>     +++
>     ++If -1, Git will use the version of the changed-path Bloom filters in the
>     ++repository, defaulting to 1 if there are none.

OK, that was misleading.  The configuration can say "-1" and it does
not mean "I'll ignore anything other than version -1"---it means
"I'll read anything".  The earlier statement should be toned down so
that we do not surprise readers, perhaps

    When set to a positive integer value, any changed-path Bloom
    filters on disk whose version is different from the value are
    ignored.

to signal that 0 and negative are special.  Then the readers can
anticipate that special cases are described next.

    When set to -1, then ...
    When set to 0, then ...
    Defaults to -1.
    
When set to the special value -1, what version will we write?

>      +If 0, git will write version 1 Bloom filters when instructed to write.

And we will only read 0 and refuse to read 1?  Or we will read both
0 and 1?

Thanks.
Junio C Hamano July 13, 2023, 10:59 p.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

>>      +If 0, git will write version 1 Bloom filters when instructed to write.
>
> And we will only read 0 and refuse to read 1?  Or we will read both
> 0 and 1?

Answering to myself (only this part).  As setting the "version"
variable to 0 is equivalent to setting "read" variable to "false",
we will refuse to read anything.
Jonathan Tan July 14, 2023, 6:48 p.m. UTC | #3
Junio C Hamano <gitster@pobox.com> writes:
> >          Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> >          Signed-off-by: Junio C Hamano <gitster@pobox.com>
> >     @@ Documentation/config/commitgraph.txt: commitGraph.maxNewFilters::
> >      -	If true, then git will use the changed-path Bloom filters in the
> >      -	commit-graph file (if it exists, and they are present). Defaults to
> >      -	true. See linkgit:git-commit-graph[1] for more information.
> >     -+	Deprecated. Equivalent to changedPathsVersion=1 if true, and
> >     ++	Deprecated. Equivalent to changedPathsVersion=-1 if true, and
> >      +	changedPathsVersion=0 if false.
> 
> I forgot to comment on this part earlier, but does the context make
> it clear enough that these `changedPathsVersion` references are
> about `commitGraph.changedPathsVersion` configuration variable
> without fully spelled out?  They sit next to each other right now,
> so it may not be too bad.  If they appeared across more distance,
> I would be worried, though.

Ah, probably better to spell it out. I'll change it.

> >      +commitGraph.changedPathsVersion::
> >      +	Specifies the version of the changed-path Bloom filters that Git will read and
> >     -+	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
> >     ++	write. May be -1, 0 or 1. Any changed-path Bloom filters on disk that do not
> >      +	match the version set in this config variable will be ignored.
> 
> So, any time the user configures this to a different value, we will
> start to ignore the existing changed-path-filters data in the
> repository, and when we are told to write commit-graph, we will
> construct changed-path-filters data using the new version?

Yes.

> >     -+Defaults to 1.
> >     ++Defaults to -1.
> >     +++
> >     ++If -1, Git will use the version of the changed-path Bloom filters in the
> >     ++repository, defaulting to 1 if there are none.
> 
> OK, that was misleading.  The configuration can say "-1" and it does
> not mean "I'll ignore anything other than version -1"---it means
> "I'll read anything".  The earlier statement should be toned down so
> that we do not surprise readers, perhaps

Ah, good point. Will do.

>     When set to a positive integer value, any changed-path Bloom
>     filters on disk whose version is different from the value are
>     ignored.
> 
> to signal that 0 and negative are special.  Then the readers can
> anticipate that special cases are described next.
> 
>     When set to -1, then ...
>     When set to 0, then ...
>     Defaults to -1.
>     
> When set to the special value -1, what version will we write?
> 
> >      +If 0, git will write version 1 Bloom filters when instructed to write.
> 
> And we will only read 0 and refuse to read 1?  Or we will read both
> 0 and 1?
> 
> Thanks.

Currently, there is only version 1 (no version 0) and after all the
patches in this patch set are applied, there will be version 1 and
version 2. I think that with your suggestions above, it will be clearer
to the reader.