mbox series

[v6,00/16] bloom: changed-path Bloom filters v2 (& sundries)

Message ID cover.1706741516.git.me@ttaylorr.com (mailing list archive)
Headers show
Series bloom: changed-path Bloom filters v2 (& sundries) | expand

Message

Taylor Blau Jan. 31, 2024, 10:52 p.m. UTC
(Rebased onto the tip of 'master', which is now bc7ee2e5e1 (The twelfth
batch, 2024-01-30), at the time of writing).

This series is a minor reroll of the combined efforts of [1] and [2] to
introduce the v2 changed-path Bloom filters, which fixes a bug in our
existing implementation of murmur3 paths with non-ASCII characters (when
the "char" type is signed).

This version is nearly identical to the previous round, modulo some
patch reorganization prompted by a test failure noticed by SZEDER [3].
I'm still not sure how that failure got through, but this round fixes
it, and I've double checked that 'git rebase -ix "make test"' passes on
across all revisions.

Thanks again to Jonathan, Peff, and SZEDER who have helped a great deal
in assembling these patches. As usual, a range-diff is included below.

Thanks in advance for your review!

[1]: https://lore.kernel.org/git/cover.1684790529.git.jonathantanmy@google.com/
[2]: https://lore.kernel.org/git/cover.1691426160.git.me@ttaylorr.com/
[3]: https://lore.kernel.org/git/20240129212614.GB9612@szeder.dev/

Jonathan Tan (1):
  gitformat-commit-graph: describe version 2 of BDAT

Taylor Blau (15):
  t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()`
  revision.c: consult Bloom filters for root commits
  commit-graph: ensure Bloom filters are read with consistent settings
  t/helper/test-read-graph.c: extract `dump_graph_info()`
  bloom.h: make `load_bloom_filter_from_graph()` public
  t/helper/test-read-graph: implement `bloom-filters` mode
  t4216: test changed path filters with high bit paths
  repo-settings: introduce commitgraph.changedPathsVersion
  bloom: annotate filters with hash version
  bloom: prepare to discard incompatible Bloom filters
  commit-graph: unconditionally load Bloom filters
  commit-graph: new Bloom filter version that fixes murmur3
  object.h: fix mis-aligned flag bits table
  commit-graph: reuse existing Bloom filters where possible
  bloom: introduce `deinit_bloom_filters()`

 Documentation/config/commitgraph.txt     |  29 ++-
 Documentation/gitformat-commit-graph.txt |   9 +-
 bloom.c                                  | 208 ++++++++++++++-
 bloom.h                                  |  38 ++-
 commit-graph.c                           |  64 ++++-
 object.h                                 |   3 +-
 oss-fuzz/fuzz-commit-graph.c             |   2 +-
 repo-settings.c                          |   6 +-
 repository.h                             |   2 +-
 revision.c                               |  26 +-
 t/helper/test-bloom.c                    |   9 +-
 t/helper/test-read-graph.c               |  67 ++++-
 t/t0095-bloom.sh                         |   8 +
 t/t4216-log-bloom.sh                     | 310 ++++++++++++++++++++++-
 14 files changed, 723 insertions(+), 58 deletions(-)

Range-diff against v5:
 1:  6292e974c5 =  1:  9df34a2f4f t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()`
 2:  6437cf5888 =  2:  a6dc377f1b revision.c: consult Bloom filters for root commits
 3:  d5a73b9fa6 !  3:  a77ab941bc commit-graph: ensure Bloom filters are read with consistent settings
    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm
     +	done
     +'
     +
    -+test_expect_success 'ensure incompatible Bloom filters are ignored' '
    ++test_expect_success 'ensure Bloom filters with incompatible settings are ignored' '
     +	# Compute Bloom filters with "unusual" settings.
     +	git -C $repo rev-parse one >in &&
     +	GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=3 git -C $repo commit-graph write \
 4:  85af5e80ef =  4:  56a9fdaff0 gitformat-commit-graph: describe version 2 of BDAT
 5:  b5ff3a9b34 =  5:  7484a82f7f t/helper/test-read-graph.c: extract `dump_graph_info()`
 6:  1b45dc8d85 =  6:  48343f93a2 bloom.h: make `load_bloom_filter_from_graph()` public
 7:  42b5126016 =  7:  286fd7dcdb t/helper/test-read-graph: implement `bloom-filters` mode
 8:  2029a2e30c =  8:  7de7b89da0 t4216: test changed path filters with high bit paths
 9:  79c6c3025a !  9:  b13c9b8ff9 repo-settings: introduce commitgraph.changedPathsVersion
    @@ repository.h: struct repo_settings {
      	int gc_write_commit_graph;
      	int fetch_write_commit_graph;
      	int command_requires_full_index;
    -
    - ## t/t4216-log-bloom.sh ##
    -@@ t/t4216-log-bloom.sh: test_expect_success 'setup for mixed Bloom setting tests' '
    - 	done
    - '
    - 
    --test_expect_success 'ensure incompatible Bloom filters are ignored' '
    -+test_expect_success 'ensure Bloom filters with incompatible settings are ignored' '
    - 	# Compute Bloom filters with "unusual" settings.
    - 	git -C $repo rev-parse one >in &&
    - 	GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=3 git -C $repo commit-graph write \
    -@@ t/t4216-log-bloom.sh: test_expect_success 'merge graph layers with incompatible Bloom settings' '
    - 	test_must_be_empty err
    - '
    - 
    -+test_expect_success 'ensure Bloom filter with incompatible versions are ignored' '
    -+	rm "$repo/$graph" &&
    -+
    -+	git -C $repo log --oneline --no-decorate -- $CENT >expect &&
    -+
    -+	# Compute v1 Bloom filters for commits at the bottom.
    -+	git -C $repo rev-parse HEAD^ >in &&
    -+	git -C $repo commit-graph write --stdin-commits --changed-paths \
    -+		--split <in &&
    -+
    -+	# Compute v2 Bloomfilters for the rest of the commits at the top.
    -+	git -C $repo rev-parse HEAD >in &&
    -+	git -C $repo -c commitGraph.changedPathsVersion=2 commit-graph write \
    -+		--stdin-commits --changed-paths --split=no-merge <in &&
    -+
    -+	test_line_count = 2 $repo/$chain &&
    -+
    -+	git -C $repo log --oneline --no-decorate -- $CENT >actual 2>err &&
    -+	test_cmp expect actual &&
    -+
    -+	layer="$(head -n 1 $repo/$chain)" &&
    -+	cat >expect.err <<-EOF &&
    -+	warning: disabling Bloom filters for commit-graph layer $SQ$layer$SQ due to incompatible settings
    -+	EOF
    -+	test_cmp expect.err err
    -+'
    -+
    - get_first_changed_path_filter () {
    - 	test-tool read-graph bloom-filters >filters.dat &&
    - 	head -n 1 filters.dat
11:  a73c77a5ba = 10:  09c44c51a5 bloom: annotate filters with hash version
12:  0c62b36206 = 11:  d4995ef600 bloom: prepare to discard incompatible Bloom filters
13:  ff348fc49d ! 12:  c8e9bb7c88 commit-graph.c: unconditionally load Bloom filters
    @@ Metadata
     Author: Taylor Blau <me@ttaylorr.com>
     
      ## Commit message ##
    -    commit-graph.c: unconditionally load Bloom filters
    +    commit-graph: unconditionally load Bloom filters
     
         In an earlier commit, we began ignoring the Bloom data ("BDAT") chunk
         for commit-graphs whose Bloom filters were computed using a hash version
    -      incompatible with the value of `commitGraph.changedPathVersion`.
    +    incompatible with the value of `commitGraph.changedPathVersion`.
     
         Now that the Bloom API has been hardened to discard these incompatible
         filters (with the exception of low-level APIs), we can safely load these
    @@ Commit message
     
      ## commit-graph.c ##
     @@ commit-graph.c: static int graph_read_bloom_data(const unsigned char *chunk_start,
    - 
    + 	g->chunk_bloom_data_size = chunk_size;
      	hash_version = get_be32(chunk_start);
      
    --	if (*c->commit_graph_changed_paths_version == -1)
    --		*c->commit_graph_changed_paths_version = hash_version;
    --	else if (hash_version != *c->commit_graph_changed_paths_version)
    +-	if (hash_version != 1)
     -		return 0;
     -
    - 	g->chunk_bloom_data = chunk_start;
    - 	g->chunk_bloom_data_size = chunk_size;
      	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
    + 	g->bloom_filter_settings->hash_version = hash_version;
    + 	g->bloom_filter_settings->num_hashes = get_be32(chunk_start + 4);
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
      	ctx->write_generation_data = (get_configured_generation_version(r) == 2);
      	ctx->num_generation_data_overflows = 0;
      
    --	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2
    --		? 2 : 1;
     +	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version;
      	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
      						      bloom_settings.bits_per_entry);
      	bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
      		/* We have changed-paths already. Keep them in the next graph */
    - 		if (g && g->bloom_filter_settings) {
    + 		if (g && g->chunk_bloom_data) {
      			ctx->changed_paths = 1;
     -			ctx->bloom_settings = g->bloom_filter_settings;
     +
10:  cfa5c358f8 ! 13:  d2f11c082d commit-graph: new Bloom filter version that fixes murmur3
    @@ bloom.h: int load_bloom_filter_from_graph(struct commit_graph *g,
      		    size_t len,
     
      ## commit-graph.c ##
    -@@ commit-graph.c: static int graph_read_bloom_index(const unsigned char *chunk_start,
    - 	return 0;
    - }
    - 
    -+struct graph_read_bloom_data_context {
    -+	struct commit_graph *g;
    -+	int *commit_graph_changed_paths_version;
    -+};
    -+
    - static int graph_read_bloom_data(const unsigned char *chunk_start,
    +@@ commit-graph.c: static int graph_read_bloom_data(const unsigned char *chunk_start,
      				  size_t chunk_size, void *data)
      {
    --	struct commit_graph *g = data;
    -+	struct graph_read_bloom_data_context *c = data;
    -+	struct commit_graph *g = c->g;
    - 	uint32_t hash_version;
    + 	struct commit_graph *g = data;
    +-	uint32_t hash_version;
      
      	if (chunk_size < BLOOMDATA_CHUNK_HEADER_SIZE) {
    + 		warning(_("ignoring too-small changed-path chunk"
     @@ commit-graph.c: static int graph_read_bloom_data(const unsigned char *chunk_start,
    - 		return -1;
    - 	}
      
    -+	hash_version = get_be32(chunk_start);
    -+
    -+	if (*c->commit_graph_changed_paths_version == -1)
    -+		*c->commit_graph_changed_paths_version = hash_version;
    -+	else if (hash_version != *c->commit_graph_changed_paths_version)
    -+		return 0;
    -+
      	g->chunk_bloom_data = chunk_start;
      	g->chunk_bloom_data_size = chunk_size;
     -	hash_version = get_be32(chunk_start);
    --
    --	if (hash_version != 1)
    --		return 0;
    --
    + 
      	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
    - 	g->bloom_filter_settings->hash_version = hash_version;
    +-	g->bloom_filter_settings->hash_version = hash_version;
    ++	g->bloom_filter_settings->hash_version = get_be32(chunk_start);
      	g->bloom_filter_settings->num_hashes = get_be32(chunk_start + 4);
    -@@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s,
    - 	}
    - 
    - 	if (s->commit_graph_changed_paths_version) {
    -+		struct graph_read_bloom_data_context context = {
    -+			.g = graph,
    -+			.commit_graph_changed_paths_version = &s->commit_graph_changed_paths_version
    -+		};
    - 		read_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
    - 			   graph_read_bloom_index, graph);
    - 		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
    --			   graph_read_bloom_data, graph);
    -+			   graph_read_bloom_data, &context);
    - 	}
    - 
    - 	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
    + 	g->bloom_filter_settings->bits_per_entry = get_be32(chunk_start + 8);
    + 	g->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
      	}
      	if (!commit_graph_compatible(r))
    @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
      
      	CALLOC_ARRAY(ctx, 1);
      	ctx->r = r;
    -@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
    - 	ctx->write_generation_data = (get_configured_generation_version(r) == 2);
    - 	ctx->num_generation_data_overflows = 0;
    - 
    -+	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2
    -+		? 2 : 1;
    - 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
    - 						      bloom_settings.bits_per_entry);
    - 	bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
      		g = ctx->r->objects->commit_graph;
      
    @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     -		if (g && g->chunk_bloom_data) {
     +		if (g && g->bloom_filter_settings) {
      			ctx->changed_paths = 1;
    - 			ctx->bloom_settings = g->bloom_filter_settings;
    - 		}
    + 
    + 			/* don't propagate the hash_version unless unspecified */
     
      ## t/helper/test-bloom.c ##
     @@ t/helper/test-bloom.c: static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
    @@ t/t0095-bloom.sh: test_expect_success 'compute unseeded murmur3 hash for test st
      	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
     
      ## t/t4216-log-bloom.sh ##
    +@@ t/t4216-log-bloom.sh: test_expect_success 'merge graph layers with incompatible Bloom settings' '
    + 	test_must_be_empty err
    + '
    + 
    ++test_expect_success 'ensure Bloom filter with incompatible versions are ignored' '
    ++	rm "$repo/$graph" &&
    ++
    ++	git -C $repo log --oneline --no-decorate -- $CENT >expect &&
    ++
    ++	# Compute v1 Bloom filters for commits at the bottom.
    ++	git -C $repo rev-parse HEAD^ >in &&
    ++	git -C $repo commit-graph write --stdin-commits --changed-paths \
    ++		--split <in &&
    ++
    ++	# Compute v2 Bloomfilters for the rest of the commits at the top.
    ++	git -C $repo rev-parse HEAD >in &&
    ++	git -C $repo -c commitGraph.changedPathsVersion=2 commit-graph write \
    ++		--stdin-commits --changed-paths --split=no-merge <in &&
    ++
    ++	test_line_count = 2 $repo/$chain &&
    ++
    ++	git -C $repo log --oneline --no-decorate -- $CENT >actual 2>err &&
    ++	test_cmp expect actual &&
    ++
    ++	layer="$(head -n 1 $repo/$chain)" &&
    ++	cat >expect.err <<-EOF &&
    ++	warning: disabling Bloom filters for commit-graph layer $SQ$layer$SQ due to incompatible settings
    ++	EOF
    ++	test_cmp expect.err err
    ++'
    ++
    + get_first_changed_path_filter () {
    + 	test-tool read-graph bloom-filters >filters.dat &&
    + 	head -n 1 filters.dat
     @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when version 1 requested' '
      	)
      '
14:  acc0a097b3 <  -:  ---------- commit-graph: drop unnecessary `graph_read_bloom_data_context`
15:  da401c8853 = 14:  9f54a376fb object.h: fix mis-aligned flag bits table
16:  2674967309 = 15:  67991dea7c commit-graph: reuse existing Bloom filters where possible
17:  bea7ec7b3f ! 16:  12058a074d bloom: introduce `deinit_bloom_filters()`
    @@ bloom.h: void add_key_to_filter(const struct bloom_key *key,
      	BLOOM_NOT_COMPUTED = (1 << 0),
     
      ## commit-graph.c ##
    -@@ commit-graph.c: struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
    - void close_commit_graph(struct raw_object_store *o)
    - {
    +@@ commit-graph.c: void close_commit_graph(struct raw_object_store *o)
    + 		return;
    + 
      	clear_commit_graph_data_slab(&commit_graph_data_slab);
     +	deinit_bloom_filters();
      	free_commit_graph(o->commit_graph);

base-commit: bc7ee2e5e16f0d1e710ef8fab3db59ab11f2bbe7