diff mbox series

[RFC,4/6] commit-graph.c: unconditionally load Bloom filters

Message ID 6676afe56db74828ba2532f3460e9b73a3ff20fd.1691426160.git.me@ttaylorr.com (mailing list archive)
State New, archived
Headers show
Series bloom: reuse existing Bloom filters when possible during upgrade | expand

Commit Message

Taylor Blau Aug. 7, 2023, 4:37 p.m. UTC
In 9e4df4da07 (commit-graph: new filter ver. that fixes murmur3,
2023-08-01), we began ignoring the Bloom data ("BDAT") chunk for
commit-graphs whose Bloom filters were computed using a hash version
incompatible with the value of `commitGraph.changedPathVersion`.

Now that the Bloom API has been hardened to discard these incompatible
filters (with the exception of low-level APIs), we can safely load these
Bloom filters unconditionally.

We no longer want to return early from `graph_read_bloom_data()`, and
similarly do not want to set the bloom_settings' `hash_version` field as
a side-effect. The latter is because we want to wait until we know which
Bloom settings we're using (either the defaults, from the GIT_TEST
variables, or from the previous commit-graph layer) before deciding what
hash_version to use.

If we detect an existing BDAT chunk, we'll infer the rest of the
settings (e.g., number of hashes, bits per entry, and maximum number of
changed paths) from the earlier graph layer. The hash_version will be
inferred from the previous layer as well, unless one has already been
specified via configuration.

Once all of that is done, we normalize the value of the hash_version to
either "1" or "2".

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

Comments

Jonathan Tan Aug. 11, 2023, 10 p.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:
> diff --git a/commit-graph.c b/commit-graph.c
> index 38edb12669..60e5f9ada7 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -317,12 +317,6 @@ static int graph_read_bloom_data(const unsigned char *chunk_start,
>  	uint32_t hash_version;
>  	hash_version = get_be32(chunk_start);
>  
> -	if (*c->commit_graph_changed_paths_version == -1) {
> -		*c->commit_graph_changed_paths_version = hash_version;
> -	} else if (hash_version != *c->commit_graph_changed_paths_version) {
> -		return 0;
> -	}

Lots of things to notice in this patch, but the summary is that this
patch looks correct.

Here, we remove (1) the autodetection of changed paths version and (2)
the storage of the result of that autodetection. (1) is restored in a
subsequent code hunk, but (2) is never restored.

Prior to this patch set, we needed (2) because we only wanted to load
Bloom filters that match a given version, so if we autodetected a
version, that version must be in effect for all future loads (which is
why we needed to store that result immediately). But with this patch
set, we support the loading of Bloom filters of varying versions, so
storing it immediately is no longer needed.

(Also, I don't think we need the commit_graph_changed_paths_version
variable anymore.)

> @@ -2390,8 +2384,7 @@ int write_commit_graph(struct object_directory *odb,
>  			r->settings.commit_graph_changed_paths_version);
>  		return 0;
>  	}
> -	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2
> -		? 2 : 1;
> +	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version;

As stated in the commit message, normalization of the hash version is
performed later.

> @@ -2423,10 +2416,18 @@ int write_commit_graph(struct object_directory *odb,
>  		/* We have changed-paths already. Keep them in the next graph */
>  		if (g && g->bloom_filter_settings) {
>  			ctx->changed_paths = 1;
> -			ctx->bloom_settings = g->bloom_filter_settings;
> +
> +			/* don't propagate the hash_version unless unspecified */
> +			if (bloom_settings.hash_version == -1)
> +				bloom_settings.hash_version = g->bloom_filter_settings->hash_version;
> +			bloom_settings.bits_per_entry = g->bloom_filter_settings->bits_per_entry;
> +			bloom_settings.num_hashes = g->bloom_filter_settings->num_hashes;
> +			bloom_settings.max_changed_paths = g->bloom_filter_settings->max_changed_paths;

Here is where the autodetection is restored.

Prior to this patch set, this part of the code does not perform
autodetection - every hash_version in memory matches the version in
commit_graph_changed_paths_version, so we're just copying over the value
in that variable. But the nature of this part of the code has changed
due to this patch set: all the g->bloom_filter_settings in memory
may not have the same hash version, and may not match what the user
specified in config. To compensate, we are more selective in what we
propagate from g->bloom_filter_settings.

> +	bloom_settings.hash_version = bloom_settings.hash_version == 2 ? 2 : 1;

And here we restore the normalization.

Thanks - up to here looks good.
Taylor Blau Aug. 21, 2023, 8:40 p.m. UTC | #2
On Fri, Aug 11, 2023 at 03:00:30PM -0700, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 38edb12669..60e5f9ada7 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -317,12 +317,6 @@ static int graph_read_bloom_data(const unsigned char *chunk_start,
> >  	uint32_t hash_version;
> >  	hash_version = get_be32(chunk_start);
> >
> > -	if (*c->commit_graph_changed_paths_version == -1) {
> > -		*c->commit_graph_changed_paths_version = hash_version;
> > -	} else if (hash_version != *c->commit_graph_changed_paths_version) {
> > -		return 0;
> > -	}
>
> Lots of things to notice in this patch, but the summary is that this
> patch looks correct.

Thanks for the detailed analysis here. This is definitely the patch that
I was most nervous about while writing, but I appreciate the second set
of eyes.

> (Also, I don't think we need the commit_graph_changed_paths_version
> variable anymore.)

Great catch, thanks. I cleaned that up in a separate commit after this
one, since I think that this patch is already intricate enough.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index 38edb12669..60e5f9ada7 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -317,12 +317,6 @@  static int graph_read_bloom_data(const unsigned char *chunk_start,
 	uint32_t hash_version;
 	hash_version = get_be32(chunk_start);
 
-	if (*c->commit_graph_changed_paths_version == -1) {
-		*c->commit_graph_changed_paths_version = hash_version;
-	} else if (hash_version != *c->commit_graph_changed_paths_version) {
-		return 0;
-	}
-
 	g->chunk_bloom_data = chunk_start;
 	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
 	g->bloom_filter_settings->hash_version = hash_version;
@@ -2390,8 +2384,7 @@  int write_commit_graph(struct object_directory *odb,
 			r->settings.commit_graph_changed_paths_version);
 		return 0;
 	}
-	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2
-		? 2 : 1;
+	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version;
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
 						      bloom_settings.bits_per_entry);
 	bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
@@ -2423,10 +2416,18 @@  int write_commit_graph(struct object_directory *odb,
 		/* We have changed-paths already. Keep them in the next graph */
 		if (g && g->bloom_filter_settings) {
 			ctx->changed_paths = 1;
-			ctx->bloom_settings = g->bloom_filter_settings;
+
+			/* don't propagate the hash_version unless unspecified */
+			if (bloom_settings.hash_version == -1)
+				bloom_settings.hash_version = g->bloom_filter_settings->hash_version;
+			bloom_settings.bits_per_entry = g->bloom_filter_settings->bits_per_entry;
+			bloom_settings.num_hashes = g->bloom_filter_settings->num_hashes;
+			bloom_settings.max_changed_paths = g->bloom_filter_settings->max_changed_paths;
 		}
 	}
 
+	bloom_settings.hash_version = bloom_settings.hash_version == 2 ? 2 : 1;
+
 	if (ctx->split) {
 		struct commit_graph *g = ctx->r->objects->commit_graph;