Message ID | 6676afe56db74828ba2532f3460e9b73a3ff20fd.1691426160.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | bloom: reuse existing Bloom filters when possible during upgrade | expand |
Taylor Blau <me@ttaylorr.com> writes: > diff --git a/commit-graph.c b/commit-graph.c > index 38edb12669..60e5f9ada7 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -317,12 +317,6 @@ static int graph_read_bloom_data(const unsigned char *chunk_start, > uint32_t hash_version; > hash_version = get_be32(chunk_start); > > - if (*c->commit_graph_changed_paths_version == -1) { > - *c->commit_graph_changed_paths_version = hash_version; > - } else if (hash_version != *c->commit_graph_changed_paths_version) { > - return 0; > - } Lots of things to notice in this patch, but the summary is that this patch looks correct. Here, we remove (1) the autodetection of changed paths version and (2) the storage of the result of that autodetection. (1) is restored in a subsequent code hunk, but (2) is never restored. Prior to this patch set, we needed (2) because we only wanted to load Bloom filters that match a given version, so if we autodetected a version, that version must be in effect for all future loads (which is why we needed to store that result immediately). But with this patch set, we support the loading of Bloom filters of varying versions, so storing it immediately is no longer needed. (Also, I don't think we need the commit_graph_changed_paths_version variable anymore.) > @@ -2390,8 +2384,7 @@ int write_commit_graph(struct object_directory *odb, > r->settings.commit_graph_changed_paths_version); > return 0; > } > - bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2 > - ? 2 : 1; > + bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version; As stated in the commit message, normalization of the hash version is performed later. > @@ -2423,10 +2416,18 @@ int write_commit_graph(struct object_directory *odb, > /* We have changed-paths already. Keep them in the next graph */ > if (g && g->bloom_filter_settings) { > ctx->changed_paths = 1; > - ctx->bloom_settings = g->bloom_filter_settings; > + > + /* don't propagate the hash_version unless unspecified */ > + if (bloom_settings.hash_version == -1) > + bloom_settings.hash_version = g->bloom_filter_settings->hash_version; > + bloom_settings.bits_per_entry = g->bloom_filter_settings->bits_per_entry; > + bloom_settings.num_hashes = g->bloom_filter_settings->num_hashes; > + bloom_settings.max_changed_paths = g->bloom_filter_settings->max_changed_paths; Here is where the autodetection is restored. Prior to this patch set, this part of the code does not perform autodetection - every hash_version in memory matches the version in commit_graph_changed_paths_version, so we're just copying over the value in that variable. But the nature of this part of the code has changed due to this patch set: all the g->bloom_filter_settings in memory may not have the same hash version, and may not match what the user specified in config. To compensate, we are more selective in what we propagate from g->bloom_filter_settings. > + bloom_settings.hash_version = bloom_settings.hash_version == 2 ? 2 : 1; And here we restore the normalization. Thanks - up to here looks good.
On Fri, Aug 11, 2023 at 03:00:30PM -0700, Jonathan Tan wrote: > Taylor Blau <me@ttaylorr.com> writes: > > diff --git a/commit-graph.c b/commit-graph.c > > index 38edb12669..60e5f9ada7 100644 > > --- a/commit-graph.c > > +++ b/commit-graph.c > > @@ -317,12 +317,6 @@ static int graph_read_bloom_data(const unsigned char *chunk_start, > > uint32_t hash_version; > > hash_version = get_be32(chunk_start); > > > > - if (*c->commit_graph_changed_paths_version == -1) { > > - *c->commit_graph_changed_paths_version = hash_version; > > - } else if (hash_version != *c->commit_graph_changed_paths_version) { > > - return 0; > > - } > > Lots of things to notice in this patch, but the summary is that this > patch looks correct. Thanks for the detailed analysis here. This is definitely the patch that I was most nervous about while writing, but I appreciate the second set of eyes. > (Also, I don't think we need the commit_graph_changed_paths_version > variable anymore.) Great catch, thanks. I cleaned that up in a separate commit after this one, since I think that this patch is already intricate enough. Thanks, Taylor
diff --git a/commit-graph.c b/commit-graph.c index 38edb12669..60e5f9ada7 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -317,12 +317,6 @@ static int graph_read_bloom_data(const unsigned char *chunk_start, uint32_t hash_version; hash_version = get_be32(chunk_start); - if (*c->commit_graph_changed_paths_version == -1) { - *c->commit_graph_changed_paths_version = hash_version; - } else if (hash_version != *c->commit_graph_changed_paths_version) { - return 0; - } - g->chunk_bloom_data = chunk_start; g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings)); g->bloom_filter_settings->hash_version = hash_version; @@ -2390,8 +2384,7 @@ int write_commit_graph(struct object_directory *odb, r->settings.commit_graph_changed_paths_version); return 0; } - bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2 - ? 2 : 1; + bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version; bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY", bloom_settings.bits_per_entry); bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES", @@ -2423,10 +2416,18 @@ int write_commit_graph(struct object_directory *odb, /* We have changed-paths already. Keep them in the next graph */ if (g && g->bloom_filter_settings) { ctx->changed_paths = 1; - ctx->bloom_settings = g->bloom_filter_settings; + + /* don't propagate the hash_version unless unspecified */ + if (bloom_settings.hash_version == -1) + bloom_settings.hash_version = g->bloom_filter_settings->hash_version; + bloom_settings.bits_per_entry = g->bloom_filter_settings->bits_per_entry; + bloom_settings.num_hashes = g->bloom_filter_settings->num_hashes; + bloom_settings.max_changed_paths = g->bloom_filter_settings->max_changed_paths; } } + bloom_settings.hash_version = bloom_settings.hash_version == 2 ? 2 : 1; + if (ctx->split) { struct commit_graph *g = ctx->r->objects->commit_graph;
In 9e4df4da07 (commit-graph: new filter ver. that fixes murmur3, 2023-08-01), we began ignoring the Bloom data ("BDAT") chunk for commit-graphs whose Bloom filters were computed using a hash version incompatible with the value of `commitGraph.changedPathVersion`. Now that the Bloom API has been hardened to discard these incompatible filters (with the exception of low-level APIs), we can safely load these Bloom filters unconditionally. We no longer want to return early from `graph_read_bloom_data()`, and similarly do not want to set the bloom_settings' `hash_version` field as a side-effect. The latter is because we want to wait until we know which Bloom settings we're using (either the defaults, from the GIT_TEST variables, or from the previous commit-graph layer) before deciding what hash_version to use. If we detect an existing BDAT chunk, we'll infer the rest of the settings (e.g., number of hashes, bits per entry, and maximum number of changed paths) from the earlier graph layer. The hash_version will be inferred from the previous layer as well, unless one has already been specified via configuration. Once all of that is done, we normalize the value of the hash_version to either "1" or "2". Signed-off-by: Taylor Blau <me@ttaylorr.com> --- commit-graph.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-)