Message ID | a4cb5fe69247ba737a8373948c1f4ff8a150d283.1691426160.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | bloom: reuse existing Bloom filters when possible during upgrade | expand |
Taylor Blau <me@ttaylorr.com> writes: > diff --git a/bloom.h b/bloom.h > index 330a140520..2b1c124bb5 100644 > --- a/bloom.h > +++ b/bloom.h > @@ -110,8 +110,24 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r, > const struct bloom_filter_settings *settings, > enum bloom_filter_computed *computed); > > -#define get_bloom_filter(r, c) get_or_compute_bloom_filter( \ > - (r), (c), 0, NULL, NULL) > +/* > + * Find the Bloom filter associated with the given commit "c". > + * > + * If any of the following are true > + * > + * - the repository does not have a commit-graph > + * - it has a commit-graph, but reading the commit-graph is disabled > + * - the given commit does not have a Bloom filter computed > + * - there is a Bloom filter for commit "c", but it cannot be read because > + * disabled s/disabled/version incompatibility/, I think. > + * > + * , then `get_bloom_filter()` will return NULL. Otherwise, the corresponding > + * Bloom filter will be returned. > + * > + * For callers who wish to inspect Bloom filters with incompatible hash > + * versions, use get_or_compute_bloom_filter(). > + */ > +struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c);
On Fri, Aug 11, 2023 at 02:48:06PM -0700, Jonathan Tan wrote: > Taylor Blau <me@ttaylorr.com> writes: > > diff --git a/bloom.h b/bloom.h > > index 330a140520..2b1c124bb5 100644 > > --- a/bloom.h > > +++ b/bloom.h > > @@ -110,8 +110,24 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r, > > const struct bloom_filter_settings *settings, > > enum bloom_filter_computed *computed); > > > > -#define get_bloom_filter(r, c) get_or_compute_bloom_filter( \ > > - (r), (c), 0, NULL, NULL) > > +/* > > + * Find the Bloom filter associated with the given commit "c". > > + * > > + * If any of the following are true > > + * > > + * - the repository does not have a commit-graph > > + * - it has a commit-graph, but reading the commit-graph is disabled > > + * - the given commit does not have a Bloom filter computed > > + * - there is a Bloom filter for commit "c", but it cannot be read because > > + * disabled > > s/disabled/version incompatibility/, I think. Well spotted, thanks. Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > diff --git a/bloom.c b/bloom.c > index 9b6a30f6f6..739fa093ba 100644 > --- a/bloom.c > +++ b/bloom.c > @@ -250,6 +250,23 @@ static void init_truncated_large_filter(struct bloom_filter *filter, > filter->version = version; > } > > +struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c) > +{ > + struct bloom_filter *filter; > + int hash_version; > + > + filter = get_or_compute_bloom_filter(r, c, 0, NULL, NULL); > + if (!filter) > + return NULL; > + > + prepare_repo_settings(r); > + hash_version = r->settings.commit_graph_changed_paths_version; > + > + if (!(hash_version == -1 || hash_version == filter->version)) > + return NULL; /* unusable filter */ > + return filter; > +} I missed this the last time, but this should match what fill_bloom_key() does. Use get_bloom_filter_settings(), then compare filter->version to version 2 if hash_version is 2, and to version 1 otherwise. Everything else looks good.
On Thu, Aug 24, 2023 at 03:20:51PM -0700, Jonathan Tan wrote: > Taylor Blau <me@ttaylorr.com> writes: > > diff --git a/bloom.c b/bloom.c > > index 9b6a30f6f6..739fa093ba 100644 > > --- a/bloom.c > > +++ b/bloom.c > > @@ -250,6 +250,23 @@ static void init_truncated_large_filter(struct bloom_filter *filter, > > filter->version = version; > > } > > > > +struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c) > > +{ > > + struct bloom_filter *filter; > > + int hash_version; > > + > > + filter = get_or_compute_bloom_filter(r, c, 0, NULL, NULL); > > + if (!filter) > > + return NULL; > > + > > + prepare_repo_settings(r); > > + hash_version = r->settings.commit_graph_changed_paths_version; > > + > > + if (!(hash_version == -1 || hash_version == filter->version)) > > + return NULL; /* unusable filter */ > > + return filter; > > +} > > I missed this the last time, but this should match what fill_bloom_key() > does. Use get_bloom_filter_settings(), I'm not sure I'm following. Are you saying that we should use get_bloom_filter_settings() instead of reading the value from r->settings directly? > then compare filter->version to version 2 if hash_version is 2, and to > version 1 otherwise. Hmm. I think we're already doing the right thing here, no? IIUC, you're saying to do something like: struct bloom_filter_settings *s = get_bloom_filter_settings(r); struct bloom_filter *f = get_or_compute_bloom_filter(r, c, ...); int hash_version; if (!f) return NULL; prepare_repo_settings(r); hash_version = r->settings.commit_graph_changed_paths_version; if (!(hash_version == -1 || hash_version == s->hash_version)) return NULL; /* incompatible */ return f; ? If so, I think that we're already OK here, since s->hash_version is always going to take the same value as f->version, since f->version is assigned in bloom.c::load_bloom_filter_from_graph(), which does: filter->version = g->bloom_filter_settings->hash_version; Or are we talking about two different things entirely? ;-) Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > I'm not sure I'm following. Are you saying that we should use > get_bloom_filter_settings() instead of reading the value from > r->settings directly? > > > then compare filter->version to version 2 if hash_version is 2, and to > > version 1 otherwise. > > Hmm. I think we're already doing the right thing here, no? IIUC, you're > saying to do something like: > > struct bloom_filter_settings *s = get_bloom_filter_settings(r); > struct bloom_filter *f = get_or_compute_bloom_filter(r, c, ...); > int hash_version; > > if (!f) > return NULL; > > prepare_repo_settings(r); > hash_version = r->settings.commit_graph_changed_paths_version; > > if (!(hash_version == -1 || hash_version == s->hash_version)) > return NULL; /* incompatible */ > return f; > > ? > > If so, I think that we're already OK here, since s->hash_version is > always going to take the same value as f->version, since f->version is > assigned in bloom.c::load_bloom_filter_from_graph(), which does: > > filter->version = g->bloom_filter_settings->hash_version; > > Or are we talking about two different things entirely? ;-) > > Thanks, > Taylor Ah, sorry for not being clear. What I meant is in how you compute hash_version after we have found that filter is not NULL. > +struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c) > +{ > + struct bloom_filter *filter; > + int hash_version; > + > + filter = get_or_compute_bloom_filter(r, c, 0, NULL, NULL); > + if (!filter) > + return NULL; > + > + prepare_repo_settings(r); Up to here is fine. > + hash_version = r->settings.commit_graph_changed_paths_version; Instead of doing this, do this (well, move the struct declaration to the top): struct bloom_filter_settings *s = get_bloom_filter_settings(r); hash_version = s->hash_version == 2 ? 2 : 1; > + if (!(hash_version == -1 || hash_version == filter->version)) No need for the comparison to -1 here. > + return NULL; /* unusable filter */ > + return filter; > +} This is fine.
On Thu, Aug 24, 2023 at 04:05:27PM -0700, Jonathan Tan wrote: > Up to here is fine. > > > + hash_version = r->settings.commit_graph_changed_paths_version; > > Instead of doing this, do this (well, move the struct declaration to > the top): > > struct bloom_filter_settings *s = get_bloom_filter_settings(r); > hash_version = s->hash_version == 2 ? 2 : 1; Do we need this normalization? We assign s->hash_version in commit-graph.c::graph_read_bloom_data() by reading it from the start of the BDAT chunk, so this should only ever be 1 or 2. > > + if (!(hash_version == -1 || hash_version == filter->version)) > > No need for the comparison to -1 here. I'm not sure I understand your suggestion. When we fetch the filter from get_or_compute_bloom_filter(), we have filter->version set to the hash_version from the containing graph's Bloom filter settings. So (besides the normalization), I would think that: struct bloom_filter_settings *s = get_bloom_filter_settings(r); struct bloom_filter *f = get_bloom_filter(...); assert(s->hash_version == f->version); would hold. I think the check that we want to make is instead: is this Bloom filter's version (or equivalently, the hash version indicated by that graph's BDAT chunk) something that we can read? And I think "what we can read" here is dictated by the commit_graph_changed_paths_version member of our repository_settings, no? Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > On Thu, Aug 24, 2023 at 04:05:27PM -0700, Jonathan Tan wrote: > > Up to here is fine. > > > > > + hash_version = r->settings.commit_graph_changed_paths_version; > > > > Instead of doing this, do this (well, move the struct declaration to > > the top): > > > > struct bloom_filter_settings *s = get_bloom_filter_settings(r); > > hash_version = s->hash_version == 2 ? 2 : 1; > > Do we need this normalization? We assign s->hash_version in > commit-graph.c::graph_read_bloom_data() by reading it from the start of > the BDAT chunk, so this should only ever be 1 or 2. I'm not sure offhand if we do...I wrote it this way to match fill_bloom_key(), but fill_bloom_key() was written in that way because it was the clearest, not specifically because it needed to normalize. > > > + if (!(hash_version == -1 || hash_version == filter->version)) > > > > No need for the comparison to -1 here. > > I'm not sure I understand your suggestion. When we fetch the filter from > get_or_compute_bloom_filter(), we have filter->version set to the > hash_version from the containing graph's Bloom filter settings. > > So (besides the normalization), I would think that: > > struct bloom_filter_settings *s = get_bloom_filter_settings(r); > struct bloom_filter *f = get_bloom_filter(...); > > assert(s->hash_version == f->version); > > would hold. My mention to avoid the comparison to -1 was just for completeness - since we're normalizing the value of hash_version to 1 or 2, we no longer need to compare it to -1. As for whether s->hash_version is always equal to f->version, I think that it may not be true if for some reason, there are multiple commit graph files on disk, not all with the same Bloom filter version. > I think the check that we want to make is instead: is this Bloom > filter's version (or equivalently, the hash version indicated by that > graph's BDAT chunk) something that we can read? I think it's not "something that we can read" (eventually, we can read all versions, we just treat them differently) but "the version that fill_bloom_key" will use. We don't want this function to produce a Bloom filter of version X and then have the calling code subsequently use it with a Bloom key of version Y. > And I think "what we can > read" here is dictated by the commit_graph_changed_paths_version member > of our repository_settings, no? I don't think commit_graph_changed_paths_version always dictates something - it could be -1 (which you have probably seen, since you check it against -1 in the current version of the patch). One of the points of my suggestion is to avoid this field completely.
On Tue, Aug 29, 2023 at 09:49:26AM -0700, Jonathan Tan wrote: > > > > + if (!(hash_version == -1 || hash_version == filter->version)) > > > > > > No need for the comparison to -1 here. > > > > I'm not sure I understand your suggestion. When we fetch the filter from > > get_or_compute_bloom_filter(), we have filter->version set to the > > hash_version from the containing graph's Bloom filter settings. > > > > So (besides the normalization), I would think that: > > > > struct bloom_filter_settings *s = get_bloom_filter_settings(r); > > struct bloom_filter *f = get_bloom_filter(...); > > > > assert(s->hash_version == f->version); > > > > would hold. > > My mention to avoid the comparison to -1 was just for completeness > - since we're normalizing the value of hash_version to 1 or 2, we no > longer need to compare it to -1. > > As for whether s->hash_version is always equal to f->version, I think > that it may not be true if for some reason, there are multiple commit > graph files on disk, not all with the same Bloom filter version. Apologies for not quite grokking this, but I am still somewhat confused. Suppose we applied something like your suggestion on top of this patch, like so: --- 8< --- diff --git a/bloom.c b/bloom.c index 739fa093ba..ee81976342 100644 --- a/bloom.c +++ b/bloom.c @@ -253,16 +253,16 @@ static void init_truncated_large_filter(struct bloom_filter *filter, struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c) { struct bloom_filter *filter; - int hash_version; + struct bloom_filter_settings *settings; filter = get_or_compute_bloom_filter(r, c, 0, NULL, NULL); if (!filter) return NULL; prepare_repo_settings(r); - hash_version = r->settings.commit_graph_changed_paths_version; + settings = get_bloom_filter_settings(r); - if (!(hash_version == -1 || hash_version == filter->version)) + if (filter->version != (settings->hash_version == 2 ? 2 : 1)) return NULL; /* unusable filter */ return filter; } --- >8 --- We have a handful of failing tests, notably including t4216.1, which tries to read settings->hash_version, but fails since settings is NULL. I believe this happens when we have a computed Bloom filter prepared for some commit, but have not yet written a commit graph containing that (or any) Bloom filter. But I think we're talking about different things here. The point of get_bloom_filter() as a wrapper around get_or_compute_bloom_filter() is to only return Bloom filters that are readable under the configuration commitGraph.changedPathsVersion. We have a handle on what version was used to compute Bloom filters in each layer of the graph by reading its "version" field, which is written via load_bloom_filter_from_graph(). So we want to return a non-NULL filter exactly when we (a) have a Bloom filter computed for the given commit, and (b) its version matches the version specified in commitGraph.chagnedPathsVersion. Are you saying that we need to do the normalization ahead of time so that we don't emit filters that have different hash versions when working in a commit-graph chain where each layer may use different Bloom filter settings? If so, then I think the change we'd need to make is limited to: --- 8< --- diff --git a/bloom.c b/bloom.c index 739fa093ba..d5cc23b0a8 100644 --- a/bloom.c +++ b/bloom.c @@ -260,9 +260,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c) return NULL; prepare_repo_settings(r); - hash_version = r->settings.commit_graph_changed_paths_version; + hash_version = r->settings.commit_graph_changed_paths_version == 2 ? 2 : 1; - if (!(hash_version == -1 || hash_version == filter->version)) + if (hash_version == filter->version) return NULL; /* unusable filter */ return filter; } --- >8 --- But that doesn't quite work, either, since it's breaking some assumption from the caller and causing us not to write out any Bloom filters (I haven't investigated further, but I'm not sure that this is the right path in the first place...). So I am not sure what you are suggesting, but I am sure that I'm missing something. Thanks, Taylor
Here are answers to your questions, but I'll also reply to the latest version stating that I'm OK with this patch set being merged (with my reasons). Taylor Blau <me@ttaylorr.com> writes: > Apologies for not quite grokking this, but I am still somewhat confused. > > Suppose we applied something like your suggestion on top of this patch, > like so: > > --- 8< --- > diff --git a/bloom.c b/bloom.c > index 739fa093ba..ee81976342 100644 > --- a/bloom.c > +++ b/bloom.c > @@ -253,16 +253,16 @@ static void init_truncated_large_filter(struct bloom_filter *filter, > struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c) > { > struct bloom_filter *filter; > - int hash_version; > + struct bloom_filter_settings *settings; > > filter = get_or_compute_bloom_filter(r, c, 0, NULL, NULL); > if (!filter) > return NULL; > > prepare_repo_settings(r); > - hash_version = r->settings.commit_graph_changed_paths_version; > + settings = get_bloom_filter_settings(r); > > - if (!(hash_version == -1 || hash_version == filter->version)) > + if (filter->version != (settings->hash_version == 2 ? 2 : 1)) > return NULL; /* unusable filter */ > return filter; > } > --- >8 --- > > We have a handful of failing tests, notably including t4216.1, which > tries to read settings->hash_version, but fails since settings is NULL. > I believe this happens when we have a computed Bloom filter prepared for > some commit, but have not yet written a commit graph containing that (or > any) Bloom filter. Ah, I discovered this independently too. I think it happens when we write version 2 Bloom filters to a repo that has no Bloom filters currently. So, r->settings.commit_graph_changed_paths_version is set to 2, but settings is NULL. I think the solution has to be a combination of both (use commit_graph_changed_paths_version, but if it is -1, check get_bloom_filter_settings()). But having said that, as I said above, we can go with what you have currently. > But I think we're talking about different things here. The point of > get_bloom_filter() as a wrapper around get_or_compute_bloom_filter() is > to only return Bloom filters that are readable under the configuration > commitGraph.changedPathsVersion. But this is my contention - sometimes commitGraph.changedPathsVersion is -1, so we need to find out which version is readable from elsewhere. > We have a handle on what version was used to compute Bloom filters in > each layer of the graph by reading its "version" field, which is written > via load_bloom_filter_from_graph(). > > So we want to return a non-NULL filter exactly when we (a) have a Bloom > filter computed for the given commit, and (b) its version matches the > version specified in commitGraph.chagnedPathsVersion. Yes, except add "or autodetected from the first commit graph file that we read if nothing is specified in commitGraph.changedPathsVersion" to the end. > Are you saying that we need to do the normalization ahead of time so > that we don't emit filters that have different hash versions when > working in a commit-graph chain where each layer may use different Bloom > filter settings? By "emit" do you mean write filters to disk? If yes, I'm worried about reading them, not writing them - for example, when running "git log" with a path. I am worried precisely about the commit-graph chain in which different layers have different Bloom settings, yes.
diff --git a/bloom.c b/bloom.c index 9b6a30f6f6..739fa093ba 100644 --- a/bloom.c +++ b/bloom.c @@ -250,6 +250,23 @@ static void init_truncated_large_filter(struct bloom_filter *filter, filter->version = version; } +struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c) +{ + struct bloom_filter *filter; + int hash_version; + + filter = get_or_compute_bloom_filter(r, c, 0, NULL, NULL); + if (!filter) + return NULL; + + prepare_repo_settings(r); + hash_version = r->settings.commit_graph_changed_paths_version; + + if (!(hash_version == -1 || hash_version == filter->version)) + return NULL; /* unusable filter */ + return filter; +} + struct bloom_filter *get_or_compute_bloom_filter(struct repository *r, struct commit *c, int compute_if_not_present, @@ -275,7 +292,8 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r, filter, graph_pos); } - if (filter->data && filter->len) + if ((filter->data && filter->len) && + (!settings || settings->hash_version == filter->version)) return filter; if (!compute_if_not_present) return NULL; diff --git a/bloom.h b/bloom.h index 330a140520..2b1c124bb5 100644 --- a/bloom.h +++ b/bloom.h @@ -110,8 +110,24 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r, const struct bloom_filter_settings *settings, enum bloom_filter_computed *computed); -#define get_bloom_filter(r, c) get_or_compute_bloom_filter( \ - (r), (c), 0, NULL, NULL) +/* + * Find the Bloom filter associated with the given commit "c". + * + * If any of the following are true + * + * - the repository does not have a commit-graph + * - it has a commit-graph, but reading the commit-graph is disabled + * - the given commit does not have a Bloom filter computed + * - there is a Bloom filter for commit "c", but it cannot be read because + * disabled + * + * , then `get_bloom_filter()` will return NULL. Otherwise, the corresponding + * Bloom filter will be returned. + * + * For callers who wish to inspect Bloom filters with incompatible hash + * versions, use get_or_compute_bloom_filter(). + */ +struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c); int bloom_filter_contains(const struct bloom_filter *filter, const struct bloom_key *key,
Callers use the inline `get_bloom_filter()` implementation as a thin wrapper around `get_or_compute_bloom_filter()`. The former calls the latter with a value of "0" for `compute_if_not_present`, making `get_bloom_filter()` the default read-only path for fetching an existing Bloom filter. Callers expect the value returned from `get_bloom_filter()` is usable, that is that it's compatible with the configured value corresponding to `commitGraph.changedPathsVersion`. This is OK, since the commit-graph machinery only initializes its BDAT chunk (thereby enabling it to service Bloom filter queries) when the Bloom filter hash_version is compatible with our settings. So any value returned by `get_bloom_filter()` is trivially useable. However, subsequent commits will load the BDAT chunk even when the Bloom filters are built with incompatible hash versions. Prepare to handle this by teaching `get_bloom_filter()` to discard filters that are incompatible with the configured hash version. Callers who wish to read incompatible filters (e.g., for upgrading filters from v1 to v2) may use the lower level routine, `get_or_compute_bloom_filter()`. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- bloom.c | 20 +++++++++++++++++++- bloom.h | 20 ++++++++++++++++++-- 2 files changed, 37 insertions(+), 3 deletions(-)