diff mbox series

[RFC,2/6] bloom: prepare to discard incompatible Bloom filters

Message ID a4cb5fe69247ba737a8373948c1f4ff8a150d283.1691426160.git.me@ttaylorr.com (mailing list archive)
State New, archived
Headers show
Series bloom: reuse existing Bloom filters when possible during upgrade | expand

Commit Message

Taylor Blau Aug. 7, 2023, 4:37 p.m. UTC
Callers use the inline `get_bloom_filter()` implementation as a thin
wrapper around `get_or_compute_bloom_filter()`. The former calls the
latter with a value of "0" for `compute_if_not_present`, making
`get_bloom_filter()` the default read-only path for fetching an existing
Bloom filter.

Callers expect the value returned from `get_bloom_filter()` is usable,
that is that it's compatible with the configured value corresponding to
`commitGraph.changedPathsVersion`.

This is OK, since the commit-graph machinery only initializes its BDAT
chunk (thereby enabling it to service Bloom filter queries) when the
Bloom filter hash_version is compatible with our settings. So any value
returned by `get_bloom_filter()` is trivially useable.

However, subsequent commits will load the BDAT chunk even when the Bloom
filters are built with incompatible hash versions. Prepare to handle
this by teaching `get_bloom_filter()` to discard filters that are
incompatible with the configured hash version.

Callers who wish to read incompatible filters (e.g., for upgrading
filters from v1 to v2) may use the lower level routine,
`get_or_compute_bloom_filter()`.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.c | 20 +++++++++++++++++++-
 bloom.h | 20 ++++++++++++++++++--
 2 files changed, 37 insertions(+), 3 deletions(-)

Comments

Jonathan Tan Aug. 11, 2023, 9:48 p.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:
> diff --git a/bloom.h b/bloom.h
> index 330a140520..2b1c124bb5 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -110,8 +110,24 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
>  						 const struct bloom_filter_settings *settings,
>  						 enum bloom_filter_computed *computed);
>  
> -#define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
> -	(r), (c), 0, NULL, NULL)
> +/*
> + * Find the Bloom filter associated with the given commit "c".
> + *
> + * If any of the following are true
> + *
> + *   - the repository does not have a commit-graph
> + *   - it has a commit-graph, but reading the commit-graph is disabled
> + *   - the given commit does not have a Bloom filter computed
> + *   - there is a Bloom filter for commit "c", but it cannot be read because
> + *     disabled

s/disabled/version incompatibility/, I think.

> + *
> + * , then `get_bloom_filter()` will return NULL. Otherwise, the corresponding
> + * Bloom filter will be returned.
> + *
> + * For callers who wish to inspect Bloom filters with incompatible hash
> + * versions, use get_or_compute_bloom_filter().
> + */
> +struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c);
Taylor Blau Aug. 21, 2023, 8:23 p.m. UTC | #2
On Fri, Aug 11, 2023 at 02:48:06PM -0700, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> > diff --git a/bloom.h b/bloom.h
> > index 330a140520..2b1c124bb5 100644
> > --- a/bloom.h
> > +++ b/bloom.h
> > @@ -110,8 +110,24 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
> >  						 const struct bloom_filter_settings *settings,
> >  						 enum bloom_filter_computed *computed);
> >
> > -#define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
> > -	(r), (c), 0, NULL, NULL)
> > +/*
> > + * Find the Bloom filter associated with the given commit "c".
> > + *
> > + * If any of the following are true
> > + *
> > + *   - the repository does not have a commit-graph
> > + *   - it has a commit-graph, but reading the commit-graph is disabled
> > + *   - the given commit does not have a Bloom filter computed
> > + *   - there is a Bloom filter for commit "c", but it cannot be read because
> > + *     disabled
>
> s/disabled/version incompatibility/, I think.

Well spotted, thanks.

Thanks,
Taylor
Jonathan Tan Aug. 24, 2023, 10:20 p.m. UTC | #3
Taylor Blau <me@ttaylorr.com> writes:
> diff --git a/bloom.c b/bloom.c
> index 9b6a30f6f6..739fa093ba 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -250,6 +250,23 @@ static void init_truncated_large_filter(struct bloom_filter *filter,
>  	filter->version = version;
>  }
>  
> +struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c)
> +{
> +	struct bloom_filter *filter;
> +	int hash_version;
> +
> +	filter = get_or_compute_bloom_filter(r, c, 0, NULL, NULL);
> +	if (!filter)
> +		return NULL;
> +
> +	prepare_repo_settings(r);
> +	hash_version = r->settings.commit_graph_changed_paths_version;
> +
> +	if (!(hash_version == -1 || hash_version == filter->version))
> +		return NULL; /* unusable filter */
> +	return filter;
> +}

I missed this the last time, but this should match what fill_bloom_key()
does. Use get_bloom_filter_settings(), then compare filter->version to
version 2 if hash_version is 2, and to version 1 otherwise.

Everything else looks good.
Taylor Blau Aug. 24, 2023, 10:47 p.m. UTC | #4
On Thu, Aug 24, 2023 at 03:20:51PM -0700, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> > diff --git a/bloom.c b/bloom.c
> > index 9b6a30f6f6..739fa093ba 100644
> > --- a/bloom.c
> > +++ b/bloom.c
> > @@ -250,6 +250,23 @@ static void init_truncated_large_filter(struct bloom_filter *filter,
> >  	filter->version = version;
> >  }
> >
> > +struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c)
> > +{
> > +	struct bloom_filter *filter;
> > +	int hash_version;
> > +
> > +	filter = get_or_compute_bloom_filter(r, c, 0, NULL, NULL);
> > +	if (!filter)
> > +		return NULL;
> > +
> > +	prepare_repo_settings(r);
> > +	hash_version = r->settings.commit_graph_changed_paths_version;
> > +
> > +	if (!(hash_version == -1 || hash_version == filter->version))
> > +		return NULL; /* unusable filter */
> > +	return filter;
> > +}
>
> I missed this the last time, but this should match what fill_bloom_key()
> does. Use get_bloom_filter_settings(),

I'm not sure I'm following. Are you saying that we should use
get_bloom_filter_settings() instead of reading the value from
r->settings directly?

> then compare filter->version to version 2 if hash_version is 2, and to
> version 1 otherwise.

Hmm. I think we're already doing the right thing here, no? IIUC, you're
saying to do something like:

    struct bloom_filter_settings *s = get_bloom_filter_settings(r);
    struct bloom_filter *f = get_or_compute_bloom_filter(r, c, ...);
    int hash_version;

    if (!f)
      return NULL;

    prepare_repo_settings(r);
    hash_version = r->settings.commit_graph_changed_paths_version;

    if (!(hash_version == -1 || hash_version == s->hash_version))
      return NULL; /* incompatible */
    return f;

?

If so, I think that we're already OK here, since s->hash_version is
always going to take the same value as f->version, since f->version is
assigned in bloom.c::load_bloom_filter_from_graph(), which does:

    filter->version = g->bloom_filter_settings->hash_version;

Or are we talking about two different things entirely? ;-)

Thanks,
Taylor
Jonathan Tan Aug. 24, 2023, 11:05 p.m. UTC | #5
Taylor Blau <me@ttaylorr.com> writes:
> I'm not sure I'm following. Are you saying that we should use
> get_bloom_filter_settings() instead of reading the value from
> r->settings directly?
> 
> > then compare filter->version to version 2 if hash_version is 2, and to
> > version 1 otherwise.
> 
> Hmm. I think we're already doing the right thing here, no? IIUC, you're
> saying to do something like:
> 
>     struct bloom_filter_settings *s = get_bloom_filter_settings(r);
>     struct bloom_filter *f = get_or_compute_bloom_filter(r, c, ...);
>     int hash_version;
> 
>     if (!f)
>       return NULL;
> 
>     prepare_repo_settings(r);
>     hash_version = r->settings.commit_graph_changed_paths_version;
> 
>     if (!(hash_version == -1 || hash_version == s->hash_version))
>       return NULL; /* incompatible */
>     return f;
> 
> ?
> 
> If so, I think that we're already OK here, since s->hash_version is
> always going to take the same value as f->version, since f->version is
> assigned in bloom.c::load_bloom_filter_from_graph(), which does:
> 
>     filter->version = g->bloom_filter_settings->hash_version;
> 
> Or are we talking about two different things entirely? ;-)
> 
> Thanks,
> Taylor

Ah, sorry for not being clear. What I meant is in how you compute
hash_version after we have found that filter is not NULL.

> +struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c)
> +{
> +	struct bloom_filter *filter;
> +	int hash_version;
> +
> +	filter = get_or_compute_bloom_filter(r, c, 0, NULL, NULL);
> +	if (!filter)
> +		return NULL;
> +
> +	prepare_repo_settings(r);

Up to here is fine.

> +	hash_version = r->settings.commit_graph_changed_paths_version;

Instead of doing this, do this (well, move the struct declaration to
the top):

  struct bloom_filter_settings *s = get_bloom_filter_settings(r);
  hash_version = s->hash_version == 2 ? 2 : 1;
  
> +	if (!(hash_version == -1 || hash_version == filter->version))

No need for the comparison to -1 here.

> +		return NULL; /* unusable filter */
> +	return filter;
> +}

This is fine.
Taylor Blau Aug. 25, 2023, 7 p.m. UTC | #6
On Thu, Aug 24, 2023 at 04:05:27PM -0700, Jonathan Tan wrote:
> Up to here is fine.
>
> > +	hash_version = r->settings.commit_graph_changed_paths_version;
>
> Instead of doing this, do this (well, move the struct declaration to
> the top):
>
>   struct bloom_filter_settings *s = get_bloom_filter_settings(r);
>   hash_version = s->hash_version == 2 ? 2 : 1;

Do we need this normalization? We assign s->hash_version in
commit-graph.c::graph_read_bloom_data() by reading it from the start of
the BDAT chunk, so this should only ever be 1 or 2.

> > +	if (!(hash_version == -1 || hash_version == filter->version))
>
> No need for the comparison to -1 here.

I'm not sure I understand your suggestion. When we fetch the filter from
get_or_compute_bloom_filter(), we have filter->version set to the
hash_version from the containing graph's Bloom filter settings.

So (besides the normalization), I would think that:

    struct bloom_filter_settings *s = get_bloom_filter_settings(r);
    struct bloom_filter *f = get_bloom_filter(...);

    assert(s->hash_version == f->version);

would hold.

I think the check that we want to make is instead: is this Bloom
filter's version (or equivalently, the hash version indicated by that
graph's BDAT chunk) something that we can read? And I think "what we can
read" here is dictated by the commit_graph_changed_paths_version member
of our repository_settings, no?

Thanks,
Taylor
Jonathan Tan Aug. 29, 2023, 4:49 p.m. UTC | #7
Taylor Blau <me@ttaylorr.com> writes:
> On Thu, Aug 24, 2023 at 04:05:27PM -0700, Jonathan Tan wrote:
> > Up to here is fine.
> >
> > > +	hash_version = r->settings.commit_graph_changed_paths_version;
> >
> > Instead of doing this, do this (well, move the struct declaration to
> > the top):
> >
> >   struct bloom_filter_settings *s = get_bloom_filter_settings(r);
> >   hash_version = s->hash_version == 2 ? 2 : 1;
> 
> Do we need this normalization? We assign s->hash_version in
> commit-graph.c::graph_read_bloom_data() by reading it from the start of
> the BDAT chunk, so this should only ever be 1 or 2.

I'm not sure offhand if we do...I wrote it this way to match
fill_bloom_key(), but fill_bloom_key() was written in that way because
it was the clearest, not specifically because it needed to normalize.

> > > +	if (!(hash_version == -1 || hash_version == filter->version))
> >
> > No need for the comparison to -1 here.
> 
> I'm not sure I understand your suggestion. When we fetch the filter from
> get_or_compute_bloom_filter(), we have filter->version set to the
> hash_version from the containing graph's Bloom filter settings.
> 
> So (besides the normalization), I would think that:
> 
>     struct bloom_filter_settings *s = get_bloom_filter_settings(r);
>     struct bloom_filter *f = get_bloom_filter(...);
> 
>     assert(s->hash_version == f->version);
> 
> would hold.

My mention to avoid the comparison to -1 was just for completeness
- since we're normalizing the value of hash_version to 1 or 2, we no
longer need to compare it to -1.

As for whether s->hash_version is always equal to f->version, I think
that it may not be true if for some reason, there are multiple commit
graph files on disk, not all with the same Bloom filter version.

> I think the check that we want to make is instead: is this Bloom
> filter's version (or equivalently, the hash version indicated by that
> graph's BDAT chunk) something that we can read? 

I think it's not "something that we can read" (eventually, we can read
all versions, we just treat them differently) but "the version that
fill_bloom_key" will use. We don't want this function to produce a Bloom
filter of version X and then have the calling code subsequently use it
with a Bloom key of version Y.

> And I think "what we can
> read" here is dictated by the commit_graph_changed_paths_version member
> of our repository_settings, no?

I don't think commit_graph_changed_paths_version always dictates
something - it could be -1 (which you have probably seen, since you
check it against -1 in the current version of the patch). One of the
points of my suggestion is to avoid this field completely.
Taylor Blau Aug. 29, 2023, 7:14 p.m. UTC | #8
On Tue, Aug 29, 2023 at 09:49:26AM -0700, Jonathan Tan wrote:
> > > > +	if (!(hash_version == -1 || hash_version == filter->version))
> > >
> > > No need for the comparison to -1 here.
> >
> > I'm not sure I understand your suggestion. When we fetch the filter from
> > get_or_compute_bloom_filter(), we have filter->version set to the
> > hash_version from the containing graph's Bloom filter settings.
> >
> > So (besides the normalization), I would think that:
> >
> >     struct bloom_filter_settings *s = get_bloom_filter_settings(r);
> >     struct bloom_filter *f = get_bloom_filter(...);
> >
> >     assert(s->hash_version == f->version);
> >
> > would hold.
>
> My mention to avoid the comparison to -1 was just for completeness
> - since we're normalizing the value of hash_version to 1 or 2, we no
> longer need to compare it to -1.
>
> As for whether s->hash_version is always equal to f->version, I think
> that it may not be true if for some reason, there are multiple commit
> graph files on disk, not all with the same Bloom filter version.

Apologies for not quite grokking this, but I am still somewhat confused.

Suppose we applied something like your suggestion on top of this patch,
like so:

--- 8< ---
diff --git a/bloom.c b/bloom.c
index 739fa093ba..ee81976342 100644
--- a/bloom.c
+++ b/bloom.c
@@ -253,16 +253,16 @@ static void init_truncated_large_filter(struct bloom_filter *filter,
 struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c)
 {
 	struct bloom_filter *filter;
-	int hash_version;
+	struct bloom_filter_settings *settings;

 	filter = get_or_compute_bloom_filter(r, c, 0, NULL, NULL);
 	if (!filter)
 		return NULL;

 	prepare_repo_settings(r);
-	hash_version = r->settings.commit_graph_changed_paths_version;
+	settings = get_bloom_filter_settings(r);

-	if (!(hash_version == -1 || hash_version == filter->version))
+	if (filter->version != (settings->hash_version == 2 ? 2 : 1))
 		return NULL; /* unusable filter */
 	return filter;
 }
--- >8 ---

We have a handful of failing tests, notably including t4216.1, which
tries to read settings->hash_version, but fails since settings is NULL.
I believe this happens when we have a computed Bloom filter prepared for
some commit, but have not yet written a commit graph containing that (or
any) Bloom filter.

But I think we're talking about different things here. The point of
get_bloom_filter() as a wrapper around get_or_compute_bloom_filter() is
to only return Bloom filters that are readable under the configuration
commitGraph.changedPathsVersion.

We have a handle on what version was used to compute Bloom filters in
each layer of the graph by reading its "version" field, which is written
via load_bloom_filter_from_graph().

So we want to return a non-NULL filter exactly when we (a) have a Bloom
filter computed for the given commit, and (b) its version matches the
version specified in commitGraph.chagnedPathsVersion.

Are you saying that we need to do the normalization ahead of time so
that we don't emit filters that have different hash versions when
working in a commit-graph chain where each layer may use different Bloom
filter settings? If so, then I think the change we'd need to make is
limited to:

--- 8< ---
diff --git a/bloom.c b/bloom.c
index 739fa093ba..d5cc23b0a8 100644
--- a/bloom.c
+++ b/bloom.c
@@ -260,9 +260,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c)
 		return NULL;

 	prepare_repo_settings(r);
-	hash_version = r->settings.commit_graph_changed_paths_version;
+	hash_version = r->settings.commit_graph_changed_paths_version == 2 ? 2 : 1;

-	if (!(hash_version == -1 || hash_version == filter->version))
+	if (hash_version == filter->version)
 		return NULL; /* unusable filter */
 	return filter;
 }
--- >8 ---

But that doesn't quite work, either, since it's breaking some assumption
from the caller and causing us not to write out any Bloom filters (I
haven't investigated further, but I'm not sure that this is the right
path in the first place...).

So I am not sure what you are suggesting, but I am sure that I'm missing
something.

Thanks,
Taylor
Jonathan Tan Aug. 29, 2023, 10:04 p.m. UTC | #9
Here are answers to your questions, but I'll also reply to the latest
version stating that I'm OK with this patch set being merged (with my
reasons).

Taylor Blau <me@ttaylorr.com> writes:
> Apologies for not quite grokking this, but I am still somewhat confused.
> 
> Suppose we applied something like your suggestion on top of this patch,
> like so:
> 
> --- 8< ---
> diff --git a/bloom.c b/bloom.c
> index 739fa093ba..ee81976342 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -253,16 +253,16 @@ static void init_truncated_large_filter(struct bloom_filter *filter,
>  struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c)
>  {
>  	struct bloom_filter *filter;
> -	int hash_version;
> +	struct bloom_filter_settings *settings;
> 
>  	filter = get_or_compute_bloom_filter(r, c, 0, NULL, NULL);
>  	if (!filter)
>  		return NULL;
> 
>  	prepare_repo_settings(r);
> -	hash_version = r->settings.commit_graph_changed_paths_version;
> +	settings = get_bloom_filter_settings(r);
> 
> -	if (!(hash_version == -1 || hash_version == filter->version))
> +	if (filter->version != (settings->hash_version == 2 ? 2 : 1))
>  		return NULL; /* unusable filter */
>  	return filter;
>  }
> --- >8 ---
> 
> We have a handful of failing tests, notably including t4216.1, which
> tries to read settings->hash_version, but fails since settings is NULL.
> I believe this happens when we have a computed Bloom filter prepared for
> some commit, but have not yet written a commit graph containing that (or
> any) Bloom filter.

Ah, I discovered this independently too. I think it happens when we
write version 2 Bloom filters to a repo that has no Bloom filters
currently. So, r->settings.commit_graph_changed_paths_version is set to
2, but settings is NULL. I think the solution has to be a combination
of both (use commit_graph_changed_paths_version, but if it is -1, check
get_bloom_filter_settings()). But having said that, as I said above, we
can go with what you have currently.

> But I think we're talking about different things here. The point of
> get_bloom_filter() as a wrapper around get_or_compute_bloom_filter() is
> to only return Bloom filters that are readable under the configuration
> commitGraph.changedPathsVersion.

But this is my contention - sometimes commitGraph.changedPathsVersion is
-1, so we need to find out which version is readable from elsewhere.

> We have a handle on what version was used to compute Bloom filters in
> each layer of the graph by reading its "version" field, which is written
> via load_bloom_filter_from_graph().
> 
> So we want to return a non-NULL filter exactly when we (a) have a Bloom
> filter computed for the given commit, and (b) its version matches the
> version specified in commitGraph.chagnedPathsVersion.

Yes, except add "or autodetected from the first commit graph file that
we read if nothing is specified in commitGraph.changedPathsVersion" to
the end.

> Are you saying that we need to do the normalization ahead of time so
> that we don't emit filters that have different hash versions when
> working in a commit-graph chain where each layer may use different Bloom
> filter settings? 

By "emit" do you mean write filters to disk? If yes, I'm worried about
reading them, not writing them - for example, when running "git log"
with a path. I am worried precisely about the commit-graph chain in
which different layers have different Bloom settings, yes.
diff mbox series

Patch

diff --git a/bloom.c b/bloom.c
index 9b6a30f6f6..739fa093ba 100644
--- a/bloom.c
+++ b/bloom.c
@@ -250,6 +250,23 @@  static void init_truncated_large_filter(struct bloom_filter *filter,
 	filter->version = version;
 }
 
+struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c)
+{
+	struct bloom_filter *filter;
+	int hash_version;
+
+	filter = get_or_compute_bloom_filter(r, c, 0, NULL, NULL);
+	if (!filter)
+		return NULL;
+
+	prepare_repo_settings(r);
+	hash_version = r->settings.commit_graph_changed_paths_version;
+
+	if (!(hash_version == -1 || hash_version == filter->version))
+		return NULL; /* unusable filter */
+	return filter;
+}
+
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 struct commit *c,
 						 int compute_if_not_present,
@@ -275,7 +292,8 @@  struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						     filter, graph_pos);
 	}
 
-	if (filter->data && filter->len)
+	if ((filter->data && filter->len) &&
+	    (!settings || settings->hash_version == filter->version))
 		return filter;
 	if (!compute_if_not_present)
 		return NULL;
diff --git a/bloom.h b/bloom.h
index 330a140520..2b1c124bb5 100644
--- a/bloom.h
+++ b/bloom.h
@@ -110,8 +110,24 @@  struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 const struct bloom_filter_settings *settings,
 						 enum bloom_filter_computed *computed);
 
-#define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
-	(r), (c), 0, NULL, NULL)
+/*
+ * Find the Bloom filter associated with the given commit "c".
+ *
+ * If any of the following are true
+ *
+ *   - the repository does not have a commit-graph
+ *   - it has a commit-graph, but reading the commit-graph is disabled
+ *   - the given commit does not have a Bloom filter computed
+ *   - there is a Bloom filter for commit "c", but it cannot be read because
+ *     disabled
+ *
+ * , then `get_bloom_filter()` will return NULL. Otherwise, the corresponding
+ * Bloom filter will be returned.
+ *
+ * For callers who wish to inspect Bloom filters with incompatible hash
+ * versions, use get_or_compute_bloom_filter().
+ */
+struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c);
 
 int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,